I have such a link

https://yandex.ru/referats/?t=astronomy+geology+gyroscope+literature+marketing+mathematics+music+polit+agrobiologia+law+psychology+geography+physics+philosophy+chemistry+estetica

When entering the site, the text in this block is generated:

<div class="referats__text"> <div>Научно-фантастический рассказ</div> <strong>ТЕМА ЗАГОЛОВОК</strong> <p>ТЕКСТ1</p> <p>ТЕКСТ2</p> </div> 

I need to parse only <p> Tell me how to do it

  • problem solved. - Lite Support

2 answers 2

To find text from all <p> on a given web page, you can use beautifulsoup4 :

 #!/usr/bin/env python3 from urllib.request import urlopen import bs4 # $ pip install beautifulsoup4 soup = bs4.BeautifulSoup(urlopen(url)) paragraphs = [p.get_text() for p in soup.find_all('p')] 

    From ubuntu with python3 works, saves type fragments to the text.txt file

      <p>ТЕКСТ1</p> <p>ТЕКСТ2</p> 

    , there is no windows to check from under it I can not.

     #!/usr/bin/env python3 #-*- coding: utf-8 -*- from urllib.request import urlopen url="https://yandex.ru/referats/?t=astronomy+geology+gyroscope+literature+marketing+mathematics+music+polit+agrobiologia+law+psychology+geography+physics+philosophy+chemistry+estetica" page=urlopen(url).read().decode('utf-8') page_out="" oK=False for one in range(3,len(page)): if page[one-3:one]=="<p>": oK=True if page[one:one+4]=="</p>": oK=False if oK : page_out+=page[one] f=open('text.txt','w') f.write(page_out) f.close() 
    • For recognizing html, it is better to use an html parser instead of direct string manipulation manually. - jfs