Parsing div> class Python3

Question

I have such a link

https://yandex.ru/referats/?t=astronomy+geology+gyroscope+literature+marketing+mathematics+music+polit+agrobiologia+law+psychology+geography+physics+philosophy+chemistry+estetica

When entering the site, the text in this block is generated:

<div class="referats__text"> <div>Научно-фантастический рассказ</div> <strong>ТЕМА ЗАГОЛОВОК</strong> <p>ТЕКСТ1</p> <p>ТЕКСТ2</p> </div>

I need to parse only <p> Tell me how to do it

jfs jfs 44.5k eight 53 199 · Answer 1 · 2017-03-02T23:09:27

To find text from all <p> on a given web page, you can use beautifulsoup4 :

 #!/usr/bin/env python3 from urllib.request import urlopen import bs4 # $ pip install beautifulsoup4 soup = bs4.BeautifulSoup(urlopen(url)) paragraphs = [p.get_text() for p in soup.find_all('p')]

Artem Artem 70 five · Answer 2 · 2017-03-02T21:51:45

From ubuntu with python3 works, saves type fragments to the text.txt file

  <p>ТЕКСТ1</p> <p>ТЕКСТ2</p>

, there is no windows to check from under it I can not.

 #!/usr/bin/env python3 #-*- coding: utf-8 -*- from urllib.request import urlopen url="https://yandex.ru/referats/?t=astronomy+geology+gyroscope+literature+marketing+mathematics+music+polit+agrobiologia+law+psychology+geography+physics+philosophy+chemistry+estetica" page=urlopen(url).read().decode('utf-8') page_out="" oK=False for one in range(3,len(page)): if page[one-3:one]=="<p>": oK=True if page[one:one+4]=="</p>": oK=False if oK : page_out+=page[one] f=open('text.txt','w') f.write(page_out) f.close()

For recognizing html, it is better to use an html parser instead of direct string manipulation manually.

Parsing div> class Python3

2 answers 2

More articles: