Problems with getting the full set of tags when parsing the site

Question

For educational purposes, I am trying to parse the auto-ads site https://ab.onliner.by , the purpose of the parsing is to get links to cars, the link looks like " https://ab.onliner.by/car/ID ". When analyzing the site through the browser, it is clearly seen that the desired machine ID resides inside the tag:

<a href="/car/4164123"><img width="80" height="80" src="https://content.onliner.by/automarket/2218487/80x80/496c37de7ec4ec3eabf6eb66e6c9bb24.jpeg"></a>

The problem is that this tag is not in the returned HTML code.

 import requests from bs4 import BeautifulSoup def get_html(url): response = requests.get(url) return response.text print(get_html('https://ab.onliner.by'))

Actually there are questions that I am missing / doing wrong?

Look at the generated DOM in the browser instead of looking at the actual source code of the page.
Press RMB → “View page code” (or “Page source code”) in the same browser, and you will see that there are no links there, and they are generated on the fly in Java script
There are a number of headless browsers to do this completely in the background, but I haven’t gotten to them yet.

Sergey Nudnov Sergey Nudnov 3,450 one 6 17 · Accepted Answer · 2019-05-12T03:32:08

To download a page with scripts, use the Selenium package with Chrome or Firefox.

 from selenium import webdriver import time chrome_driver = 'C:/Tools/ChromeDriver/chromedriver.exe' chrome_options = webdriver.ChromeOptions() driver = webdriver.Chrome(executable_path=chrome_driver, options=chrome_options) driver.get('https://ab.onliner.by') # Таймаут, чтобы JS успели отработать. # Использование time.sleep - это грубый и не очень надёжный подход # Лучше почитать и использовать Expected Conditions из того же Selenium # from selenium.webdriver.support.ui import WebDriverWait # from selenium.webdriver.support import expected_conditions as EC time.sleep(5) print(driver.page_source)

And this is chrome_options.add_argument ('window-size = 1920x935')

Problems with getting the full set of tags when parsing the site

1 answer 1

More articles: