Hello!
I am trying to make a parser for news - economic sites, and on this site I stall: http://ir.debenhams.com/news-releases .
From it, I try to pull out a tablet that is loaded onto it from this site http://tools.eurolandir.com/tools/pressreleases/?companycode=uk-deb&lang=en-gb , and there this table is loaded using ajax request.
The problem is that when you request a page using simple dom html, there is no table in it. I tried to use curl to load everything, but either I didn't set all the right parameters, or my hands were crooked. Data is not loaded, even if you try to download directly from the second site.
Please tell me at least in which direction to dig.
|
1 answer
If the page content is created using js, then js will have to be executed to get the HTML you see in the browser. Especially for automation tasks there is a browser without a graphical interface - phantomjs .
Here is an easy way to try it out:
Run phantomjs in the docker .
docker run -d -p 8910:8910 wernight/phantomjs phantomjs --webdriver=8910 Now you have a local web-interface on localhost:8910 , which is convenient to use using the php-webdriver library , for example.
<?php require __DIR__ . '/vendor/autoload.php'; $driver = \Facebook\WebDriver\Remote\RemoteWebDriver::create( 'localhost:8910', \Facebook\WebDriver\Remote\DesiredCapabilities::phantomjs() ); $driver->get('http://tools.eurolandir.com/tools/pressreleases/?companycode=uk-deb&lang=en-gb'); sleep(2); // жду чтобы js отработал, но вы можете нагуглить более правильные способы сделать это $html = $driver->getPageSource(); echo $html; If you have not yet figured out how to use the docker, you can install phantomjs into your system and work with it using libraries like jonnyw / php-phantomjs (it seems that it can even install phantomjs itself).
- Made it through the php-phantomjs library. It worked for a similar link highlandgold.com/home/investors/regulatory-news , which has the same source for the iframe, but unfortunately for some reason this did not give the result of ir.debenhams.com/news-releases , it downloads everything, but unfilled text (can there be any ideas or recommendations on the use of the library? I tried to set the Lazy mod and wait 12 seconds to set. - Konstantin Petrov
- @ Konstantin Petrov in the same place iframe . A parent cannot access the iframe content if a document is loaded from a different domain. Think of the iframe as a separate window (browser tab). Your spider can simply treat frames as links, that is, you need to parse the page specified in
src. - Yegor Banin - I tried it. Using the example of these two sites, I downloaded the data from the iframe via PhantomJS. But then weird websites appeared linking to tools.eurolandir.com/tools/pressreleases/… and tools.eurolandir.com/tools/pressreleases/… respectively, which seems to be one site, but the first link is downloaded completely, but the second one is with unfilled div elements (and I need them). Please tell me if there are ideas, what is the difference between them? - Konstantin Petrov
- @ Konstantin Petrov how do you make sure that the elements are empty? I tried it now, I have both links normally downloaded. Maybe you do not have time to work js (I have 2 seconds)? - Yegor Banin
- I checked for downloading at the beginning of the file and looked into the manual that it downloads, I put the code here. ru.stackoverflow.com/questions/898286/… - Konstantin Petrov
|
ajax. - Andiframe, and get the necessary information from the frame. - And