Download site via php (for parsing) with all text content

Question

Hello!
I am trying to make a parser for news - economic sites, and on this site I stall: http://ir.debenhams.com/news-releases .
From it, I try to pull out a tablet that is loaded onto it from this site http://tools.eurolandir.com/tools/pressreleases/?companycode=uk-deb&lang=en-gb , and there this table is loaded using ajax request.
The problem is that when you request a page using simple dom html, there is no table in it. I tried to use curl to load everything, but either I didn't set all the right parameters, or my hands were crooked. Data is not loaded, even if you try to download directly from the second site.
Please tell me at least in which direction to dig.

It is much easier to use php to load tools.eurolandir.com/tools/pressreleases / ... and already answer from this parse address.
So the problem is that even when trying to load directly from the second site, it is not fully loaded because of its internal auto-completion.
You can parse the page in the iframe , and get the necessary information from the frame.

Accepted Answer · 2018-10-24T14:54:16

If the page content is created using js, then js will have to be executed to get the HTML you see in the browser. Especially for automation tasks there is a browser without a graphical interface - phantomjs .

Here is an easy way to try it out:

Run phantomjs in the docker .

 docker run -d -p 8910:8910 wernight/phantomjs phantomjs --webdriver=8910

Now you have a local web-interface on localhost:8910 , which is convenient to use using the php-webdriver library , for example.

 <?php require __DIR__ . '/vendor/autoload.php'; $driver = \Facebook\WebDriver\Remote\RemoteWebDriver::create( 'localhost:8910', \Facebook\WebDriver\Remote\DesiredCapabilities::phantomjs() ); $driver->get('http://tools.eurolandir.com/tools/pressreleases/?companycode=uk-deb&lang=en-gb'); sleep(2); // жду чтобы js отработал, но вы можете нагуглить более правильные способы сделать это $html = $driver->getPageSource(); echo $html;

If you have not yet figured out how to use the docker, you can install phantomjs into your system and work with it using libraries like jonnyw / php-phantomjs (it seems that it can even install phantomjs itself).

It worked for a similar link highlandgold.com/home/investors/regulatory-news , which has the same source for the iframe, but unfortunately for some reason this did not give the result of ir.debenhams.com/news-releases , it downloads everything, but unfilled text (can there be any ideas or recommendations on the use of the library? I tried to set the Lazy mod and wait 12 seconds to set.
A parent cannot access the iframe content if a document is loaded from a different domain.
Your spider can simply treat frames as links, that is, you need to parse the page specified in src .
Using the example of these two sites, I downloaded the data from the iframe via PhantomJS.
But then weird websites appeared linking to tools.eurolandir.com/tools/pressreleases/… and tools.eurolandir.com/tools/pressreleases/… respectively, which seems to be one site, but the first link is downloaded completely, but the second one is with unfilled div elements (and I need them).
Please tell me if there are ideas, what is the difference between them?
@ Konstantin Petrov how do you make sure that the elements are empty?
I checked for downloading at the beginning of the file and looked into the manual that it downloads, I put the code here.

Download site via php (for parsing) with all text content

1 answer 1

More articles: