Getting the HTML code of the page after running Javascript and other client scripts in PHP

Question

There are problems checking external (relative to PHP) html pages, for example, for the presence of obscene expressions that a user can see in his browser window! For example, in the example.com/index.html page the code is embedded

Переменная1 = Нец Переменная2 = ензу Переменная3 = ра document.write(Переменная1+Переменная2+Переменная3);

And see for yourself what happens - what the user sees!

What was studied before writing the question: 1. PhantomJS - gives the perfect picture in PDF, i.e. correctly renders the html page - but I do not have the opportunity to analyze the resulting PDF files. The page.content function in different versions:

 var webPage = require('webpage'); var page = webPage.create(); page.open('http://www.phantomjs.org', function (status) { if (status !== 'success') { console.log(page.content); console.log('Unable to load the address!'); phantom.exit(); } else { window.setTimeout(function () { console.log(page.content); phantom.exit(); }, 20000); // Change timeout as required to allow sufficient time } });

It gives ONLY the ending of the html file, and since the documentation for PhantomJS is NO, I can’t do anything accordingly.

And the third to call phantomJS you need to create JS files, which also looks very crooked.

Please tell me how to achieve html reading after running Javascriptov inside PHP (freebsd) I need to get a string containing TEXT, LINKS to external objects (ie, URL) HTML formatting inside PHP.

I do not know how you are going to use it, but as an option, use selenium.
Examples can be found here. In general, this is used for testing, but suddenly you will be satisfied with this option
Silenium does not suit this solution for testing a single page, I need to process thousands of url on limited resources, if you open browsers and use Java engine - it takes a very long time.
According to the PhanomJS - there is little documentation (!); Here’s an example TODO you have to do once :) phantomjs.org/api/webpage/method/get-page.html
There is NOTHING using this method and others. You will carefully study the site and you will understand that there is no documentation for serious work.

Dennis dennis 82 ten · Answer 1 · 2016-01-31T12:26:28

After a painful search, the simplest solution was found: 1. To get complete data from Phantom JS, namely the “rendered” image of the site in HTML in PHP, you need to use shell_exec (and not just go), url, and pass parameters through a space (t. e. url you want to process) php:

 $phantom_script= dirname(__FILE__). '/get-website.js'. ' http://google.com'; $response = shell_exec ('phantomjs ' . $phantom_script);

and in JS get-website.js:

 var args = require('system').args; var webPage = require('webpage'); var page = webPage.create(); var address = args[1]; page.open(address, function (status) { if (status !== 'success') { console.log(page.content); //content = Null console.log('Unable to load the address! PHP'); phantom.exit(); } else { window.setTimeout(function () { console.log(page.content); phantom.exit(); }, 1000); // Change timeout as required to allow sufficient time } });

The only problem is cyrrilic (Cyrillic) after receiving the content, even in UTF-8 format, it is hopelessly distorted - but that is another question.

Getting the HTML code of the page after running Javascript and other client scripts in PHP

1 answer 1

More articles: