Page parsing

Question

Good day! It is necessary to take some data from the site, but it gives an error:

Warning: file_get_contents ( http://whois.domaintools.com/195.90.131.231) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP / 1.1 403 Forbidden

I try to load the page through simple_html_dom:

$ site = file_get_html ("http://whois.domaintools.com/$ip");

how do i solve this problem?

Try to get the body of the page using socket and then transfer to simple_dom_html, write the headers of a real browser to the socket
I tried to open through curl, but the result is the same> $ ch = curl_init ();
>> curl_setopt ($ ch, CURLOPT_URL, " whois.domaintools.com/$ip" );
>> curl_setopt ($ ch, CURLOPT_USERAGENT, 'Mozilla / 5.0> (Windows NT 6.1; WOW64; rv: 20.0) Gecko / 20100101 Firefox / 20.0');
>> curl_setopt ($ ch, CURLOPT_REFERER, " whois.domaintools.com/$ip" );>>> curl_exec ($ ch);

Accepted Answer · 2013-04-13T14:21:23

Get it. Sign here.

<pre><? $headers="Host: whois.domaintools.com\r\nAccept: */*\r\nAccept-Language: ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4\r\nAccept-Charset: windows-1251,utf-8;q=0.7,*;q=0.3\r\nCache-Control: max-age=0\r\nAccept-Encoding: gzip,deflate,sdch\r\nUser-Agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/11.10 Chromium/17.0.963.79 Chrome/17.0.963.79 Safari/535.11\r\n"; $sock=fsockopen('whois.domaintools.com', 80); $query="GET /195.90.131.231 HTTP/1.0\r\n".$headers."\r\n\r\n"; fwrite($sock, $query); $res=""; while (!feof($sock)) $res.=fread($sock, 2048); fclose($sock); $sp=explode("\r\n\r\n", $res); echo htmlspecialchars(gunzip($sp[1]))."\r\n"; function gunzip($zipped) { $offset = 0; if (substr($zipped,0,2) == "\x1f\x8b") $offset = 2; if (substr($zipped,$offset,1) == "\x08") { return gzinflate(substr($zipped, $offset + 8)); } return "Unknown Format"; } ?></pre>

CURL and file_get_contents transfer the headers that contain PHP in the user agent, and the site apparently returns 403 because of this user agent.

Yes, yes, for whom the API is written ... Or am I already picking on it?
@klopp, well, the author wanted to pull the text out of HTML, I gave him an example of how to get HTML, and the fact that he went the wrong way is his problem.
You prompted him the right path, and I helped to go through the wrong one :)
Well, if you think about it - the more people who are churning, the more we have a salary :)
If you open the example now, you will see the message there:> Thank you for using the DomainTools for your domain research.
> To protect> Registered registrants> Whois lookups> That are allowed.
You can create and> log in to your account account before doing it.

Answer 2 · 2013-04-13T14:00:39

To get started, get the first utility of each spider: wget . Run, and see:

 wget -S "http://whois.domaintools.com/195.90.131.231" --17:54:15-- http://whois.domaintools.com/195.90.131.231 => `195.90.131.231' Resolving whois.domaintools.com... done. Connecting to whois.domaintools.com[199.93.60.254]:80... connected. HTTP request sent, awaiting response... 1 HTTP/1.1 403 Forbidden 2 Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 3 Pragma: no-cache 4 Set-Cookie: csrftoken=a8346261cd9fdd81fa9af22a80be95ca; path=/; domain=.domaintools.com 5 Set-Cookie: dtsession=ak67m6mt66fn012ppi629vmdc3; expires=Tue, 11-Apr-2023 13:54:15 GMT; path=/; domain=.domaintools.com 6 Content-Type: text/html 7 Expires: Thu, 19 Nov 1981 08:52:00 GMT 8 Server: lighttpd/1.4.30 9 Date: Sat, 13 Apr 2013 13:54:16 GMT 10 Connection: close

17:54:16 ERROR 403: Forbidden.

Think about it.

But it is better not to suffer foolishness, but to read about the domaintools.com API, everything is very detailed there. As a home exercise, I suggest finding a link to the documentation on their website. If it takes you more than two clicks and 10 seconds - pichalka ...

Page parsing

2 answers 2

More articles: