There is a site where (in one of the directories) you can enter numbers in the search field (for ease of experiment). If the number is in the database - displays a page with certain data. If the number is not - an error message.

Only about 250-300 thousand pages are in the database. It is required to write an algorithm that would iterate over the numbers and upload information to the database (for example, to a csv file). Captcha or some other protection against machine requests was not detected. Each page has a URL of the form \ Member \ XXX (XXX - numeric number).

I never did that. So far, Google offers to first upload the entire directory of interest to the local machine (via HTTrack), and then parse each page for information that interests me. Can someone tell me if I think in the right direction, or are there any better ways to get this information?

Update from 11/24/15:

Until came to download specific pages using:

<?php $limit = 10; for($i = 1; $i <= $limit; $i++) { $html = file_get_contents('http://site.ru/Member/Detail/' . $i); $handle = fopen("$i.html", 'a+'); fwrite($handle, $html); fclose($handle); } 

Can I upload them somehow not into the project folder, but into a certain subdirectory? If you make a search and save information of interest to a file inside the loop, nothing terrible will happen? Will 300k pages be processed normally?

2) In the process of parsing, I ran into the fact that there are a lot of left links in place of the email address, and the correct address is hidden between them using CSS rules. Every time - in a different place. (I hope correctly explained). How can this be circumvented? What kind of advanced parsing literature is there?

  • If you are given an exhaustive answer, mark it as correct (a daw opposite the selected answer). - Nicolas Chabanovsky

2 answers 2

I think it is better to pump it out to LAN, and only then parse it.

Will explain.
Definitely have to go through 300 thousand pages. This is not discussed ... everything is clear. In any case, it will be necessary to "go to" the page and read its source. But then either save to LAN; either parse, analyze, drive into CSV and “forget” about it.

First , and most importantly, the process of debugging parsing. How many times do you have to run the page parsing code until you achieve the correct result? Countless times. If you do not keep on LAN, it will be constant web traffic, constant requests, delays (depending on the Internet speed and server response) ... And then, as Qwertiy wrote, they can just ban IP.
None of this will be a problem if there is a link on LAN.

Secondly , since each page will be a .html file on LAN, they can be manipulated. For example, the parsing code does what it needs, saves it to the CSV that it found, and then takes the original .html and transfers it to another folder with the name сделано .
So it will be easier to track the work of parsing after debugging, when to start in full processing. Also, if somewhere in the middle of some kind of malfunction, or one of the files is erroneous, you will not need to chase those already processed.

And thirdly , since everything is on LAN, you can even debug and launch the script in the transport :) Still, 300 thousand records in 5 minutes will not work :)

In general, my advice is to download to LAN and parse with it .

  • As for this, my answer said, "You can save or not save as needed." I am not against such a download, but using a specific software for downloading. That is, you need 300K pages, and you run the utility for downloading the site, which besides them will download several more times that you don’t need, plus all the scripts and pictures. And why, if you can pump out only what is required? - Qwertiy
  • Can you give a link to the examples / documentation on the described process? I have never parsed something more than one page. Moments that I do not quite understand: 0) Downloading information to a local machine. Will HTTrack work or can it be somehow more effective? 1) Parsing and moving from one page to another. (I mean the presence of the search algorithm) 2) Saving \ addition of the information of interest into a single csv file - Bear
  • 0) since you have a ready list of URL \Member\XXX and you only need to download them without dependent pages, HTTrack or Teleport will not work for you. They download sites completely. It is better to write a php script that goes through all the URLs and downloads them (see the docks for file_get_contents() ). 1) The transition between pages should not worry you. All downloaded pages will be on LAN in one folder - just take them one by one, and parse. 2) Parsing and saving is on the instructions. You will need knowledge of regular expressions and working with functions of the file system such as fputcsv() .. - cyadvert
  • There is hardly anything concrete about the docks here. Typically, people download certain pages for specific needs, so there are few docks. everyone writes for themselves. In general terms: it takes a URL, using file_get__contents() take away its contents, with the help of fopen() , fwrite() , fclose() save it on LAN. Go to the following URL and all over again ... - cyadvert
  • Thank. Tomorrow I will understand what you have proposed. Immediately the question arises, how in PHP to implement a loop to iterate the values ​​in the link? \Member\XXX Here XXX is replaced with a certain parameter and by enumeration from 1 to ~ 300k to create local HTML copies corresponding to each value. - Bear

Well, you still have to download the pages, but you can do it yourself programmatically, and not drag the whole site with special programs. You just go to the right addresses, get the page markup and immediately parse it. You can save or not save as needed.

As for the protection against robots, you can also ban ip. One of the sites with wallpaper did so at about 3 thousand downloads. It is solved easily - and you use free proxy servers from the list.

  • It is possible in more detail, how can I unload only the directory with pages that interests me? Links to documentation / examples / videos would be very helpful. - Bear
  • @Bear, you simply load the pages in a loop. - Qwertiy