There is a site where (in one of the directories) you can enter numbers in the search field (for ease of experiment). If the number is in the database - displays a page with certain data. If the number is not - an error message.
Only about 250-300 thousand pages are in the database. It is required to write an algorithm that would iterate over the numbers and upload information to the database (for example, to a csv file). Captcha or some other protection against machine requests was not detected. Each page has a URL of the form \ Member \ XXX (XXX - numeric number).
I never did that. So far, Google offers to first upload the entire directory of interest to the local machine (via HTTrack), and then parse each page for information that interests me. Can someone tell me if I think in the right direction, or are there any better ways to get this information?
Update from 11/24/15:
Until came to download specific pages using:
<?php $limit = 10; for($i = 1; $i <= $limit; $i++) { $html = file_get_contents('http://site.ru/Member/Detail/' . $i); $handle = fopen("$i.html", 'a+'); fwrite($handle, $html); fclose($handle); } Can I upload them somehow not into the project folder, but into a certain subdirectory? If you make a search and save information of interest to a file inside the loop, nothing terrible will happen? Will 300k pages be processed normally?
2) In the process of parsing, I ran into the fact that there are a lot of left links in place of the email address, and the correct address is hidden between them using CSS rules. Every time - in a different place. (I hope correctly explained). How can this be circumvented? What kind of advanced parsing literature is there?