Question about parsing sites [closed]

Question

Hello. I have a service that contains prices from different outlets. Now you have to update the prices yourself, it takes a lot of time. That's why I want to write a parser. It will work like this: specify the link to the product, and the class / id of the tag that contains the price. The script runs and parses the price from the site. I want to use file_get_contents () + symfony DomCrawler. The question itself: since all sites have their own layout, is the solution that I wrote above suitable for parsing different sites, or is there a better solution?

And the owners of retail outlets vkurse that you use their prices?
If so, it is better to agree with them about the unloading of prices on their part.
Daily requests to all pages of the site is a small ddos attack.
And even if each site has several tens of thousands of pages, then the update can last for several hours.

Sergey Strelchenko Sergey Strelchenko 103 7 · Accepted Answer · 2016-09-13T20:13:58

If parsing from one site, it is better to use regular expressions. If there are several sites, I highly recommend the Content Downloader desktop program. It has very flexible settings using PHP, writing directly to MySQL and uploading to FTP. Not entirely clear from the description of how you are going to specify a link to the product. Once in the card of your aggregator or directly during the parsing? What are the tasks in front of your system?

I don’t know how to explain it in another way) here’s the structure of the database: goods |
link table (product ID, seller ID, seller price, link to seller’s website).
Now I do this: I choose a product, I choose a seller, I assign a price, and indicate a link to the seller’s website.
And I would like to do this, I choose a vendor (he already has prices and links to product cards on his website), I launch the parser, and he runs on the links and parsit prices.
$content = file_get_contents('link_donor'); $pattern = 'регулярное выражение поиска кода цены'; preg_match($pattern, $content, $out);

danil danil 476 3 eleven · Answer 2 · 2016-09-14T11:28:43

I would recommend using scrapy , this is of course python, not php, but parsers do not write this kind of php. Here you can read about it. We collect data using Scrapy . Very handy and flexible thing, parsing asynchronously into multiple threads. You can read in more detail on the link or on the official site.

And the second option is to use something like PhantomJs or SlimerJs Simple parsing of sites using SlimerJS . If the first option does not work, for some reason.

Dimitrii Lebedev Dimitrii Lebedev sixteen one · Answer 3 · 2016-09-17T21:32:03

I used Grab :: Spider for these purposes, it is in Python. There is Russian documentation, examples can be found on Habré. I made the requests myself through XPath, the FirePHP module for Firefox to help, well, or through CSS requests, but it seems to me less effective. Data obtained in this way can be uploaded to any database, since python has a lot of libraries.

Question about parsing sites [closed]

Closed due to the fact that the issue is too general for the participants D-side , Denis , aleksandr barakin , cheops , user207618 19 Sep '16 at 20:59 .

3 answers 3

More articles: