Hello. I have a service that contains prices from different outlets. Now you have to update the prices yourself, it takes a lot of time. That's why I want to write a parser. It will work like this: specify the link to the product, and the class / id of the tag that contains the price. The script runs and parses the price from the site. I want to use file_get_contents () + symfony DomCrawler. The question itself: since all sites have their own layout, is the solution that I wrote above suitable for parsing different sites, or is there a better solution?

Closed due to the fact that the issue is too general for the participants D-side , Denis , aleksandr barakin , cheops , user207618 19 Sep '16 at 20:59 .

Please correct the question so that it describes the specific problem with sufficient detail to determine the appropriate answer. Do not ask a few questions at once. See “How to ask a good question?” For clarification. If the question can be reformulated according to the rules set out in the certificate , edit it .

  • And the owners of retail outlets vkurse that you use their prices? If so, it is better to agree with them about the unloading of prices on their part. After all, parsing a site is expensive by resources. Daily requests to all pages of the site is a small ddos ​​attack. And even if each site has several tens of thousands of pages, then the update can last for several hours. - terantul

3 answers 3

If parsing from one site, it is better to use regular expressions. If there are several sites, I highly recommend the Content Downloader desktop program. It has very flexible settings using PHP, writing directly to MySQL and uploading to FTP. Not entirely clear from the description of how you are going to specify a link to the product. Once in the card of your aggregator or directly during the parsing? What are the tasks in front of your system?

  • I don’t know how to explain it in another way) here’s the structure of the database: goods | sales | link table (product ID, seller ID, seller price, link to seller’s website). Now I do this: I choose a product, I choose a seller, I assign a price, and indicate a link to the seller’s website. And I would like to do this, I choose a vendor (he already has prices and links to product cards on his website), I launch the parser, and he runs on the links and parsit prices. Something like this - Alex_01
  • $content = file_get_contents('link_donor'); $pattern = 'регулярное выражение поиска кода цены'; preg_match($pattern, $content, $out); In out will be the value. May have to clean the framing tags, which were searched - Sergey Strelchenko

I would recommend using scrapy , this is of course python, not php, but parsers do not write this kind of php. Here you can read about it. We collect data using Scrapy . Very handy and flexible thing, parsing asynchronously into multiple threads. You can read in more detail on the link or on the official site.

And the second option is to use something like PhantomJs or SlimerJs Simple parsing of sites using SlimerJS . If the first option does not work, for some reason.

    I used Grab :: Spider for these purposes, it is in Python. There is Russian documentation, examples can be found on Habré. I made the requests myself through XPath, the FirePHP module for Firefox to help, well, or through CSS requests, but it seems to me less effective. Data obtained in this way can be uploaded to any database, since python has a lot of libraries.