I have a "parse" method. Now we use recursion, follow the links of the site and take only the first link from the pages. And how can I take all the links of each level of the site, in order to get a site map?

--index.php--

function parse($url){ $url = $this->readUrl($url); if( !$url or $this->cacheurl[$url] or $this->cacheurl[preg_replace('#/$#','',$url)] ) return false; $this->_allcount--; if( $this->_allcount<=0 ) return false; $this->cacheurl[$url] = true; $item = array(); $data = str_get_html(request($url)); $item['url'] = $url; $item['title'] = count($data->find('title'))?$data->find('title',0)->plaintext:''; $this->result[] = $item; if(count($data->find('a'))){ foreach($data->find('a') as $a){ $this->parse($a->href); } } $data->clear(); unset($data); } function printresult(){ foreach($this->result as $item){ echo ''.$item['title'].' - <small>'.$item['url'].'</small>'; //echo '<p style="margin:20px 0px;background:#eee; padding:20px;">'.$item['text'].'</p>'; }; exit(); } 
  • Maybe easier sitemap.xml search sites - Ordman
  • The fact is that I need a php-script that follows the links of any site (starting from the main page) and builds a sitemap. The result of the program should be sitemap.xml. And initially I want to make a parser of sites - Anastasiya

1 answer 1

Anastasia, you are at the beginning of a very long journey.

  • What if there are a million or more pages on your site?
  • What if you fall into a loop during a recursion traversal (two pages will refer to one another)
  • What if the site has a link to another site? (Social networks for example) so you can scan the entire Internet with one call :-)
  • What if you come across links in various formats Absolute ( http://ya.ru/test ), relative ("index.php" located at " http://text.ru/folder/sub-folder/ ") or just root link ("/folder/index.php")
  • Even among the links may come across pictures and other media content.
  • What if for some link you get an error (40X / 500)

There are still a lot of questions. And not all of them can solve even the brightest minds of the developers of Google search. For example, this or that parameter in the address bar affects the content of the page (that is, how the param parameter in the http://test.ru?param=value link affects the content)

And this task is better not to perform in one script and on request. For work it is better to use storage as a queue. And it is advisable not to reinvent the wheel but to look for ready-made libraries. Search better for "php Crawler" - that is what you do. In search of a bunch of libraries, descriptions of all possible pitfalls, and more.

Search:

https://www.google.com/search?q=php%20crawler&oq=php%20crawler