Which algorithm is best used for searching in the text?

For example, take php MySQL.

The design of the like , as experience shows, is complete crap for searching when it comes to high attendance and a lot of information.

What do you advise? There is literature or references - throw, everything is useful.

  • First of all, the question of a highly visited site with a huge text base and you need to look for words or sentences on the text - dgfhgjljhjjd

4 answers 4

Best of all - look towards the ready-made search engines. And if you want to invent the next bike yourself, then start with the fact that the search is conducted not by content, but by a separate array of information that is built, for example, like this:

  1. words from the page are reduced to a single form (nouns - im. pad, unit. number etc)
  2. are entered into the database with additional information (for example, the ordinal number of the word on the page, the page itself, the original word form, which tags are surrounded by - everything you might need)
  3. ...

Next - work with a search query. The easiest option: one word. We bring it to the same word form, as in paragraph 1, we are looking. Is - do not forget to give the original fragment (for this and keep the "original" version of the words, their order ...). We need to search for a few words or work with the language of requests - we continue to wrinkle the forehead, but by this time either the idea of ​​writing your engine will die by itself, or answers to emerging questions will already be received :)

Well, the organization of this repository should deal with a separate engine. Or a completely separate one, which periodically goes around the pages, looks at the changes and builds a subscript index (Yandex.Site etc). Or update the search index occurs when creating / editing the page, the simplest option is implemented in the engines of many forums.

Something like that, in the most general terms, without details :) So - see the first sentence of the first paragraph.

    Sphinx with periodic indexing of information on cron .. Other normal options have not yet been invented .. There is still Lucene, but it is for Java

    Php for such a task is not very suitable, for the muzzle of a web project, yes. There is still looking how the search will take place - just by the file system or by database. You have been told that there are many engines so use them. There are nutch, solr. These are heavyweights. Very nice stuff. If I were you, I would use java, there are a lot of things in it to work with the Internet.

      Analog Correlation.