Creating your own search engine for the site in php

Question

I want to create a search engine for the blog. The problem lies in the large amount of information that needs to be stored.

In my opinion, the best option is to store articles in XML files, and to store keywords in the database (that is, the article itself, the title and comments, in which the erupting words are removed) and a link to the file. Everything seems to be nothing, but when searching the database for only 1,000,000 entries, the search time is 3.7 seconds, which is already a lot!

There is another option: store only words with a link to another database or file, which actually lists the articles in which these words are found. Tests did not hold, but, in my opinion, the search time will not be reduced ...

What is the way out of this situation? How do modern search engines find information in a split second?

% -) google: // elastic search sphinx fulltext indexing reverse index, etc.
I tried to index the text, but I didn’t get much improvement!
@Jaroslav you need to read a lot of books or a few very good articles.
Etki is absolutely right about php, in it you can only prototype indexing (in python too, etc.). But you can't get into the search engine market right away.

etki etki 33.2k 2 45 71 · Accepted Answer · 2016-07-05T01:29:44

This will not be a complete answer, but an explanation on the example of ElasticSearch, but after reading it, you will understand why there is no complete instruction (and why it is easier to use the same ElasticSearch).

The question does not indicate specific requirements for the engine, so we formalize them like this: search for documents with arbitrary fields (strings, boolean values, numbers, dates, ip addresses, geo points), both accurate and inaccurate, the number of documents is uncountable , Exact search should be performed in milliseconds, inaccurate - in hundreds of milliseconds when hundreds and tens of requests are received per second, respectively, the engine should provide opportunities for linear horizontal scaling, i.e. increase in performance is proportional to the size of the cluster. It is assumed that each document has a primary key by which it can be uniquely identified.

In order to search for some attribute of a document, the engine must build a map of <attribute-documents> so that when a match for an attribute is found, it will be followed by the corresponding document itself.

document a: type: user admin: true document b: type: user admin: true document c: type: user admin: false nickname: bob index type: user -> a, b, c index admin: true -> a, b false -> c index nickname: bob -> c

Such a structure is called an inverted index, and there are a number of problems with it. First, he tends to be great. The more documents, the more it grows, and it becomes more and more, so you can not hope that it fits in the RAM. And, in extreme cases, in one file on one machine. Secondly, it cannot be partially rebuilt while adding a new document. This data structure should be optimized for searching as much as possible, for that you need to efficiently place it in memory, this (at least) kills the link ban, and, therefore, to add a new document to the index, you need to rebuild everything that follows its position. . Potentially, these are gigabytes of data and a not-so-fast operation, even if everything is placed in RAM, and at the time of each rebuild, the index would be unavailable. Therefore, a de facto effective search is conducted on immutable data structures, which are regularly replaced with more recent versions.
Approximately such a theory lies behind the indices that allow searching for documents by their attributes. There is a huge field for optimizations (for example, in order not to operate with giant indexes, ElasticSearch beats them into segments and thus achieves a near-realtime search when documents appear in the index within a second after adding).

But the index engine itself, of course, can search for documents only by a specific index and match, and the search engine should be able to work with several attributes at once. Therefore, there are two problems: working with an additional layer of abstraction (requests to the engine) and directly combining requests. The query language can be any (at least SQL), but with the union everything is a bit more complicated. I'm afraid to lie now, but as far as I know, at the time of the request, the engine creates bitsets for each individual subquery, and then combines them to get a common set of documents that satisfy the condition. About all this here you can read on request the boolean model (I did not read, I confess).

After that, one more requirement to the search engine comes in, without which it is practically useless - the results are required to be ranked. Here comes another model, tf / idf , which allows using relatively cheap methods to obtain the overall significance of matches in a document relative to other matches. If we imagine that we search for the words "main" and "distributor", then obviously the documents in which ten matches the words "main" and one words "distributor" should be less relevant than documents with ten matches "distributor" and one "main". Using the tf / idf model, the engine normalizes the number of matches of the request with the document relative to the match of the query with all documents in general; if the word “main” is found in every second document, then its influence on the relevance of the query is minimal, and when calculating the “weight” of a match, the engine simply divides the real weight by (roughly speaking) the number of matching documents in general.
All of the above is not mandatory, but is applied by default in the ElasticSearch engine.

After all this theory, you can finally touch on more complex parts of the engine - for example, search by text (for the time being, there is not even a fuzzy search). The problem with the text is that the search for a document requires a reverse index, and you cannot build a reverse index from the entire string (in this case, a ten-kilobyte text would have an index that coincides exclusively with the same ten-kilo-byte text). Therefore, the text should be broken down into components (usually words), according to which the index will be built and subsequently will be searched. This is just an incredibly disgusting moment, which in any case is given to the user (since only he knows the coincidences he needs to search for), but the engine itself should provide functionality for cutting text into components. and here you need to provide the following functionality:

The possibility of removing the end
Ability to include numbers, brackets, other characters in the chopped result
Split or not divide words by hyphen
Get rid of words that are too often used in advance (clear out all prepositions)
Turn Ä into a
Turn any input data stream into a phonetic counterpart (“here” turn into “sdes” so that even an ignoramus can find something with this engine)

And, of course, all this should work for some reasonable set of different languages (in our case - at least Russian and English).
After the text document has turned into a set of tokens and an index has been built on them, to correctly search it, it is also necessary to do each time for the input search line - so that the user who entered “something doesn’t smoke here” searched for documents that match the index with "sdes" and "chicken" (by the way, pay attention to the last example - in this case, the query is very simply turned into invalid, the user runs the risk of receiving documents on the household, this is another point in the basket "difficulty of implementing a search engine").

And finally, a fuzzy search. After all of the above, a relatively simple scheme is taken: each token in the index is taken, compared with the converted query, the Levenshtein distance is calculated, in the process all mismatched tokens (and documents along with them) are discarded when the Levenshtein maximum distance is exceeded. However, do not forget about the fact that the user of the engine will certainly want to highlight the coincidences and corrections, and this is still a lot of pain to implement.

After this machine is built and seems to work, another problem arises - the problem of scaling the engine. Here begins a very specific pain:

All data does not fit on one physical machine.
When distributing data across machines, in order to get results, you must run a query on all machines and combine the results
As a result, the advantage of scaling is zero, because each request goes to all machines, and in fact, any machine serves as many requests
Now it remains to add here the user's desire to aggregate the results (analogous to GROUP BY) and pagination of the results in order to get the hell

ElasticSearch uses a hack for this, which when used correctly makes life easier. He initially divides the document repository into shards — separate execution units — and then distributes these shards across the cluster. Shards are usually duplicated, so just by randomly distributing requests across a cluster, you can achieve some reduction in the load on each individual node. However, ElasticSearch allows you to store and search on a particular shard using one of the document attributes: for example, if you store geographic names, and you need to search the streets, then you can use the identifier of the city where the street is located, and ElasticSearch Only the shard that stores data for this city will be disturbed.

This does not describe even a tenth of the problems that need to be solved when implementing the search engine (and not all of them are correctly formulated), so I recommend that you take ElasticSearch, Solr, Sphinx or, if you wish, some of the tools for your own build. indexes (eg Lucene, on which ElasticSearch is built - he himself doesn’t search, haha - or YoctoDB), and use ready-made solutions.

And indeed, after all the problems and difficulties described above, I don’t even think to create my own search engine!
And to you, Etki, thank you very much for your time spent on the article! Very informative material and, I think, not only for me.
@ Yaroslav Well, that problems and difficulties did not stop Sergey Brin and Larry Page.

Creating your own search engine for the site in php

1 answer 1

More articles: