Search phrases in the library of texts

Question

There are many texts, 500 pieces, each the size of a book. How best to organize the search for the exact phrase entered by the user for these texts using c #? Speed is important, the texts are always the same.

UPD. And if I break all the texts into sentences, will it be possible to speed up the search using a database, for example?

@AlexanderPetrov, for the sake of only 500 texts it is not rational to start a full-fledged database.
If the phrase should consist exactly of the words of the texts, then you can make an index of words with their offsets in the text and build a search using this index.
Probably can be found in Google, it is better to read something about the implementation of databases (if you find).
In general, your index is a data structure, where the word is a key, and the data they address is an offset from the beginning of the text, where the given word occurs.
Probably for such a task with each word (key in the index) it is worth storing and the number of places where it occurs.
This will allow the analysis of the words of the phrase to select the most rare word and read only those places of the text where it is.
Accordingly, less text to compare with other words of the phrase.

Accepted Answer · 2016-06-07T13:12:26

The fastest information search is performed using a search index . Of course, in order to be able to use it, it must first be compiled.

If your project is already using any database that supports full-text search , then it is logical to use it. Otherwise, you should search for any search engine.

One of the most famous search engines is Lucene . The easiest way to install is using nuget . An example of using in .NET.
You can look at hOOt .

Will it always be faster than stupid enumeration of text files? Not. However, even if the number and size of files are relatively small and they can fit entirely in RAM, the use of the index can still be faster: in most cases, downloading files will not be necessary at all.

Alexis Alexis 2,390 one 24 49 · Answer 2 · 2016-06-06T19:36:00

Load texts into a thread-safe queue ( ConcurrentQueue for example), pick up 500 threads, and scan texts from the queue, where each text is checked in its own stream. There are also different options with asynchronous tasks, Parallel.For cycles, Linq.ForEach , etc.

To give code examples without specifics is meaningless.

@AlexanderPetrov, not necessarily threads, you can asynchronous tasks.
Here for example: www.stackoverflow.com/questions/503721/…

Answer 3 · 2016-06-07T13:09:57

As an option to use Lucene, there is a port on .net Lucenenet . You need to make an index in advance, and from the application look for the required document. Here is another link Introducing-Lucene-Net

Search phrases in the library of texts

3 answers 3

More articles: