There is no experience with big data, as well as no experience with noSQL.

There is a lot of data (20 million lines), all of them are written in one table, it is assumed that they will not change.

What to do with the data: build simple samples + sometimes use the Levenshtein algorithm to search for similar strings.

The data in its original form are in the form of csv files. There is an idea to import them into sqlite and try to work with it. But still, maybe I should choose some other solution?

  • You, on the whole, will suit anything where the Levenshtein distance is realized (or realizable). In SQLite, it seems to be there . What is the essence of the question? - D-side
  • I would take postgress - Sasha
  • PostgreSQL really does Levenshtein , I would take it. 20 million lines is not a problem for almost any database, but SQLite is clearly not worth taking, unless the problem is non-disposable. A little later I will try to write a more detailed answer. - etki
  • @Etki SQL Server is also able to levenshte . so the answer to the question really depends on the personal preferences of the respondent - PashaPash ♦
  • @PashaPash from there and "I would be" - etki

1 answer 1

The full answer to this question is hidden behind the undeclared requirements for a specific repository, so there will be some explanations rather than tips.

Storage can be divided into five main types - relational, column-family, key-value, document-oriented and graph. The first thing to do is to understand which types do not fit the conditions and immediately fly out. In this case, the word "sampling" sounded, and nothing was said about complex connections - therefore, you can precisely throw out the key-value of the repository and ignore the graph ones. There are three types left - relational, column-family and document-oriented, the final choice of which depends on the structured data. If the data does not have a permanent structure and have a different nesting, but it is necessary to search by attributes, then only the variant with the document-oriented storage remains; otherwise, you can choose a standard relational model or a column-family repository (on the client side, they are quite similar, only the column-family looks severely curtailed - for example, the column-family usually does not have the notion of join), although no one will forbid using anything here documented.
Separately, I want to note that 20 million lines is a large number only from a human point of view; Despite the fact that any storage will consume a truly impressive amount of disk space with such imports, it should not die, such volumes are normal in modern services, so by default we can assume that any storage is able to work with the necessary data. Some GUI will find it difficult to display the entire sample (because for this it will be necessary to keep in memory all the data), but the storage itself and the client code must process the records in small batches without surviving the entire RAM.

As far as I understood from the mention of CSV, the data is rigidly structured, so the easiest thing would be to take some kind of relational database (column-family is a separate beast that I wouldn’t want to jump on the move, and, most likely, there is no one; besides, cf-storages are aimed at distributed storage of parts of data, while here with any request you will need to process all the data entirely, and this cannot be done simply on each individual node) and work with it; The need to determine the distance of Levenshtein during the query narrows down the range of possible storages even more: of the known free ones, this is only SQLite and PostgreSQL. And here you can either choose PostgreSQL or parse the task in a little more detail.

First, as far as I understand, the task to the product itself requires not to look for the Levenshtein distance, but to compare some input X with all values ​​present in the repository. Secondly, this task strongly beats the very idea of ​​indexes, which are usually used to speed up the search: indexes are built by exact values, while here it is impossible to calculate the exact value for the index, just like building a tree for faster search. An attempt to pre-calculate the distance for all possible options is likely to end with the disk space. Therefore, it may be worth referring to additional solutions.

The very first optimization that you can think of is to put those attributes that can be searched in separate tables. One way or another, the repository will iterate over all records (theoretically, it can iterate over the index itself, but I’m not so sure about that), so just to speed up the reading, you can duplicate the necessary things in separate tables. You can also set the condition for the application that the prefix of the search word (for example, the first third of the characters rounded down) is always correct, and search for the reduced sample. Also, if there is a maximum allowable distance, you can limit the selection by the length of the records and use the distance determination function itself for a much smaller dictionary.

The second optimization, which will not be as obvious as the third, but about which I would like to say a little earlier, is the construction of some intermediate model for fuzzy search. In that case, if you really need to look for similar words, and not Levenshtein distance, you can make a preliminary phonetic transformation of the terms (if, of course, the requirements for the application allow). It may be worthwhile to convert all references to "terrace", "terrace" and "terrace" to "terrace" in order to synonym these words; in this case (again, if the conditions set for the application allow it), it will not be necessary to search for the Levenshtein distance at all (at that, the search query will have to be run through the same filter before finding matches). With this, in particular, the snowball filter works in the search engine Elasticsearch, which, perhaps, is required in the application; In addition, ES will allow you to search not for specific matches within the given framework, but for the required number of the closest results (although it would be worth writing a separate post, and it’s not very easy with Levenshteyn’s distance). NB: in the ES official blog they ask you to be very careful about such requests, because for obvious reasons, they are really slow and voracious. ElasticSearch itself is the solution to most search problems in the world, but for correct use, you should understand what is happening inside the processes and constantly update both the documents themselves and the whole indexes, and diving into it will be as difficult as in the column-family database .

And finally, the third optimization, which is the most resource-loving and demanding, and, at the same time, the most optimal. You can simply push all the data that is being searched into the RAM, and work right on the go. This is the fastest option, which, however, is poorly shaded (although previous hacks with a prefix / suffix and word length remain) and require a lot of attention, however, on large amounts of data it is often inclined to it (because it is easier than maintaining a fleet databases, and one machine is capable of handling huge volumes of requests). In particular, there is a Levenshtein automaton (disclaimer: I have not yet found the time to read about it), which will most likely solve all problems. The ElasticSearch proposed in the previous version, as far as I know, uses the Levenshtein automaton, but I can’t guess how much RAM it will need (there will certainly be some kind of overhead; ES can be swapped to disk, but the authors of ES strongly do not recommend this) and how to configure it correctly.

Summarizing: most likely, PostgreSQL or an analog will suit the tasks, but the intended operations themselves are quite resource-intensive, so I would consider alternatives.

  • Thank you so much for such a detailed response! - PendalF