This answer is divided into two parts.
"Many clever bukAf" - a sample of the English WITH
answer to the question
What are the methods or approaches for solving such problems?
structure:
Q: translation of the question
original title - link to the question
A: Answer
"reasoning"
on the topic
Comparison of titles in reference books
many clever books
Q : finding the best match
Getting the closest string match
A : really a lot of bukAf there
and this is closer to mere mortals
You may find this library useful! http://code.google.com/p/google-diff-match-patch/
List of supported programming Java , JavaScript , Dart , C++ , C# , Objective C , Lua and Python
I use it in my Lua projects.
Q : what algorithms to use to determine how two lines are similar
How are the two strings are?
A : the original is here. Free translation:
What you are looking for is called String Metric algorithms . Here is a link to en.wikipedia . It contains a large list of algorithms, but many with similar characteristics. Among the most popular:
(-> hereinafter the author lists with a brief decoding, which is replaced by quotes from Wikipedia by me. <-)
Levenshtein distance : this is the minimum number of operations to insert one character, delete one character, and replace one character with another, necessary to turn one line into another.
Hamming distance : the number of positions in which the corresponding characters of two words of the same length are different. In the more general case, the Hamming distance is applied to strings of the same length of any q-ary alphabets and serves as a metric of difference (a function defining the distance in metric space) of objects of the same dimension.
Smith-Waterman algorithm : designed to obtain local sequence alignment.
Sørensen coefficient : binary measure of similarity (a dimensionless measure of the similarity of objects being compared. Also known as “measure of association,” “measure of similarity,” etc.)
UPD :
https://ru.wikipedia.org/wiki/Algorithms_List #Algorithms_ on_stroke
for example, the diff utility for comparing files is based on finding the largest common subsequence
Diff Algorithm stackoverflow.com/q/805626
Q : the best library for spell checking C#
What is the best spell checking library for C #?
A : Answer:
Hunspell
PostgreSQL
reasoning
I believe that the algorithm needed to solve the problem requires only one:
separation of the original question / problem into several "small" ones
then there is a search for a tool / library to solve each of the “small” questions.
данные от поставщиков - лево -> ( средство ) <- право - наша база, справочник ^ | программный комплекс,библиотека, модуль
eg:
as an option, you can try a fuzzy search, but the problem is that the name of the product includes its characteristics will have to be addressed anyway
Q: The line contains garbage, how?
A: search from right to left
Q: how to search?
A: we take the "tool" that can induce, check for errors and look for keywords (one field in the "directory" may have several keywords)
just looking for "cabbage" do not divide the "cabbage" on the "fermented" "stewed" "fresh" - let the user do it
Q: how the user will search on such a database
A: for the "client" the same program is written as on the "server", only easier!
agree, it's easier to write a search engine than a parser
Q: OK, and if the "tool" did not find either "cabbage" or "cabbage" or "cabbage"
A: we add the second criterion to the keywords - proper names
we find the dictionary, in the process we fill up with our names
continue the "split of the original question"
a: have suppliers
b: change the condition to have a supplier and solve the problem
categories:
categories can be used in the analysis they can be an additional criterion, but simply reduce the search
a: the supplier delivers goods of the same category (only building materials products only)
b: everything is OK
a: the supplier delivers goods of different categories
b: please, split by files / tables into categories