I do not know which side to approach. In the accounting system there is a directory of goods, about 500 thousand. there are suppliers, each with an average price of 20 thousand items. it is necessary to identify in terms of our directory what is there with the supplier in the price list, i.e. match the string in the price list and in the directory. it is clear that the hands do not master. The problem is that all the names are written differently, for example, ryazhenka 200ml and ryazhenka 0,2l - this is the most harmless example.

What are the methods or approaches for solving such problems?

Paraphrasing: there is a large classifier of 500 thousand + categories and you need to determine which category the incoming line belongs to.

  • 3
    We'll have to write custom logic with our hands. Disassemble, say, "Ryazhenka 200ml" on the product name and quantity. You have a difficult task. - VladD
  • 3
    as an option, you can try a fuzzy search , but the problem with the fact that the product name includes its characteristics will have to be solved anyway - Bald
  • one
    Here they solved a similar problem habrahabr.ru/post/348250 - CuprumBur
  • 3
    @rawman: well, people make mistakes of the same type. Be creative, if you see "700 grams", then catch the "number" "possible spaces and punctuation" "g" / "gr" / "gra" / "gram" / "gram" / "grams" / ... for new Set the lines to check: if you see a suspicious that is not embedded in known schemes, detect automatically for analysis. Try to add not specific lines, but whole schemes of lines. It is clear that if your code will catch only “500 grams”, then you will be addicted to add “501 grams”, “502 grams”, etc. - VladD
  • one
    A related option is also possible: your system finds the closest names, and the operator selects + trains the system from several points. - Dmitry Polyanin

4 answers 4

Here we need a system that, according to a certain function, selects the most similar options with a coefficient of similarity.

This function may have several modules.

  1. Barcode module. Produces a barcode option, if any.
  2. Name comparison module. Gives similar options by name.
  3. Other modules (for example, image color analysis, price in the range, manufacturer, etc.).

Next, for each product using each module, a list of similar products is calculated, each of which is taken with a coefficient. If at some position the coefficient is sufficiently large, then we take it immediately. If lower, then we offer the operator a choice of several options.

Further, if the operator sees that it is possible to improve the module of names, and there is some pattern that allows you to compare goods from the first and second base, then he either enters this pattern based on the pattern system, or gives you to think up the pattern system, and add this name.

About the pattern system:

You can bring all the names in the canonical form and compare it already. For example, first we take the first words before the punctuation mark or numbers and write to the beginning, then we write the numbers given in canonical form, for example, we translate to 0.3 liters (initially it could be in ml, milliliters or liter).

And the fact that the ready-made system will immediately start speaking is not going on, it is gradually finishing and alignment of the system.

Added: I also recommend not to rush to write code, but to verify every millimeter of code. The system that analyzes the database of 500,000 items will also be large. The system of comparison patterns and modules should be as clear as possible, simple, understandable, logical, and so on, so that it can be easily developed, improved, and so on, since it will work for a long time.

That is, it will be some kind of base (algorithms) for comparisons, coercions of values ​​and types, and so on, and all this should be easily expanded, supported, and so on.

As new data becomes available, experience accumulates and the idea of ​​the base refactoring algorithm is improved.

    Generally speaking, guaranteed to solve this problem for a data volume that cannot be checked manually is impossible. No matter how hard you try to predict exactly how these or other suppliers will call the goods, they can throw out something unexpected at any time. But you can solve it iteratively.

    First you need to conduct a lexical analysis of all directories - stupidly select lexemes for punctuation and whitespace.

    Then you can formulate the rules for the formation of syntactic structures, some of which can be matched by the same semantics. For example, after a numeric literal (which is composed of either a single set of digits or two sets of digits separated by a dot or comma), the units of measure must go, and all units are reduced to a single system, for example, to liters. As a result, 0.2 l, 0.2 l and 200 ml will give the same result as a result of parsing.

    After the formation of the syntax tree, we will have lines that we managed to fully recognize all the others. We process the recognized lines and do not touch them anymore, and continue to look at examples of what in all other lines in order to understand what rules are needed for processing. In this case, most likely, different suppliers will need different syntax rules.

      This answer is divided into two parts.

      1. "Many clever bukAf" - a sample of the English WITH
        answer to the question

        What are the methods or approaches for solving such problems?

        structure:
        Q: translation of the question
        original title - link to the question
        A: Answer

      2. "reasoning"
        on the topic

        Comparison of titles in reference books

      many clever books

      Q : finding the best match

      Getting the closest string match

      A : really a lot of bukAf there

      and this is closer to mere mortals

      You may find this library useful! http://code.google.com/p/google-diff-match-patch/
      List of supported programming Java , JavaScript , Dart , C++ , C# , Objective C , Lua and Python
      I use it in my Lua projects.

      Q : what algorithms to use to determine how two lines are similar

      How are the two strings are?

      A : the original is here. Free translation:

      What you are looking for is called String Metric algorithms . Here is a link to en.wikipedia . It contains a large list of algorithms, but many with similar characteristics. Among the most popular:

      (-> hereinafter the author lists with a brief decoding, which is replaced by quotes from Wikipedia by me. <-)

      Levenshtein distance : this is the minimum number of operations to insert one character, delete one character, and replace one character with another, necessary to turn one line into another.

      Hamming distance : the number of positions in which the corresponding characters of two words of the same length are different. In the more general case, the Hamming distance is applied to strings of the same length of any q-ary alphabets and serves as a metric of difference (a function defining the distance in metric space) of objects of the same dimension.

      Smith-Waterman algorithm : designed to obtain local sequence alignment.

      Sørensen coefficient : binary measure of similarity (a dimensionless measure of the similarity of objects being compared. Also known as “measure of association,” “measure of similarity,” etc.)

      UPD :

      https://ru.wikipedia.org/wiki/Algorithms_List #Algorithms_ on_stroke

      for example, the diff utility for comparing files is based on finding the largest common subsequence

      Diff Algorithm stackoverflow.com/q/805626

      Q : the best library for spell checking C#

      What is the best spell checking library for C #?

      A : Answer:


      Hunspell


      PostgreSQL


      reasoning

      I believe that the algorithm needed to solve the problem requires only one:
      separation of the original question / problem into several "small" ones

      then there is a search for a tool / library to solve each of the “small” questions.

       данные от поставщиков - лево -> ( средство ) <- право - наша база, справочник ^ | программный комплекс,библиотека, модуль 

      eg:

      as an option, you can try a fuzzy search, but the problem is that the name of the product includes its characteristics will have to be addressed anyway

      Q: The line contains garbage, how?
      A: search from right to left

      Q: how to search?
      A: we take the "tool" that can induce, check for errors and look for keywords (one field in the "directory" may have several keywords)
      just looking for "cabbage" do not divide the "cabbage" on the "fermented" "stewed" "fresh" - let the user do it

      Q: how the user will search on such a database
      A: for the "client" the same program is written as on the "server", only easier!
      agree, it's easier to write a search engine than a parser

      Q: OK, and if the "tool" did not find either "cabbage" or "cabbage" or "cabbage"
      A: we add the second criterion to the keywords - proper names
      we find the dictionary, in the process we fill up with our names

      continue the "split of the original question"

      a: have suppliers
      b: change the condition to have a supplier and solve the problem

      categories:

      categories can be used in the analysis they can be an additional criterion, but simply reduce the search

      a: the supplier delivers goods of the same category (only building materials products only)
      b: everything is OK

      a: the supplier delivers goods of different categories
      b: please, split by files / tables into categories

      • PS: Give me a fulcrum and I will turn the whole world up © © Don't look for Archimedes' theorem, determine the fulcrum as a correct question, discard too much, take a lever from the answer and everything will work out - qwabra

      I would begin to write tools for "Mepirovaniya." The idea is simple, to begin with, each category from the directory has only its own name. Further, according to any algorithm, a correspondence is built and the probability of the imported name for the category is calculated, and, for example, 5 variants are shown for each item. Then a person decides what is really appropriate. His choice is remembered, and for subsequent imports taken into account. Well, then, if the probability of coincidence is> 90%, then it does not ask, and when importing, only those nomenclature items that are <80%, for example, are displayed.

      • What algorithms are we talking about? which way to read? - rawman
      • Algorithms do not tell you on the go. I would base on any dictionary, i.e. there is a word (standard) and its abbreviations. the input line would lead to the standard in the dictionary and already compared these lines by the number of matches (order is not always important). - Chad