There is a large input file (10 ^ 12 lines) with the following format: First Name | Last Name | Date of Birth

Example:

Yana|Petrova|21.01.1990 Kseniya|Ivanova|22.02.1990 Kseniya|Ivvanova|22.02.1990 Jana|Petrova|21.01.1091 

...

These users can enter data both with errors in any field, and with different spellings in the first and last names. Also, the user can confuse the order of the name and surname in the file (the date of birth can not be confused). A file can contain several entries about the same user. It also assumes the presence of tagged data on which you can check the results of the algorithm.

You need to implement a MapReduce application that will allocate unique users.

Sample output file after running the algorithm:

 1|Yana|Petrova|21.01.1991 2|Kseniya|Ivanova|22.02.1990 2|Kseniya|Ivvanova|22.02.1990 1|Jana|Petrova|21.01.1091 

Where the first field is the unique user ID. The order of the lines in the output file is not important. Only the correct IDs of the unique user minimizing the selected metric are important.

Tell me how to better implement the algorithm, in particular, the stage reduce. How best to perform comparisons and recognize identical users?

    1 answer 1

    In such cases, the Levenshtein distance is used for fuzzy string comparisons — the number of editing operations that will allow one line to be converted to another.

    For identical lines, the distance is zero. For one or two misprints the distance will be small. You choose the threshold value after which the lines are no longer considered equal.

    Since you assume that the name and surname can be mixed up in places, you will have to make two comparisons (gluing the name with the last name in the direct and reverse order, and then take the best (smaller) value.

     public boolean fuzzyEqual(String firstname1, String lastname1, String firstname2, String lastname2, int treshold) { return treshold >= Math.min( dist(firstname1 + " " + lastname1, firstname2 + " " + lastname2), dist(firstname1 + " " + lastname1, lastname2 + " " + firstname2) ); } public int dist(String a, String b) { // тут ваша реализация расстояния Левенштейна } 

    Ps. When using the algorithm, the string, of course, must be converted to upper or lower case.

    • So in the end, what exactly do you propose to pass to fuzzeEual() ? All pairs of 10 ^ 12 lines? This will be, if I'm not mistaken, 10 ^ 24 function calls ... - avp
    • @avp, my answer - directly to the algorithmic question of the vehicle - the reduce stage is particularly interested. How best to perform comparisons and recognize identical users? . What to do with a file in which 27 terabytes of names and surnames is an interesting question in itself, and I don’t undertake to consult here. But back in the 14th year, 100TB data was sorted on Apache Spark into 200 Amazon nodes in just 23 minutes. - Nofate
    • I suspect that the data have already been distributed at least across the disks of these nodes. It is clear that in the task of the Customs Union, the first thing to do is to throw out real duplicates, of which there is at least 99.9% (based on the population of the country), and fuzzy should be applied to the strongly truncated set. Perhaps after this truncation, it is worthwhile to calculate the distance from a certain standard (for example, all spaces), then you can select not all pairs anymore. And of course, I completely forgot, at some stage of division it is imperative to consider birthdays as the primary key of division. - avp