There is a large input file (10 ^ 12 lines) with the following format: First Name | Last Name | Date of Birth
Example:
Yana|Petrova|21.01.1990 Kseniya|Ivanova|22.02.1990 Kseniya|Ivvanova|22.02.1990 Jana|Petrova|21.01.1091 ...
These users can enter data both with errors in any field, and with different spellings in the first and last names. Also, the user can confuse the order of the name and surname in the file (the date of birth can not be confused). A file can contain several entries about the same user. It also assumes the presence of tagged data on which you can check the results of the algorithm.
You need to implement a MapReduce application that will allocate unique users.
Sample output file after running the algorithm:
1|Yana|Petrova|21.01.1991 2|Kseniya|Ivanova|22.02.1990 2|Kseniya|Ivvanova|22.02.1990 1|Jana|Petrova|21.01.1091 Where the first field is the unique user ID. The order of the lines in the output file is not important. Only the correct IDs of the unique user minimizing the selected metric are important.
Tell me how to better implement the algorithm, in particular, the stage reduce. How best to perform comparisons and recognize identical users?