You need to compare 2 large lists of numbers and highlight different numbers.

Given:

There are 5-6 * .csv files with lists of mobile numbers like 12345678912. There are about 1 million numbers in each file in one column. Those. only 5-6 million numbers.

Need to:

1) Combine these files into one. Excel does not allow to add more than 1 million lines to one file. In which program can this be done? And you also need to remove duplicate numbers there. 2) There will be a 2nd file with approximately the same number of numbers. It is necessary to compare these 2 files by approximately 6 million numbers and form the 3rd file (or a list with numbers) where there will be only numbers that are not in the 1st file.

I am not a programmer and I don’t understand the code. I ask you to prompt the program (Microsoft Access, something from SQL, etc.) with the help of which it can be done and instructions how.

It is also possible someone knows the services, where it can be done online?

2 people download 2 different files in one session, without the ability to view or download someone else's file. They are compared, they see the result and the second person sends the 3rd file (with the result of unique values) to the 1st person.

Thanks in advance for your advice and guidance.

  • In principle, yes, if you merge everything into one database (Access, MS SQL Server or another - I don’t know what is being used on the server where the files are loaded), then this information is obtained from the database with a single SQL query. - Ella Svetlaya

2 answers 2

The easiest way is to save these lists in a text file: one line - one record.

And then process them using command line utilities. In Linux and Mac it is out of the box. In Windows, find and install. Or if there is an opportunity to use any hosting on Linux, where there is an opportunity to log in via SSH.

uniq utility

Call

  uniq -u filename.txt > myfile.txt 

Displays in the file myfile.txt * only unique lines from the file filename.txt

Thus, problem number one is solved.

After that, merge your file with the resulting file into one. And you do again the search for unique lines:

 uniq -u summary.txt > delta.txt 

In the delta.txt file you get unique strings. But there are lines that are in the first file, but which are not in the second, and vice versa.

Then merge this file with your original one - and once again

uniq -u summ.txt> newfile.txt

You get a new list in the file newfile.txt , which includes your old numbers and those numbers from another file that you didn’t have before.

    The limitations of MS Excel in handling large csv files can be circumvented using the R language and the RStudio environment. I cite the complete code snippet for solving the problem:

     # Работающий код на примере трех файлов: # a.csv, b.csv, c.csv - файлы для демонстрации работы кода для 1-ой части вопроса; # в результате обработки a.csv и b.csv получим файл содержащий только уникальные значения # z.csv - файл для демонстрации работы кода для 2-ой части вопроса # # Код очень примитивен # подключаем библиотеку для манипуляции данными library(dplyr) # создаем общий файл из нескольких # создаем Таблицы данных с содержимым файлов file_a <- "D:/Tmp/R-cmprFls/a.csv" df_temp_a <- read.csv(file_a, head=TRUE, sep="|", dec=",") file_b <- "D:/Tmp/R-cmprFls/b.csv" df_temp_b <- read.csv(file_b, head=TRUE, sep="|", dec=",") file_c <- "D:/Tmp/R-cmprFls/c.csv" df_temp_c <- read.csv(file_c, head=TRUE, sep="|", dec=",") # для добавления количества обрабатываемых файлов # добавляем строки file_xxx, df_temp_xxx # объединяем все Таблицы данных в одну join_df_temp <- bind_rows(df_temp_a, df_temp_b, df_temp_c) # для определения уникальных строк в Таблице данных используем группировку # и ссумирование по полю Number # добавляем поле Cnt - для подсчета одинаковых Number mt_df_temp <- mutate(join_df_temp, Cnt = 1) # группируем по полю Number и суммируем по полю Cnt gr_df_temp <- group_by(mt_df_temp, Number) sm_df_temp <- summarise(gr_df_temp, Cnt = sum(Cnt)) # уникальными являются строки со значением поля Cnt = 1 rslt_1_df_temp <- filter(sm_df_temp, Cnt == 1) # сохраняем результат для 1-ой части вопроса write.table(rslt_1_df_temp, file = "D:/Tmp/R-cmprFls/rslt_1_df_temp.csv", row.names = FALSE) # 2-я часть. Объединяем результат 1-ой части с эталонным файлом # создаем Таблицу даных на основании z.csv file_z <- "D:/Tmp/R-cmprFls/z.csv" df_temp_z <- read.csv(file_z, head=TRUE, sep="|", dec=",") mt_df_temp_z <- mutate(df_temp_z, Cnt=1) # создаем результирующую Таблицу данных в соответствии с требованиями # для 2-ой части вопроса rslt_2_df_temp <- anti_join(mt_df_temp_z, rslt_1_df_temp) # сохраняем результат write.table(rslt_2_df_temp, file = "D:/Tmp/R-cmprFls/rslt_2_df_temp.csv", row.names = FALSE) 

    Data files

    a.csv

     Number 1111111111 2222222222 3333333333 4444444444 5555555555 6666666666 

    b.csv

     Number 1111111111 2222222222 3333333333 7777777777 8888888888 9999999999 

    c.csv

     Number 1111111111 2222222222 3333333333 7777777777 8888888888 9999999999 

    z.csv

     Number 1111111111 2222222222 4444444444 5555555555 7777777777 8080808080 9090909090 

    results:

    rslt_1_df_temp

     "Number" "Cnt" 4444444444 1 5555555555 1 6666666666 1 

    rslt_2_df_temp

     "Number" "Cnt" 8080808080 1 9090909090 1 2222222222 1 1111111111 1 7777777777 1