Help please with the decision.

There are two tables with data.

The size of the rows of one table is 30 thousand values, the size of the rows of the second table is 32 thousand values

The data represent two time series. I would like to identify extra lines and remove them from the total sample.

That is, in fact, it is necessary to compare each line of one table with another line of the same table and if it is found that the line is not in the other table, delete it

    3 answers 3

    Since you did not provide sample data, the solution is blind:

    dataset <- rbind(dataset1, dataset2) dataset <- dataset[duplicated(dataset), ] 

    That is, we leave only duplicate lines. You can specify for which columns to find duplicates:

     duplicated(dataset[, cols]) 

      This sounds like a task for merge() or dplyr::inner_join() .


      Example

       require(dplyr) one <- data.frame(id = sample(letters,size = 20, replace = F), value = rnorm(20)) two <- data.frame(id = sample(letters,size = 20, replace = F), value = rnorm(20)) joined <- inner_join(one,two,by='id') 

      If there is no task to combine two data sets, but only to filter, you can combine one dataset with an id variable from the other. For example, we need to filter only common lines in dataset two :

       two_filtered <- inner_join(two, one %>% select(id), by='id') 

        And there is an even better solution - dplyr::intersect()

         library(dplyr) joined <- intersect(one, two)