How to create a subset without a specific item?

Question

Hello. How to create a subset without a specific item? For example:

> a1<-c(1:5) > a2<-c("bed","pillow","sleep","bed","pillow") > a3<-c(6:10) > df<-data.frame(a1,a2,a3) > df a1 a2 a3 1 1 bed 6 2 2 pillow 7 3 3 sleep 8 4 4 bed 9 5 5 pillow 10

I would like to get a subset of the second column, but without the "pillow". The operator NOT! Does not work. How to use it here correctly? Or maybe there are other commands that allow you to create such a subset?

it seems to work for me not: subset(df, a2 != 'pillow') - Evgenii Izhboldin

Answer 1 · 2016-11-16T08:43:11

Another (in my opinion) is quite convenient to do subset c using [ (square brackets).

For example df[df$a2 != 'pillow',]

Note that it is not recommended to use subsets in certain cases.

Warning
This is a convenience function. For example, it can be unanticipated consequences

.

The result is the same, but slightly faster than the usual subset (on certain data sets)

 Unit: microseconds expr min lq mean median uq max neval { subset(df, a2 != "pillow") } 98.621 102.191 114.19937 104.4220 107.323 463.205 100 { df[df$a2 != "pillow", ] } 73.631 76.532 88.98195 78.0935 80.325 380.202 100

Answer 2 · 2016-11-27T14:14:41

I will add a solution using dplyr , since this package provides, in my opinion, the optimal combination of ease of writing / reading code and speed of execution.

So, the solution is as follows. By the way, you can create data immediately using the data_frame() function.

 library(tidyverse) df <- data_frame(a1 = 1:5, a2 = c("bed","pillow","sleep","bed","pillow"), a3 = 6:10) df_sub <- df %>% filter(! a2 == 'pillow')

Check speed

 library(microbenchmark) microbenchmark( subset(df, a2 != "pillow"), df[df$a2 != 'pillow',], df %>% filter(! a2 == 'pillow'), filter(df, ! a2 == 'pillow') ) Unit: microseconds expr min lq mean median uq max neval subset(df, a2 != "pillow") 61.009 73.9805 84.59654 80.3945 91.5130 168.200 100 df[df$a2 != "pillow", ] 61.863 72.6970 88.89838 83.8160 96.0740 314.732 100 df %>% filter(!a2 == "pillow") 310.456 339.1070 381.22235 364.0520 404.2485 579.290 100 filter(df, !a2 == "pillow") 228.353 249.7340 300.94876 267.4090 314.1625 1601.882 100

It would seem that dplyr loses much in speed. But this is due to the very small size of the data frame. Also note that a very convenient piping operator %>% slows down the execution of the code.

Let's try to increase the data a thousand times and compare the speed again.

 df2 <- data_frame(a1 = rep(1:5, 1e3), a2 = rep(c("bed","pillow","sleep","bed","pillow"), 1e3), a3 = rep(6:10, 1e3)) microbenchmark( subset(df2, a2 != "pillow"), df2[df2$a2 != 'pillow',], df2 %>% filter(! a2 == 'pillow'), filter(df2, ! a2 == 'pillow') ) Unit: microseconds expr min lq mean median uq max neval subset(df2, a2 != "pillow") 128.858 140.9740 246.4695 146.9615 169.1980 1857.316 100 df2[df2$a2 != "pillow", ] 115.459 126.2925 188.3296 137.2685 159.5045 1587.627 100 df2 %>% filter(!a2 == "pillow") 343.525 358.3505 384.0674 376.1680 392.7025 578.149 100 filter(df2, !a2 == "pillow") 265.128 278.3845 375.8885 293.3515 322.5720 3016.463 100

Still the differences are not in favor of dplyr . Let's try to increase the source dataset a million times and compare the speed again.

 df3 <- data_frame(a1 = rep(1:5, 1e6), a2 = rep(c("bed","pillow","sleep","bed","pillow"), 1e6), a3 = rep(6:10, 1e6)) microbenchmark( subset(df3, a2 != "pillow"), df3[df3$a2 != 'pillow',], df3 %>% filter(! a2 == 'pillow'), filter(df3, ! a2 == 'pillow') ) Unit: milliseconds expr min lq mean median uq max neval subset(df3, a2 != "pillow") 129.24171 153.90437 193.8962 200.42959 214.1032 289.8579 100 df3[df3$a2 != "pillow", ] 88.93557 106.71604 123.6339 112.33160 143.0285 197.5125 100 df3 %>% filter(!a2 == "pillow") 73.45358 87.98082 103.9721 92.90250 107.3194 169.9390 100 filter(df3, !a2 == "pillow") 73.26229 88.31038 106.6175 92.96208 108.6734 239.5112 100

And here it is! On a large dplyr faster.

You have a typo / error in df[df3$a2 != 'pillow',] , different df.
By the way, further interesting observation: on the millionth dataset ( df3 ), the home i5 4GB Ubuntu 16.04 is faster than the i7 8GB win7 worker!

How to create a subset without a specific item?

2 answers 2

Check speed

More articles: