I will add a solution using dplyr , since this package provides, in my opinion, the optimal combination of ease of writing / reading code and speed of execution.
So, the solution is as follows. By the way, you can create data immediately using the data_frame() function.
library(tidyverse) df <- data_frame(a1 = 1:5, a2 = c("bed","pillow","sleep","bed","pillow"), a3 = 6:10) df_sub <- df %>% filter(! a2 == 'pillow')
Check speed
library(microbenchmark) microbenchmark( subset(df, a2 != "pillow"), df[df$a2 != 'pillow',], df %>% filter(! a2 == 'pillow'), filter(df, ! a2 == 'pillow') ) Unit: microseconds expr min lq mean median uq max neval subset(df, a2 != "pillow") 61.009 73.9805 84.59654 80.3945 91.5130 168.200 100 df[df$a2 != "pillow", ] 61.863 72.6970 88.89838 83.8160 96.0740 314.732 100 df %>% filter(!a2 == "pillow") 310.456 339.1070 381.22235 364.0520 404.2485 579.290 100 filter(df, !a2 == "pillow") 228.353 249.7340 300.94876 267.4090 314.1625 1601.882 100
It would seem that dplyr loses much in speed. But this is due to the very small size of the data frame. Also note that a very convenient piping operator %>% slows down the execution of the code.
Let's try to increase the data a thousand times and compare the speed again.
df2 <- data_frame(a1 = rep(1:5, 1e3), a2 = rep(c("bed","pillow","sleep","bed","pillow"), 1e3), a3 = rep(6:10, 1e3)) microbenchmark( subset(df2, a2 != "pillow"), df2[df2$a2 != 'pillow',], df2 %>% filter(! a2 == 'pillow'), filter(df2, ! a2 == 'pillow') ) Unit: microseconds expr min lq mean median uq max neval subset(df2, a2 != "pillow") 128.858 140.9740 246.4695 146.9615 169.1980 1857.316 100 df2[df2$a2 != "pillow", ] 115.459 126.2925 188.3296 137.2685 159.5045 1587.627 100 df2 %>% filter(!a2 == "pillow") 343.525 358.3505 384.0674 376.1680 392.7025 578.149 100 filter(df2, !a2 == "pillow") 265.128 278.3845 375.8885 293.3515 322.5720 3016.463 100
Still the differences are not in favor of dplyr . Let's try to increase the source dataset a million times and compare the speed again.
df3 <- data_frame(a1 = rep(1:5, 1e6), a2 = rep(c("bed","pillow","sleep","bed","pillow"), 1e6), a3 = rep(6:10, 1e6)) microbenchmark( subset(df3, a2 != "pillow"), df3[df3$a2 != 'pillow',], df3 %>% filter(! a2 == 'pillow'), filter(df3, ! a2 == 'pillow') ) Unit: milliseconds expr min lq mean median uq max neval subset(df3, a2 != "pillow") 129.24171 153.90437 193.8962 200.42959 214.1032 289.8579 100 df3[df3$a2 != "pillow", ] 88.93557 106.71604 123.6339 112.33160 143.0285 197.5125 100 df3 %>% filter(!a2 == "pillow") 73.45358 87.98082 103.9721 92.90250 107.3194 169.9390 100 filter(df3, !a2 == "pillow") 73.26229 88.31038 106.6175 92.96208 108.6734 239.5112 100
And here it is! On a large dplyr faster.
subset(df, a2 != 'pillow')- Evgenii Izhboldin