I want to use data.table
to create a function that only keeps rows where the ID column(s) - stored as a vector of strings - are duplicated. Note that where there are multiple ID columns, I only want to keep rows where the combination of ID columns is duplicated.
library(data.table)
dt <- data.table(x = c(1:5,5), y = rep(c(1,3,5), each = 2), z = rep(1:3, 2))
get_duplicate_id_rows1 <- function(dt_in, id_str) {
dt_in[, if(.N > 1) .SD, by = id_str]
}
get_duplicate_id_rows1(dt, c("x", "y"))
#> x y z
#> 1: 5 5 2
#> 2: 5 5 3
get_duplicate_id_rows1(dt[, .(x,y)], c("x", "y"))
#> Empty data.table (0 rows and 2 cols): x,y
As above, my first attempt works when the data table has one non-ID column. However, when all of the columns are ID columns, then the data table has no rows. I think this is because, as per ?data.table
, .SD
includes all variables of the original data table, except the grouping rows. Consequently, .SD
has zero columns, which seems to be causing my issue.
get_duplicate_id_rows2 <- function(dt_in, id_str) {
dt_in[, if(.N > 1) .SD, by = id_str, .SDcols = names(dt_in)]
}
get_duplicate_id_rows2(dt, c("x", "y"))
#> x y x y z
#> 1: 5 5 5 5 2
#> 2: 5 5 5 5 3
get_duplicate_id_rows2(dt[, .(x,y)], c("x", "y"))
#> x y x y
#> 1: 5 5 5 5
#> 2: 5 5 5 5
My second attempt tries to circumvent my issues with my first attempt by using .SDcols
. This does resolve the issue where all the columns in my data table are ID columns. However, here the column names in id_str
are duplicated.
I think this is because one set of column names comes from the by
argument and the other set of column names comes from .SDcols
, although I'm not certain about this, because in my first attempt, the resultant data table had zero rows, not zero columns.
因此,我很想了解这里发生的事情,以及对我的问题最有效的解决方案是什么-特别是对于大型数据集,这就是为什么我从tidyverse转到data.table的原因。
Created on 2020-04-09 by the reprex package (v0.3.0)
这是另一种选择:
In the example, when
id_str = c("x", "y", "z")
indt[, if(.N > 1) .SD, id_str]
each row is distinct by itself and hence there are no rows with.N > 1
.By default, when
by
is used, these are also returned as the leftmost columns in the output, hence inget_duplicate_id_rows2
, you see x, y and then columns from.SD
as specified in.SDcols
.Lastly, regarding efficiency, you can time the various options posted here using
microbenchmark
with your actual dataset and share your results.We can use
.I
to get the index of groups with frequency count greater than 1, extract the column and subset the data.tableNOTE: It should be faster than
.SD