R data.table-仅保留具有重复ID的行(最有效的解决方案)

I want to use data.table to create a function that only keeps rows where the ID column(s) - stored as a vector of strings - are duplicated. Note that where there are multiple ID columns, I only want to keep rows where the combination of ID columns is duplicated.


dt <- data.table(x = c(1:5,5), y = rep(c(1,3,5), each = 2), z = rep(1:3, 2))

get_duplicate_id_rows1 <- function(dt_in, id_str) {
  dt_in[, if(.N > 1) .SD, by = id_str]

get_duplicate_id_rows1(dt, c("x", "y"))
#>    x y z
#> 1: 5 5 2
#> 2: 5 5 3

get_duplicate_id_rows1(dt[, .(x,y)], c("x", "y"))
#> Empty data.table (0 rows and 2 cols): x,y

As above, my first attempt works when the data table has one non-ID column. However, when all of the columns are ID columns, then the data table has no rows. I think this is because, as per ?data.table, .SD includes all variables of the original data table, except the grouping rows. Consequently, .SD has zero columns, which seems to be causing my issue.

get_duplicate_id_rows2 <- function(dt_in, id_str) {
  dt_in[, if(.N > 1) .SD, by = id_str, .SDcols = names(dt_in)]

get_duplicate_id_rows2(dt, c("x", "y"))
#>    x y x y z
#> 1: 5 5 5 5 2
#> 2: 5 5 5 5 3

get_duplicate_id_rows2(dt[, .(x,y)], c("x", "y"))
#>    x y x y
#> 1: 5 5 5 5
#> 2: 5 5 5 5

My second attempt tries to circumvent my issues with my first attempt by using .SDcols. This does resolve the issue where all the columns in my data table are ID columns. However, here the column names in id_str are duplicated.

I think this is because one set of column names comes from the by argument and the other set of column names comes from .SDcols, although I'm not certain about this, because in my first attempt, the resultant data table had zero rows, not zero columns.


Created on 2020-04-09 by the reprex package (v0.3.0)