R data.table-仅保留具有重复ID的行(最有效的解决方案)

I want to use data.table to create a function that only keeps rows where the ID column(s) - stored as a vector of strings - are duplicated. Note that where there are multiple ID columns, I only want to keep rows where the combination of ID columns is duplicated.

library(data.table)

dt <- data.table(x = c(1:5,5), y = rep(c(1,3,5), each = 2), z = rep(1:3, 2))

get_duplicate_id_rows1 <- function(dt_in, id_str) {
  dt_in[, if(.N > 1) .SD, by = id_str]
}

get_duplicate_id_rows1(dt, c("x", "y"))
#>    x y z
#> 1: 5 5 2
#> 2: 5 5 3

get_duplicate_id_rows1(dt[, .(x,y)], c("x", "y"))
#> Empty data.table (0 rows and 2 cols): x,y

As above, my first attempt works when the data table has one non-ID column. However, when all of the columns are ID columns, then the data table has no rows. I think this is because, as per ?data.table, .SD includes all variables of the original data table, except the grouping rows. Consequently, .SD has zero columns, which seems to be causing my issue.

get_duplicate_id_rows2 <- function(dt_in, id_str) {
  dt_in[, if(.N > 1) .SD, by = id_str, .SDcols = names(dt_in)]
}

get_duplicate_id_rows2(dt, c("x", "y"))
#>    x y x y z
#> 1: 5 5 5 5 2
#> 2: 5 5 5 5 3

get_duplicate_id_rows2(dt[, .(x,y)], c("x", "y"))
#>    x y x y
#> 1: 5 5 5 5
#> 2: 5 5 5 5

My second attempt tries to circumvent my issues with my first attempt by using .SDcols. This does resolve the issue where all the columns in my data table are ID columns. However, here the column names in id_str are duplicated.

I think this is because one set of column names comes from the by argument and the other set of column names comes from .SDcols, although I'm not certain about this, because in my first attempt, the resultant data table had zero rows, not zero columns.

因此,我很想了解这里发生的事情,以及对我的问题最有效的解决方案是什么-特别是对于大型数据集,这就是为什么我从tidyverse转到data.table的原因。

Created on 2020-04-09 by the reprex package (v0.3.0)