R data.table-仅保留具有重复ID的行(最有效的解决方案)

I want to use data.table to create a function that only keeps rows where the ID column(s) - stored as a vector of strings - are duplicated. Note that where there are multiple ID columns, I only want to keep rows where the combination of ID columns is duplicated.


dt <- data.table(x = c(1:5,5), y = rep(c(1,3,5), each = 2), z = rep(1:3, 2))

get_duplicate_id_rows1 <- function(dt_in, id_str) {
  dt_in[, if(.N > 1) .SD, by = id_str]

get_duplicate_id_rows1(dt, c("x", "y"))
#>    x y z
#> 1: 5 5 2
#> 2: 5 5 3

get_duplicate_id_rows1(dt[, .(x,y)], c("x", "y"))
#> Empty data.table (0 rows and 2 cols): x,y

As above, my first attempt works when the data table has one non-ID column. However, when all of the columns are ID columns, then the data table has no rows. I think this is because, as per ?data.table, .SD includes all variables of the original data table, except the grouping rows. Consequently, .SD has zero columns, which seems to be causing my issue.

get_duplicate_id_rows2 <- function(dt_in, id_str) {
  dt_in[, if(.N > 1) .SD, by = id_str, .SDcols = names(dt_in)]

get_duplicate_id_rows2(dt, c("x", "y"))
#>    x y x y z
#> 1: 5 5 5 5 2
#> 2: 5 5 5 5 3

get_duplicate_id_rows2(dt[, .(x,y)], c("x", "y"))
#>    x y x y
#> 1: 5 5 5 5
#> 2: 5 5 5 5

My second attempt tries to circumvent my issues with my first attempt by using .SDcols. This does resolve the issue where all the columns in my data table are ID columns. However, here the column names in id_str are duplicated.

I think this is because one set of column names comes from the by argument and the other set of column names comes from .SDcols, although I'm not certain about this, because in my first attempt, the resultant data table had zero rows, not zero columns.


Created on 2020-04-09 by the reprex package (v0.3.0)

  • 嘘!安静
    嘘!安静 回复


    dt[dt[rowid(x, y) > 1], on=.(x, y)]

    In the example, when id_str = c("x", "y", "z") in dt[, if(.N > 1) .SD, id_str] each row is distinct by itself and hence there are no rows with .N > 1.

    By default, when by is used, these are also returned as the leftmost columns in the output, hence in get_duplicate_id_rows2, you see x, y and then columns from .SD as specified in .SDcols.

    Lastly, regarding efficiency, you can time the various options posted here using microbenchmark with your actual dataset and share your results.

  • lquia
    lquia 回复

    We can use .I to get the index of groups with frequency count greater than 1, extract the column and subset the data.table

    dt[dt[, .I[.N >1], .(x, y)]$V1]

    NOTE: It should be faster than .SD