使用R以便用条件填充NA值

您好,我有一个数据框,例如

> tab
   COL1                          COL2 COL3 COL4
1    G1     SEQ1_10-67_-__Canis_lupus    A    B
2    G1    SEQ4.1_90-345_-__Elpah_bis    C    D
3    G1  SEQA.B_34-50_-__Felis_cattus <NA> <NA>
4    G1  SEQA.B_34-50_-__Felis_cattus <NA> <NA>
5    G1 SEQA.A-2-BICs_-__Felis_cattus    E    F
6    G1  SEQA.A_10-30_-__Felis_cattus <NA> <NA>
7    G1  SEQA.A_34-50_-__Felis_cattus <NA> <NA>
8    G1 SEQA.B-2-BICs_-__Felis_cattus    L    P
9    G2  SEQA.A_60-79_-__Felis_cattus    K    L
10   G2  SEQA.A_34-50_-__Felis_cattus    M    N
11   G2     SEQ3_10-67_-__Lupus_lupus    O    P

> dput(tab)
structure(list(COL1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L), .Label = c("G1", "G2"), class = "factor"), COL2 = structure(c(1L, 
3L, 8L, 8L, 7L, 4L, 5L, 9L, 6L, 5L, 2L), .Label = c("SEQ1_10-67_-__Canis_lupus", 
"SEQ3_10-67_-__Lupus_lupus", "SEQ4.1_90-345_-__Elpah_bis", "SEQA.A_10-30_-__Felis_cattus", 
"SEQA.A_34-50_-__Felis_cattus", "SEQA.A_60-79_-__Felis_cattus", 
"SEQA.A-2-BICs_-__Felis_cattus", "SEQA.B_34-50_-__Felis_cattus", 
"SEQA.B-2-BICs_-__Felis_cattus"), class = "factor"), COL3 = structure(c(1L, 
2L, NA, NA, 3L, NA, NA, 5L, 4L, 6L, 7L), .Label = c("A", "C", 
"E", "K", "L", "M", "O"), class = "factor"), COL4 = structure(c(1L, 
2L, NA, NA, 3L, NA, NA, 6L, 4L, 5L, 6L), .Label = c("B", "D", 
"F", "L", "N", "P"), class = "factor")), class = "data.frame", row.names = c(NA, 
-11L))

and I would like to fill the NA values in COL3 and COL4

The idea is for each COL1 groups, if a COL2 value contains the -BICs- pattern, then I extract the part before the first-and look within the groups if otherCOL2 valueshave the same pattern before the first-`too.

Here in the exemple there is the pattern -BICsin line3 SEQA.A-2-BICs_-__Felis_cattus

the part before the first - is : SEQA.A

lines 4 and 5 have the same SEQA.A in their string:

  1. SEQA.A10-30-__Felis_cattus
  2. SEQA.A34-50-__Felis_cattus

    then I add to them the COL3 and COL4 values of the line3 and get an expected output:

    COL1 COL2 COL3 COL4 G1 SEQ1_10-67_-__Canis_lupus A B G1 SEQ4.1_90-345_-__Elpah_bis C D G1 SEQA.B_34-50_-__Felis_cattus L P G1 SEQA.B_34-50_-__Felis_cattus L P G1 SEQA.A-2-BICs_-__Felis_cattus E F G1 SEQA.A_10-30_-__Felis_cattus E F G1 SEQA.A_34-50_-__Felis_cattus E F G1 SEQA.B-2-BICs_-__Felis_cattus L P G2 SEQA.A_60-79_-__Felis_cattus K L G2 SEQA.A_34-50_-__Felis_cattus M N G2 SEQ3_10-67_-__Lupus_lupus O P

and I also remove the line3 that had the -BICspattern.

有人使用dplyr这样的主意吗?

评论
  • 碧鲁翠梅
    碧鲁翠梅 回复

    We could use fill after grouping by 'COL1' and a grouping column created by the occurence of '-BICs' substring in 'COL2', then fill the values in 'COL3' and 'COL4' to change the NA with previous non-NA element, ungroup, and filter out the row with 'BICs' from 'COL2'

    library(dplyr)
    library(tidyr)
    library(stringr)
    tab %>% 
       group_by(COL1, grp = cumsum(str_detect(COL2, "-BICs"))) %>% 
       fill(c(COL3, COL4)) %>%
       ungroup %>% 
       filter(!str_detect(COL2, "-BICs")) %>%
       select(-grp)
    # A tibble: 7 x 4
    #  COL1  COL2                         COL3  COL4 
    #  <fct> <fct>                        <fct> <fct>
    #1 G1    SEQ1_10-67_-__Canis_lupus    A     B    
    #2 G1    SEQ4.1_90-345_-__Elpah_bis   C     D    
    #3 G1    SEQA.A_10-30_-__Felis_cattus E     F    
    #4 G1    SEQA.A_34-50_-__Felis_cattus E     F    
    #5 G2    SEQA.A_60-79_-__Felis_cattus K     L    
    #6 G2    SEQA.A_34-50_-__Felis_cattus M     N    
    #7 G2    SEQ3_10-67_-__Lupus_lupus    O     P