您好,我有一个数据框,例如
> tab
COL1 COL2 COL3 COL4
1 G1 SEQ1_10-67_-__Canis_lupus A B
2 G1 SEQ4.1_90-345_-__Elpah_bis C D
3 G1 SEQA.B_34-50_-__Felis_cattus <NA> <NA>
4 G1 SEQA.B_34-50_-__Felis_cattus <NA> <NA>
5 G1 SEQA.A-2-BICs_-__Felis_cattus E F
6 G1 SEQA.A_10-30_-__Felis_cattus <NA> <NA>
7 G1 SEQA.A_34-50_-__Felis_cattus <NA> <NA>
8 G1 SEQA.B-2-BICs_-__Felis_cattus L P
9 G2 SEQA.A_60-79_-__Felis_cattus K L
10 G2 SEQA.A_34-50_-__Felis_cattus M N
11 G2 SEQ3_10-67_-__Lupus_lupus O P
> dput(tab)
structure(list(COL1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L), .Label = c("G1", "G2"), class = "factor"), COL2 = structure(c(1L,
3L, 8L, 8L, 7L, 4L, 5L, 9L, 6L, 5L, 2L), .Label = c("SEQ1_10-67_-__Canis_lupus",
"SEQ3_10-67_-__Lupus_lupus", "SEQ4.1_90-345_-__Elpah_bis", "SEQA.A_10-30_-__Felis_cattus",
"SEQA.A_34-50_-__Felis_cattus", "SEQA.A_60-79_-__Felis_cattus",
"SEQA.A-2-BICs_-__Felis_cattus", "SEQA.B_34-50_-__Felis_cattus",
"SEQA.B-2-BICs_-__Felis_cattus"), class = "factor"), COL3 = structure(c(1L,
2L, NA, NA, 3L, NA, NA, 5L, 4L, 6L, 7L), .Label = c("A", "C",
"E", "K", "L", "M", "O"), class = "factor"), COL4 = structure(c(1L,
2L, NA, NA, 3L, NA, NA, 6L, 4L, 5L, 6L), .Label = c("B", "D",
"F", "L", "N", "P"), class = "factor")), class = "data.frame", row.names = c(NA,
-11L))
and I would like to fill the NA values in COL3
and COL4
The idea is for each COL1
groups, if a COL2 value
contains the -BICs- pattern, then I extract the part before the first
-and look within the groups if other
COL2 valueshave the same pattern before the first
-`too.
Here in the exemple there is the pattern -BICs
in line3 SEQA.A-2-BICs_-__Felis_cattus
the part before the first -
is : SEQA.A
和
lines 4
and 5
have the same SEQA.A
in their string:
- SEQA.A10-30-__Felis_cattus
SEQA.A34-50-__Felis_cattus
then I add to them the
COL3
andCOL4
values of theline3
and get an expected output:COL1 COL2 COL3 COL4 G1 SEQ1_10-67_-__Canis_lupus A B G1 SEQ4.1_90-345_-__Elpah_bis C D G1 SEQA.B_34-50_-__Felis_cattus L P G1 SEQA.B_34-50_-__Felis_cattus L P G1 SEQA.A-2-BICs_-__Felis_cattus E F G1 SEQA.A_10-30_-__Felis_cattus E F G1 SEQA.A_34-50_-__Felis_cattus E F G1 SEQA.B-2-BICs_-__Felis_cattus L P G2 SEQA.A_60-79_-__Felis_cattus K L G2 SEQA.A_34-50_-__Felis_cattus M N G2 SEQ3_10-67_-__Lupus_lupus O P
and I also remove the line3 that had the -BICs
pattern.
有人使用dplyr这样的主意吗?
We could use
fill
after grouping by 'COL1' and a grouping column created by the occurence of '-BICs' substring in 'COL2', thenfill
the values in 'COL3' and 'COL4' to change the NA with previous non-NA element,ungroup
, andfilter
out the row with 'BICs' from 'COL2'