如何在R中的数据框中搜索和压缩重复行?

我正在使用R处理RNA测序数据,这是我的新手。我正在使用BioMart提供的参考资料的数据框,当包含GO术语时,它们的排列方式非常错误(如下所示)。

head(goZref)
      Gene.stable.ID Transcript.stable.ID  Protein.stable.ID
1 ENSDARG00000063344   ENSDART00000131829 ENSDARP00000123357
2 ENSDARG00000063344   ENSDART00000131829 ENSDARP00000123357
3 ENSDARG00000063344   ENSDART00000144883 ENSDARP00000114467
4 ENSDARG00000063344   ENSDART00000144883 ENSDARP00000114467
5 ENSDARG00000097685   ENSDART00000156963 ENSDARP00000128236
6 ENSDARG00000097685   ENSDART00000156963 ENSDARP00000128236
                                                            Gene.description         Gene.name WikiGene.name
1 family with sequence similarity 162 member A [Source:NCBI gene;Acc:336363]           fam162a       fam162a
2 family with sequence similarity 162 member A [Source:NCBI gene;Acc:336363]           fam162a       fam162a
3 family with sequence similarity 162 member A [Source:NCBI gene;Acc:336363]           fam162a       fam162a
4 family with sequence similarity 162 member A [Source:NCBI gene;Acc:336363]           fam162a       fam162a
5                      si:ch211-235i11.3 [Source:ZFIN;Acc:ZDB-GENE-131125-9] si:ch211-235i11.3  LOC101885363
6                      si:ch211-235i11.3 [Source:ZFIN;Acc:ZDB-GENE-131125-9] si:ch211-235i11.3  LOC101885363
                                                       GO.term.name
1                                                          membrane
2                                    integral component of membrane
3                                                          membrane
4                                    integral component of membrane
5                                              nucleic acid binding
6 RNA polymerase II regulatory region sequence-specific DNA binding

I want to annotate a data frame of genes of interest (the gene names are in a character vector called genes here), but I'm struggling to automate it given all the repetition and row duplication in the references. I've tried using match but because it only finds the first instance of something I miss out on other rows. I would like to, for instance, search for "fam162a" and get something like "membrane, integral component of membrane", and then automate this for a list of 100 gene names. subset is useful in giving me multiple rows with the same gene name identifier, and I've tried to pass it to ddply but I don't really know what I'm doing and got stuck here:

test<- ddply(.data = goZref, .variables = genes, for (x in genes) {
+ paste(unique(subset(goZref, WikiGene.name==x, select= Go.term.name)), sep = ",")})
Error in parse(text = x) : <text>:1:12: unexpected symbol
1: si:dkey-224k5.13
               ^

任何帮助表示感谢!

评论