从熊猫数据框创建句子

假设我有以下数据集:

    pos   sentence_idx      word
    NNS     1.0            Thousands
    IN      1.0            of
    NNS     1.0            demonstrators
    VBP     1.0            have
    VBN     1.0            marched
... ... ... ...
    PRP     47959.0        they
    VBD     47959.0        responded
    TO      47959.0        to
    DT      47959.0        the
    NN      47959.0        attack

我想创建句子(为此,我必须使用句子_idx)。我可以使用以下代码执行此操作:

sent = []
for i in df['sentence_idx'].unique():
  sent.append([(w,t) for w,t in zip(df[df['sentence_idx'] == i]['word'].values.tolist(),df[df['sentence_idx'] == i]['pos'].values.tolist())])

但是首先,它效率不高(使用for循环而不是numpy / pandas函数),而且看起来很丑。 我怎样才能更有效地做到这一点?

编辑: 结果应该是句子,其中每个元素都是一个元组(word,pos):

[[('Thousands', 'NNS'),
  ('of', 'IN'),
  ('demonstrators', 'NNS'),
  ('have', 'VBP'),
  ('marched', 'VBN'),
  ('through', 'IN'),
  ('London', 'NNP'),
  ('to', 'TO'),
  ('protest', 'VB'),
  ('the', 'DT'),
  ('war', 'NN'),
  ('in', 'IN'),
  ('Iraq', 'NNP'),
  ('and', 'CC'),
  ('demand', 'VB'),
  ('withdrawal', 'NN'),
  ('British', 'JJ'),
  ('troops', 'NNS'),
  ('from', 'IN'),
  ('that', 'DT'),
  ('country', 'NN'),
  ('.', '.')],
 [('Families', 'NNS'),
  ('of', 'IN'),
  ('soldiers', 'NNS'),
  ('killed', 'VBN'),
  ('in', 'IN'),
  ('the', 'DT'),
  ('conflict', 'NN'),
  ('joined', 'VBD'),
  ('protesters', 'NNS'),
  ('who', 'WP'),
  ('carried', 'VBD'),
  ('banners', 'NNS'),
  ('with', 'IN'),
  ('such', 'JJ'),
  ('slogans', 'NNS'),
  ('as', 'IN'),
  ('"', '``'),
  ('Bush', 'NNP'),
  ('Number', 'NN'),
  ('One', 'CD'),
  ('Terrorist', 'NN'),
  ('and', 'CC'),
  ('Stop', 'VB'),
  ('Bombings', 'NNS'),
  ('.', '.')],...
评论
  • ut_quo
    ut_quo 回复

    这应该工作:

    def compute(_):
        return [*zip(_['word'], _['pos'])]
    
    df.groupby('sentence_idx').apply(compute).values.tolist()