如何解析XML并放入单个元组? -Python

 收藏

I am trying to separate each sentence tag into its own individual tuple then get each entry from the lemma tag. I have been able to get all words and append them to a list called corpus as well as removing stop words, but not into their separate tuples. I have tried a few ways, but haven't figured it out yet.

这是我当前的输出:

['Σιδών', 'ἐπί', 'κλέω']

这是预期的输出:

[('Σιδών', 'ἐπί'), ('κλέω')]

这是我到目前为止的代码。

from xml.etree.ElementTree import parse
from cltk.corpus.greek.beta_to_unicode import Replacer
from cltk.stop.greek.stops import STOPS_LIST

replacer = Replacer()
tree = parse('Achilles Tatius (0532) - Leucippe and Clitophon (001).xml')
root = tree.getroot()
corpus = []
for sentence in root.iter('sentence'):
    for word in sentence.iter('word'):
        for lemma in word.iter('lemma'):
            entry = lemma.get('entry')
            if entry is None:
                entry = replacer.beta_code(word.get('form'))
            corpus.append(entry) 
            if entry in STOPS_LIST:
                corpus.remove(entry)
print(corpus)

阿喀琉斯(Achilles Tatius)(0532)-Leucippe和Clitophon(001).xml

<body>
  <sentence id="1" location="1.1.1">
    <word form="*sidw\n" id="1">
      <lemma id="94083" entry="Σιδών" POS="proper" TreeTagger="false" disambiguated="n/a">
        <analysis morph="fem nom/voc sg"/>
      </lemma>
    </word>
    <word form="e)pi\" id="2">
      <lemma id="nlsj126" entry="ἐπί" POS="preposition" TreeTagger="false" disambiguated="n/a">
        <analysis morph="indeclform (prep)"/>
      </lemma>
    </word>
  </sentence>
  <sentence id="2" location="1.1.1">
    <word form="klei/wn" id="7">
      <lemma id="57835" entry="κλέω" POS="verb" disambiguated="0.33" TreeTagger="true">
        <analysis morph="pres part act masc nom sg"/>
      </lemma>
    </word>
    <word form="to\" id="8">
      <lemma id="71882" entry="ὁ" POS="article" TreeTagger="false" disambiguated="n/a">
        <analysis morph="neut nom/voc/acc sg"/>
      </lemma>
    </word>
  </sentence>
</body>
回复