I am trying to remove <u>
and <a>
tags from all the DIV tags that has class "sf-item" from an HTML source because they are breaking the text while scraping from a web url.
(对于此演示,我已将样本html字符串分配给BeautifulSoup方法-但理想情况下,它将是一个Web URL作为源)
So far I have tried using re with below line - but am not sure how to specify a condition in re such that - remove only the substring between all the <u
/u>
only within DIV tags of class sf-item
data = re.sub('<u.*?u>', '', data)
Also tried removing all <u>
and <a>
tags from the entire source using below line, but somehow it doesn't work. Am kind of unsure how to specify all the and tags only within DIV tags with class sf-item.
for tag in soup.find_all('u'):
tag.replaceWith('')
感谢您能否帮助我实现这一目标。
以下是有效的示例Python代码-
from re import sub
from bs4 import BeautifulSoup
import re
data = """
<div class="sf-item"> The rabbit got to the halfway point at
<u><a href="https://DummyLocationURL/"> here </a></u> However, it couldn't see the turtle.
</div>
<div class="sf">
<div class="sf-item sf-icon">
<span class="supporticon is"></span>
</div>
<div class="sf-item"> He was hot and tired and decided to stop and take a short nap.
</div>
<div class="sf-item"> Even if the turtle passed him at
<u><a href="https://DummyLocationURL/">Link</a></u>. he would be able to race to the finish line ahead of
<u><a href="https://DummyLocationURL/">place</a></u>, he just kept going.
</div>
"""
# data = re.sub('<u.*?u>', '', data) ## This works for this particular string but I cannot use on a web url
# It would solve if I can somehow specify to remove <u> and <a> only within DIV of class sf-item
soup = BeautifulSoup(data, "html.parser")
for tag in soup.find_all('u'):
tag.replaceWith('')
fResult = []
rMessage=soup.findAll("div",{'class':"sf-item"})
for result in rMessage:
fResult.append(sub("“|.”","","".join(result.contents[0:1]).strip()))
fResult = list(filter(None, fResult))
print(fResult)
我从上面的代码中得到的输出是
['The rabbit got to the halfway point at', 'He was hot and tired and decided to stop and take a short nap.', 'Even if the turtle passed him at']
但是我需要以下输出-
['The rabbit got to the halfway point at here However, it couldnt see the turtle.', 'He was hot and tired and decided to stop and take a short nap.', 'Even if the turtle passed him at Link. he would be able to race to the finish line ahead of place, he just kept going.']
BeautifulSoup具有用于从标签获取可见文本(即在浏览器中呈现时将显示的文本)的内置方法。运行以下代码,我得到您的预期输出:
这将为您提供适当的输出,但有一些额外的空间。如果要将它们全部减少为单个空格,可以通过以下方式运行fResult:
fResult = re.sub(' +', ' ', fResult)