I'm scraping content from a dynamic web page. https://www.nytimes.com/search?query=china+COVID-19 I want to get the content of all the news articles (26,783 in total). I cannot iterate pages because on this website you have to click "show more" to load the next page.
因此,我正在使用webdriver.ActionChians。该代码没有显示任何错误消息,但是每隔几秒钟就会弹出一个新窗口,看起来就像是每次都在同一页面上一样。这个过程似乎无止境,我在2小时后中断了它。我使用了代码“ print(article)”,但未显示任何内容。 有人可以帮我解决这个问题吗?非常感谢您的帮助!
import time
import requests
from bs4 import BeautifulSoup
import json
import string
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
base = "https://www.nytimes.com"
browser = webdriver.Chrome('C:/chromedriver.exe')
wait = WebDriverWait(browser, 10)
browser.get('https://www.nytimes.com/search?query=china+COVID-19')
myarticle = []
while True:
try:
driver = webdriver.Chrome('C:/chromedriver.exe')
driver.get('https://www.nytimes.com/search?query=china+COVID-19')
el = driver.find_element_by_xpath('//button[@type="button"][contains(.,"Show More")]')
webdriver.ActionChains(driver).move_to_element(el).click(el).perform()
except Exception as e:
print(e)
break
soup = BeautifulSoup(browser.page_source,'lxml')
search_results = soup.find('ol', {'data-testid':'search-results'})
links = search_results.find_all('a')
for link in links:
link_url = link['href']
response = requests.get(base + link_url)
soup_link = BeautifulSoup(response.text, 'html.parser')
scripts = soup_link.find_all('script')
for script in scripts:
if 'window.__preloadedData = ' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('window.__preloadedData = ')[-1]
jsonStr = jsonStr.rsplit(';',1)[0]
jsonData = json.loads(jsonStr)
article = []
for k, v in jsonData['initialState'].items():
w=1
try:
if v['__typename'] == 'TextInline':
article.append(v['text'])
#print (v['text'])
except:
continue
article = [ each.strip() for each in article ]
article = ''.join([('' if c in string.punctuation else ' ')+c for c in article]).strip()
print(article)
myarticle.append(article)
因为您在每次循环的迭代中重新创建驱动程序,所以会弹出“新窗口”。
一步步。首先,您在此处创建驱动程序并转到页面:
然后在循环内部,每次迭代创建一个驱动程序:
这就是为什么您每次都会看到新窗口的原因。
要解决此问题,您可以应用以下代码(其中仅包括迭代部分):
据我了解,第二部分没有任何问题,您可以在哪里解析搜索结果,但是如果有的话,可以随时提出。