使用硒和BeautifulSoup抓取动态网页,但新页面不断弹出

I'm scraping content from a dynamic web page. https://www.nytimes.com/search?query=china+COVID-19 I want to get the content of all the news articles (26,783 in total). I cannot iterate pages because on this website you have to click "show more" to load the next page.

因此,我正在使用webdriver.ActionChians。该代码没有显示任何错误消息,但是每隔几秒钟就会弹出一个新窗口,看起来就像是每次都在同一页面上一样。这个过程似乎无止境,我在2小时后中断了它。我使用了代码“ print(article)”,但未显示任何内容。 有人可以帮我解决这个问题吗?非常感谢您的帮助!

import time
import requests
from bs4 import BeautifulSoup
import json
import string
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
base = "https://www.nytimes.com"
browser = webdriver.Chrome('C:/chromedriver.exe')
wait = WebDriverWait(browser, 10)

browser.get('https://www.nytimes.com/search?query=china+COVID-19')
myarticle = []

while True:
    try:
        driver = webdriver.Chrome('C:/chromedriver.exe')
        driver.get('https://www.nytimes.com/search?query=china+COVID-19')
        el = driver.find_element_by_xpath('//button[@type="button"][contains(.,"Show More")]')
        webdriver.ActionChains(driver).move_to_element(el).click(el).perform()

    except Exception as e:
        print(e)
        break    

soup = BeautifulSoup(browser.page_source,'lxml')
search_results = soup.find('ol', {'data-testid':'search-results'})

links = search_results.find_all('a')
for link in links:
    link_url = link['href']

    response = requests.get(base + link_url)
    soup_link = BeautifulSoup(response.text, 'html.parser')
    scripts = soup_link.find_all('script')
    for script in scripts:
        if 'window.__preloadedData = ' in script.text:
            jsonStr = script.text
            jsonStr = jsonStr.split('window.__preloadedData = ')[-1]
            jsonStr = jsonStr.rsplit(';',1)[0]

            jsonData = json.loads(jsonStr)

            article = []
            for k, v in jsonData['initialState'].items():
                w=1
                try:
                    if v['__typename'] == 'TextInline':
                        article.append(v['text'])
                        #print (v['text'])
                except:
                    continue
            article = [ each.strip() for each in article ]
            article = ''.join([('' if c in string.punctuation else ' ')+c for c in article]).strip()
            print(article)
            myarticle.append(article)
评论
  • 下一站
    下一站 回复

    因为您在每次循环的迭代中重新创建驱动程序,所以会弹出“新窗口”。

    一步步。首先,您在此处创建驱动程序并转到页面:

    browser = webdriver.Chrome('C:/chromedriver.exe')
    browser.get('https://www.nytimes.com/search?query=china+COVID-19')
    

    然后在循环内部,每次迭代创建一个驱动程序:

    while True:
        try:
            driver = webdriver.Chrome('C:/chromedriver.exe')
            driver.get('https://www.nytimes.com/search?query=china+COVID-19')
    

    这就是为什么您每次都会看到新窗口的原因。

    要解决此问题,您可以应用以下代码(其中仅包括迭代部分):

    from selenium import webdriver
    from bs4 import BeautifulSoup
    
    # Initialize driver once
    chromedriver_path = 'C:/chromedriver.exe'
    driver = webdriver.Chrome(chromedriver_path)
    
    # Get to the page
    driver.get('https://www.nytimes.com/search?query=china+COVID-19')
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    
    # While button is present
    while soup.find('button', {'data-testid': 'search-show-more-button'}) != None:
        # Find button
        button = driver.find_element_by_xpath('//button[@type="button"][contains(.,"Show More")]')
        # Move to it to avoid false-clicking other elements
        webdriver.ActionChains(driver).move_to_element(button).perform()
        # Click the button
        button.click()
        # Redefine variable 'soup' in case if button dissapeared, so the 'while' loop will end
        soup = BeautifulSoup(driver.page_source, 'html.parser')
    

    据我了解,第二部分没有任何问题,您可以在哪里解析搜索结果,但是如果有的话,可以随时提出。