使用“加载更多”按钮对网站进行网页爬取

 收藏

我正在尝试使用Selenium和BeautifulSoup对具有“加载更多”按钮的网站进行网络抓取。我已经获得了脚本来单击“加载更多”按钮并加载其余内容,但是在将内容刮入json文件时遇到了问题。这是我的剧本

from bs4 import BeautifulSoup as soup
from selenium import webdriver
import time

url = "https://smarthistory.org/americas-before-1900/"
driver = webdriver.Chrome('/Users/rawlins/Downloads/chromedriver')
driver.get(url)
html = driver.page_source.encode('utf-8')
page_num = 0

while driver.find_elements_by_css_selector('#load-more-cc-objects'):
    driver.find_element_by_css_selector('#load-more-cc-objects').click()
    page_num += 1
    print("getting page number "+str(page_num))
    time.sleep(1)

html = driver.page_source.encode('utf-8')

data = [] 

# Parse HTML, close browser
page_soup = soup(driver.page_source, 'lxml')
containers = page_soup.findAll("div", {"class":"mb-8 hover-zoom tablescraper-selected-row opacity-100"})

for container in containers:
    item = {}
    item['type'] = "Course Material"
    item['title'] = container.find('h5', {'class' : 'm-0 mt-4 text-grey-darker text-normal leading-tight hover-connect'}).text.strip()
    item['link'] = container.a["href"]
    item['source'] = "Smarthistory"
    item['base_url'] = "https://smarthistory.org"
    item['license'] = "Attribution-NonCommercial-ShareAlike"
    data.append(item) # add the item to the list

with open("smarthistory-2.json", "w") as writeJSON:
    json.dump(data, writeJSON, ensure_ascii=False)

browser.quit()

我的预期输出是这样的

[
    {
        "type": "Course Material",
        "title": "Impressionism as optical realism: Monet",
        "link": "https://smarthistory.org/impressionism-optical-realism-monet/",
        "source": "Smarthistory",
        "base_url": "https://smarthistory.org",
        "license": "Attribution-NonCommercial-ShareAlike"
    },
    {
        "type": "Course Material",
        "title": "Impressionism: painting modern life",
        "link": "https://smarthistory.org/painting-modern-life/",
        "source": "Smarthistory",
        "base_url": "https://smarthistory.org",
        "license": "Attribution-NonCommercial-ShareAlike"
    }
]
回复
  • 咕-_- 回复

    When using Google Chrome's DEV Tools (F12) you can inspect the network traffic. Simply go to the Network tab inside the DEV tools whilst being on the website and click on the "Load More" button. You should see an request (object?tag=DDD&page=2) popping up in the list. Use the request URL inside a loop to iterate over the pages. With this way you can get the JSON directly without needing to click a button.