我使用此代码从网站提取数据,但不足以获取我需要的所有数据
网站链接示例
https://www.noon.com/saudi-en/accessories-and-supplies?f[is_fbn]=1&sort[by]=price&sort[dir]=asc&limit=150&page=1
运行代码的结果
price,title,sku,
0.75,Lightning Cable For Apple iPhone7 6centimeter WhiteSAR 0.75,
1.30,AirPods Strap WhiteSAR 1.30SAR 470% Off,
1.35,Anti-Lost Sport Silicone Strap Cable For Apple AirPods WhiteSAR 1.35SAR 354% Off,
1.40,Micro USB Fast Charging Cable 1meter WhiteSAR 1.40,
.
.
.
.
.
. #rest of results
please notice the result in not neat, there are extra information should be in separated columns
for example : title should be only AirPods Strap White
not AirPods Strap WhiteSAR 1.30SAR 470% Off,
also I can not catch the sku
value which is buried at the end of the page with other valuable data at the end of the page (which I do not how to get it)
{"offer_code":"dd3125025109fb4d","sku":"N15614801A","sku_config":"N15614801A","brand":null,"name":"AirPods Strap White","plp_specifications":{},"price":4.4,"sale_price":1.3,"url":"airpods-strap-white","image_key":"v1532025662/N15614801A_1","is_buyable":true,"flags":["fbn","prepaid"]},
使结果文件包含所有这些值(这是我正在寻找的)将非常有帮助
price,title,sku,offer_code,brand,sale_price
这是我使用的python代码
from bs4 import BeautifulSoup as soup
from concurrent.futures import ThreadPoolExecutor
import requests
number_of_threads = 6
out_filename = "noonresult.csv"
headers = "price,title,sku, \n"
def extract_data_from_url_func(url):
print(url)
response = requests.get(url)
page_soup = soup(response.text, "html.parser")
containers = page_soup.findAll('div',{'class' : 'jsx-3152181095 productContainer'})
output = ''
for container in containers:
price = container.find('span',{'class':'value'}).text if container.find('span',{'class':'value'}) else ""
title = container.find('div',{'class':'jsx-866269109 detailsContainer'}).text if container.find('div',{'class':'jsx-866269109 detailsContainer'}) else ""
sku = container.find('div',{'class':'jsx-866269109 wrapper listView'}).a['href'] if container.find('div',{'class':'jsx-866269109 wrapper listView'}) else ""
output_list = [price,title,sku,]
output = output + ",".join(output_list) + "\n"
print(output)
return output
with open("speednoon.txt", "r") as fr:
URLS = list(map(lambda x: x.strip(), fr.readlines()))
with ThreadPoolExecutor(max_workers=number_of_threads) as executor:
results = executor.map( extract_data_from_url_func, URLS)
responses = []
for result in results:
responses.append(result)
with open(out_filename, "w", encoding='utf-8-sig') as fw:
fw.write(headers)
for response in responses:
fw.write(response + "\n")