网页抓取:收集信息后清空数据集

我想创建一个数据集,其中包含从网站抓取的信息。我在下面解释我做了什么以及预期的输出。我正在为行和列获取空数组,然后为整个数据集获取空数组,但我不明白原因。我希望你能帮助我。

1)创建一个只有一列的空数据框:此列应包含要使用的网址列表。

data_to_use = pd.DataFrame([], columns=['URL'])

2)从以前的数据集中选择网址。

select_urls=dataset.URL.tolist()

这组网址看起来像:

                             URL
0                     www.bbc.co.uk
1             www.stackoverflow.com           
2                       www.who.int
3                       www.cnn.com
4         www.cooptrasportiriolo.it
...                             ...

3)使用以下网址填充列:

data_to_use['URL']= select_urls
data_to_use['URLcleaned'] = data_to_use['URL'].str.replace('^(www\.)', '')

4) Select a random sample to test: the first 50 rows in column URL

data_to_use = data_to_use.loc[1:50, 'URL']

5)尝试抓取信息

import requests
import time
from bs4 import BeautifulSoup

urls= data_to_use['URLcleaned'].tolist()

ares = []

for u in urls: # in the selection there should be an error. I am not sure that I am selecting the rig
    print(u)
    url = 'https://www.urlvoid.com/scan/'+ u
    r = requests.get(url)
    ares.append(r)   

rows = []
cols = []

for ar in ares:
    soup = BeautifulSoup(ar.content, 'lxml')
    tab = soup.select("table.table.table-custom.table-striped")   
    try:
            dat = tab[0].select('tr')
            line= []
            header=[]
            for d in dat:
                row = d.select('td')
                line.append(row[1].text)
            new_header = row[0].text
            if not new_header in cols:
                cols.append(new_header)
            rows.append(line)
    except IndexError:
        continue

print(rows) # this works fine. It prints the rows. The issue comes from the next line

data_to_use = pd.DataFrame(rows,columns=cols)  

Unfortunately there is something wrong in the steps above as I am not getting any results, but only [] or __.

Error from data_to_use = pd.DataFrame(rows,columns=cols):

ValueError: 1 columns passed, passed data had 12 columns

我的预期输出将是:

URL          Website Address   Last Analysis   Blacklist Status \  
bbc.co.uk          Bbc.co.uk         9 days ago       0/35
stackoverflow.com Stackoverflow.com  7 days ago      0/35

Domain Registration               IP Address       Server Location    ...
996-08-01 | 24 years ago       151.101.64.81    (US) United States    ...
2003-12-26 | 17 years ago      ...

最后,我应该将创建的数据集保存在csv文件中。

评论
  • aumujun
    aumujun 回复

    Yon只能使用熊猫来做到这一点。尝试以下代码。

    urllist=[ 'bbc.co.uk','stackoverflow.com','who.int','cnn.com']
    
    dffinal=pd.DataFrame()
    for url in urllist:
        df=pd.read_html("https://www.urlvoid.com/scan/" + url + "/")[0]
        list = df.values.tolist()
        rows = []
        cols = []
        for li in list:
            rows.append(li[1])
            cols.append(li[0])
        df1=pd.DataFrame([rows],columns=cols)
        dffinal = dffinal.append(df1, ignore_index=True)
    
    print(dffinal)
    dffinal.to_csv("domain.csv",index=False)
    

    CSV快照:

    enter image description here

    快照。

    enter image description here

    CSV文件。

    enter image description here