无法使用BeautifulSoup Webscrape HTML表并使用Python将其加载到Pandas数据框中

My objective is to access the table on the following webpage https://www.countries-ofthe-world.com/world-currencies.html and turn it into a Pandas dataframe that has columns "Country or territory", "Currency", and "ISO-4217".

I am able to access the columns correctly, but I am having a hard time figuring out how to append each row to a dataframe. Do you all have any suggestions on how I can do this? For example, on the webpage, the first row in the table is the letter "A". However, I need the first row in the dataframe to be Afghanistan, Afghan afghani, and AFN.

这是我到目前为止的内容:

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.countries-ofthe-world.com/world-currencies.html"
req = Request(url, headers={"User-Agent":"Mozilla/5.0"})
webpage=urlopen(req).read()
soup = BeautifulSoup(webpage, "html.parser")
table = soup.find("table", {"class":"codes"})
rows = table.find_all('tr')
columns = [v.text for v in rows[0].find_all('th')] 
print(columns) # ['Country or territory', 'Currency', 'ISO-4217']

也请看这张图片。

enter image description here

谢谢大家的时间。

托尼

评论
迷恋
迷恋

With your fix in place, it's something that can be pretty easily parsed by pd.read_html:

url = "https://www.countries-ofthe-world.com/world-currencies.html"
req = Request(url, headers={"User-Agent":"Mozilla/5.0"})
webpage = urlopen(req).read()

df = pd.read_html(webpage)[0]
print(df.head())

         Country or territory        Currency ISO-4217
0                           A               A        A
1                 Afghanistan  Afghan afghani      AFN
2  Akrotiri and Dhekelia (UK)   European euro      EUR
3     Aland Islands (Finland)   European euro      EUR
4                     Albania    Albanian lek      ALL

It has those alphabet headers, but you can get rid of those with something like df = df[df['Currency'] != df['ISO-4217']]

点赞
评论