The code below gets the URL for each gym location on this website, beginning with "Albertville, AL."
from urlparse import urljoin
import requests
import urllib3
from bs4 import BeautifulSoup
res = requests.get("https://www.planetfitness.com/sitemap").content
soup = BeautifulSoup(res, 'html.parser')
tds = soup.find_all('td', {'class': 'club-title'})
links = [td.find('a')['href'] for td in tds]
keywords = ['gyms']
for link in links:
if any(keyword in link for keyword in keywords):
print urljoin('https://www.planetfitness.com/', link)
前2个链接输出:
https://www.planetfitness.com/gyms/albertville-al
https://www.planetfitness.com/gyms/alexander-city-al
但是,我试图从每个链接输出中抓取以下内容:
- 街道地址
- 俱乐部营业时间
以下是我尝试完成街道地址部分的代码。我相信这是行不通的,因为“ ps =”行返回空白,但是我不知道该用什么代替“ p”,“ class”和“ address”。任何有关如何修复代码以使其实际执行此操作的想法,将不胜感激!
res1 = requests.get(urljoin('https://www.planetfitness.com/', link)).content
soup1 = BeautifulSoup(res1, 'html.parser')
ps = soup.find_all('p', {'class': 'address'})
address1 = [p.find('span')['itemprop'] for p in ps]
This image of when you inspect street address may help
感谢您的帮助!