我在每个步骤上都使用了特定的链接几次测试了代码,而且效果很好,我不知道是否存在某种阻止请求的机制(这是我第一次尝试抓取/加油)
现在,当我尝试运行它时,对于每个请求(总共507个),我都会收到403错误,因此我只是停止了节点。我真的很希望,因为我确实获得了初始链接,但是当我尝试在它们上运行profileParse时,它就崩溃了:(
这是我的retailX.js
const rp = require('request-promise');
const $ = require('cheerio');
const profileParse = require('./profileParse')
const fs = require('fs')
const writeStream = fs.createWriteStream('post.csv')
const url = 'url im scraping here';
//headers
//writeStream.write(`Name, URL \n`)
rp({
url:url,
headers: {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
},
json:true
})
.then(function(html){
const profileUrls = []
for (let i = 456; i < 457; i++){
profileUrls.push('first portion of link here' + $('td > a[class="exhibitorName"]', html)[i].attribs.href)
}
return Promise.all(
profileUrls.map(url => {
return profileParse(url)
})
)
})
.then(profile => {
//write row to csv
writeStream.write(profile)
console.log(profile, 'scraping done')
})
.catch(function(err){
//handle error
console.log("THERE IS AN ERROR")
});
从第二个链接中刮取的第二个功能 profileParse.js
const rp = require('request-promise');
const $ = require('cheerio');
const profileParse = (url) => {
return rp({
url:url,
headers: {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
},
json:true
})
.then(function(html) {
const info = {
name: $('div[class="panel-body"] > h1', html).text(),
url: $('span[class="BoothContactUrl"] > a', html).text()
}
return info
})
.catch(function(err) {
//handle error
console.log(err, 'THERE IS AN ERROR')
});
}
module.exports = profileParse