本文主要爬取糗事百科页面首页的内容,涉及到的内容包括:作者,性别,年龄,段子内容,好笑数,评论数。并将读取的内容写入到CSV文件中。
一、网页分析
通过对网址分析,发现糗事百科的网页比较简单,网页的信息均是出现在网址中,为get获取方式,因此此处采用get方式进行请求。
二、主要内容分析
从网页内容来看,每条信息间并列分布,因此采用遍历的方式对每一条信息的相关内容进行爬取。
三、代码
import requests
from bs4 import BeautifulSoup
import re
url = 'http://www.qiushibaike.com/text/'
headers = {
'Cookie': '_qqq_uuid_="2|1',
'Upgrade-Insecure-Requests': '1',
'Accept-Encoding': 'gzip, deflate, sdch, br',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Cache-Control': 'max-age=0',
'ccept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Host': 'www.qiushibaike.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36',
'Referer': 'https',
'If-None-Match': '"091c1ffec42275e428d6a951055a2c5266c52a17"',
'Connection': 'keep-alive'
}
html = requests.get(url, headers).content
soup = BeautifulSoup(html,'lxml')
f = open('C:\\Users\Administrator\\Desktop\\练习杂物\\糗事百科爬虫练习.csv', 'w', encoding = 'utf-8')
f.seek(0)
div_list = soup.find_all(name = 'div', class_ = 'article block untagged mb15')
for i in div_list:
name = i.find('h2').text
genders = i.find(name = 'div', class_ = re.compile('articleGender .*'))
if genders == None:
gender = 'None'
age = 'None'
else:
gender = genders.attrs['class'][1][:-4] #attrs['class']为一字典,字符串为元素
age =genders.text
content = i.find('span').text
laugh = i.find(name = 'span', class_ = 'stats-vote').find('i').text
comment = i.find(name = 'span', class_ = 'stats-comments').find('i').text
f.writelines(['姓名: '+name,' 性别: '+gender,' 年龄: '+age,' 好笑数: '+laugh,' 评论数: '+comment])
f.writelines('\n')
f.writelines(content+'\n')
f.close()
print('finished!')