Python作业20170526：美股吧爬虫

帖子网页分析

帖子导航

从这个标签中可以获得帖子总数1706，以及每一页帖子的数量80，当前处于第几页：第一页。
![美股吧帖子列表网页分析](http://upload-images.jianshu.io/upload_images/5298387-ca563fc7a0c2552e.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

- 构造帖子列表的url

http://guba.eastmoney.com/list,meigu_2.html

帖子列表的url可以表示为：

'http://guba.eastmoney.com/list,meigu_{}.html'.format(page_num)

可以根据帖子总数/每页帖子的数量得到一个帖子url的列表，代码表示:

page_data = soup.find(name='span', class_='pagernums').get('data-pager').split('|')
page_nums = math.ceil(int(page_data[1]) / int(page_data[2]))

**注意：使用math模块的ceil函数向上取整**

- 循环获取每一页帖子的信息

## 评论网页分析
- 评论页导航
> 查看网页的html信息，查询105，有三个地方可以获取到这个信息，这里用了正则表达式从script中获取。

{var num=40030; }var pinglun_num=105;var xgti="";if(typeof (count) != "undefined"){xgti="<a href='list,meigu.html'>相关帖子"+count+"条</a>";}

![评论页导航信息](http://upload-images.jianshu.io/upload_images/5298387-8e7eae194d70eb22.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

- 构造评论页url

http://guba.eastmoney.com/news,meigu,613304918_2.html

帖子评论url可以表示为：

'http://guba.eastmoney.com/news,meigu,613304918_{}.html'.format(page_num)

可以根据评论总数reply_count / 30（有分页情况下，每页帖子的数量最多为30）得到一个帖子url的列表，代码表示:

pattern = re.compile(r'var pinglun_num=(.*?);')

文章评论数

reply_count = int(re.search(pattern, resp.text).group(1))
page_num = math.ceil(reply_count / 30)

**注意：使用math模块的ceil函数向上取整**

- 循环获取每一页评论的信息
先判断有没有评论，如果有的话遍历评论url，返回帖子的评论信息

## 使用的库
- requests：发起网页请求
- BeautifulSoup：解析网页
- re：正则表达式解析网页
- math：使用ceil函数向上取整
- csv：数据保存为csv文件

## 爬取过程
1. 以http://guba.eastmoney.com/list,meigu.html为入口；
2. 先获取帖子的总数、计算出帖子导航页的页码数；
3. 得到帖子的导航url列表；
4. 遍历帖子的导航url，得到帖子的信息；
 - 遍历帖子url的地址，得到帖子的阅读量、评论数、标题
 - 获取评论信息
    - 以帖子url，如http://guba.eastmoney.com/news,meigu,646708357.html 为入口
    - 先获取评论的总数，计算出帖子评论的页数
    - 得到评论导航的url列表
    - 遍历评论url列表，得到帖子的评论信息

## 代码

import requests
from bs4 import BeautifulSoup
import math
import re
import csv

start_url = 'http://guba.eastmoney.com/list,meigu_1.html'

url = "http://guba.eastmoney.com/news,meigu,646708357.html"

base_url = "http://guba.eastmoney.com"

获取所有帖子的信息

def get_articles_info(start_url):
resp = get_html(start_url)
soup = BeautifulSoup(resp.text, 'html.parser')
page_data = soup.find(name='span', class_='pagernums').get('data-pager').split('|')
page_nums = math.ceil(int(page_data[1]) / int(page_data[2]))
print('共{}页'.format(page_nums))
articles_infos = []
with open('meigu.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(['阅读量', '评论数', '发布时间', '帖子网址', '帖子标题', '帖子评论'])
for i in range(1, page_nums+1):
print('爬取第{}页...'.format(i))
articles_url = start_url.split('')[0] + '' + str(i) + '.html'
articles_infos = parser_articles_info(articles_url)
articles_infos.extend(articles_infos)
return articles_infos

获取一页的所有帖子信息：阅读量、评论数、发布时间、帖子的url、帖子的标题、帖子的所有评论

param：每一页帖子的链接

def parser_articles_info(article_list_url):
resp = get_html(article_list_url)
articles_soup = BeautifulSoup(resp.text, 'html.parser')
articles_infos = articles_soup.find_all(name='div', class_='articleh')
articles = []
for info in articles_infos:
if '/news' in info.find(name='span', class_='l3').find(name='a').get('href'):
article_infos = {
'read_count': info.find(name='span', class_='l1').text,
'reply_count': info.find(name='span', class_='l2').text,
'release_time': info.find(name='span', class_='l5').text,
'article_url': base_url + info.find(name='span', class_='l3').find(name='a').get('href'),
'article_title': info.find(name='span', class_='l3').find(name='a').get('title'),
'article_comments': parse_comment_page(get_html(base_url + info.find(name='span', class_='l3').find(name='a').get('href')))
}
with open('meigu.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(article_infos.values())
articles.append(article_infos)
# print(articles)
return articles

根据url获取html文档

def get_html(url):
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
}
resp = requests.get(url)
if resp.status_code == 200:
return resp
return None

解析帖子的html文档，提取需要的数据：帖子的内容以及帖子的所有评论

def parse_comment_page(resp):
soup = BeautifulSoup(resp.text, 'html.parser')
# 正则表达式获取总的评论数
pattern = re.compile(r'var pinglun_num=(.*?);')
article_info = {}
# 文章评论数
article_info['reply_count'] = int(re.search(pattern, resp.text).group(1))
# 文章内容
article_info['article_content'] = soup.find(name='div', class_='stockcodec').text.strip()
# print(article_info['article_content'])
page_num = math.ceil(article_info['reply_count'] / 30)
print('{}条评论'.format(article_info['reply_count'] ), ',', '共{}页'.format(page_num))
# 爬取所有的评论
article_comments = []
if article_info['reply_count'] > 0:
for i in range(1, page_num+1):
comment_url = '.'.join(resp.url.split('.')[:-1]) + '_{}'.format(i) + '.html'
print(comment_url)
article_comments.extend(parser_article_comment(comment_url))
else:
article_comments.append('本帖子暂时没有评论内容')
return article_comments

获得帖子一页的评论信息

def parser_article_comment(comment_list_url):
resp = get_html(comment_list_url)
if resp:
comment_soup = BeautifulSoup(resp.text, 'html.parser')
comments_infos = comment_soup.find_all(name='div', class_='zwlitxt')
comments = []
# print(len(comments_infos))
for info in comments_infos:
comment = {}
comment['commentator'] = info.find(name='span', class_='zwnick').find('a').text if info.find(name='span', class_='zwnick').find('a') else None
comment['reply_time'] = info.find(name='div', class_='zwlitime').text
comment['reply_content'] = info.find(name='div', class_='zwlitext').text
comments.append(comment)
return comments

def main():
get_articles_info(start_url)

if name == 'main':
main()


## 运行结果
![爬取的结果](http://upload-images.jianshu.io/upload_images/5298387-f98fdbc345c75d57.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
> 爬虫能正常运行，但是爬取的过程很慢

最后编辑于：2017.12.07 18:20:42

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 204,189评论 6赞 478
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,577评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,857评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,703评论 1赞 276
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,705评论 5赞 366
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,620评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,995评论 3赞 396
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,656评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,898评论 1赞 298
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,639评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,720评论 1赞 330
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,395评论 4赞 319
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,982评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,953评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,195评论 1赞 260
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 44,907评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,472评论 2赞 342