很多时候我们要了解一部电视剧或电影的好坏时都会去豆瓣上查看评分和评论,本文基于豆瓣上对某一部电视剧评论的爬取,然后进行SnowNLP情感分析,最后生成词云,给人一个直观的印象
1. 爬取评论
以前段时间比较火热的《猎场》为例,因豆瓣网有反爬虫机制,所以在爬取时要带登录后的cookie文件,保存在cookie.txt文件里,具体代码及结果如下:
import requests, codecs
from lxml import html
import time
import random
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'}
f_cookies = open('cookie.txt', 'r')
cookies = {}
for line in f_cookies.read().split(';'):
name, value = line.strip().split('=', 1)
cookies[name] = value
#print(cookies)
for num in range(0, 500, 20):
url = 'https://movie.douban.com/subject/26322642/comments?start=' + str(
num) + '&limit=20&sort=new_score&status=P&percent_type='
with codecs.open('comment.txt', 'a', encoding='utf-8') as f:
try:
r = requests.get(url, headers=header, cookies=cookies)
result = html.fromstring(r.text)
comment = result.xpath( " // div[@class ='comment'] / p / text() ")
for i in comment:
f.write(i.strip() + '\r\n')
except Exception as e:
print(e)
time.sleep(1 + float(random.randint(1, 100)) / 20)
2. 情感分析
SnowNLP是python中用来处理文本内容的,可以用来分词、标注、文本情感分析等,情感分析是简单的将文本分为两类,积极和消极,返回值为情绪的概率,越接近1为积极,接近0为消极。代码如下:
import numpy as np
from snownlp import SnowNLP
import matplotlib.pyplot as plt
f = open('comment.txt', 'r', encoding='UTF-8')
list = f.readlines()
sentimentslist = []
for i in list:
s = SnowNLP(i)
# print s.sentiments
sentimentslist.append(s.sentiments)
plt.hist(sentimentslist, bins=np.arange(0, 1, 0.01), facecolor='g')
plt.xlabel('Sentiments Probability')
plt.ylabel('Quantity')
plt.title('Analysis of Sentiments')
plt.show()
3. 生成词云
词云的生成主要用到了结巴分词和wordcloud,前者是针对中文进行分词的处理库,后者可以根据分词处理结果定制化生成词云,详细代码如下:
#coding=utf-8
import matplotlib.pyplot as plt
from scipy.misc import imread
from wordcloud import WordCloud
import jieba, codecs
from collections import Counter
text = codecs.open('comment.txt', 'r', encoding='utf-8').read()
text_jieba = list(jieba.cut(text))
c = Counter(text_jieba) # 计数
word = c.most_common(800) # 取前500
bg_pic = imread('src.jpg')
wc = WordCloud(
font_path='C:\Windows\Fonts\SIMYOU.TTF', # 指定中文字体
background_color='white', # 设置背景颜色
max_words=2000, # 设置最大显示的字数
mask=bg_pic, # 设置背景图片
max_font_size=200, # 设置字体最大值
random_state=20 # 设置多少种随机状态,即多少种配色
)
wc.generate_from_frequencies(dict(word)) # 生成词云
wc.to_file('result.jpg')
# show
plt.imshow(wc)
plt.axis("off")
plt.figure()
plt.imshow(bg_pic, cmap=plt.cm.gray)
plt.axis("off")
plt.show()