前言
众所周知,BeautifulSoup 是个非常强大的库,不过还有一些比较流行的解析库,例如 lxml,使用的是 Xpath 语法,同样是效率比较高的解析方法。如果大家对 BeautifulSoup 使用不太习惯的话,可以尝试下 Xpath。(墙裂推荐哦)
lxml的安装:
pip install lxml
代码
# -*- coding: UTF-8 -*-
import requests
from lxml import etree
#request和lxml,用于网络请求和解析
import sys
reload(sys)
sys.setdefaultencoding('utf8')
#用于解决python2.7中文编码问题
ori_url = 'http://maoyan.com/films?sortId=1&offset={}'
#猫眼电影主页url,offset从0开始递增,一页30部电影
headers={
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Cookie': 'your cookie',
#填写你自己的浏览器cookie
'Host': 'maoyan.com',
'Referer': 'http://maoyan.com/',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
all_url=[]
for i in range(11):
offset = str(i*30)
req_url = ori_url.format(offset)
all_url.append(req_url)
#一共11页,url动态变化
movie_item=list()
i = 0
j = 0
for url in all_url:
html = requests.get(url, headers=headers).text
selector = etree.HTML(html)
infos = selector.xpath('//div[@class="movies-list"]/dl[@class="movie-list"]//div[@class="channel-detail movie-item-title"]/a')
#xpath爬取电影name和电影url
j = i
for info in infos:
movie_item.append(dict())
movie_url = 'http://maoyan.com' + info.xpath('@href')[0]
movie_name = info.xpath('text()')[0]
movie_item[i]['name'] = movie_name
movie_item[i]['url'] = movie_url
i += 1
score = selector.xpath('//div[@class="channel-detail channel-detail-orange"]')
#xpath爬取电影评分(两种情况:有评分/暂无评分)
for item in score:
if item.text == None:
sc= item.getchildren()[0].text+item.getchildren()[1].text
else:
sc= item.text
movie_item[j]['score'] = sc
j+=1
movie_item = sorted(movie_item, key=lambda item:item['score'], reverse=False)
#按照评分排序
file=open('./p_data/movieinfos.txt','w')
#将结果写入本地文件
print len(movie_item)
for i in range(len(movie_item)):
file.write(str(movie_item[i]['name'])+' '+str(movie_item[i]['score'])+' '+str(movie_item[i]['url'])+'\n')
file.close()
最终结果
Ps:银翼杀手和异形契约在我看来是很好的两部电影,导演水准也很高(心情复杂