最近看到requests 作者 kennethreitz 出了一个新库 requests-html,拿来练练手。该库旨在尽可能简单直观地解析html(例如,抓取网页)。
官方文档
http://html.python-requests.org/
来抓抓网易11选5的彩票的数据。
首先我们打开网站,打开开发者工具找到对应的html。
session = HTMLSession()
def getData():
response = session.get('http://caipiao.163.com/award/11xuan5/')
content = response.html.find('section.main', first=True)
body = content.find('tbody')
itemDicts = dict()
for tr in body:
list = tr.find('td.start')
for td in list:
try:
period = td.attrs['data-period']
award = td.attrs['data-award']
print("序号:" + td.text + " 期号:" + period + " 开奖号码:" + award)
itemDicts[period] = award
except KeyError as e:
print('except: ', e)
finally:
print('finally')
因为还有没有开出来的开奖号码 我们就try...except了。我们发现网页是表格的,我们需要按期号排列。
sortItemDict = sorted(itemDicts.keys(), reverse=False)
# print(sortItemDict)
for key in sortItemDict:
print("期号:", key, " 开奖号码:", itemDicts[key])
最后结果:
完整代码(发现省了不少事,直接find元素s)
from requests_html import HTMLSession
import requests
session = HTMLSession()
def getData():
response = session.get('http://caipiao.163.com/award/11xuan5/')
content = response.html.find('section.main', first=True)
body = content.find('tbody')
itemDicts = dict()
for tr in body:
list = tr.find('td.start')
for td in list:
try:
period = td.attrs['data-period']
award = td.attrs['data-award']
print("序号:" + td.text + " 期号:" + period + " 开奖号码:" + award)
itemDicts[period] = award
except KeyError as e:
print('except: ', e)
finally:
print('finally')
sortItemDict = sorted(itemDicts.keys(), reverse=False)
# print(sortItemDict)
for key in sortItemDict:
print("期号:", key, " 开奖号码:", itemDicts[key])
if __name__ == '__main__':
getData()