很早便知晓大神的新库推出,可懒惰让我一直没亲自尝试,今天先做一实战测试,证实一下在实际爬取中这个库带来的便利性到底是不是如大神本人所说:HTML Parsing for Humans
这个库的简单功能介绍在上一篇文章:爬虫请求、解析、js渲染于一体---requests-html库
- 只使用reques-html这一个库
- 试验目标:简书首页:https://www.jianshu.com/
- 使用功能包括:随机User-Agent、JavaScript渲染及模拟拉倒滚动条、css及xpath嵌套解析、absolute_links绝对路径获取。
暂时先试验这些,以后遇到遇到合适的爬取案例再添加~
import requests_html
text_url = "https://www.jianshu.com/"
# 自动生成一个useragent
user_agent = requests_html.user_agent()
print("User-Agent:",user_agent)
# 创建session对象
session = requests_html.HTMLSession()
headers = {
"User-Agent":user_agent
}
# 请求简书主页
r = session.get(text_url,headers=headers)
# 渲染Javasc内容,模拟滚动条翻页5次,每次滚动停止1秒
r.html.render(scrolldown=5, sleep=1)
# 使用css解析
items = r.html.find("ul.note-list li")
print("当前获得总文章数:",len(items))
for item in items:
# 嵌套使用xpath解析提取文章标题和作者
title = item.xpath('.//a[@class="title"]',first=True).text
author = item.xpath('.//a[@class="nickname"]', first=True).text
# 不写解析式,偷懒使用absolute_links获取文章、评论及作者主页完整链接
# absolute_links返回的是不重复的集合对象,转换为列表
link_list = list(item.absolute_links)
print({
'title':title,
'author':author,
'link_list':link_list,
})
输出结果:
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8
当前获得总文章数: 21
{'title': '男人不再爱你,而是又爱上了别人,才会有这5种表现', 'author': '蓓蓓情', 'link_list': ['https://www.jianshu.com/p/650c7cfc56b2#comments', 'https://www.jianshu.com/u/4c41bdfecd3b', 'https://www.jianshu.com/p/650c7cfc56b2']}
{'title': '知道这些穿搭小诀窍,又可以每天多睡五分钟啦!', 'author': '时髦精小C', 'link_list': ['https://www.jianshu.com/p/5d67389aa8ee#comments', 'https://www.jianshu.com/p/5d67389aa8ee', 'https://www.jianshu.com/u/883151cfab80']}
{'title': '呼吸入腰,百病全消', 'author': '觉太极崔云鸣', 'link_list': ['https://www.jianshu.com/p/455d75b07298#comments', 'https://www.jianshu.com/p/455d75b07298', 'https://www.jianshu.com/u/097121dee43e']}
{'title': '娱乐圈明星的美脚,赵丽颖郭碧婷上榜夏誉轩可谓是五千年最美的脚', 'author': '影视红人淘', 'link_list': ['https://www.jianshu.com/p/1988e7b69df1', 'https://www.jianshu.com/u/f3edbd94dfec', 'https://www.jianshu.com/p/1988e7b69df1#comments']}
{'title': '女生必须知道的48件小事', 'author': '伪人_bc2a', 'link_list': ['https://www.jianshu.com/p/62d7a211e58d', 'https://www.jianshu.com/p/62d7a211e58d#comments', 'https://www.jianshu.com/u/7179b9ac408d']}
{'title': '当明星离开美图和PS后:杨丽萍一脸老相,吴亦凡区别实在太大了', 'author': '长安air', 'link_list': ['https://www.jianshu.com/p/7198ae55ea7e', 'https://www.jianshu.com/u/6369232b240f', 'https://www.jianshu.com/p/7198ae55ea7e#comments']}
{'title': '你爱用手账吗?', 'author': '出战', 'link_list': ['https://www.jianshu.com/u/106f6530144f', 'https://www.jianshu.com/p/d3b3bcc6aeeb#comments', 'https://www.jianshu.com/p/d3b3bcc6aeeb']}
{'title': '喜欢独处的人都具有这8项人格特质,你也有吗?', 'author': '云儿_0101', 'link_list': ['https://www.jianshu.com/p/e54b195f272c', 'https://www.jianshu.com/p/e54b195f272c#comments', 'https://www.jianshu.com/u/6312196f74af']}
{'title': '我离开你,不是不喜欢你了,而是', 'author': '轻水兰洲', 'link_list': ['https://www.jianshu.com/p/0069b3b0fa03#comments', 'https://www.jianshu.com/u/fc85cdca00ee', 'https://www.jianshu.com/p/0069b3b0fa03']}
{'title': '一万多仅仅为了信仰?谈谈苹果笔记本的优缺点', 'author': '佳简科技', 'link_list': ['https://www.jianshu.com/u/8b6a9d138f38', 'https://www.jianshu.com/p/7ed597524b2f', 'https://www.jianshu.com/p/7ed597524b2f#comments']}
{'title': '焦俊艳恐婚:婚姻,真的太有趣了!', 'author': '漫漫Chen', 'link_list': ['https://www.jianshu.com/p/33234e4ae1a4#comments', 'https://www.jianshu.com/p/33234e4ae1a4', 'https://www.jianshu.com/u/6e176873807c']}
{'title': '胸脯', 'author': '汉天真', 'link_list': ['https://www.jianshu.com/u/28139ce42555', 'https://www.jianshu.com/p/489fa707a41e#comments', 'https://www.jianshu.com/p/489fa707a41e']}
{'title': '寂静法师:心中有多少阴暗,生活有多少灾难', 'author': '阿宝阳光', 'link_list': ['https://www.jianshu.com/u/0f56e7515e3c', 'https://www.jianshu.com/p/b36afc0991f1', 'https://www.jianshu.com/p/b36afc0991f1#comments']}
{'title': '馋哭朋友圈的烧卖我也会了', 'author': '七月的桃之妖妖', 'link_list': ['https://www.jianshu.com/p/16befa7bbef5', 'https://www.jianshu.com/p/16befa7bbef5#comments', 'https://www.jianshu.com/u/be5d30cd4219']}
{'title': '“用Mac口红的女生,不配和我谈恋爱!”', 'author': '是茧里啊', 'link_list': ['https://www.jianshu.com/p/cd7dbea62fcb', 'https://www.jianshu.com/u/90fa365750f8', 'https://www.jianshu.com/p/cd7dbea62fcb#comments']}
{'title': '男人“这样”和你聊天,说白了就是不够爱你,笨女人才不懂', 'author': '爱情摇篮', 'link_list': ['https://www.jianshu.com/u/8f29ba3025ea', 'https://www.jianshu.com/p/6942d9902eaa#comments', 'https://www.jianshu.com/p/6942d9902eaa']}
{'title': '学院风、通勤风、民族风自由切换!9套冬日搭配分享', 'author': 'sukami', 'link_list': ['https://www.jianshu.com/p/2720a33bdd40#comments', 'https://www.jianshu.com/u/a17438634136', 'https://www.jianshu.com/p/2720a33bdd40']}
{'title': '人要赚钱,要简单', 'author': '学愉创业思维', 'link_list': ['https://www.jianshu.com/u/cca7c3a9641e', 'https://www.jianshu.com/p/0f781183522b', 'https://www.jianshu.com/p/0f781183522b#comments']}
{'title': '丧心病狂的Github技巧', 'author': '凯睿看世界', 'link_list': ['https://www.jianshu.com/u/5b2be4252766', 'https://www.jianshu.com/p/758e8bd48308#comments', 'https://www.jianshu.com/p/758e8bd48308']}
{'title': '真心太准了!有没敢自报星座的,我是巨蟹座的。', 'author': '伍香权', 'link_list': ['https://www.jianshu.com/u/dce559f63fef', 'https://www.jianshu.com/p/b022717d86e5#comments', 'https://www.jianshu.com/p/b022717d86e5']}
{'title': 'Flutter提升开发效率的一些方法和工具', 'author': 'Android技术干货分享', 'link_list': ['https://www.jianshu.com/u/06fd4cf1f427', 'https://www.jianshu.com/p/4b5da079bc6b', 'https://www.jianshu.com/p/4b5da079bc6b#comments']}
其实发现了一个预想不一样的地方,由于使用absolute_links
获取到的是集合,而集合是无序的,所以我最初设想的获取单个element所有链接,然后按索引取出赋值给指定变量的想法就落空了~
文章内使用的都是简单的方法示例,为了试用更多的功能而写,比如这里使用render渲染其实是没必要的,太过耗时,实际应该分析ajax接口来进行爬取。