logging模块的使用
import scrapy
import logging
logger = logging.getLogger(__name__)
class QbSpider(scrapy.Spider):
name = 'qb'
allowed_domains = ['qiushibaike.com']
start_urls = ['http://qiushibaike.com/']
def parse(self, response):
for i in range(10):
item = {}
item['content'] = "haha"
# logging.warning(item)
logger.warning(item)
yield item
运行结果
- settings中设置LOG_FILE = './log.log' 将错误或者警告保存在日志文件中
# 模板
import logging
logging.basicConfig(level=log_level,
format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
datefmt='%a, %d %b %Y %H:%M:%S',
filename='parser_result.log',
filemode='w')
if __name__ == '__main__':
logging.info('i am warning')
结果(这个模板比较详细)
Fri, 21 Aug 2020 13:38:22 logfile.py[line:12] INFO i am warning
pipeline文件
import logging
logger = logging.getLogger(__name__)
class MyspiderPipeline(object):
def process_item(self, item, spider):
# print(item)
logger.warning(item)
item['hello'] = 'world'
return item
保存到本地,在setting文件中LOG_FILE = './log.log'
basicConfig样式设置
https://www.cnblogs.com/felixzh/p/6072417.html
回顾
如何翻页
腾讯爬虫案例
通过爬取腾讯招聘的页面的招聘信息,学习如何实现翻页请求
http://hr.tencent.com/position.php
创建项目
scrapy startproject tencent
创建爬虫
scrapy genspider hr tencent.com
scrapy.Request知识点
scrapy.Request(url, callback=None, method='GET', headers=None, body=None,cookies=None, meta=None, encoding='utf-8', priority=0,
dont_filter=False, errback=None, flags=None)
常用参数为:
callback:指定传入的URL交给那个解析函数去处理
meta:实现不同的解析函数中传递数据,meta默认会携带部分信息,比如下载延迟,请求深度
dont_filter:让scrapy的去重不会过滤当前URL,scrapy默认有URL去重功能,对需要重复请求的URL有重要用途
item的介绍和使用
# 定义一些我们需要查询的的字段,以防我们提取数据的时候将字段写错,所以可以提前定义好
items.py
import scrapy
class TencentItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
position = scrapy.Field()
date = scrapy.Field()
阳光政务平台案例
http://wz.sun0769.com/index.php/question/questionType?type=4&page=0