默认的去重逻辑
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2017/5/7 22:43
# @Author : Aries
# @File : scrapy_filter.py
# @Software: PyCharm
import scrapy
from scrapy.http.request import Request
class FilterSpider(scrapy.Spider):
name = 'filter'
headers = {
'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36", }
def start_requests(self):
yield Request(url='https://www.baidu.com/s?wd=22', headers=self.headers)
def parse_print(self, response):
self.logger.info(response.url)
def parse(self, response):
self.logger.info("--------------------------")
yield Request(url='https://www.baidu.com/s?wd=1', callback=self.parse_print, headers=self.headers)
yield Request(url='https://www.baidu.com/s?wd=3', callback=self.parse_print, headers=self.headers)
yield Request(url='https://www.baidu.com/s?wd=3', callback=self.parse_print, headers=self.headers)
# 运行结果如下
2017-05-07 23:33:36 [scrapy.core.engine] INFO: Spider opened
2017-05-07 23:33:36 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-07 23:33:36 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-07 23:33:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/s?wd=22> (referer: None)
2017-05-07 23:33:37 [filter] INFO: --------------------------
2017-05-07 23:33:37 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.baidu.com/s?wd=3> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-05-07 23:33:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/s?wd=3> (referer: https://www.baidu.com/s?wd=22)
2017-05-07 23:33:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/s?wd=1> (referer: https://www.baidu.com/s?wd=22)
2017-05-07 23:33:37 [filter] INFO: https://www.baidu.com/s?wd=3
2017-05-07 23:33:37 [filter] INFO: https://www.baidu.com/s?wd=1
2017-05-07 23:33:37 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-07 23:33:37 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
# 从运行结果中看到:两个https://www.baidu.com/s?wd=3 的请求有一个被默认的去重逻辑处理掉了
2017-05-07 23:33:37 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.baidu.com/s?wd=3> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
默认去重逻辑的瓶颈
默认去重: 'scrapy.dupefilters.RFPDupeFilter'
通过request_fingerprint将request中传递过来的url做类似指纹认证。
如果指纹已经存在,则丢弃返回true
如果指纹不存在,则add加入到fingerprints中(此处self.fingerprints = set())
class RFPDupeFilter(BaseDupeFilter):
def request_seen(self, request):
fp = self.request_fingerprint(request)
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)
按照这个去重处理逻辑,默认情况下法实现如下情况:
例子一:不同日期发起的同一个url请求(时间不同页面内容会改变)
https://www.baidu.com/s?wd=3 (2017.05.05抓取)
https://www.baidu.com/s?wd=3 (2017.05.07抓取 - 默认情况下这个会被处理掉)例子二:我只关注URL中部分参数(我只关注https://www.baidu.com/s?wd=3)
https://www.baidu.com/s?wd=3&s=1 (s=1)
https://www.baidu.com/s?wd=3&s=2 (虽然s=2但是前面wd于前一个请求一样,需要去掉)
解决方法:
- 以例子二为例:通过mate将wd的传值到request_fingerprint的函数中,将URL返回,让filter生成指纹
- 例子一的解决方法类似,将日期传递给request_fingerprint。重新返回对应的URL就可以了(return response + "--" + datetime)
第一步:重写RFPDupeFilter
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2017/5/7 23:17
# @Author : Aries
# @File : custom_filter.py
# @Software: PyCharm
from scrapy.dupefilters import RFPDupeFilter
class CustomURLFilter(RFPDupeFilter):
def request_fingerprint(self, request):
if "wd" in request.meta:
return "https://www.baidu.com/s" + "--" + request.meta["wd"]
else:
return request.url
第二步:启用custom_settings并设置好对应的meta
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2017/5/7 22:43
# @Author : Aries
# @File : scrapy_filter.py
# @Software: PyCharm
import scrapy
from scrapy.http.request import Request
class FilterSpider(scrapy.Spider):
name = 'filter'
headers = {
'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36", }
custom_settings = {
'DUPEFILTER_DEBUG': True,
'DUPEFILTER_CLASS': "lagou.custom_filter.CustomURLFilter"
}
def start_requests(self):
yield Request(url='https://www.baidu.com/s?wd=22', headers=self.headers, meta={"wd": "22"})
def parse_print(self, response):
self.logger.info(response.url)
def parse(self, response):
self.logger.info("--------------------------")
yield Request(url='https://www.baidu.com/s?wd=1', callback=self.parse_print, headers=self.headers, meta={"wd": "1"})
yield Request(url='https://www.baidu.com/s?wd=3&s=1', callback=self.parse_print, headers=self.headers, meta={"wd": "3"})
yield Request(url='https://www.baidu.com/s?wd=3&s=2', callback=self.parse_print, headers=self.headers, meta={"wd": "3"})
# 运行结果如下
2017-05-07 23:31:14 [scrapy.core.engine] INFO: Spider opened
2017-05-07 23:31:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-07 23:31:14 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-07 23:31:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/s?wd=22> (referer: None)
2017-05-07 23:31:14 [filter] INFO: --------------------------
2017-05-07 23:31:14 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.baidu.com/s?wd=3&s=2>
2017-05-07 23:31:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/s?wd=3&s=1> (referer: https://www.baidu.com/s?wd=22)
2017-05-07 23:31:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com/s?wd=1> (referer: https://www.baidu.com/s?wd=22)
2017-05-07 23:31:14 [filter] INFO: https://www.baidu.com/s?wd=3&s=1
2017-05-07 23:31:15 [filter] INFO: https://www.baidu.com/s?wd=1
2017-05-07 23:31:15 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-07 23:31:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
# 从运行结果中看到:两个https://www.baidu.com/s?wd=3&s=2 的请求被处理掉了
2017-05-07 23:31:14 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.baidu.com/s?wd=3&s=2>