在之前,对于scrapy 框架进行了相关的学习,本篇承接上一篇爬虫的内容,进行相关的实践,利用scrapy_redis 实现分布式爬取和mongodb 存储
根据该项目我学到的知识点有
该实战项目学习到的内容
1. 类中,init 和str的区别
2.关于绝对路径的调用
3.scrapy_redis 分布式部署
4. crawlspider 以及其中linkextractor,rule 的使用
首先"rules"
在rules中包含一个或多个Rule对象,每个Rule对爬取网站的动作定义了特定操作。
如果多个Rule匹配了相同的链接,则根据规则在本集合中被定义的顺序,第一个会被使用。
主要可以用来存储要爬取网站列表页面和商品详情页面
rules 的主要参数有:link_extractor 是一个Link Extractor对象。其定义了如何从爬取到的页面提取链接。
callback 回调函数
follow 是一个boolean值,指定了根据该规则从response提取的链接是否需要跟进。如果callback为None,follow默认设置True,否则默认False。
其次重点介绍"linkextractor"
allow:满足括号中“正则表达式”的值会被提取,如果为空,则全部匹配。
deny:与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。
allow_domains:会被提取的链接的domains。
deny_domains:一定不会被提取链接的domains。
restrict_xpaths:使用XPath表达式,和allow共同作用过滤链接。
举例子:
rules = (Rule(SgmlLinkExtractor(allow=('category/20/index_\d+\.html'), restrict_xpaths=("//div[@class='left']"))),
Rule(SgmlLinkExtractor(allow=('a/\d+/\d+\.html'), restrict_xpaths=("//div[@class='left']")), callback='parse_item'),
)
Rule是在定义抽取链接的规则,上面的两条规则分别对应列表页的各个分页页面和详情页,
关键点在于通过restrict_xpath来限定只从页面特定的部分来抽取接下来将要爬取的链接。
5.
复习一下re模块
pattern=re.complie(正则表达式],re.S)#要匹配的条件
re.match(pattern,要匹配的content对象)match 是尝试从字符串的起始位置匹配一个模式如果不是起始位置匹配成功,则返回None
re.search() 扫描整个字符串并返回第一个成功的匹配
6.scrapy 中的request 的参数meta
参数的作用是传递信息给下一个函数,使用过程可一理解为
把需要传递的信息复制给meta 变量、但是meta只接受字典类型的父子
比如meta={'key1':value1}
如果想在下一个函数中取出value1
因为meta是随着request 产生实时传递的,下一个函数的到的response对象
中就会有meta即
response.meta
取 value1则是value1=response.meta['key1']
7.response.text 和response.content 的区别
response.text 返回的是unicode型的数据
response.content 返回的是bytes 型也就是二进制
所以获取文本要response.text
获取图片,文件用response.content
8.json.loads的使用,将json 形式加载为字典
首先我们以京东的手机界面为例,来分析一下网页结构,其首页的网址如下:
http://list.jd.com/list.html?cat=9987,653,655&page=1 其中cat 为搜索商[图片上传中...(image.png-d941b5-1548055408475-0)]
品的id,而 page 就是首页的页数
我们利用Rule 方法中的LinkExtractor 来对首页的连接进行提取,并将其的返回结果回调给详情解析函数parse_item
import re
import scrapy
import json
import logging
import random
import uuid
import requests
from scrapy import Spider
from scrapy.http import Request
#from .exception import parsenotsupportederror
from ..items import JdItem
from scrapy_redis.spiders import RedisCrawlSpider#用于进行分布式部署
import time
from scrapy.spiders import Rule,CrawlSpider
from scrapy.linkextractors import LinkExtractor#crawlspider中专门用于提取相关连接的工具
logger=logging.getLogger('jd')#日志输出,warning,不打印没有返回值
class JingdongSpider(RedisCrawlSpider):
name = 'jingdong'
allowed_domains = ['jd.com']
#start_urls = ['http://list.jd.com/list.html?cat=9987,653,655&page=1']#可根据自己需要修改,这里仅以手机示例,启动爬虫后会自动返回response
#redis_key='jd:start_urls'
rules=[Rule(LinkExtractor(allow=r"page=\d+?"),callback='parse_item',follow=True)]#将返回的htmlresponse 给到解析函数
下面详情解析函数会提取商品的店铺名称,产品的id号,产品详情页的链接,产品图片链接,以及产品价格对应的请求链接,关于前几个提取信息都在html中的class="gl-item' 标签里面,在这里不再赘述,需要的同学可以根据后面的代码自己匹配验证一下。
这里要说的是商品的价格,除了可以从上述的标签里面去利用xpath、css 选择器提取外,这里介绍了另一种方法,就是找到json 形式存储的链接 base_price_url
经过上述的分析,相应的代码如下
def parse_item(self,response):
pattern=re.compile('page=(\d+?)',re.S)#这样就识别了目前爬取的列表的页数
num=re.search(pattern,response.url).group(1)
print ('目前正在爬取的页数为%s'%(num))
phone_items=response.xpath('//li[@class="gl-item"]')#列表页面中每一个产品的所在位置
base_price_url = 'https://p.3.cn/prices/mgets?callback=jQuery%s&skuIds=J_%s'#以手机为例,一款手机有不同的sku,价格也不一样,sku价格是用jquery 的异步存储的方式
for phone_item in phone_items:
item=JdItem()#首先实例化
item['jd_shop_name']=phone_item.xpath('./div/div[@class="p-shop"]/@data-shop_name').extract_first()#店铺名
item['product_id']=phone_item.xpath('.//div/@data-sku').extract_first()#提取的是商品的id
item['jd_page_url'] = 'http://' + gl_item.xpath('.//div[@class="p-img"]/a/@href').extract_first()#商品详情页链接
price_url=base_price_url%(item['product_id'],item['product_id'])#这样就获得了每一款sku的价格的链接
jd_img_url = gl_item.xpath('.//img[@height="220"]/@src').extract_first()
item['jd_img_url'] = 'https:' + jd_img_url#图片url
yield Request(url=price_url,callback=self.parse_price,meta={'item':item})
随后的函数 parse_price 就是提取价格的过程
def parse_price(selfself,response):
item=response.meta['item']#将上述函数得到的item字典集合重新拷贝
price_json=response.text#先取出json 其中存储的格式是这样的:jQuery5267943([{"id":"J_100001693779","m":"9999.00","op":"1299.00","p":"1299.00"}]
pattern=re.compile('jQuery\d+\(\[(.*?)\]\)')
price_dict=re.search(pattern,prcie_json).group(1)
#随后进行加载为字典
item['jd_product_price']=json.loads(price_dict)['p']
yield Request(url=item['jd_page_url'],callback=self.parse_detail,meta={'item':item})
下面就是对评论界面进行分析,评论也是利用jquery 实现的json 形式的存储
对应的代码如下
def parse_detail(self,response):
item=reponse.meta['item']#进行重新的复值
product_id=response.meta['item']['product_id']
url = 'http://club.jd.com/comment/productPageComments.action?productId=%s&score=0&sortType=6&page=0&pageSize=10' % (
product_id)
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'keep-alive',
'Host': 'club.jd.com',
'Referer': 'https://item.jd.com/%s.html' % product_id, # 引用的url
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:52.0) Gecko/20100101 '
'Firefox/52.0',
}
yield Request(url=url,callback=self.get_all_comment,headers=headers,method='GET',meta={'item':item})
def get_all_comment(self,response):
item=response.meta['item']
product_id=rresponse.meta['item']['product_id']
if response:
data=json.loads(response.text)#获取json 文本
productCommentSummary = data.get('productCommentSummary', '')
if productCommentSummary:
item['jd_comment_num'] = productCommentSummary.get('commentCount', '') # 总评论数
item['jd_good_count'] = productCommentSummary.get('goodCount', '') # 好评数
item['jd_gen_count'] = productCommentSummary.get('generalCount', '') # 中评
item['jd_bad_count'] = productCommentSummary.get('poorCount', '') # 差评
item['jd_add_count'] = productCommentSummary.get('afterCount', '') # 追评
comment_info_list=[]
for i in range(1,3):#获取前三页的评论信息,可以自行进行修改
url = 'http://club.jd.com/comment/productPageComments.action?productId=%s&score=0&sortType=6&page=%s&pageSize=10' % (product_id, i)
response = requests.get(url)
if response:
data=json.loads(response.text)
#商品评论
comments=data.get('comments','')#第二个参数是如果指定的键值不存在,就返回''
for commnet in comments:
comment_info_dict={}
comment_info_dict['jd_content']=comment['content']
comment_info_dict['jd_creationTime'] = comment.get('creationTime', '')
comment_info_dict['jd_userClientShow'] = comment.get('userClientShow', '')
comment_info_dict['jd_id'] = comment.get('id', '')
comment_info_dict['jd_userLevelName'] = comment.get('userLevelName', '')
comment_info_list.append(comment_info_dict)
item['jd_comments']=comment_info_list
yield item #这一步很重要,就是将item 传回到itempipline
完整的代码就是上述代码的汇总,这是scrapy 框架的主体代码
保存在jingdong.py 中
除此之外,其他的附属文件的作用则正如上一篇介绍,简介如下:
- items.py 目的是为了集中要爬取的信息
- middleware.py 由于是 request 和response 之间的钩子,因此主要作用是使得请求顺利进行,一般会用于添加代理,例如本例子
- piplines.py 用于返回存储item 例如本例子的mongodb 存储
- settings.py 主要是一些变量的声明,比如MONGO_URI,MONGO_DATABASE, USER_AGENT,除此之外它是连接redis 参数设定的位置
相应的代码如下
- items.py
import scrapy
from scrapy import Item,Field
class JdItem(Item):
d_img_url = Field() # 商品图片url
jd_page_url = Field() # 详情页面url
jd_product_price = Field() # 商品价格
jd_shop_name = Field() # 店铺名字
jd_comment_num = Field() # 总评论
product_id = Field() # 商品id
jd_good_count = Field() # 好评数
jd_gen_count = Field() # 中评
jd_bad_count = Field() # 差评
jd_add_count = Field() # 追评
jd_comments = Field() # 评论内容
- middlewares.py (除去自带文件不用动之外,额外添加)
class RandomUserAgent(object):
def __init__(self):
self.user_agent=settings['USER_AGENTS']
def process_request(self,request,spider):
user_agent=random.choice(self.user_agent)
request.headers.setdefault("User-Agent",user_agent)
- piplines.py
class MongoPipeline(object):
collection_name = 'scrapy_items'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db[self.collection_name].insert_one(dict(item))
return item
- settings.py
# Scrapy settings for jd project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'jd'
SPIDER_MODULES = ['jd.spiders']
NEWSPIDER_MODULE = 'jd.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'jd (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'keep-alive',
'Host': 'club.jd.com',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:52.0) Gecko/20100101 '
'Firefox/52.0',
}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'jd.middlewares.JdSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'jd.middlewares.JdDownloaderMiddleware': 543,
}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'jd.pipelines.MongoPipeline': 300,
'scrapy_redis.pipelines.RedisPipeline':500,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
USER_AGENTS = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2)',
'Opera/9.27 (Windows NT 5.2; U; zh-cn)',
'Opera/8.0 (Macintosh; PPC Mac OS X; U; en)',
'Mozilla/5.0 (Macintosh; PPC Mac OS X; U; en) Opera 8.0',
'Mozilla/5.0 (Linux; U; Android 4.0.3; zh-cn; M032 Build/IML74K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
'Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.27 Safari/525.13'
]
# 允许中途暂停,redis记录不丢失
SCHEDULER_PERSIST = True
# redis服务器的 ip地址和端口号
REDIS_HOST = '127.0.0.1'#默认的主机ip 为127.0.0.1
REDIS_PORT = 6379
# url 过滤 用scrapy_redis,去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 调度器改成 scrapy-redis 调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 队列改成scrapy-redis
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue" # 优先级
MONGO_URI='localhost'
MONGO_DADABASE='JD'
上述就是scrapy_redis 爬取京东商品信息和评论的全部内容,下一篇将继续scrapy 框架进行其他有趣网站的爬取实践,本篇内容的相关代码对应的github传送门:
https://github.com/luzhisheng/jd-scrapy-redis