- Scrapy
- Unicode与utf-8编码转换
1. 安装Scrapy
conda install scrapy
验证安装是否成功
scrapy version
2. scray shell的使用
- 使用方法
scrapy shell -s ROBOTSTXT_OBEY=False "http://mp.weixin.qq.com/s?__biz=MjM5MTI0NjQ0MA==&mid=402001834&idx=1&sn=fbe58fd99b6a1b64e6764a436964ba4a&scene=21#wechat_redirect"
scrapy shell -s USER_AGENT='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36' "http://www.jianshu.com/trending/weekly?utm_medium=index-banner-s&utm_source=desktop&page=5"
- 用于测试css、xpath表达式是否正确
response.xpath('//*[(@id ="TopicsNode")]//td[(((count(preceding-sibling::*) + 1) = 3) and parent::*)]')
topic.css('a::attr("href")').extract_first()
- 用于测试网页返回内容是否正确
view(response)
- 获取请求状态码
response.status
3. 爬取v2ex.com
- 网址url构成
url = 'https://www.v2ex.com/go/python?p={}'.format(page_number)
4. v2ex爬虫代码
5. Unicode与utf-8编码转换
scrapy默认编码为Unicode,修改pipeline.py的内容将unicode编码为utf-8
import json
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = codecs.open('jianshu_data_utf-8.json', 'w', encoding='utf-8')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(line)
return item
修改完成后激活Item Pipeline组件,将Item Pipeline组件的类名加入到settings.py的ITEM_PIPELINES中。
ITEM_PIPELINES = {
'myproject.pipelines.PricePipeline': 300,
'myproject.pipelines.JsonWriterPipeline': 800,
}