0. 基础知识:
1) 搜索引擎爬虫介绍 --> 增量式爬虫和分布式爬虫
http://www.zouxiaoyang.com/archives/386.html
http://docs.pythontab.com/scrapy/scrapy0.24/intro/overview.html
62792
scrapy crawl -s LOG_FILE=./logs/liter.log -s MONGODB_COLLECTION=literature literatureSpider
#http://doc.scrapy.org/en/latest/topics/jobs.html
scrapy crawl douban8590Spider -s JOBDIR=crawls/douban8590Spider -s MONGODB_DB=douban -s MONGODB_COLLECTION=book8590
1. Run your spider with -a option like:
scrapy crawl myspider -a filename=text.txt
Then read the file in the __init__ method of the spider and define start_urls:
class MySpider(BaseSpider):
name = 'myspider'
def __init__(self, filename=None):
if filename:
with open(filename, 'r') as f:
self.start_urls = f.readlines()
2. scrapy可以通过Settings来让爬取结束之后不自动关闭. how ?
3. 快代理svip 经常出现的问题:
TCP connection timed out: 60: Operation timed out.
Connection was refused by other side: 61: Connection refused.
An error occurred while connecting: 65: No route to host.
504 Gateway Time-out
404 Not Found
501 Not Implemented
4. AttributeError: 'Response' object has no attribute 'body_as_unicode'
出现这个问题,主要是网站的header里面没有content-type字段,scrapy就抽风了,不知道抓取网页的类型,其实解决办法很简单。
把pase方法进行简单的改写即可
def parse(self, response):
hxs=Selector(text=response.body)
detail_url_list = hxs.xpath('//li[@class="good-list"]/@href').extract()
for url in detail_url_list:
if 'goods' in url:
yield Request(url, callback=self.parse_detail)
#该代码片段来自于: http://www.sharejs.com/codes/python/9049
5.Speed up web scraper
Here's a collection of things to try:
- use latest scrapy version (if not using already)
- check if non-standard middlewares are used
- try to increase CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS settings (docs)
- turn off logging LOG_ENABLED = False (docs)
- try yielding an item in a loop instead of collecting items into the items list and returning them
- use local cache DNS (see this thread)
- check if this site is using download threshold and limits your download speed (see this thread)
- log cpu and memory usage during the spider run - see if there are any problems there
- try run the same spider under scrapyd service
- see if grequests + lxml will perform better (ask if you need any help with implementing this solution)
- try running Scrapy on pypy, see Running Scrapy on PyPy