scrapy基础笔记1-创建并运行一个项目

1.创建一个scrapy项目

scrapy startproject quotetutorial

2.进入到刚才创建的项目quotetutorial文件夹中为项目创建一个爬虫

scrapy genspider quotes quotes.toscrape.com

这时候发现quotetutorial-quotetutorial-spider文件夹中有生成quotes.py文件

内容如下：

   class QuotesSpider(scrapy.Spider):
       name ='quotes' # 爬虫项目的名字
       allowed_domains = ['quotes.toscrape.com']
       start_urls = ['http://quotes.toscrape.com/']  # 刚才指定的url
       def parse(self, response):
           pass

到现在为止的文件结构：

image

scrapy.cfg中指定settings文件和部署的配置

[settings]
default = quotetutorial.settings
[deploy]
#url = http://localhost:6800/
project = quotetutorial

1.items.py-保存数据结构
2.middlewares.py-爬虫中间件
3.pipelines.py-定义一些管道
4.settings.py-配置信息

所有的爬虫是写在spider文件夹下

我们把def parse方法加上一个print内容：

import scrapy

class QuotesSpider(scrapy.Spider):
    name ='quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    def parse(self, response):
        print(response.text)

parse这个方法会在爬取网页后执行，在这里改成print(response.text)然后作如下操作执行爬虫

3.运行爬虫

quotetutorial下还有一个quotetutorial文件夹，在外层quotetutorial下执行

scrapy crawl quotes

这时候可以看到log信息如下，打印了scrapy框架执行的信息，有版本信息，系统信息，爬虫信息，使用的中间件，爬去的网页数据信息，刚才的print(response.text也会在下面打印)

    D:\study\bandwagon\repository\spider\scrapy\quotetutorial>scrapy         crawl quotes
2019-02-27 21:58:22 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: quotetutorial)
2019-02-27 21:58:22 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.7.0,
Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.3.1,
Platform Windows-10-10.0.17134-SP0
2019-02-27 21:58:22 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'quotetutorial', 'NEWSPIDER_MODULE': 'quotetutorial.spiders', 'ROBO
TSTXT_OBEY': True, 'SPIDER_MODULES': ['quotetutorial.spiders']}
2019-02-27 21:58:22 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2019-02-27 21:58:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-02-27 21:58:23 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-02-27 21:58:23 [scrapy.core.engine] INFO: Spider opened
2019-02-27 21:58:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-02-27 21:58:23 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-02-27 21:58:28 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-02-27 21:58:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
<!DOCTYPE html>
<html lang="en">
**！！！在这个位置会打印刚才的response.text，由于篇幅就不放了**
</html>
2019-02-27 21:58:31 [scrapy.core.engine] INFO: Closing spider (finished)
2019-02-27 21:58:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 446,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2701,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 2, 27, 13, 58, 31, 73758),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 2, 27, 13, 58, 23, 304498)}
2019-02-27 21:58:31 [scrapy.core.engine] INFO: Spider closed (finished)

4.输入爬虫结果到不同格式的文件或ftp server.

通过-o 文件名的参数方式

scrapy crawl quotes -o     quotes.json/quotes.csv/quotes.xml/quotes.pickle/quotes.jl/quote s.marshal/ftp://user:passwd@ftp.xxx.com/path/quotes.json

5.scrapy shell命令行交互模式

scrapy shell quotes.toscrape.com

In [1]: quotes = response.css('.quote')
In [4]: quotes[0]
Out[4]: <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>
In [5]: quotes[0].css('.text')
Out[5]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]" data='<span class="text" itemprop="text">“The '>]
In [6]: quotes[0].css('.text::text')
Out[6]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]/text()" data='“The world as we have created it is a pr'>]

在scrapy中css选择器可以用::text的方式获取文本

In [7]: quotes[0].css('.text::text').extract()
Out[7]: ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']
In [8]: quotes[0].css('.text::text').extract_first()
Out[8]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
In [9]: quotes[0].css('.tags .tag::text').extract_first()
Out[9]: 'change'
In [10]: quotes[0].css('.tags .tag::text').extract()
Out[10]: ['change', 'deep-thoughts', 'thinking', 'world']

从上面这四个输入输出可以看出，extract_first()用于提取第一个匹配项，extract()用于提取所有匹配项成列表的格式，所以一般查找结果唯一的可以用extract_first()，查找结果很多项的就用extract()

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 204,530评论 6赞 478
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 86,403评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 151,120评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,770评论 1赞 277
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,758评论 5赞 367
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,649评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,021评论 3赞 398
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,675评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,931评论 1赞 299
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,659评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,751评论 1赞 330
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,410评论 4赞 321
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,004评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,969评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,203评论 1赞 260
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,042评论 2赞 350
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,493评论 2赞 343

scrapy基础笔记1-创建并运行一个项目

推荐阅读更多精彩内容