网络爬虫-入门

Scrapy

什么是scrapy

  • Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中。其最初是为了 页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。

什么是网络爬虫

  • 又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常被称为网页追逐者),是一种按照一定的规则,自动的抓取万维网信息的程序或者脚本,已被广泛应用于互联网领域。搜索引擎使用网络爬虫抓取Web网页、文档甚至图片、音频、视频等资源,通过相应的索引技术组织这些信息,提供给搜索用户进行查询。网络爬虫也为中小站点的推广提供了有效的途径,网站针对搜索引擎爬虫的优化曾风靡一时.

Scrapy的安装过程

  • 首先在终端中输入python,ubuntu检查是否自带了python 如果没有自带则重新安装
  • 在终端输入python 然后出现以下
Paste_Image.png
  • 接着输入import lxml
  • 然后再输入import OpenSSL
Paste_Image.png
  • 如果没有报错的话,说明ubuntu自带了python

依次在终端中输入以下指令

  • sudo apt-get install python-dev
  • sudo apt-get install libevent-dev
  • sudo apt-get install python-pip
  • sudo pip install Scrapy

然后输入scrapy 出现以下界面

Paste_Image.png

scrapy安装完成之后,便是进行简单的抓取网页数据(对简书的首页进行数据抓取)

  • 创建新工程:scrapy startproject XXX(例如:scrapy startproject jianshu)
Paste_Image.png
  • 显示文件的树状结构
Paste_Image.png
  1. spiders文件夹下就是要实现爬虫功能(具体如何爬取数据的代码),爬虫的核心。
  2. 在spiders文件夹下自己创建一个spider,用于爬取简书首页热门文章。
  3. scrapy.cfg是项目的配置文件。
  4. settings.py用于设置请求的参数,使用代理,爬取数据后文件保存等。
  5. items.py: 项目中的item文件,该文件存放的是抓取的类目,类似于dict字典规则
  6. pipelines.py: 项目中的pipelines文件,该文件为数据抓取后进行数据处理的方法

进行简单的简书的首页数据的抓取

  • 在spiders文件夹下创建了文件jianshuSpider.py并且在里面输入
MicrosoftInternetExplorer4
0
2
DocumentNotSpecified
7.8 磅

Normal

0

@font-face{
font-family:"Times New Roman";
}

@font-face{
font-family:"宋体";
}

@font-face{
font-family:"Calibri";
}

@font-face{
font-family:"Courier New";
}

p.MsoNormal{
mso-style-name:正文;
mso-style-parent:"";
margin:0pt;
margin-bottom:.0001pt;
mso-pagination:none;
text-align:justify;
text-justify:inter-ideograph;
font-family:Calibri;
mso-fareast-font-family:宋体;
mso-bidi-font-family:'Times New Roman';
font-size:10.5000pt;
mso-font-kerning:1.0000pt;
}

p.MsoPlainText{
mso-style-name:纯文本;
margin:0pt;
margin-bottom:.0001pt;
mso-pagination:none;
text-align:justify;
text-justify:inter-ideograph;
font-family:宋体;
mso-hansi-font-family:'Courier New';
mso-bidi-font-family:'Times New Roman';
font-size:10.5000pt;
mso-font-kerning:1.0000pt;
}

span.msoIns{
mso-style-type:export-only;
mso-style-name:"";
text-decoration:underline;
text-underline:single;
color:blue;
}

span.msoDel{
mso-style-type:export-only;
mso-style-name:"";
text-decoration:line-through;
color:red;
}
@page{mso-page-border-surround-header:no;
    mso-page-border-surround-footer:no;}@page Section0{
}
div.Section0{page:Section0;}

#coding=utf-8

import scrapy

from scrapy.spiders import CrawlSpider

from scrapy.selector import Selector

from scrapy.http import Request

from jianshu.items import JianshuItem

import urllib

 

 

class Jianshu(CrawlSpider):

    name='jianshu'

    start_urls=['http://www.jianshu.com/top/monthly']

    url = 'http://www.jianshu.com'

 

    def parse(self, response):

        item = JianshuItem()

        selector = Selector(response)

        articles = selector.xpath('//ul[@class="article-list thumbnails"]/li')

 

        for article in articles:

            title = article.xpath('div/h4/a/text()').extract()

            url = article.xpath('div/h4/a/@href').extract()

            author = article.xpath('div/p/a/text()').extract()

 

            # 下载所有热门文章的缩略图, 注意有些文章没有图片

            try:

                image = article.xpath("a/img/@src").extract()

                urllib.urlretrieve(image[0], '/Users/apple/Documents/images/%s-%s.jpg' %(author[0],title[0]))

            except:

                print '--no---image--'

 

 

            listtop = article.xpath('div/div/a/text()').extract()

            likeNum = article.xpath('div/div/span/text()').extract()

 

            readAndComment = article.xpath('div/div[@class="list-footer"]')

            data = readAndComment[0].xpath('string(.)').extract()[0]

 

 

            item['title'] = title

            item['url'] = 'http://www.jianshu.com/'+url[0]

            item['author'] = author

 

            item['readNum']=listtop[0]

            # 有的文章是禁用了评论的

            try:

                item['commentNum']=listtop[1]

            except:

                item['commentNum']=''

            item['likeNum']= likeNum

            yield item

 

        next_link = selector.xpath('//*[@id="list-container"]/div/button/@data-url').extract()

 

 

 

        if len(next_link)==1 :

 

            next_link = self.url+ str(next_link[0])

            print "----"+next_link

            yield Request(next_link,callback=self.parse)

  • 在items.py中
MicrosoftInternetExplorer4
0
2
DocumentNotSpecified
7.8 磅

Normal

0

@font-face{
font-family:"Times New Roman";
}

@font-face{
font-family:"宋体";
}

@font-face{
font-family:"Calibri";
}

@font-face{
font-family:"Courier New";
}

p.MsoNormal{
mso-style-name:正文;
mso-style-parent:"";
margin:0pt;
margin-bottom:.0001pt;
mso-pagination:none;
text-align:justify;
text-justify:inter-ideograph;
font-family:Calibri;
mso-fareast-font-family:宋体;
mso-bidi-font-family:'Times New Roman';
font-size:10.5000pt;
mso-font-kerning:1.0000pt;
}

p.MsoPlainText{
mso-style-name:纯文本;
margin:0pt;
margin-bottom:.0001pt;
mso-pagination:none;
text-align:justify;
text-justify:inter-ideograph;
font-family:宋体;
mso-hansi-font-family:'Courier New';
mso-bidi-font-family:'Times New Roman';
font-size:10.5000pt;
mso-font-kerning:1.0000pt;
}

span.msoIns{
mso-style-type:export-only;
mso-style-name:"";
text-decoration:underline;
text-underline:single;
color:blue;
}

span.msoDel{
mso-style-type:export-only;
mso-style-name:"";
text-decoration:line-through;
color:red;
}
@page{mso-page-border-surround-header:no;
    mso-page-border-surround-footer:no;}@page Section0{
}
div.Section0{page:Section0;}

# -*- coding: utf-8 -*-

 

# Define here the models for your scraped items

#

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html

 

from scrapy.item import Item,Field

 

class JianshuItem(Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    title = Field()

    author = Field()

    url = Field()

    readNum = Field()

    commentNum = Field()

    likeNum = Field()
  • 在setting.py中

···
MicrosoftInternetExplorer4
0
2
DocumentNotSpecified
7.8 磅

Normal

0

@font-face{
font-family:"Times New Roman";
}

@font-face{
font-family:"宋体";
}

@font-face{
font-family:"Calibri";
}

@font-face{
font-family:"Courier New";
}

p.MsoNormal{
mso-style-name:正文;
mso-style-parent:"";
margin:0pt;
margin-bottom:.0001pt;
mso-pagination:none;
text-align:justify;
text-justify:inter-ideograph;
font-family:Calibri;
mso-fareast-font-family:宋体;
mso-bidi-font-family:'Times New Roman';
font-size:10.5000pt;
mso-font-kerning:1.0000pt;
}

p.MsoPlainText{
mso-style-name:纯文本;
margin:0pt;
margin-bottom:.0001pt;
mso-pagination:none;
text-align:justify;
text-justify:inter-ideograph;
font-family:宋体;
mso-hansi-font-family:'Courier New';
mso-bidi-font-family:'Times New Roman';
font-size:10.5000pt;
mso-font-kerning:1.0000pt;
}

span.msoIns{
mso-style-type:export-only;
mso-style-name:"";
text-decoration:underline;
text-underline:single;
color:blue;
}

span.msoDel{
mso-style-type:export-only;
mso-style-name:"";
text-decoration:line-through;
color:red;
}
@page{mso-page-border-surround-header:no;
mso-page-border-surround-footer:no;}@page Section0{
}
div.Section0{page:Section0;}

-- coding: utf-8 --

Scrapy settings for jianshu project

For simplicity, this file contains only settings considered important or

commonly used. You can find more settings consulting the documentation:

http://doc.scrapy.org/en/latest/topics/settings.html

http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'jianshu'

SPIDER_MODULES = ['jianshu.spiders']

NEWSPIDER_MODULE = 'jianshu.spiders'

Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'jianshu (+http://www.yourdomain.com)'

Obey robots.txt rules

ROBOTSTXT_OBEY = True

Configure maximum concurrent requests performed by Scrapy (default: 16)

CONCURRENT_REQUESTS = 32

Configure a delay for requests for the same website (default: 0)

See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay

See also autothrottle settings and docs

DOWNLOAD_DELAY = 3

The download delay setting will honor only one of:

CONCURRENT_REQUESTS_PER_DOMAIN = 16

CONCURRENT_REQUESTS_PER_IP = 16

Disable cookies (enabled by default)

COOKIES_ENABLED = False

Disable Telnet Console (enabled by default)

TELNETCONSOLE_ENABLED = False

Override the default request headers:

DEFAULT_REQUEST_HEADERS = {

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8',

'Accept-Language': 'en',

}

Enable or disable spider middlewares

See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

SPIDER_MIDDLEWARES = {

'jianshu.middlewares.MyCustomSpiderMiddleware': 543,

}

Enable or disable downloader middlewares

See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {

'jianshu.middlewares.MyCustomDownloaderMiddleware': 543,

}

Enable or disable extensions

See http://scrapy.readthedocs.org/en/latest/topics/extensions.html

EXTENSIONS = {

'scrapy.extensions.telnet.TelnetConsole': None,

}

Configure item pipelines

See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

'jianshu.pipelines.SomePipeline': 300,

}

Enable and configure the AutoThrottle extension (disabled by default)

See http://doc.scrapy.org/en/latest/topics/autothrottle.html

AUTOTHROTTLE_ENABLED = True

The initial download delay

AUTOTHROTTLE_START_DELAY = 5

The maximum download delay to be set in case of high latencies

AUTOTHROTTLE_MAX_DELAY = 60

The average number of requests Scrapy should be sending in parallel to

each remote server

AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

Enable showing throttling stats for every response received:

AUTOTHROTTLE_DEBUG = False

Enable and configure HTTP caching (disabled by default)

See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

HTTPCACHE_ENABLED = True

HTTPCACHE_EXPIRATION_SECS = 0

HTTPCACHE_DIR = 'httpcache'

HTTPCACHE_IGNORE_HTTP_CODES = []

HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

FEED_URI=u'/Users/apple/Documents/jianshu-monthly.csv'

FEED_FORMAT='CSV'
···

  • 运行结果还有点问题,CSV保存文件字节数为0
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 205,386评论 6 479
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,939评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,851评论 0 341
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,953评论 1 278
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,971评论 5 369
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,784评论 1 283
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,126评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,765评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 43,148评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,744评论 2 323
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,858评论 1 333
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,479评论 4 322
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,080评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,053评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,278评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,245评论 2 352
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,590评论 2 343

推荐阅读更多精彩内容