一、Scrapy安装
不说了,装Ubuntu
在Windows上建环境就是SB
Xpath例子
1.新建项目
scrapy startproject tutorial
2.运行项目
scrapy crawl dmoz
3.打开测试窗口
scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
4.假设要匹配下面这段代码中的src
<div class="float-r">
<img src="/img/moz/obooksm.gif" width="84" height="55" alt="[Book Mozilla]">
</div>
用下面这行
response.xpath("//div[@class = 'float-r']/img/@src").extract()
输出为
[u'/img/moz/obooksm.gif']
主代码
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
class MininovaSpider(CrawlSpider):
name = 'mininova.org'
allowed_domains = ['mininova.org']
start_urls = ['http://www.mininova.org/today']
rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]
def parse_torrent(self, response):
x = HtmlXPathSelector(response)
torrent = TorrentItem()
torrent['url'] = response.url
torrent['name'] = x.select("//h1/text()").extract()
torrent['description'] = x.select("//div[@id='description']").extract()
torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
return torrent
item代码
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item, Field
class TorrentItem(Item):
url = Field()
name = Field()
description = Field()
size = Field()
调用及保存为Json
scrapy crawl mininova.org -o scraped_data.json -t json