Scrapy 官方文档
1.安装scrapy
终端命令:
pip3 install scrapy
2.创建项目
终端命令:
scrapy startproject <projectname>
cd <projectname> #进入工程目录
scrapy genspider <spidername> <url domain>
url domain
为想爬取的网址域名
之后会在当前路径生成一个以projectname
为名称的文件夹,以下projectname
=mySpider
,spidername
=epilepsy_spider
、url domain
=baidu.com
为例,文件夹目录结构如图:
- scrapy.cfg 项目的配置信息,主要为Scrapy命令行工具提供一个基础的配置信息,真正爬虫相关的配置信息在settings.py文件中
- items.py 定义爬取数据的对象
- pipelines 数据处理行为,如:一般结构化的数据持久化
- settings.py 配置文件,如:递归的层数、并发数,延迟下载等
- spiders 爬虫目录,如:创建文件,编写爬虫规则
3.编写items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class MovieItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
4.编写epilepsy_spider.py包括爬去网页的起始路径和xpath表达式
# -*- coding: utf-8 -*-
import scrapy
from mySpider.items import MovieItem
class MeijuSpider(scrapy.Spider):
name = 'epilepsy_spider'
allowed_domains = ['baidu.com']
start_urls = ['https://baike.baidu.com/medicine/disease/%E7%99%AB%E7%97%AB/1613?from=lemma']
def parse(self, response):
movies = response.xpath('//*[@id="medical_content"]/ul').extract()
item = MovieItem()
item['name'] = movies
print(item['name'])
yield item
在pycharm项目内创建Spider项目时注意将Spider项目的父文件夹设置为Resource Root:文件夹右键->Make Directory as ->Resource Root
关于xpath表达式的使用见下一篇
5.编写pipeline.py定义爬取数据的保存路径
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
class MoviePipeline(object):
def process_item(self, item, spider):
with open("./epilepsy.txt", "a") as fp:
for i in item['name']:
fp.write(i+'\n')
6.执行爬虫
终端命令:
scrapy crawl <spider name>