(Optional) Create virtual environment
prefer using python version 3
mkvirtualenv --python=/usr/bin/python3 python3
check pip version by
pip --version
to make sure python 3 is used
Steps
scrapy startproject name
scrapy genspider botname url
robotstxt in setting should be true to always crawl permitted pages and be a good web citizen
- inside project folder
scrapy crawl botname
- test in shell
- scrapy crawl botname -o xx.json or csv to see result
shell to debug and test
scrapy shell
- test url is valid - fetch(url)
- test valid html - view(response.body)
Alternative xpath testing tool
http://www.freeformatter.com/xpath-tester.html
Xpath docs
uses response from selector
selctor, as it is named, selects html content,
from scrapy.selector import Selector
Since this is a common operation, response.selector is shorten to.xpath()
Extra
css can also be used as selector, but xpath is the official way
//name
or //*
- relative select every instance of html tag name or all
text()
- text content in unicode
'//name[1]' - python isolated selector for ('//name')[0], use either
.
- extracting first instance of data that is not response, can also just omit //
@
- attribute grabbing
if itemprop exist, use it over class to extract
Tools to get xpath fast -
https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl