简述
本节抓取全球空气监测站点列表数据
目标对象
World Meteorological Organization
实现逻辑
-
数据来源分析
- 数据表设计
# WMO站点信息列表
station_name station_id index_nbr latitude longitude obs_rems
- 遇到问题
https://www.wmo.int/cpdb/volume_a_observing_stations/list_stations?sEcho=2&iColumns=6&sColumns=station_name%2Cstation_id%2Cindex_nbr%2Clatitude%2Clongitude%2Cobs_rems&iDisplayStart=25&iDisplayLength=25&mDataProp_0=0&sSearch_0=&bRegex_0=false&bSearchable_0=true&bSortable_0=true&mDataProp_1=1&sSearch_1=&bRegex_1=false&bSearchable_1=true&bSortable_1=true&mDataProp_2=2&sSearch_2=&bRegex_2=false&bSearchable_2=true&bSortable_2=true&mDataProp_3=3&sSearch_3=&bRegex_3=false&bSearchable_3=true&bSortable_3=true&mDataProp_4=4&sSearch_4=&bRegex_4=false&bSearchable_4=true&bSortable_4=true&mDataProp_5=5&sSearch_5=&bRegex_5=false&bSearchable_5=true&bSortable_5=true&sSearch=&bRegex=false&iSortCol_0=0&sSortDir_0=asc&iSortingCols=1&_=1506328442807
# iDisplayStart,开始页数
# iDisplayLength,每页显示行数
# _, 时间戳
单独请求服务,或页面中打开URL
,系统自动跳转首页或返回首页数据
实现代码
引用包
import requests #数据抓取
import time, os
import datetime
from MSSql_SqlHelp import MSSQL
import json
检查自动跳转原因
def download_page(url):
try:
return requests.get(url, cookies=cookies,headers={
'X-Requested-With':'XMLHttpRequest',
'Accept':'application/json, text/javascript, */*; q=0.01',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'
}, timeout=120).json()
except Exception as e:
print("download_page抓取异常:" + url)
time.sleep(30) #延迟N秒再抓取
main()
总结
除需指明'X-Requested-With':'XMLHttpRequest'
,“告诉”服务为为ajax
请求,否则自动跳转首页