Python爬虫学习(一)

Python爬虫学习（一）

在这个暑假之前，我学了一下简单的python爬虫，但忘得差不多了，这几日决定复习一下，顺便写下了Python爬虫的入门。

爬虫是什么

在我们学习python爬虫之前，我先来认识一下什么是爬虫？
网络爬虫（英语：web crawler），也叫网络蜘蛛（spider），是一种用来自动浏览万维网的网络机器人。
注：来自于维基百科网络爬虫
简而言之就是可以自动的去访问万维网的机器人。还可以把自己访问的页面保存下来，以便搜索引擎事后生成索引供用户搜索或者是用户做数据的统计分析等朋友

Python爬虫的相关库

urllib模块

urllib是python自带的web模块，包含了平常我们使用的大多数功能

request模块

request是一个比urllib功能更多的模块，他支持了HTTP协议，cookie会话，文件上传等，使用requests可以方便的对登录的账号使用session保持会话，此外还能自动解析gzip压缩的网页，十分强大。

selenium模块

selenium 模块原来是用来做web测试的，它可以模拟不同浏览器，因此使用 selenium 模块，配合不同浏览器的 driver，就相当于在浏览器中打开链接，并可以对 dom 进行操作，加载异步的数据。

我主要讲request模块，他的文档在http://docs.python-requests.org/en/master/ ，因为不是Python自带的所以需要自己安装，安装命令

pip install requests

注：第三方库是指其他人写的库。

python的使用工具

python的使用工具分两种，一种是文本工具，比如IDLE NotePad++ sublime Text，另外一种集成类工具工具，比如pycharm wing visual studio

request和respone的区别

当我们打开浏览器，输入URL的时候，我们的浏览器在向web服务器发送了一个Request请求，web服务器接到了requests后进行处理，生成相应的Respone ，返回给浏览器。

Requests第三方库

Request第三方库是Python唯一的HTTP包，使用Request发送网络请求非常简单。举个栗子

import requests.get
res = requests.get('https://www.baidu.com')

只需要这样就可以发送网络请求。除了发送HTTP get请求，还可以发送put,delete,head,options,post请求，如果你不知道这些方法是什么意思，出门左拐网络类去看吧。这里就不做介绍了

我们通过get请求可以获取到服务器相应内容，和头部信息。

获得的内容是:

<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head>
<meta http-equiv=content-type content=text/html;charset=utf-8>
<meta http-equiv=X-UA-Compatible content=IE=Edge>
<meta content=always name=referrer>
<linkrel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css>
<title>ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93</title>
</head> 
<body link=#0000cc> 
............

为什么我们获取下来的内容是乱码呢？？因为Request会自动解码来自服务器的内容，当请求发出后，服务器给了回应，会在header中根据charset来进行解码，如果不存在charset则默认编码是ISO-8859-1，这样的编码并不能解析中文

{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'Keep-Alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Tue, 22 Aug 2017 02:25:36 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:24:45 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}

而百度的头部信息中是没有charset,我们需要对他的编码进行设置,我们可以用apparent_encoding它是从是从内容中分析出的响应内容编码方式，原则上更加准确

res.encoding = res.apparent_encoding 
print(res.text)

获取内容是:

    <!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8>
<meta http-equiv=X-UA-Compatible content=IE=Edge>
<meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css>
<title>百度一下，你就知道</title></head> <body link=#0000cc> 
<div id=wrapper> 
<div id=head> <div class=head_wrapper> 
<div class=s_form> 
<div class=s_form_wrapper> <div id=lg> 
<img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> 
<form id=form name=f action=//www.baidu.com/s class=fm> 
<input type=hidden name=bdorz_come value=1> 
</div>
......

简单的request就这些，其他的操作可以参考request的官方文档

小栗子

我们来尝试爬取从国家地理中文网中爬取一张图片

image.png

我们可以通过这张图的链接爬取，获取内容。

res = requests.get('http://image.nationalgeographic.com.cn/2015/0121/20150121033625957.jpg')

因为我们要保存一张图片，我把他保存到photo路径下的abc.jpg

path = "../photo/abc.jpg"

我们打开这个文件，对这个文件进行写入