使用Urllib

它是python内置的http请求模库，分四个模块：

request模块，最基本的HTTP请求模块，用它模拟发送请求
error模块异常处理模块
parse模块是一个工具模块，提供URL处理方法，如拆分、解析、合并等
robotparser模块，用来识别网战的robots.txt文件，判断网站是否可以爬，用的较少

1.urlopen

模拟浏览器的一个请求发起过程，还可以处理authenticaton(授权验证)，redirections(重定向， cookies(浏览器Cookies)等

import urllib.request
# 模拟HTTP请求
response = urllib.reuqest.urlopen('https://www.baidu.com')
# 返回的类型是 http.client.HTTPResponse
print(response.read().decode('utf-8')

它主要包含的方法有 read()、readinto()、getheader(name)、getheaders()、fileno() 等方法和 msg、version、status、reason、debuglevel、closed 等属性。

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
# 返回状态码
print(response.status)
# 获取响应头
print(response.getheaders())
# 获取响应头中名为server属性
print(response.getheader('Server'))

data 参数

字节流编码格式的内容，可选的参数，传递的这个参数它的请求就是POST请求

import urllib.parse
import urllib.request
# 需要被转码成bytes（字节流）类型
data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read())

timeout 参数

可以设置超时时间，单位为秒。超出设置的时间还没有响应就会抛出异常

import urllib.request

response = urllib.request.urlopen('http://httpbin.org/get', timeout=1)
print(response.read())

Request

urlopen不足以构建一个完整的请求。可以在请求中添加Headers信息;用法如下

import urllib.request

request = urllib.request.Request('https://python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

构造方法如下：

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

url, 只有这个是必选的参数
data, 必须传bytes类型的，如果是字典，可以先用urllib.parse 模块里的 urlencode() 编码
headers, 参数是一个字典,也可以通过调用 Request 实例的 add_header() 方法来添加.

如果想伪装成火狐浏览器：

Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11

origin_req_host, 请求方的 host 名称或者 IP 地址。
unverifiable,指的是这个请求是否是无法验证的，默认是False。意思就是说用户没有足够权限来选择接收这个请求的结果。
method, 例如GET, POST, PUT等

from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'Germey'
}

data = bytes(parse.urlencode(dict), encoding='utf8')
# 设置属性
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)

# req = request.Request(url=url, data=data, method='POST')
# req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')

print(response.read().decode('utf-8'))

Handler

请求,直接提示你输入用户名和密码的页面,认证成功才能查看的页面

HTTPBasicAuthHandler

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError

username = 'username'
password = 'password'
url = 'http://localhost:5000/'

# 提供HTTPPasswordMgrWithDefaultRealm参数
p = HTTPPasswordMgrWithDefaultRealm()
# 添加进行用户名和密码
p.add_password(None, url, username, password)
# 实例化HTTPBasicAuthHandler
auth_handler = HTTPBasicAuthHandler(p)
# 利用handler构建一个Opener
opener = build_opener(auth_handler)

try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

代理

from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener

proxy_handler = ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

在此本地搭建了一个代理，运行在 9743 端口上。

在这里使用了 ProxyHandler，ProxyHandler 的参数是一个字典，键名是协议类型，比如 HTTP 还是 HTTPS 等，键值是代理链接，可以添加多个代理。

cookies

Cookies的处理需要Cookies相关的Handler

import http.cookiejar, urllib.request

# 声明一个CookieJar对象
cookie = http.cookiejar.CookieJar()
# 利用HTTPCookieProcessor构建一个Handler
handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

Cookies 实际也是以文本形式保存的。

filename = 'cookies.txt'
#生成文件时需要用到MozillaCookieJar，它是CookieJar子类，可以用来处理Cookies和文件相关的事件
cookie = http.cookiejar.MozillaCookieJar(filename)

handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

它可以将 Cookies 保存成 Mozilla 型浏览器的 Cookies 的格式。

LWPCookieJar，同样可以读取和保存 Cookies，但是保存的格式和 MozillaCookieJar 的不一样，它会保存成与 libwww-perl(LWP) 的 Cookies 文件格式。

cookie = http.cookiejar.LWPCookieJar(filename)

以 LWPCookieJar 格式读取文件

cookie = http.cookiejar.LWPCookieJar()
# 读取本地cookies文件，注意之前生成cookies的格式也是LWPCookieJar
cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

解析链接

1.urlparse()

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)

通过指定默认的 scheme 参数，返回的结果是 https。

运行结果:

<class 'urllib.parse.ParseResult'>
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

返回结果是一个ParseResult类型的对象，它包含了六个部分

scheme, http, 代表协议
netloc, www.baidu.com, 代表域名
path, 代表路径
params,代表参数
query, id=5, 条件
fragment ,comment, 位置标识符，代表index.html的comment的位置

所以一个标准链接如下:

scheme://netloc/path;parameters?query#fragment

添加scheme

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', scheme='https')

结果：

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

scheme 参数只有在 URL 中不包含 scheme 信息时才会生效

allow_fragments,如果它被设置为 False，fragment 部分就会被忽略，它会被解析为 path、parameters 或者 query 的一部分

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)
print(result)

urlunparse()

urlparse()的对立方法 urlunparse()。

from urllib.parse import urlunparse
# data还可以其他类型如元组或特定数据结构
data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

运行结果：

http://www.baidu.com/index.html;user?a=6#comment

urlsplit()

与urlparse()类似，它不会单独解析parameters这一部分，只返回五个结果。

from urllib.parse import urlsplit

result = urlsplit('http://www.baidu.com/index.html;user?id=5#comment')
print(result)

运行结果：

SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')

urlunsplit()

与 urlunparse() 类似，是将链接的各个部分组合成完整链接的方法，传入的也是一个可迭代对象，例如列表、元组等等，唯一的区别是，长度必须为 5。

from urllib.parse import urlunsplit

data = ['http', 'www.baidu.com', 'index.html', 'a=6', 'comment']
print(urlunsplit(data))

运行的结果：

http://www.baidu.com/index.html?a=6#comment

urljoin()

urljoin() 方法我们可以提供一个 base_url(基础链接),新的链接作为第二个参数，方法会分析 base_url 的 scheme、netloc、path 这三个内容对新链接缺失的部分进行补充，作为结果返回。

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))

运行结果：

http://www.baidu.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html?question=2
https://cuiqingcai.com/index.php
http://www.baidu.com?category=2#comment
www.baidu.com?category=2#comment
www.baidu.com?category=2

可以发现，base_url 提供了三项内容，scheme、netloc、path，如果这三项在新的链接里面不存在，那么就予以补充，如果新的链接存在，那么就使用新的链接的部分。base_url 中的 parameters、query、fragments 是不起作用的。

urlencode

from urllib.parse import urlencode

params = {
    'name': 'germey',
    'age': 22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

运行结果：

http://www.baidu.com?name=germey&age=22

parse_qs

反序列化

from urllib.parse import parse_qs

query = 'name=germey&age=22'
print(parse_qs(query))

运行结果:

{'name': ['germey'], 'age': ['22']}

parse_qsl

parse_qsl() 方法可以将参数转化为元组组成的列表

from urllib.parse import parse_qsl

query = 'name=germey&age=22'
print(parse_qsl(query))

运行结果：

[('name', 'germey'), ('age', '22')]

quote()

quote() 方法可以将内容转化为 URL 编码的格式,有时候 URL 中带有中文参数的时候可能导致乱码的问题，所以我们可以用这个方法将中文字符转化为 URL 编码

from urllib.parse import quote

keyword = '壁纸'
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url)

运行结果：

https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8

unquote()

可以进行 URL 解码

from urllib.parse import unquote

url = 'https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8'
print(unquote(url))

运行结果：

https://www.baidu.com/s?wd=壁纸

基本库的使用urllib

基本库的使用urllib

使用Urllib

1.urlopen

data 参数

timeout 参数

Request

Handler

HTTPBasicAuthHandler

代理

cookies

解析链接

1.urlparse()

urlunparse()

urlsplit()

urlunsplit()

urljoin()

urlencode

parse_qs

parse_qsl

quote()

unquote()

推荐阅读更多精彩内容