Python函数单元测试以及爬虫的基本实现

一.Python程序函数的测试

Python中有一个自带的单元测试框架是unittest模块，用它来做单元测试，它里面封装好了一些校验返回的结果方法和一些用例执行前的初始化操作。

在说unittest之前，先说几个概念：

TestCase 也就是测试用例

TestSuite 多个测试用例集合在一起，就是TestSuite

TestLoader是用来加载TestCase到TestSuite中的

TestRunner是来执行测试用例的,测试的结果会保存到TestResult实例中，包括运行了多少测试用例，成功了多少，失败了多少等信息

接下来我们就选出一部分代码来进行一个简单的测试

这是一个简单的比赛结束函数，下面我们来测试它是否可用

<pre class="prettyprint hljs python" style="padding: 0.5em; font-family: Menlo, Monaco, Consolas, "Courier New", monospace; color: rgb(68, 68, 68); border-radius: 4px; display: block; margin: 0px 0px 1.5em; font-size: 14px; line-height: 1.5em; word-break: break-all; overflow-wrap: break-word; white-space: pre; background-color: rgb(246, 246, 246); border: none; overflow-x: auto; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">def gameOver(a,b):
if a>=10 and b>=10:
if abs(a-b)==2:
return True
if a<10 or b<10:
if a==11 or b==11:
return True
else:
return False
</pre>

首先我们编写一个

Game.py

<pre class="hljs python" style="padding: 0.5em; font-family: Menlo, Monaco, Consolas, "Courier New", monospace; color: rgb(68, 68, 68); border-radius: 4px; display: block; margin: 0px 0px 0.75em; font-size: 14px; line-height: 1.5em; word-break: break-all; overflow-wrap: break-word; white-space: pre; background-color: rgb(246, 246, 246); border: none; overflow-x: auto; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">class Game():
def init(self,a,b):
self.a = a
self.b = b
</pre>

再编写一个测试代码

Testgame.py

<pre class="prettyprint hljs python" style="padding: 0.5em; font-family: Menlo, Monaco, Consolas, "Courier New", monospace; color: rgb(68, 68, 68); border-radius: 4px; display: block; margin: 0px 0px 1.5em; font-size: 14px; line-height: 1.5em; word-break: break-all; overflow-wrap: break-word; white-space: pre; background-color: rgb(246, 246, 246); border: none; overflow-x: auto; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import unittest
from game import Game
class GameTest(unittest.TestCase):
#一个测试用例
def test_gameOver(self):
self = Game('15','13')
#要测试的函数
def gameOver(a,b):
if a>=10 and b>=10:
if abs(a-b)==2:
return True
if a<10 or b<10:
if a==11 or b==11:
return True
else:
return False

运行测试代码

unittest.main()
</pre>

效果如下

image

学习Python中的小伙伴，需要学习资料的话，可以到我的微信公众号：Python学习知识圈，后台回复：“01”，即可拿Python学习资料

完美运行，此函数无问题。

二.requests库的运用

1.爬虫的基本框架是获取HTML页面信息，解析页面信息，保存结果，requests模块是用于第一步获取HTML页面信息； requests库用于爬取HTML页面，提交网络请求，基于urllib,但比urllib更方便。
[图片上传中...(image.png-571044-1558592667773-0)]

2.安装requests库

打开命令行，输入

<pre class="hljs nginx" style="padding: 0.5em; font-family: Menlo, Monaco, Consolas, "Courier New", monospace; color: rgb(68, 68, 68); border-radius: 4px; display: block; margin: 0px 0px 0.75em; font-size: 14px; line-height: 1.5em; word-break: break-all; overflow-wrap: break-word; white-space: pre; background-color: rgb(246, 246, 246); border: none; overflow-x: auto; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">pip install requests
</pre>

即可安装

3.下面介绍一些requests库的基本用法

image

get方法，它能够获得url的请求，并返回一个response对象作为响应。

常见的HTTP状态码：200 - 请求成功，301 - 资源（网页等）被永久转移到其它URL，404 - 请求的资源（网页等）不存在，500 - 内部服务器错误。

image

4.实例

我们现在用requests库访问搜狗主页20次，并获取一些信息

<pre class="prettyprint hljs go" style="padding: 0.5em; font-family: Menlo, Monaco, Consolas, "Courier New", monospace; color: rgb(68, 68, 68); border-radius: 4px; display: block; margin: 0px 0px 1.5em; font-size: 14px; line-height: 1.5em; word-break: break-all; overflow-wrap: break-word; white-space: pre; background-color: rgb(246, 246, 246); border: none; overflow-x: auto; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import requests
for i in range(20):
r = requests.get("https://www.sogou.com")#搜狗主页
print("网页返回状态:{}".format(r.status_code))
print("text内容为:{}".format(r.text))
print("\n")
print("text内容长度为:{}".format(len(r.text)))
print("content内容长度为:{}".format(len(r.content)))
</pre>

代码运行后结果如下

image

三.BeautifulSoup库的运用

安装方式：

打开命令行，输入

1.简介

beautifulsoup是一个非常强大的工具，爬虫利器。

beautifulSoup “美味的汤，绿色的浓汤”

一个灵活又方便的网页解析库，处理高效，支持多种解析器。

利用它就不用编写正则表达式也能方便的实现网页信息的抓取。

2.常用解析库

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，

如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐安装。

image

3.基本使用

<pre class="prettyprint hljs vbnet" style="padding: 0.5em; font-family: Menlo, Monaco, Consolas, "Courier New", monospace; color: rgb(68, 68, 68); border-radius: 4px; display: block; margin: 0px 0px 1.5em; font-size: 14px; line-height: 1.5em; word-break: break-all; overflow-wrap: break-word; white-space: pre; background-color: rgb(246, 246, 246); border: none; overflow-x: auto; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"># BeautifulSoup入门
from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml') # 创建BeautifulSoup对象
print(soup.prettify()) # 格式化输出
print(soup.title) # 打印标签中的所有内容
print(soup.title.name) # 获取标签对象的名字
print(soup.title.string) # 获取标签中的文本内容 == soup.title.text
print(soup.title.parent.name) # 获取父级标签的名字
print(soup.p) # 获取第一个p标签的内容
print(soup.p["class"]) # 获取第一个p标签的class属性
print(soup.a) # 获取第一个a标签
print(soup.find_all('a')) # 获取所有的a标签
print(soup.find(id='link3')) # 获取id为link3的标签
print(soup.p.attrs) # 获取第一个p标签的所有属性
print(soup.p.attrs['class']) # 获取第一个p标签的class属性
print(soup.find_all('p',class_='title')) # 查找属性为title的p

通过下面代码可以分别获取所有的链接以及文字内容

for link in soup.find_all('a'):
print(link.get('href')) # 获取链接

print(soup.get_text())获取文本
</pre>

（1）：标签选择器

在快速使用中我们添加如下代码：

print(soup.title)

print(type(soup.title))

print(soup.head)

print(soup.p)

通过这种soup.标签名我们就可以获得这个标签的内容

这里有个问题需要注意，通过这种方式获取标签，如果文档中有多个这样的标签，返回的结果是第一个标签的内容，如我们通过soup.p获取p标签，而文档中有多个p标签，但是只返回了第一个p标签内容。

（2）：获取名称

当我们通过soup.title.name的时候就可以获得该title标签的名称，即title。

（3）：获取属性

print(soup.p.attrs['name'])

print(soup.p['name'])

上面两种方式都可以获取p标签的name属性值

（4）：获取内容

print(soup.p.string)

结果就可以获取第一个p标签的内容。

（5）：嵌套选择

我们直接可以通过下面嵌套的方式获取

print(soup.head.title.string)

（6）：标准选择器

         find_all

find_all(name,attrs,recursive,text,**kwargs)

可以根据标签名，属性，内容查找文档

1.name

<pre class="prettyprint hljs scala" style="padding: 0.5em; font-family: Menlo, Monaco, Consolas, "Courier New", monospace; color: rgb(68, 68, 68); border-radius: 4px; display: block; margin: 0px 0px 1.5em; font-size: 14px; line-height: 1.5em; word-break: break-all; overflow-wrap: break-word; white-space: pre; background-color: rgb(246, 246, 246); border: none; overflow-x: auto; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul')) # 找到所有ul标签
print(type(soup.find_all('ul')[0])) # 拿到第一个ul标签

find_all可以多次嵌套，如拿到ul中的所有li标签

for ul in soup.find_all('ul'):
print(ul.find_all('li'))
</pre>

2.attrs

<pre class="prettyprint hljs scala" style="padding: 0.5em; font-family: Menlo, Monaco, Consolas, "Courier New", monospace; color: rgb(68, 68, 68); border-radius: 4px; display: block; margin: 0px 0px 1.5em; font-size: 14px; line-height: 1.5em; word-break: break-all; overflow-wrap: break-word; white-space: pre; background-color: rgb(246, 246, 246); border: none; overflow-x: auto; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'})) # 找到id为ilist-1的标签
print(soup.find_all(attrs={'name': 'elements'})) # 找到name属性为elements的标签
</pre>

注意：attrs可以传入字典的方式来查找标签，但是这里有个特殊的就是class,因为class在python中是特殊的字段，所以如果想要查找class相关的可以更改attrs={'class_':'element'}或者soup.find_all('',{"class":"element})，特殊的标签属性可以不写attrs，例如id。

实例

代码如下

<pre class="prettyprint hljs dust" style="padding: 0.5em; font-family: Menlo, Monaco, Consolas, "Courier New", monospace; color: rgb(68, 68, 68); border-radius: 4px; display: block; margin: 0px 0px 1.5em; font-size: 14px; line-height: 1.5em; word-break: break-all; overflow-wrap: break-word; white-space: pre; background-color: rgb(246, 246, 246); border: none; overflow-x: auto; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"># -- coding: utf-8 --
"""
Created on Mon May 20 19:33:22 2019

@author: moyulin
"""
import re
from bs4 import BeautifulSoup
html = """
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>菜鸟教程(runoob.com)</title>
</head>
<body>

<h1>我的第一个标题</h1>
<p id="first">我的第一个段落。</p>
<table border="1">
<tr>
<td>row 1, cell 1<\td>
<td>row 1, cdll 2<\td>
<\tr>
<tr>
<td>row 2, cell 1<\td>
<td>row 2, cell 2<\td>
<\tr>
<\table>
<\body>
<\html>
"""
soup = BeautifulSoup(html,"html.parser")
print(soup.prettify())
print("(A).该html的head标签内容为\n{}\n03".format(soup.head.prettify()))
print("\n")
print("(B).该html的body标签内容为\n{}".format(soup.body.prettify()))
print("(C).该html中id为first的标签对象为:{}".format(soup.find(id="first")))
print("(D).该html中所有的中文字符为:{}".format(soup.find_all(string = re.compile('[^\x00-\xff]'))))#这里用到正则表达式来找出中文字符串 </pre>

运行结果为

image

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 204,732评论 6赞 478
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 87,496评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 151,264评论 0赞 338
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,807评论 1赞 277
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,806评论 5赞 368
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,675评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,029评论 3赞 399
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,683评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 41,704评论 1赞 299
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,666评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,773评论 1赞 332
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,413评论 4赞 321
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,016评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,978评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,204评论 1赞 260
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,083评论 2赞 350
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,503评论 2赞 343

Python函数单元测试以及爬虫的基本实现

运行测试代码

完美运行，此函数无问题。

get方法，它能够获得url的请求，并返回一个response对象作为响应。

常见的HTTP状态码：200 - 请求成功，301 - 资源（网页等）被永久转移到其它URL，404 - 请求的资源（网页等）不存在，500 - 内部服务器错误。

4.实例

我们现在用requests库访问搜狗主页20次，并获取一些信息

代码运行后结果如下

三.BeautifulSoup库的运用

3.基本使用

通过下面代码可以分别获取所有的链接以及文字内容

（1）：标签选择器

（2）：获取名称

（3）：获取属性

（4）：获取内容

（5）：嵌套选择

（6）：标准选择器

find_all可以多次嵌套，如拿到ul中的所有li标签

运行结果为

推荐阅读更多精彩内容