Beautifulsoup小结

参考链接：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#id14

什么是Beautifulsoup

Beautifulsoup是用python写的HTML/XML解析器，它可以很好的处理不规范标记并生成剖析树(parse tree)。

系统说明：centos7 linux环境

- 安装

pip install beautifulsoup4
easy_install beautifulsoup4

- 主要解析器及其优缺点

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, "html.parser")	1.Python的内置标准库；2.执行速度适中；3.文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	1.速度快；2.文档容错能力强	需要安装C语言库
lxml XML 解析器	1.BeautifulSoup(markup, ["lxml-xml"]);2.BeautifulSoup(markup, "xml")	1.速度快;2.唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, "html5lib")	1.最好的容错性;2.以浏览器的方式解析文档;3.生成HTML5格式的文档	1.速度慢;2.不依赖外部扩展

推荐使用lxml作为解析器，效率更高

使用

对于本地文档的解析方法

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("XXX.html"), 'lxml')
soup = BeautifulSoup("<html>data</html>")

详细解析（匹配）使用方法

Tag标签

注意事项：soup.b # 只能获取第一个b标签
soup.find_all('b') # 获取所有b标签

获取tag数据：soup.b

soup = BeautifulSoup('<b class="boldest" id="bold">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p>','lxml')
tag = soup.b  # 只能获取第一个b标签   
print(tag)
# 输出为：<b class="boldest">Extremely bold</b>
print(type(tag))
# 输出为：<class 'bs4.element.Tag'>

获取标签名：tag.name

soup = BeautifulSoup('<b class="boldest" id="bold">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p>','lxml')
tag = soup.b
print(tag.name)
# 输出为：b

获取标签中属性值：tag['class']

soup = BeautifulSoup('<b class="boldest" id="bold">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p>','lxml')
tag = soup.b
print(tag['class'])
# 输出为：['boldest']

获取标签中属性名及其值：tag.attrs

注意：XML中不包含多值属性

soup = BeautifulSoup('<b class="boldest mybold" id="bold dd">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p>','lxml')
tag = soup.b
print(tag['class'])
# 输出为：['boldest', 'mybold']
print(tag['id'])
# 输出为：bold dd
print(tag.attrs)
# 输出为：{'class': ['boldest', 'mybold'], 'id': 'bold dd'}

修改标签中的属性：tag['class'] = 'change'/tag['class'] = 'change muil' /也可以写为tag['class'] = ['change','muil']

soup = BeautifulSoup('<b class="boldest" id="bold">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p>','lxml')
tag = soup.b
tag['class'] = 'change'
tag['id'] = '1'
print(soup)
# 输出为：<html><body><b class="change" id="1">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p></body></html>
tag['class'] = 'change muil'  # 也可以写为tag['class'] = ['change','muil']
tag['id'] = '1'
print(soup)
# 输出为：<html><body><b class="change muil" id="1">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p></body></html>

删除标签中的属性：del tag['class']

soup = BeautifulSoup('<b class="boldest" id="bold">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p>','lxml')
tag = soup.b
del tag['class']
del tag['id']
print(soup)
# 输出为：<html><body><b>Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p></body></html>
print(tag['class'])
# KeyError: 'class'
print(tag.get('class'))
# 输出为：None

文本内容

获取文本值：tag.string

soup = BeautifulSoup('<b class="boldest mybold" id="bold dd">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p>','lxml')
tag = soup.b
print(tag.string)
# 输出为：Extremely bold
print(soup.string)
# 输出为：None

替换文本内容:tag.string.replace_with("repalced")

soup = BeautifulSoup('<b class="boldest mybold" id="bold dd">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p>','lxml')
tag = soup.b
tag.string.replace_with("repalced")
print(tag)
# 输出为：<b class="boldest mybold" id="bold dd">repalced</b>

总结:
soup.a之类只查找第一个；

soup.find_all('a')查找所有；

以列表形式输出tag直接子节点:head.contents

soup = BeautifulSoup("<html><head><title>The Dormouse's story</title></head></html>",'lxml')
head = soup.head
print(head)
# 输出为：<head><title>The Dormouse's story</title></head>
print(head.contents)
# 输出为：[<title>The Dormouse's story</title>]
print(head.contents[0].contents)
# 输出为：["The Dormouse's story"]

对tag直接子节点进行循环(生成器类型)：for i in head.children

soup = BeautifulSoup("<html><head><title>The Dormouse's story</title></head></html>",'lxml')
head = soup.head
print(head)
# 输出为：<head><title>The Dormouse's story</title></head>
for i in head.children:
    print(i)
# 输出为：<title>The Dormouse's story</title>

对所有子孙节点进行循环（生成器）:for i in head.descendants

soup = BeautifulSoup("<html><head><title>The Dormouse's story</title></head></html>",'lxml')
head = soup.head
print(head)
# 输出为：<head><title>The Dormouse's story</title></head>
for i in head.descendants:
    print(i)
# 输出为：
# <title>The Dormouse's story</title>
# The Dormouse's story

对所有文本内容进行循环（生成器）:for i in head.strings

soup = BeautifulSoup("<html><head><title>The Dormouse's story</title></head><p>ppppp</p></html>",'lxml')
head = soup.html
print(head)
# 输出为：<html><head><title>The Dormouse's story</title></head><body><p>ppppp</p></body></html>
for i in head.strings:
    print(i)
# 输出为：
# The Dormouse's story
# ppppp

对所有文本内容进行循环,并去除多余空格或空行：for i in head.stripped_strings

soup = BeautifulSoup("<html><head><title>&nbsp The Dormouse's   story \n\r  </title></head><p>ppppp</p></html>",'lxml')
head = soup.html
print(head)
for i in head.stripped_strings:
    print(i)
# 输出为：
# &nbsp The Dormouse's   story
# ppppp

获取某个元素的直属父节点：head.parent

html.parent是beautifulsoup对象，输出整个内容
soup.parent为None

soup = BeautifulSoup("<html><head><title>The Dormouse's story</title></head><p>ppppp</p></html>",'lxml')
print(soup)
# 输出为：<html><head><title>The Dormouse's story</title></head><body><p>ppppp</p></body></html>
head = soup.p
print(soup.p.string)
# 输出为：ppppp
print(soup.p.string.parent)
# 输出为：<p>ppppp</p>
print(head.parent)
# 输出为：<body><p>ppppp</p></body>
print(soup.html.parent)
# 输出为：<html><head><title>The Dormouse's story</title></head><body><p>ppppp</p></body></html>

获取某个节点的所有父节点(生成器)：for i in head.parents

soup = BeautifulSoup("<html><head><title>The Dormouse's story</title></head><p>ppppp</p></html>",'lxml')
print(soup)
# 输出为：<html><head><title>The Dormouse's story</title></head><body><p>ppppp</p></body></html>
head = soup.p
print(head.parents)
# 输出为：<generator object parents at 0x7f6f4282ef68>
for i in head.parents:
    print(i)
# 父节点----输出：
# body-----<body><p>ppppp</p></body>
# html-----<html><head><title>The Dormouse's story</title></head><body><p>ppppp</p></body></html>
# soup-----<html><head><title>The Dormouse's story</title></head><body><p>ppppp</p></body></html>

获取兄弟节点（有可能得到换行符和顿号）:
获取后一个兄弟节点：soup.b.next_sibling
获取前一个兄弟节点：soup.c.previous_sibling
获取所有兄弟节点：
for sibling in soup.a.next_siblings
for sibling in soup.find(id="link3").previous_siblings

soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",'lxml')
print(soup)
# 输出为：<html><body><a><b>text1</b><c>text2</c></a></body></html>
print(soup.b.next_sibling)
# 输出为：<c>text2</c>
print(soup.c.previous_sibling)
# 输出为：<b>text1</b>

总结：
获取元素直属子节点：head.contents \ head.children
获取元素直属父节点:head.parent
获取元素所有子孙节点:for i in head.descendants
获取元素所有父节点：for i in head.parents
获取兄弟节点：.next_sibling / .previous_sibling
获取所有兄弟节点：.next_siblings / .previous_siblings
获取某一个元素 soup.a
获取所有元素 soup('a')
获取某一文本内容 soup.string
获取所有文本内容 for i in head.strings

搜索：

find_all

说明1：调用tag的 find_all()方法时,BeautifulSoup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False .

说明2：soup.find_all("a")与soup("a")等价

说明3：find方法只返回第一个

查找所有b标签：soup.find_all('b')
查找所有以b开头的标签：soup.find_all(re.compile("^b"))
查找所有包含d的标签：soup.find_all(re.compile("d"))
查找所有a和b标签：soup.find_all(["a", "b"])
查找所有p标签中class值为myclass的标签：soup.find_all("b1", "myclass")
查找p标签中id值为myid的标签：soup.find_all("b1", id_="myid")
查找id为link2的标签：soup.find_all(id_="myid")
查找文本内容包含sister的文本值（注意：输出的不是整个标签，只有文本）：soup.find(string=re.compile("text"))
soup.find_all(string=["Tillie", "Elsie", "Lacie"])
查找所有href中包含elsie的标签：soup.find_all(href=re.compile("elsie"))
匹配任何值：soup.find_all(True)
soup.find_all(id=True)
多参数过滤：soup.find_all(href=re.compile("elsie"), id='link1')
数量限制：soup.find_all("a", limit=2)

soup = BeautifulSoup("<a><b><b1 class='myclass' id='myid'>text1</b1><b2>text2</b2></b></a>",'lxml')
print(soup.find_all('b'))
# 输出为：[<b><b1>text1</b1><b2>text2</b2></b>]
print(soup.find_all(re.compile("^b")))
# 输出为：[<body><a><b><b1>text1</b1><b2>text2</b2></b></a></body>, <b><b1>text1</b1><b2>text2</b2></b>, <b1>text1</b1>, <b2>text2</b2>]
print(soup.find_all(re.compile('d')))
# 输出为：[<body><a><b><b1>text1</b1><b2>text2</b2></b></a></body>]
print(soup.find_all(["a", "b"]))
# 输出为：[<a><b><b1>text1</b1><b2>text2</b2></b></a>, <b><b1>text1</b1><b2>text2</b2></b>]

data_soup.find_all(data-foo="value")会报错
data_soup.find_all(attrs={"data-foo": "value"})这样就可以啦～
soup.find_all("a", attrs={"class": "sister"})

find_parents() 与 find_parent()

说明：find_parents()为列表，find_parent()则不是，但二者文本

find_next_siblings() 与 find_next_sibling()
说明:find_next_siblings() 方法返回所有符合条件的后面的兄弟节点, find_next_sibling() 只返回符合条件的后面的第一个tag节点.

find_previous_siblings() 与 find_previous_sibling()
说明:find_previous_siblings() 方法返回所有符合条件的前面的兄弟节点, find_previous_sibling() 方法返回第一个符合条件的前面的兄弟节点

find_all_next() 与 find_next()
说明:find_all_next() 方法返回所有符合条件的节点, find_next() 方法返回第一个符合条件的节点

find_all_previous() 与 find_previous()
说明1:find_all_previous()方法返回所有符合条件的节点,find_previous()方法返回第一个符合条件的节点.
说明2:find_all_previous("p") 返回了文档中的第一段(class=”title”的那段),但还返回了第二段,<p>标签包含了我们开始查找的<a>标签.不要惊讶,这段代码的功能是查找所有出现在指定<a>标签之前的<p>标签,因为这个<p>标签包含了开始的<a>标签,所以<p>标签一定是在<a>之前出现的.

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister1" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
<p href="http://example.com/laci" class="sister" id="link4">Laci</p> and
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

a_string = soup.find(string="Lacie")
print(a_string)
# Lacie

a = a_string.find_parents("a")
print(a)
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

p = a_string.find_parent("p")
print(p)
# <p class="story">Once upon a time there were three little sisters; and their names were
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# <p class="sister" href="http://example.com/laci" id="link4">Laci</p> and
#  and they lived at the bottom of a well.</p>

p1 = a_string.find_parents("p", class_="title")
print(p1)
# []

first_link = soup.a
first_link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

first_link.find_all_previous("p")
# [<p class="story">Once upon a time there were three little sisters; ...</p>,
#  <p class="title"><b>The Dormouse's story</b></p>]

first_link.find_previous("title")
# <title>The Dormouse's story</title>

select

通过tag标签逐层查找:soup.select("body a")\soup.select("html head title")
找到某个tag标签下的直接子标签:soup.select("p > a")
soup.select("p > a:nth-of-type(2)")
soup.select("p > #link1")
找到兄弟节点标签:soup.select("#link1 ~ .sister") 找所有
soup.select("#link1 + .sister") 找第一个
通过CSS的类名查找:soup.select(".sister")\soup.select("[class~=sister]")
通过tag的id查找:soup.select("#link1")\soup.select("a#link2")\soup.select("#link1,#link2")
通过是否存在某个属性来查找:soup.select('a[href]')
通过属性的值来查找:soup.select('a[href="http://example.com/elsie"]') 精确查找
soup.select('a[href^="http://example.com/"]') 开头匹配查找
soup.select('a[href$="tillie"]') 结尾匹配查找
soup.select('a[href*=".com/el"]') 中间匹配查找
查找第一个:soup.select_one(".sister")

获取文本
获取文本:soup.get_text()
以指定分隔符获取文本:soup.get_text("|")
以指定分隔符获取文本并去除文本前后空白:soup.get_text("|", strip=True)
使用 .stripped_strings 生成器,获得文本列表后手动处理列表:[text for text in soup.stripped_strings]
方法
校验当前元素，包含 class 属性却不包含 id 属性,并查找所有符合该方法的标签

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
print(soup.find_all(has_class_but_no_id))

找出所有href属性不符合指定正则的标签

def not_lacie(href):
    return href and not re.compile("lacie").search(href)
print(soup.find_all(href=not_lacie))

找出前后均有文字的标签

from bs4 import NavigableString
def surrounded_by_strings(tag):
    return (isinstance(tag.next_element, NavigableString)
            and isinstance(tag.previous_element, NavigableString))
for tag in soup.find_all(surrounded_by_strings):
    print(tag)

根据class长度进行匹配

def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6
print(soup.find_all(class_=has_six_characters))

修改
删除:del tag \ del tag['class']
修改: =
扩展:append("Bar")

soup = BeautifulSoup("<a>Foo</a>")
soup.a.append("Bar")

soup
# <html><head></head><body><a>FooBar</a></body></html>
soup.a.contents
# [u'Foo', u'Bar']

增加: append(new_string) 或 NavigableString(" there")

soup = BeautifulSoup("<b></b>")
tag = soup.b
tag.append("Hello")
new_string = NavigableString(" there")
tag.append(new_string)
tag
# <b>Hello there.</b>
tag.contents
# [u'Hello', u' there']

增加注释: soup.new_string("Nice to see you.", Comment)

from bs4 import Comment
new_comment = soup.new_string("Nice to see you.", Comment)
tag.append(new_comment)
tag
# <b>Hello there<!--Nice to see you.--></b>
tag.contents
# [u'Hello', u' there', u'Nice to see you.']

创建tag: soup.new_tag("a", href="http://www.example.com")

soup = BeautifulSoup("<b></b>")
original_tag = soup.b

new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
original_tag
# <b><a href="http://www.example.com"></a></b>

new_tag.string = "Link text."
original_tag
# <b><a href="http://www.example.com">Link text.</a></b>

添加到末尾:append()
添加到指定位置:insert()
在当前tag或文本节点前插入内容:soup.b.string.insert_before(tag) 在b.string前添加tag
在当前tag或文本节点后插入内容:soup.b.i.insert_after(soup.new_string(" ever "))

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
tag = soup.a

tag.insert(1, "but did not endorse ")
tag
# <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a>
tag.contents
# [u'I linked to ', u'but did not endorse', <i>example.com</i>]

移除当前tag的内容:tag.clear()
将当前tag移除文档树,并作为方法结果返回(即将删除后的文档树返回):x = soup.i.extract()
将当前节点移除文档树并完全销毁:soup.i.decompose()
移除文档树中的某段内容,并用新tag或文本节点替代它:a_tag.i.replace_with(new_tag)
x = a_tag.i.replace_with(new_tag) 返回被替代的节点
对指定的tag元素进行包装,并返回包装后的结果:soup.p.string.wrap(soup.new_tag("b"))
移除tag内的所有tag标签(不删除文本),该方法常被用来进行标记的解包:a_tag.i.unwrap()
x = a_tag.i.unwrap() 返回被移除的标签

输出
格式化输出:soup.prettify()
编码检测
beautifulsoup会自动识别并猜测编码格式
编码自动识别:soup.original_encoding
指定编码方式:soup = BeautifulSoup(markup, from_encoding="iso-8859-8")
排除该项猜测编码:soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])
编码方式传入prettify()方法:soup.prettify("latin-1")
子节点编码:soup.p.encode("utf-8")

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 206,378评论 6赞 481
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 88,356评论 2赞 382
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 152,702评论 0赞 342
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 55,259评论 1赞 279
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 64,263评论 5赞 371
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 49,036评论 1赞 285
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,349评论 3赞 400
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,979评论 0赞 259
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 43,469评论 1赞 300
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,938评论 2赞 323
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 38,059评论 1赞 333
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,703评论 4赞 323
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,257评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 30,262评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,485评论 1赞 262
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,501评论 2赞 354
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,792评论 2赞 345

Beautifulsoup小结

Beautifulsoup小结

什么是Beautifulsoup

- 安装

- 主要解析器及其优缺点

使用

对于本地文档的解析方法

详细解析（匹配）使用方法

推荐阅读更多精彩内容