Beautifulsoup小结
参考链接:https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#id14
什么是Beautifulsoup
Beautifulsoup是用python写的HTML/XML解析器,它可以很好的处理不规范标记并生成剖析树(parse tree)。
系统说明:centos7 linux环境
- 安装
pip install beautifulsoup4
easy_install beautifulsoup4
- 主要解析器及其优缺点
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
Python标准库 | BeautifulSoup(markup, "html.parser") | 1.Python的内置标准库;2.执行速度适中;3.文档容错能力强 | Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差 |
lxml HTML 解析器 | BeautifulSoup(markup, "lxml") | 1.速度快;2.文档容错能力强 | 需要安装C语言库 |
lxml XML 解析器 | 1.BeautifulSoup(markup, ["lxml-xml"]);2.BeautifulSoup(markup, "xml") | 1.速度快;2.唯一支持XML的解析器 | 需要安装C语言库 |
html5lib | BeautifulSoup(markup, "html5lib") | 1.最好的容错性;2.以浏览器的方式解析文档;3.生成HTML5格式的文档 | 1.速度慢;2.不依赖外部扩展 |
推荐使用lxml作为解析器,效率更高
使用
对于本地文档的解析方法
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("XXX.html"), 'lxml')
soup = BeautifulSoup("<html>data</html>")
详细解析(匹配)使用方法
- Tag标签
注意事项:soup.b # 只能获取第一个b标签
soup.find_all('b') # 获取所有b标签
- 获取tag数据:soup.b
soup = BeautifulSoup('<b class="boldest" id="bold">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p>','lxml')
tag = soup.b # 只能获取第一个b标签
print(tag)
# 输出为:<b class="boldest">Extremely bold</b>
print(type(tag))
# 输出为:<class 'bs4.element.Tag'>
- 获取标签名:tag.name
soup = BeautifulSoup('<b class="boldest" id="bold">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p>','lxml')
tag = soup.b
print(tag.name)
# 输出为:b
- 获取标签中属性值:tag['class']
soup = BeautifulSoup('<b class="boldest" id="bold">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p>','lxml')
tag = soup.b
print(tag['class'])
# 输出为:['boldest']
- 获取标签中属性名及其值:tag.attrs
注意:XML中不包含多值属性
soup = BeautifulSoup('<b class="boldest mybold" id="bold dd">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p>','lxml')
tag = soup.b
print(tag['class'])
# 输出为:['boldest', 'mybold']
print(tag['id'])
# 输出为:bold dd
print(tag.attrs)
# 输出为:{'class': ['boldest', 'mybold'], 'id': 'bold dd'}
- 修改标签中的属性 :tag['class'] = 'change'/tag['class'] = 'change muil' /也可以写为tag['class'] = ['change','muil']
soup = BeautifulSoup('<b class="boldest" id="bold">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p>','lxml')
tag = soup.b
tag['class'] = 'change'
tag['id'] = '1'
print(soup)
# 输出为:<html><body><b class="change" id="1">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p></body></html>
tag['class'] = 'change muil' # 也可以写为tag['class'] = ['change','muil']
tag['id'] = '1'
print(soup)
# 输出为:<html><body><b class="change muil" id="1">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p></body></html>
- 删除标签中的属性:del tag['class']
soup = BeautifulSoup('<b class="boldest" id="bold">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p>','lxml')
tag = soup.b
del tag['class']
del tag['id']
print(soup)
# 输出为:<html><body><b>Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p></body></html>
print(tag['class'])
# KeyError: 'class'
print(tag.get('class'))
# 输出为:None
- 文本内容
- 获取文本值:tag.string
soup = BeautifulSoup('<b class="boldest mybold" id="bold dd">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p>','lxml')
tag = soup.b
print(tag.string)
# 输出为:Extremely bold
print(soup.string)
# 输出为:None
- 替换文本内容:tag.string.replace_with("repalced")
soup = BeautifulSoup('<b class="boldest mybold" id="bold dd">Extremely bold</b><b class="boldest1" id="bold1">Extremely bold1</b><p>test</p>','lxml')
tag = soup.b
tag.string.replace_with("repalced")
print(tag)
# 输出为:<b class="boldest mybold" id="bold dd">repalced</b>
总结:
soup.a之类只查找第一个;
soup.find_all('a')查找所有;
- 以列表形式输出tag直接子节点:head.contents
soup = BeautifulSoup("<html><head><title>The Dormouse's story</title></head></html>",'lxml')
head = soup.head
print(head)
# 输出为:<head><title>The Dormouse's story</title></head>
print(head.contents)
# 输出为:[<title>The Dormouse's story</title>]
print(head.contents[0].contents)
# 输出为:["The Dormouse's story"]
- 对tag直接子节点进行循环(生成器类型):for i in head.children
soup = BeautifulSoup("<html><head><title>The Dormouse's story</title></head></html>",'lxml')
head = soup.head
print(head)
# 输出为:<head><title>The Dormouse's story</title></head>
for i in head.children:
print(i)
# 输出为:<title>The Dormouse's story</title>
- 对所有子孙节点进行循环(生成器):for i in head.descendants
soup = BeautifulSoup("<html><head><title>The Dormouse's story</title></head></html>",'lxml')
head = soup.head
print(head)
# 输出为:<head><title>The Dormouse's story</title></head>
for i in head.descendants:
print(i)
# 输出为:
# <title>The Dormouse's story</title>
# The Dormouse's story
- 对所有文本内容进行循环(生成器):for i in head.strings
soup = BeautifulSoup("<html><head><title>The Dormouse's story</title></head><p>ppppp</p></html>",'lxml')
head = soup.html
print(head)
# 输出为:<html><head><title>The Dormouse's story</title></head><body><p>ppppp</p></body></html>
for i in head.strings:
print(i)
# 输出为:
# The Dormouse's story
# ppppp
- 对所有文本内容进行循环,并去除多余空格或空行:for i in head.stripped_strings
soup = BeautifulSoup("<html><head><title>  The Dormouse's story \n\r </title></head><p>ppppp</p></html>",'lxml')
head = soup.html
print(head)
for i in head.stripped_strings:
print(i)
# 输出为:
#   The Dormouse's story
# ppppp
- 获取某个元素的直属父节点:head.parent
html.parent是beautifulsoup对象,输出整个内容
soup.parent为None
soup = BeautifulSoup("<html><head><title>The Dormouse's story</title></head><p>ppppp</p></html>",'lxml')
print(soup)
# 输出为:<html><head><title>The Dormouse's story</title></head><body><p>ppppp</p></body></html>
head = soup.p
print(soup.p.string)
# 输出为:ppppp
print(soup.p.string.parent)
# 输出为:<p>ppppp</p>
print(head.parent)
# 输出为:<body><p>ppppp</p></body>
print(soup.html.parent)
# 输出为:<html><head><title>The Dormouse's story</title></head><body><p>ppppp</p></body></html>
- 获取某个节点的所有父节点(生成器):for i in head.parents
soup = BeautifulSoup("<html><head><title>The Dormouse's story</title></head><p>ppppp</p></html>",'lxml')
print(soup)
# 输出为:<html><head><title>The Dormouse's story</title></head><body><p>ppppp</p></body></html>
head = soup.p
print(head.parents)
# 输出为:<generator object parents at 0x7f6f4282ef68>
for i in head.parents:
print(i)
# 父节点----输出:
# body-----<body><p>ppppp</p></body>
# html-----<html><head><title>The Dormouse's story</title></head><body><p>ppppp</p></body></html>
# soup-----<html><head><title>The Dormouse's story</title></head><body><p>ppppp</p></body></html>
- 获取兄弟节点(有可能得到换行符和顿号):
获取后一个兄弟节点:soup.b.next_sibling
获取前一个兄弟节点:soup.c.previous_sibling
获取所有兄弟节点:
for sibling in soup.a.next_siblings
for sibling in soup.find(id="link3").previous_siblings
soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",'lxml')
print(soup)
# 输出为:<html><body><a><b>text1</b><c>text2</c></a></body></html>
print(soup.b.next_sibling)
# 输出为:<c>text2</c>
print(soup.c.previous_sibling)
# 输出为:<b>text1</b>
总结:
获取元素直属子节点:head.contents \ head.children
获取元素直属父节点:head.parent
获取元素所有子孙节点:for i in head.descendants
获取元素所有父节点:for i in head.parents
获取兄弟节点:.next_sibling / .previous_sibling
获取所有兄弟节点:.next_siblings / .previous_siblings
获取某一个元素 soup.a
获取所有元素 soup('a')
获取某一文本内容 soup.string
获取所有文本内容 for i in head.strings
- 搜索:
find_all
说明1:调用tag的 find_all()方法时,BeautifulSoup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False .
说明2:soup.find_all("a")与soup("a")等价
说明3:find方法只返回第一个
查找所有b标签:soup.find_all('b')
查找所有以b开头的标签:soup.find_all(re.compile("^b"))
查找所有包含d的标签:soup.find_all(re.compile("d"))
查找所有a和b标签:soup.find_all(["a", "b"])
查找所有p标签中class值为myclass的标签:soup.find_all("b1", "myclass")
查找p标签中id值为myid的标签:soup.find_all("b1", id_="myid")
查找id为link2的标签:soup.find_all(id_="myid")
查找文本内容包含sister的文本值(注意:输出的不是整个标签,只有文本):soup.find(string=re.compile("text"))
soup.find_all(string=["Tillie", "Elsie", "Lacie"])
查找所有href中包含elsie的标签:soup.find_all(href=re.compile("elsie"))
匹配任何值:soup.find_all(True)
soup.find_all(id=True)
多参数过滤:soup.find_all(href=re.compile("elsie"), id='link1')
数量限制:soup.find_all("a", limit=2)
soup = BeautifulSoup("<a><b><b1 class='myclass' id='myid'>text1</b1><b2>text2</b2></b></a>",'lxml')
print(soup.find_all('b'))
# 输出为:[<b><b1>text1</b1><b2>text2</b2></b>]
print(soup.find_all(re.compile("^b")))
# 输出为:[<body><a><b><b1>text1</b1><b2>text2</b2></b></a></body>, <b><b1>text1</b1><b2>text2</b2></b>, <b1>text1</b1>, <b2>text2</b2>]
print(soup.find_all(re.compile('d')))
# 输出为:[<body><a><b><b1>text1</b1><b2>text2</b2></b></a></body>]
print(soup.find_all(["a", "b"]))
# 输出为:[<a><b><b1>text1</b1><b2>text2</b2></b></a>, <b><b1>text1</b1><b2>text2</b2></b>]
data_soup.find_all(data-foo="value")会报错
data_soup.find_all(attrs={"data-foo": "value"})这样就可以啦~
soup.find_all("a", attrs={"class": "sister"})
find_parents() 与 find_parent()
说明:find_parents()为列表,find_parent()则不是,但二者文本
find_next_siblings() 与 find_next_sibling()
说明:find_next_siblings() 方法返回所有符合条件的后面的兄弟节点, find_next_sibling() 只返回符合条件的后面的第一个tag节点.
find_previous_siblings() 与 find_previous_sibling()
说明:find_previous_siblings() 方法返回所有符合条件的前面的兄弟节点, find_previous_sibling() 方法返回第一个符合条件的前面的兄弟节点
find_all_next() 与 find_next()
说明:find_all_next() 方法返回所有符合条件的节点, find_next() 方法返回第一个符合条件的节点
find_all_previous() 与 find_previous()
说明1:find_all_previous()方法返回所有符合条件的节点,find_previous()方法返回第一个符合条件的节点.
说明2:find_all_previous("p") 返回了文档中的第一段(class=”title”的那段),但还返回了第二段,<p>标签包含了我们开始查找的<a>标签.不要惊讶,这段代码的功能是查找所有出现在指定<a>标签之前的<p>标签,因为这个<p>标签包含了开始的<a>标签,所以<p>标签一定是在<a>之前出现的.
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister1" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
<p href="http://example.com/laci" class="sister" id="link4">Laci</p> and
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
a_string = soup.find(string="Lacie")
print(a_string)
# Lacie
a = a_string.find_parents("a")
print(a)
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
p = a_string.find_parent("p")
print(p)
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# <p class="sister" href="http://example.com/laci" id="link4">Laci</p> and
# and they lived at the bottom of a well.</p>
p1 = a_string.find_parents("p", class_="title")
print(p1)
# []
first_link = soup.a
first_link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
first_link.find_all_previous("p")
# [<p class="story">Once upon a time there were three little sisters; ...</p>,
# <p class="title"><b>The Dormouse's story</b></p>]
first_link.find_previous("title")
# <title>The Dormouse's story</title>
select
通过tag标签逐层查找:soup.select("body a")\soup.select("html head title")
找到某个tag标签下的直接子标签:soup.select("p > a")
soup.select("p > a:nth-of-type(2)")
soup.select("p > #link1")
找到兄弟节点标签:soup.select("#link1 ~ .sister") 找所有
soup.select("#link1 + .sister") 找第一个
通过CSS的类名查找:soup.select(".sister")\soup.select("[class~=sister]")
通过tag的id查找:soup.select("#link1")\soup.select("a#link2")\soup.select("#link1,#link2")
通过是否存在某个属性来查找:soup.select('a[href]')
通过属性的值来查找:soup.select('a[href="http://example.com/elsie"]') 精确查找
soup.select('a[href^="http://example.com/"]') 开头匹配查找
soup.select('a[href$="tillie"]') 结尾匹配查找
soup.select('a[href*=".com/el"]') 中间匹配查找
查找第一个:soup.select_one(".sister")
获取文本
获取文本:soup.get_text()
以指定分隔符获取文本:soup.get_text("|")
以指定分隔符获取文本并去除文本前后空白:soup.get_text("|", strip=True)
使用 .stripped_strings 生成器,获得文本列表后手动处理列表:[text for text in soup.stripped_strings]方法
校验当前元素,包含 class 属性却不包含 id 属性,并查找所有符合该方法的标签
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
print(soup.find_all(has_class_but_no_id))
找出所有href属性不符合指定正则的标签
def not_lacie(href):
return href and not re.compile("lacie").search(href)
print(soup.find_all(href=not_lacie))
找出前后均有文字的标签
from bs4 import NavigableString
def surrounded_by_strings(tag):
return (isinstance(tag.next_element, NavigableString)
and isinstance(tag.previous_element, NavigableString))
for tag in soup.find_all(surrounded_by_strings):
print(tag)
根据class长度进行匹配
def has_six_characters(css_class):
return css_class is not None and len(css_class) == 6
print(soup.find_all(class_=has_six_characters))
- 修改
删除:del tag \ del tag['class']
修改: =
扩展:append("Bar")
soup = BeautifulSoup("<a>Foo</a>")
soup.a.append("Bar")
soup
# <html><head></head><body><a>FooBar</a></body></html>
soup.a.contents
# [u'Foo', u'Bar']
增加: append(new_string) 或 NavigableString(" there")
soup = BeautifulSoup("<b></b>")
tag = soup.b
tag.append("Hello")
new_string = NavigableString(" there")
tag.append(new_string)
tag
# <b>Hello there.</b>
tag.contents
# [u'Hello', u' there']
增加注释: soup.new_string("Nice to see you.", Comment)
from bs4 import Comment
new_comment = soup.new_string("Nice to see you.", Comment)
tag.append(new_comment)
tag
# <b>Hello there<!--Nice to see you.--></b>
tag.contents
# [u'Hello', u' there', u'Nice to see you.']
创建tag: soup.new_tag("a", href="http://www.example.com")
soup = BeautifulSoup("<b></b>")
original_tag = soup.b
new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
original_tag
# <b><a href="http://www.example.com"></a></b>
new_tag.string = "Link text."
original_tag
# <b><a href="http://www.example.com">Link text.</a></b>
添加到末尾:append()
添加到指定位置:insert()
在当前tag或文本节点前插入内容:soup.b.string.insert_before(tag) 在b.string前添加tag
在当前tag或文本节点后插入内容:soup.b.i.insert_after(soup.new_string(" ever "))
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
tag = soup.a
tag.insert(1, "but did not endorse ")
tag
# <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a>
tag.contents
# [u'I linked to ', u'but did not endorse', <i>example.com</i>]
移除当前tag的内容:tag.clear()
将当前tag移除文档树,并作为方法结果返回(即将删除后的文档树返回):x = soup.i.extract()
将当前节点移除文档树并完全销毁:soup.i.decompose()
移除文档树中的某段内容,并用新tag或文本节点替代它:a_tag.i.replace_with(new_tag)
x = a_tag.i.replace_with(new_tag) 返回被替代的节点
对指定的tag元素进行包装,并返回包装后的结果:soup.p.string.wrap(soup.new_tag("b"))
移除tag内的所有tag标签(不删除文本),该方法常被用来进行标记的解包:a_tag.i.unwrap()
x = a_tag.i.unwrap() 返回被移除的标签
- 输出
格式化输出:soup.prettify() - 编码检测
- beautifulsoup会自动识别并猜测编码格式
编码自动识别:soup.original_encoding
指定编码方式:soup = BeautifulSoup(markup, from_encoding="iso-8859-8")
排除该项猜测编码:soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])
编码方式传入prettify()方法:soup.prettify("latin-1")
子节点编码:soup.p.encode("utf-8")