Beautiful Soup4学习笔记（三）：遍历文档树

还是之前的字符串作为栗子：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

通过这段例子来演示怎样从文档的一段内容找到另一段内容

子节点

一个Tag可能包含多个字符串或其他的Tag，这些都是这个Tag的子节点。Beautiful Soup提供了许多操作和遍历子节点的属性.

注意: Beautiful Soup中字符串节点不支持这些属性,因为字符串没有子节点

tag的名字

操作文档树最简单的方法就是告诉它你想获取的tag的name.如果想获取 <head> 标签,只要用 soup.head :

>>> soup.head
<head><title>The Dormouse's story</title></head>
>>> soup.title
<title>The Dormouse's story</title>

这是个获取tag的小窍门,可以在文档树的tag中多次调用这个方法.下面的代码可以获取<body>标签中的第一个标签:

>>> soup.body.b
<b>The Dormouse's story</b>

通过点取属性的方式只能获得当前名字的第一个tag：

>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

如果想要得到所有的<a>标签,或是通过名字得到比一个tag更多的内容的时候,就需要用到 Searching the tree 中描述的方法,比如: find_all()

>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/l
acie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

.contents 和 .children

tag的.contents属可以将tag的子节点以列表的方式输出：

>>> head_tag = soup.head  
>>> head_tag
<head><title>The Dormouse's story</title></head>
>>> head_tag.contents
[<title>The Dormouse's story</title>]
>>> title_tag = head_tag.contents[0]  
>>> title_tag 
<title>The Dormouse's story</title>
>>> title_tag .contents
["The Dormouse's story"]

BeautifulSoup 对象本身一定会包含子节点,也就是说<html>标签也是 BeautifulSoup 对象的子节点:

len(soup.contents)
# 1
soup.contents[0].name
# u'html'

这里，我以字符串的形式操作，是有空格存在的。

字符串没有 .contents属性，因为字符串没有子节点：

>>> text = title_tag.contents[0]  
>>> text.contents  
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    text.contents
  File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 730, in __getattr__
    self.__class__.__name__, attr))
AttributeError: 'NavigableString' object has no attribute 'contents'

通过tag的.children生成器，可以对tag的子节点进行循环：

>>> for child in title_tag.children: 
...     print(child)  
...      
...    
... 
The Dormouse's story

.descendants

.contents 和 .children 属性仅包含tag的直接子节点.例如,<head>标签只有一个直接子节点<title>

>>> head_tag.contents
[<title>The Dormouse's story</title>]

但是<title>标签也包含一个子节点:字符串 “The Dormouse’s story”,这种情况下字符串 “The Dormouse’s story”也属于<head>标签的子孙节点. .descendants 属性可以对所有tag的子孙节点进行递归循环 :

>>> for child in head_tag.descendants:
...     print(child)  
...       
<title>The Dormouse's story</title>
The Dormouse's story

上面的例子中, <head>标签只有一个子节点,但是有2个子孙节点:<head>节点和<head>的子节点, BeautifulSoup 有一个直接子节点(<html>节点),却有很多子孙节点:

len(list(soup.children))
# 1
len(list(soup.descendants))
# 25

这里的还是和用字符串的不一样，还是复制了文档。

.string

如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点:

>>> title_tag.string 
"The Dormouse's story"

如果一个tag仅有一个子节点,那么这个tag也可以使用 .string 方法,输出结果与当前唯一子节点的 .string 结果相同:

>>> head_tag.contents
[<title>The Dormouse's story</title>]
>>> head_tag.string 
"The Dormouse's story"

如果tag包含了多个子节点,tag就无法确定 .string 方法应该调用哪个子节点的内容, .string 的输出结果是 None :

>>> print(soup.html.string) 
None

.strings 和 stripped_strings

如果tag中包含多个字符串，可以使用 .strings来循环获取：

>>> for string in soup.strings: 
...     print(repr(string)) 
...      

'\n'
"The Dormouse's story"
'\n'
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
'Elsie'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'
'\n'

注意：看到了吧，多了两个换行符和文档相比

输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容:

>>> for string in soup.stripped_strings: 
...     print(repr(string))  
...     
... 
"The Dormouse's story"
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
'Elsie'
','
'Lacie'
'and'
'Tillie'
';\nand they lived at the bottom of a well.'
'...'

全部是空格的行会被忽略掉,段首和段末的空白会被删除

父节点

在文档树中每个tag或字符串都有父节点：即被包含在某个tag中

.parent

通过 .parent 属性来获取某个元素的父节点.在文档中,<head>标签是<title>标签的父节点:

>>> title_tag = soup.title 
>>> title_tag 
<title>The Dormouse's story</title>
>>> title_tag .parent 
<head><title>The Dormouse's story</title></head>

文档title的字符串也有父节点:<title>标签

>>> title_tag.string.parent 
<title>The Dormouse's story</title>

文档的顶层节点比如<html>的父节点是 BeautifulSoup 对象:

>>> html_tag = soup.html 
>>> type(html_tag.parent) 
<class 'bs4.BeautifulSoup'>

BeautifulSoup 对象的 .parent 是None:

>>> print(soup.parent) 
None

.parents

通过元素的 .parents 属性可以递归得到元素的所有父辈节点,下面的例子使用了 .parents 方法遍历了<a>标签到根节点的所有节点.

>>> link = soup.a 
>>> link 
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> for parent in link.parents: 
...     if parent is None:
...         print(parent) 
...     else:
...         print(parent.name) 
... 
p
body
html
[document]

兄弟节点

看一段简单的栗子：

>>> sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>") 
>>> print(sibling_soup.prettify())
<html>
 <body>
  <a>
   <b>
    text1
   </b>
   <c>
    text2
   </c>
  </a>
 </body>
</html>

因为标签和<c>标签是同一层:他们是同一个元素的子节点,所以和<c>可以被称为兄弟节点.一段文档以标准格式输出时,兄弟节点有相同的缩进级别.在代码中也可以使用这种关系.

.next_sibling 和 .previous_sibling

在文档树中,使用 .next_sibling 和 .previous_sibling 属性来查询兄弟节点:

>>> sibling_soup.b.next_sibling
<c>text2</c>
>>> sibling_soup.c.previous_sibling
<b>text1</b>

标签有 .next_sibling 属性,但是没有 .previous_sibling 属性,因为标签在同级节点中是第一个.同理,<c>标签有 .previous_sibling 属性,却没有 .next_sibling 属性:

>>> print(sibling_soup.b.previous_sibling) 
None
>>> print(sibling_soup.c.next_sibling) 
None

例子中的字符串“text1”和“text2”不是兄弟节点,因为它们的父节点不同:

>>> sibling_soup.b.string
'text1'
>>> print(sibling_soup.b.string.next_sibling) 
None

实际文档中的tag的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白. 看看看文档:

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

如果以为第一个<a>标签的 .next_sibling 结果是第二个<a>标签,那就错了,真实结果是第一个<a>标签和第二个<a>标签之间的顿号和换行符:

>>> link = soup.a  
>>> link 
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> link.next_sibling 
',\n'

第二个<a>标签是顿号的 .next_sibling 属性:

>>> link.next_sibling.next_sibling
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

.next_siblings 和 .previous_siblings

通过 .next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出:

>>> for sibling in soup.a.next_siblings:
...     print(repr(sibling))
...      
...  
... 
',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'
>>> for sibling in soup.find(id="link3").previous_siblings:
...     print(repr(sibling))
...     
... 
' and\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'

回退和前进

看一下文档：

<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>

HTML解析器把这段字符串转换成一连串的事件: “打开<html>标签”,”打开一个<head>标签”,”打开一个<title>标签”,”添加一段字符串”,”关闭<title>标签”,”打开标签”,等等.Beautiful Soup提供了重现解析器初始化过程的方法.

.next_element 和 .previous_element

.next_element 属性指向解析过程中下一个被解析的对象(字符串或tag),结果可能与 .next_sibling 相同,但通常是不一样的.

这是“爱丽丝”文档中最后一个<a>标签,它的 .next_sibling
结果是一个字符串，当前的解析过程因为遇到了<a>标签而中断了:

>>> last_a_tag = soup.find("a",id="link3") 
>>> last_a_tag
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
>>> last_a_tag.next_sibling
';\nand they lived at the bottom of a well.'

但这个<a>标签的 .next_element 属性结果是在<a>标签被解析之后的解析内容,不是<a>标签后的句子部分,应该是字符串”Tillie”:

>>> last_a_tag.next_element
'Tillie'

这是因为在原始文档中,字符串“Tillie” 在分号前出现,解析器先进入<a>标签,然后是字符串“Tillie”,然后关闭</a>标签,然后是分号和剩余部分.分号与<a>标签在同一层级,但是字符串“Tillie”会被先解析.
.previous_element 属性刚好与 .next_element 相反,它指向当前被解析的对象的前一个解析对象:

last_a_tag.previous_element
# u' and\n'
last_a_tag.previous_element.next_element
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

.next_elements 和 .previous_elements

通过 .next_elements 和 .previous_elements 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样:

>>> for element in last_a_tag.next_elements:
...     print(repr(element)) 
...     
... 
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
<p class="story">...</p>
'...'
'\n'

最后编辑于：2017.12.06 00:26:37

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 202,802评论 5赞 476
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,109评论 2赞 379
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 149,683评论 0赞 335
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,458评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,452评论 5赞 364
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,505评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,901评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,550评论 0赞 256
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,763评论 1赞 296
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,556评论 2赞 319
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,629评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,330评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,898评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,897评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,140评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,807评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,339评论 2赞 342