工具
BeautifulSoup使用
from bs4 import BeautifulSoup
html_simple = '\
<html>\
<body>\
<h1 id="title">Hello World</h1>\
<a hred="#" class="link">This is link1</a>\
<a hred="#link2" class="link">This is link2</a>\
</body>
</html>'
soup = BeautifulSoup(html_simple)
print(soup.text)
打印结果:
Hello WorldThis is link1This is link2
通过特定的标签取元素
select
soup = BeautifulSoup(html_simple)
header = soup.select("h1")
print(header)
print(header[0])
print(header[0].text)
alink = soup.select("a")
print(alink)
for link in alink:
print(link)
print(link.text)
结果:
[<h1 id="title">Hello World</h1>] //列表
<h1 id="title">Hello World</h1> //第一个元素
Hello World //文本
[<a class="link" hred="#">This is link1</a>, <a class="link" hred="#link2">This is link2</a>]
<a class="link" href="#">This is link1</a>
This is link1
<a class="link" href="#link2">This is link2</a>
This is link2
通过css属性取元素
header = soup.select("#title") #id前面加上#
print(header)
print(header[0])
print(header[0].text)
alink = soup.select(".link")#class前面加上.
print(alink)
for link in alink:
print(link)
print(link.text)
结果:
[<h1 id="title">Hello World</h1>]
<h1 id="title">Hello World</h1>
Hello World
[<a class="link" href="#">This is link1</a>, <a class="link" href="#link2">This is link2</a>]
<a class="link" href="#">This is link1</a>
This is link1
<a class="link" href="#link2">This is link2</a>
This is link2
id、class区别
id 唯一标识
class 重复标识
取得标签中的链接
alink = soup.select(".link")#class前面加上.
print(alink)
for link in alink:
print(link["href"])
结果:
#
#link2
属性通过字典的形式存放,所以可以