笔记
爬取网页的基本方法:
- 使用BeautifulSoup解析网页
- Soup = BeautifulSoup(html, 'lxml')
- 描述要爬取的东西在哪里
- CSS Selector (谁,在哪,第几个,长什么样)
- XPath (谁,在哪,第几个)
- 从标签中获得需要的信息并封装到数据容器中
xxx = Soup.select('???')
-
获取标签中信息的方法:
- 获取文本 : xxx.get_text()
- 获取标签的属性 : xxx.get('attributeName')
- 获取所有子标签的文本 : list(xxx.stripped_strings)
- 根据特征查找 : find_all()
作业
代码:
from bs4 import BeautifulSoup
with open('index.html', 'r') as web_file:
soup = BeautifulSoup(web_file, 'lxml')
titles = soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4 > a')
images = soup.select('body > div > div > div.col-md-9 > div > div > div > img')
reviews = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p.pull-right')
prices = soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4.pull-right')
stars = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)')
starCnts = []
for star in stars:
starCnt = len(list(star.find_all(class_="glyphicon glyphicon-star")))
starCnts.insert(-1, starCnt)
for title, image, review, price, starCnt in zip(titles, images, reviews, prices, starCnts):
obj = {
'title' : title.get_text(),
'image' : image.get('src'),
'review' : review.get_text(),
'price': price.get_text(),
'starCnt' : starCnt,
}
print(obj)
运行结果:
{'image': 'img/pic_0000_073a9256d9624c92a05dc680fc28865f.jpg', 'title': 'EarPod', 'starCnt': 4, 'price': '$24.99', 'review': '65 reviews'}
{'image': 'img/pic_0005_828148335519990171_c234285520ff.jpg', 'title': 'New Pocket', 'starCnt': 4, 'price': '$64.99', 'review': '12 reviews'}
{'image': 'img/pic_0006_949802399717918904_339a16e02268.jpg', 'title': 'New sunglasses', 'starCnt': 3, 'price': '$74.99', 'review': '31 reviews'}
{'image': 'img/pic_0008_975641865984412951_ade7a767cfc8.jpg', 'title': 'Art Cup', 'starCnt': 4, 'price': '$84.99', 'review': '6 reviews'}
{'image': 'img/pic_0001_160243060888837960_1c3bcd26f5fe.jpg', 'title': 'iphone gamepad', 'starCnt': 4, 'price': '$94.99', 'review': '18 reviews'}
{'image': 'img/pic_0002_556261037783915561_bf22b24b9e4e.jpg', 'title': 'Best Bed', 'starCnt': 4, 'price': '$214.5', 'review': '18 reviews'}
{'image': 'img/pic_0011_1032030741401174813_4e43d182fce7.jpg', 'title': 'iWatch', 'starCnt': 4, 'price': '$500', 'review': '35 reviews'}
{'image': 'img/pic_0010_1027323963916688311_09cc2d7648d9.jpg', 'title': 'Park tickets', 'starCnt': 5, 'price': '$15.5', 'review': '8 reviews'}