爬取短租房前三页,并将数据存储在mongodb中,打印出大于等于500元的租房信息。
代码:
import requests
from bs4 import BeautifulSoup
import pymongo
client = pymongo.MongoClient('localhost')
duanzufang = client['dzf']
list = duanzufang['list']
urls = ['http://bj.xiaozhu.com/search-duanzufang-p' + str({}).format(str(i)) + '-0/' for i in range(1, 4)]
head = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36'}
for url in urls:
wb_data = requests.get(url, headers = head)
soup = BeautifulSoup(wb_data.text, 'lxml')
titles = soup.find_all('span', {'class': 'result_title hiddenTxt'})
infos = soup.find_all('em', {'class': 'hiddenTxt'})
prices = soup.find_all('span', {'class': 'result_price'})
for title, info, price in zip(titles, infos, prices):
data = {
'title': title.get_text(),
'typo': info.get_text().replace('\n', '').replace(' ', '').split('-')[0],
'comment_num': info.get_text().replace('\n', '').replace(' ', '').split('-')[1],
'address': info.get_text().replace('\n', '').replace(' ', '').split('-')[2],
'price': int(price.i.get_text())
}
list.insert_one(data)
for item in list.find({'price': {'$gte': 500}}):
print (item)
总结:
1、理解了网页的结构
2、通过研读Bs4文档,学会了find系列函数用法
3、学会数据库建立以及输入数据
问题:
如果不止爬取前三页,想爬取所有页,观察了底下页码发现是动态变化的,请问老师这种情况应该怎么爬取呢?