爬取300个房源的详细信息,在判断房东性别问题很快就解决了,但是在住房图片的链接获取上费了一些劲,目前用了字符的切片;在抓300个详情的链接的时候开始没有找到,后来成功定位,用了循环添加,大于300停止
我的成果
我的代码
from bs4 import BeautifulSoup
import requests
headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36'}
def get_links():
link_list=[]
count = len(link_list)
urls=['http://bj.xiaozhu.com/search-duanzufang-p{}-0/'.format(i) for i in range(1,20)]
for url in urls:
if count >300:
break
wb_data=requests.get(url,headers=headers)
soup=BeautifulSoup(wb_data.text,'lxml')
links=soup.select('#page_list > ul > li > div.result_btm_con.lodgeunitname')
for link in links:
link_list.append(link.get('detailurl'))
return link_list
def get_info(url):
# url='http://bj.xiaozhu.com/fangzi/4131080529.html'
# url='http://bj.xiaozhu.com/fangzi/3828318529.html'
wb_data=requests.get(url)
soup=BeautifulSoup(wb_data.text,'lxml')
titles=soup.select('body > div.wrap.clearfix.con_bg > div.con_l > div.pho_info > h4 > em')
areas=soup.select('body > div.wrap.clearfix.con_bg > div.con_l > div.pho_info > p > span')
prices=soup.select('#pricePart > div.day_l > span')
housepics=soup.select('#imgMouseCusor')
hostimgs=soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > a > img')
hostnames=soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > a')
hostsexes=soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > div')
for title,area,price,housepic,hostimg,hostname,hostsex in zip(titles,areas,prices,housepics,hostimgs,hostnames,hostsexes):
sex='male' if hostsex.get('class') =='member_ico1' else 'female'
data={
'title':title.get_text(),
'area':area.get_text(),
'price':price.get_text(),
'housepic':'http://bj.xiaozhu.com'+housepic.get('style')[16:len(housepic.get('style'))-2],
'hostimg':hostimg.get('src'),
'hostname':hostname.get_text(),
'hostsex':sex
}
print(data)
link_list=get_links()
for i in link_list[:300]:
get_info(i)
总结
- 获取详情页面的房源图片需要再看一下答案
- 在实现level2分别构建获取链接函数存储在列表中返回,构建获取详情函数依次爬取
- 抓300链接用两层循环,翻页以及存每页链接