最近,为了加强自己的数据获取以及分析能力,迈入了爬虫学习之路。在网上找了一些教程,以及翻阅了一些参考书(推荐《Python网络数据采集》)之后,成功的写出了一段高可用的python爬虫代码。特此写下python爬虫系列的文章,与大家分享踩过的坑以及将知识以文件的形式沉淀下来。由于网络上已经有很多最基础的教程,因此我就不再赘述,从我踩到的第一个坑开始写起。
代理IP的获取与检测
许多写爬虫的朋友第一个碰到的难题就是:在对某个网站进行了持续一段时间的爬取之后,网站的反爬虫机制会返回一个错误的结果给爬虫,不就是返回503,要不就是强制给你跳转到登录界面,让你获取elements的时候出错。代理IP能帮我们解决这个问题。接下里以爬取拉勾网的所有招聘信息为例,为大家解决这个问题。
由于收费的代理IP普遍较贵,因此为了省钱,我们可用找一些免费提供代理IP的网站,下文介绍的是西刺代理。这个网站的代理IP是免费的,但是很不稳定,需要测试是否可用。
import requests
from lxml import etree
from bs4 import BeautifulSoup
from get_headers import GetHeaders
class GetProxy():
def getproxy(self):
urls=['http://www.xicidaili.com/nn/',
'http://www.xicidaili.com/nt/',
'http://www.xicidaili.com/wn/',
'http://www.xicidaili.com/wt/']
header=GetHeaders().getHeaders()
proxies=[]
for url in urls:
print(url)
s = requests.get(url,headers=header)
html=etree.HTML(s.text)
ips=html.xpath('//*[@class="country"][1]/following-sibling::td[1]/text()')
ports=html.xpath('//*[@class="country"][1]/following-sibling::td[2]/text()')
for i in range(0,int(len(ips)/2)):
proxies.append(ips[i]+':'+ports[i])
print('ok!')
#测试
proxies_useful=[]
for proxy in proxies:
proxy_http={
'http':"http://"+proxy,
'https':"http://"+proxy,
}
title=''
try:
s=requests.get('http://music.163.com/',headers=header,proxies=proxy_http,timeout=2)
title=BeautifulSoup(s.text,'lxml').h1.text
if title.strip()=='网易云音乐':
print('correct:'+proxy)
proxies_useful.append(proxy)
except Exception as e:
print('error:'+proxy)
continue
print('Test Done!!')
proxy_list=[]
for proxy in proxies_useful:
proxy_http={
'http':"http://"+proxy,
'https':"http://"+proxy,
}
proxy_list.append(proxy_http)
return proxy_list
header=GetHeaders().getHeaders()
这是我自己写的一个获取header的函数,接下来会再介绍
urls=['http://www.xicidaili.com/nn/',
'http://www.xicidaili.com/nt/',
'http://www.xicidaili.com/wn/',
'http://www.xicidaili.com/wt/']
urls是西刺代理IP四种类型的代理,多次测试后发现基本上每个类型的代理都只有第一页的前五十个代理IP可用率高一些,其他的可用率太低了,就不要在这上面浪费时间了。
for url in urls:
print(url)
s = requests.get(url,headers=header)
html=etree.HTML(s.text)
ips=html.xpath('//*[@class="country"][1]/following-sibling::td[1]/text()')
ports=html.xpath('//*[@class="country"][1]/following-sibling::td[2]/text()')
for i in range(0,int(len(ips)/2)):
proxies.append(ips[i]+':'+ports[i])
这段代码用来爬取四个页面中的代理IP,其中etree是一个可以根据xpath来寻找元素的包,非常好用,强推!!
proxies_useful=[]
for proxy in proxies:
proxy_http={
'http':"http://"+proxy,
'https':"http://"+proxy,
}
title=''
try:
s=requests.get('http://music.163.com/',headers=header,proxies=proxy_http,timeout=2)
title=BeautifulSoup(s.text,'lxml').h1.text
if title.strip()=='网易云音乐':
print('correct:'+proxy)
proxies_useful.append(proxy)
except Exception as e:
print('error:'+proxy)
continue
这段代码用来测试代理ip是否可用,我的策略是直接用获取到的代理IP取爬网易云音乐,如果返回正确说明这个代理IP是可用的。
伪装请求头
import random
class Urls():
lagouwang_urls=[
'https://www.lagou.com/zhaopin/.NET/',
'https://www.lagou.com/zhaopin/WP/',
'https://www.lagou.com/zhaopin/Java/',
'https://www.lagou.com/zhaopin/C%2B%2B/',
'https://www.lagou.com/zhaopin/PHP/',
'https://www.lagou.com/zhaopin/shujuwajue/',
'https://www.lagou.com/zhaopin/sousuosuanfa/',
'https://www.lagou.com/zhaopin/jingzhuntuijian/',
'https://www.lagou.com/zhaopin/C/',
'https://www.lagou.com/zhaopin/C%23/',
'https://www.lagou.com/zhaopin/quanzhangongchengshi/',
'https://www.lagou.com/zhaopin/Hadoop/',
'https://www.lagou.com/zhaopin/Python/',
'https://www.lagou.com/zhaopin/Delphi/',
'https://www.lagou.com/zhaopin/VB/',
'https://www.lagou.com/zhaopin/Perl/',
'https://www.lagou.com/zhaopin/Ruby/',
'https://www.lagou.com/zhaopin/Node.js/',
]
class GetHeaders():
user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
]
def getHeaders(self):
headers={
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.8",
"Connection": "keep-alive",
"Content-Type":"application/x-www-form-urlencoded; charset=UTF-8",
'User-Agent':random.choice(self.user_agent_list),
'Referer':"http://www.lagou.com/"
}
return headers
上面的user_agent_list是从网上复制粘贴下来的,headers是随便上一个网站打开网页的开发者工具在network里找一个随便填充一下就行了。lagouwangurls是用直接用python爬下来的每个分类的url,要再写一层嵌套太麻烦了,因此就把它写进这里来了,由于篇幅有限就没有全部列出来了,实际比这个多得多。
开始爬取#
接下来这段代码就是爬拉勾网的招聘信息的主程序了。
import requests
from getPorxy import GetProxy
from get_headers import GetHeaders,Urls
from lxml import etree
import re
import json
import os
import time
import random
import settings
from requests.exceptions import RequestException
class LaGou():
headers=GetHeaders().getHeaders()
path=settings.path
urls=Urls.lagouwang_urls
def getCatUrls(self):
proxy_list_useful=GetProxy().getproxy()
proxy_list=[]
proxy_len=len(proxy_list_useful)
print(proxy_len)
#扩展可用代理ip的数量
proxy_list=proxy_list_useful*10
print('proxies is ready!')
for url in self.urls:
#判断是否是空白页面,默认false
is_null_page=False
cat_name=url.split('/')[-2].replace('.','')
item_urls=[]
next_page=url
while True:
proxy=random.choice(proxy_list)
try:
this_page=next_page
response=requests.get(this_page,headers=GetHeaders().getHeaders(),timeout=10,proxies=proxy)
html=etree.HTML(response.text)
each_jos_urls=html.xpath('//*[@id="s_position_list"]/ul/li/div/div[1]/div/a/@href')
next_page=html.xpath('//*[@id="s_position_list"]/div[2]/div/a')[-1].xpath('./@href')[0]
except RequestException as e:
print(e)
proxy_list.remove(proxy)
if len(proxy_list)<proxy_len:
proxy_list_useful=GetProxy().getproxy()
proxy_list=list(proxy_list_useful*10)
print('proxies is ready!')
continue
except IndexError as e:
try:
#判断该页面是否是空白页面
is_null_page=html.xpath('//*[@id="s_position_list"]/ul/div/div[2]/div/text()')==['暂时没有符合该搜索条件的职位']
if is_null_page:
break
except:
pass
print(e)
proxy_list.remove(proxy)
if len(proxy_list)<proxy_len:
proxy_list_useful=GetProxy().getproxy()
proxy_list=list(proxy_list_useful*3)
print('proxies is ready!')
continue
except Exception as e:
print(e)
continue
if is_null_page:
continue
for url in each_jos_urls:
item_urls.append(url)
print(url)
print(next_page)
# time.sleep(1)
if 'www.lagou.com' not in next_page:
# with open('/home/lys/project/requests project/item_urls.txt','a') as f:
# for url in item_urls:
# f.writelines(url+'\n')
try:
os.mkdir(path=self.path+cat_name)
except:
pass
os.chdir(path=self.path+cat_name)
LaGou().processItem(item_urls)
break
def processItem(self,urls):
i=0
proxy_list_useful=GetProxy().getproxy()
proxy_len=len(proxy_list_useful)
proxy_list=proxy_list_useful*10
print('proxies is ready!')
#扩展可用代理ip的数量
while i<len(urls):
proxy=random.choice(proxy_list)
try:
print(urls[i])
response=requests.get(urls[i],headers=GetHeaders().getHeaders(),proxies=proxy,timeout=5)
html=etree.HTML(response.text)
data_dict={}
#company
data_dict['company']=html.xpath('/html/body/div[2]/div/div[1]/div/div[1]/text()')[0].strip()
#job_name
try:
data_dict['job_name']=html.xpath('/html/body/div[2]/div/div[1]/div/span/text()')[0].strip()
except:
data_dict['job_name']=' '
#salary
try:
data_dict['salary']=html.xpath('/html/body/div[2]/div/div[1]/dd/p[1]/span[1]/text()')[0].strip()
except:
data_dict['salary']=' '
#city
try:
data_dict['city']=html.xpath('/html/body/div[2]/div/div[1]/dd/p[1]/span[2]/text()')[0].replace('/','').strip()
except:
data_dict['city']=' '
#experience
try:
data_dict['experience']=html.xpath('/html/body/div[2]/div/div[1]/dd/p[1]/span[3]/text()')[0].replace('/','').strip()
except:
data_dict['experience']=' '
#educetion
try:
data_dict['educetion']=html.xpath('/html/body/div[2]/div/div[1]/dd/p[1]/span[4]/text()')[0].replace('/','').strip()
except:
data_dict['educetion']=' '
#job_type
try:
data_dict['job_type']=html.xpath('/html/body/div[2]/div/div[1]/dd/p[1]/span[5]/text()')[0].replace('/','').strip()
except:
data_dict['job_type']=' '
#attractive_title
try:
data_dict['attractive_title']=html.xpath('//*[@id="job_detail"]/dd[1]/span/text()')[0].replace(':','').strip()
except:
data_dict['attractive_title']=' '
#attractive_content
try:
data_dict['attractive_content']=html.xpath('//*[@id="job_detail"]/dd[1]/p/text()')[0].strip()
except:
data_dict['attractive_content']=' '
#description_title
try:
data_dict['description_title']=html.xpath('//*[@id="job_detail"]/dd[2]/h3/text()')[0].replace(':','').strip()
except:
data_dict['description_title']=' '
#description_content
try:
data_dict['description_content']=re.sub(r'\\xa\d','',str(html.xpath('//*[@id="job_detail"]/dd[2]/div/p/text()')).replace('\', \'','\n').replace('[\'','').replace('\']','')).strip()
except:
data_dict['description_content']=' '
#city_distinct
try:
data_dict['city_distinct']=str(html.xpath('//*[@id="job_detail"]/dd[3]/div[1]/a/text()')[0:-1]).replace('\', \'','\n').replace('[\'','').replace('\']','').strip()
except:
data_dict['city_distinct']=' '
#city_detail
try:
data_dict['city_detail']=html.xpath('//*[@id="job_detail"]/dd[3]/input[3]/@value')[0].strip()
except:
data_dict['city_detail']=' '
#city_longitude
try:
data_dict['city_longitude']=html.xpath('//*[@id="job_detail"]/dd[3]/input[1]/@value')[0].strip()
except:
data_dict['city_longitude']=' '
#city_latitude
try:
data_dict['city_latitude']=html.xpath('//*[@id="job_detail"]/dd[3]/input[2]/@value')[0].strip()
except:
data_dict['city_latitude']=' '
#company_name
try:
data_dict['company_name']=html.xpath('//*[@id="job_company"]/dt/a/div/h2/text()')[0].strip()
except:
data_dict['company_name']=' '
#field
try:
data_dict['field']=str(html.xpath('//*[@id="job_company"]/dd/ul/li[1]/text()')).replace('\', \'','\n').replace('[\'','').replace('\']','').replace('\\n','').strip()
except:
data_dict['field']=' '
#development_stage
try:
data_dict['development_stage']=html.xpath('//*[@id="job_company"]/dd/ul/li[2]/text()[2]')[0].strip()
except:
data_dict['development_stage']=' '
#company_scale
try:
data_dict['company_scale']=html.xpath('//*[@id="job_company"]/dd/ul/li[3]/text()[2]')[0].strip()
except:
data_dict['company_scale']=' '
#company_page
try:
data_dict['company_page']=html.xpath('//*[@id="job_company"]/dd/ul/li[4]/a/@href')[0].strip()
except:
data_dict['company_page']=' '
#爬取成功后并且读取没有错误后url变成下一个
data_json=json.dumps(data_dict,ensure_ascii=False)
with open(str(i)+'.json','w') as f:
f.write(data_json)
i=i+1
time.sleep(1)
except RequestException as e:
print(e)
proxy_list.remove(proxy)
if len(proxy_list)<proxy_len:
proxy_list_useful=GetProxy().getproxy()
proxy_list=list(proxy_list_useful*10)
print('proxies is ready!')
continue
except IndexError as e:
print(e)
proxy_list.remove(proxy)
if len(proxy_list)<proxy_len:
proxy_list_useful=GetProxy().getproxy()
proxy_list=list(proxy_list_useful*10)
print('proxies is ready!')
continue
except Exception as e:
print(e)
continue
if __name__ == '__main__':
LaGou().getCatUrls()
思路就是每爬一个分类的所有职位链接之后就重新获取代理IP,然后在爬这个分类下面的所有职位。
总结
爬虫代理IP的问题虽然比较简单,但是像我这样要爬的时候再实时进行检测是十分耗时的,因此接下来我还会完善这个爬虫,配合数据库以及多线程构建一个高可用的代理IP池。欢迎大家持续关注,源码放在我的GitHub上,有需要请自行查看。