python爬取58上的招聘信息

爬虫学习记录

获取58同城上的招聘信息

爬虫的意义

我们编写爬虫就是把网页中的关键信息爬取下来，然后做分析，现在是数据时代，所以数据是很重要的资源。爬虫可以帮助我们获取这些资源。

本文的目的

现在的爬虫技术很多，但是以python为主，作为初学者我建议不要使用太多现成的工具，这样无法学习到里面的技术，比如你在使用scrapy时都很难了解它在内部调用了什么，这篇文章也将用urllib2+beautifulSoup+mysql来获取58同城上的招聘信息，最关键的是分析网页源代码，找到需要的信息。

获取网页源码

            url = "http://hz.58.com/tech/" + "pn"+str(start)+"/"
            request = urllib2.Request(url=url,headers=headers)
           
            response = urllib2.urlopen(request,timeout=60)
            html = response.read().decode('utf-8')
        

            soup = BeautifulSoup(html,'lxml')

获取58的列表信息


            for item in all_dl:

               job =  item.find('dt').find('a')
               info = getdatas.getInfo(job['href'])
               if info != 0:
                   count += insertmysql.insertMysql(info)
                   print "现在的数据量为%d"%(count)
               time.sleep(5)
            start = start + 1

其中的每一个item就是一条招聘信息，然后进入这个二级地址，获取相关的招聘信息

二级网址

在这个部分首先也要获取网页源代码，然后用beautifulSoup来匹配关键信息，beautifulSoup的用法可以在官网看看。

def getInfo(url):
    headers = {}
    headers["User-Agent"] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36"

    try:
        # proxies = {'http': proxy_ip}
        request = urllib2.Request(url=url, headers=headers)
        # request.set_proxy(proxy_ip, 'http')
        response = urllib2.urlopen(request)
        html = response.read().decode('utf-8')
        # html = requests.get(url, headers=headers, proxies=proxies)

        html = BeautifulSoup(html, 'lxml')
        info = {}
        info['id'] = uuid.uuid4()
        info['title'] = html.find('div', class_='item_con pos_info').find('span', class_='pos_name').get_text()
        temp  = html.find('div', class_='pos_base_info').find('span', class_='pos_salary').get_text()
        info['salary_min'] = 0+int(re.findall(r"(\d+)\-", temp)[0])
        info['salary_max'] = 0 + int(re.findall(r"\-(\d+)", temp)[0])
        info['company'] = html.find('div', class_='item_con company_baseInfo').find('p',class_='comp_baseInfo_title').find('a', class_='baseInfo_link').get_text()
        temp = html.find('div', class_='item_con company_baseInfo').find('p', class_='comp_baseInfo_scale').get_text()
        info['scale_min'] = 0+int(re.findall(r"(\d+)\-", temp)[0])
        info['scale_max'] = 0+int(re.findall(r"\-(\d+)", temp)[0])
        info['address'] = html.find('div', class_='item_con work_adress').find('p', class_='detail_adress').get_text()
        return info
    except Exception, e:
        return 0

我用uuid作为主键，爬取了招聘信息中的主要内容，薪水，公司规模，公司地址等信息，但是58里的招聘页面有些不是按照这个标准设置的，所以如果想要更加完整的信息，就需要在分类讨论一下。

存储数据库

这里选择的数据库是mysql，python连接mysql也很容易：

 db = MySQLdb.connect(host='localhost', user='root', passwd='123', db='58city', port=3306,charset='utf8')

 cursor = db.cursor()

然后将相关的信息放到mysql中：


 cursor.execute(
                'insert into jobs(id,title,salary_min,salary_max,company,scale_min,scale_max,address) values(%s,%s,%s,%s,%s,%s,%s,%s)',
                (id,title,salary_min,salary_max,company,scale_min,scale_max,address))


            db.commit()
            db.close()
            cursor.close()

我们在写代码的时候会肯定会有bug，所以使用try catch 的方法最好。

        except Exception, e:
            print e.message+"数据库报错"+e.message+e.args[0]
            return 0

反爬的策略

我们可以做个ip代理，防止地址被封，并且设置休眠时间，以免爬取太快
被网站察觉。

这里提供源代码

# coding:utf8
import random
import urllib2
import time
from bs4 import BeautifulSoup
import getdatas
import insertmysql
import requests

ISOTIMEFORMAT = '%Y-%m-%d %X'
headers = {}
headers["User-Agent"] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36"
import getip
# 获取tag


# start

print
"********** START **********"
print
time.strftime(ISOTIMEFORMAT, time.localtime())

try:
    start = 33
    count = 0
    # proxy_list = getip.get_ips()
    while True:
        try:
            
            # proxy_ip = random.choice(proxy_list)
            # proxies = {'http': proxy_ip}
            # 
            url = "http://hz.58.com/tech/" + "pn"+str(start)+"/"
            request = urllib2.Request(url=url,headers=headers)
            # request.set_proxy(proxy_ip,'http')
            response = urllib2.urlopen(request,timeout=60)
            html = response.read().decode('utf-8')
            # html = requests.get(url, headers=headers, proxies=proxies)

            soup = BeautifulSoup(html,'lxml')
            all_dl = soup.find('div',id='infolist').findAll('dl')
            
            if len(all_dl) == 0:
                break

            for item in all_dl:

                job =  item.find('dt').find('a')
                info = getdatas.getInfo(job['href'])
                if info != 0:
                    count += insertmysql.insertMysql(info)
                    print "现在的数据量为%d"%(count)
                time.sleep(5)
            start = start + 1
            print start
            time.sleep(5)
            # print info_list['director']
        except Exception, e:
            print e.message + "1"
        


except Exception, e:
    print e.message +'2'

# coding:utf8


import urllib2
import urllib
import json
import time
import re
import random
import uuid
import requests
from bs4 import BeautifulSoup


def getInfo(url):
    headers = {}
    headers["User-Agent"] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36"

    try:
        # proxies = {'http': proxy_ip}
        request = urllib2.Request(url=url, headers=headers)
        # request.set_proxy(proxy_ip, 'http')
        response = urllib2.urlopen(request)
        html = response.read().decode('utf-8')
        # html = requests.get(url, headers=headers, proxies=proxies)

        html = BeautifulSoup(html, 'lxml')
        info = {}
        info['id'] = uuid.uuid4()
        info['title'] = html.find('div', class_='item_con pos_info').find('span', class_='pos_name').get_text()
        temp  = html.find('div', class_='pos_base_info').find('span', class_='pos_salary').get_text()
        info['salary_min'] = 0+int(re.findall(r"(\d+)\-", temp)[0])
        info['salary_max'] = 0 + int(re.findall(r"\-(\d+)", temp)[0])
        info['company'] = html.find('div', class_='item_con company_baseInfo').find('p',class_='comp_baseInfo_title').find('a', class_='baseInfo_link').get_text()
        temp = html.find('div', class_='item_con company_baseInfo').find('p', class_='comp_baseInfo_scale').get_text()
        info['scale_min'] = 0+int(re.findall(r"(\d+)\-", temp)[0])
        info['scale_max'] = 0+int(re.findall(r"\-(\d+)", temp)[0])
        info['address'] = html.find('div', class_='item_con work_adress').find('p', class_='detail_adress').get_text()
        return info
    except Exception, e:
        return 0

# -*- coding:utf-8 -*-  
import MySQLdb
import MySQLdb.cursors
import getCity

def insertMysql(info):

    if info == None:
        print "there is no infomation"
        return 0
    else:
        try:
            db = MySQLdb.connect(host='localhost', user='root', passwd='123', db='58city', port=3306,charset='utf8')

            cursor = db.cursor()
            id = info['id']
            title = info['title'] 
            salary_min = info['salary_min']
            salary_max = info['salary_max']
            company = info['company']

            scale_min = info['scale_min']
            scale_max = info['scale_max']
            address = info['address']
            cursor.execute(
                'insert into jobs(id,title,salary_min,salary_max,company,scale_min,scale_max,address) values(%s,%s,%s,%s,%s,%s,%s,%s)',
                (id,title,salary_min,salary_max,company,scale_min,scale_max,address))


            db.commit()
            db.close()
            cursor.close()
            return 1
        except Exception, e:
            print e.message+"数据库报错"+e.message+e.args[0]
            return 0

最后编辑于：2017.12.06 15:01:42

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 194,491评论 5赞 459
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 81,856评论 2赞 371
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 141,745评论 0赞 319
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 52,196评论 1赞 263
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 61,073评论 4赞 355
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 46,112评论 1赞 272
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 36,531评论 3赞 381
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 35,215评论 0赞 253
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 39,485评论 1赞 290
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 34,578评论 2赞 309
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 36,356评论 1赞 326
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 32,215评论 3赞 312
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 37,583评论 3赞 299
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 28,898评论 0赞 17
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,174评论 1赞 250
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 41,497评论 2赞 341
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 40,697评论 2赞 335