大众点评爱车频道爬虫,requests多线程

在请求每个列表页的商家详情页时开15个子线程(每个列表页有15个商家)

这一版是没有加jion()的:

存在一个问题,15个详情页的子线程没有结束的情况下,主线程不会等待(非阻塞),主线程会继续往下翻页(列表页),继续开子线程解析详情页;如果电脑性能不好,再加上网络差或者网站反爬严重,可能同时会有上千个子线程在跑,这对于服务器的压力就比较大了;而且有个问题就是,下载的数据更加的混乱了,可能第一条是第一个列表页的第一个商铺详情,但是第二条却是第50个列表页的第15个商铺详情,这数据就有点混乱了;所以需要优化;

每分钟大概是40.5条数据,折合每天(24小时)58320条数据,似乎速度还是有点慢啊,相当于开了75个子线程

下面加了Joion()的会更慢

#coding:utf-8
import hashlib
import time
from fake_useragent import UserAgent
import requests
# from UA import data
import json
import scylla_test
import kuai_test
from lxml import etree
import re
import csv
from shantou_links import *
import random
from cookie_parse import GetComments
import threading
L = threading.Lock()


class Luoyang:
    def __init__(self):
        self.city_name = '洛阳'

    # 爬虫入口,开始爬取
    def get_first_page(self):
        ua = UserAgent()
        print('全部爬虫开始工作,从后往前')
        data = ['id从后往前','店铺名称','所属城市','行政区','地址','二级分类','三级分类','电话','点评数量','最新点评时间']
        with open('luoyang_dianping_car.csv', 'a', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(data)
        field = '买车卖车'
        urls = self.get_target_urls()
        self.start(urls)
        print('完成')

    # 获取所有的初始链接
    def get_target_urls(self):
        # beau_urls = self.get_target_url(url='http://www.dianping.com/luoyang/ch65/g34072')
        # mend_urls = self.get_target_url(url='http://www.dianping.com/luoyang/ch65/g176')
        match_urls = self.get_target_url(url='http://www.dianping.com/luoyang/ch65/g177')
        # result = mend_urls + beau_urls + match_urls
        result = match_urls
        print(result)
        return result

    def get_target_url(self, url):
        field = '买车'
        while True:
            headers = {

            }
            content = self.url_start(url,headers,field)

            if content.status_code == 200 and field in content.text:
                tree = etree.HTML(content.text)
                urls = tree.xpath('//div[@id="region-nav"]/a/@href')
                return urls


    # 遍历所有的初始链接,发起请求,保存数据
    def start(self,urls):
        for first_url in urls:
            field = '买车卖车'


            # 如果不能正确响应,反复试
            while True:

                headers = {

                }
                response = self.url_start(first_url,headers,field)
                print(response.status_code)
                print(response.text)
                if response.status_code == 200 and field in response.text:
                    print('返回200,页面有需求的字段')
                    break
                else:
                    print('该请求失败,准备重试')
                    time.sleep(2)
            if 'g34084' in first_url:
                type = '4s'
            elif 'g34085' in first_url:
                type = '综合经销商'
            elif 'g34086' in first_url:
                type = '二手车'
            else:
                type = 0
            self.parse_firs_full_page(response, type)


    # 拿到页数;保存首页全部店铺信息,循环翻页执行,保存操作,如果没有type 那么type就传参0
    def parse_firs_full_page(self,response,type=0):
        tree = etree.HTML(response.text)
        try:
            pages = tree.xpath('//a[@class="PageLink"]/text()')[-1]
            pages = int(pages)
        except:
            pages = 0
        # 保存首页数据
        # url_t = 'http://www.dianping.com/luoyang/ch65/g34085'
        # if url_t not in response.url:
        #     print('经销商页面进来了,首页不保存')
        self.one_full_page(response,type)
        # 翻页继续保存
        self.turn_page(response.url,pages,type)

    # 翻页并且保存每页的全部店铺数据
    def turn_page(self, url, pages, type):
        if pages > 0:
            for i in range(2, pages + 1):
                time.sleep(5)
                start_url = url + 'p{}'
                action_url = start_url.format(i)
                if i == 2:
                    headers = {
                        'Referer': url,
                        'Host': 'www.dianping.com'
                    }
                else:
                    headers = {
                        'Referer': start_url.format(i - 1),
                        'Host': 'www.dianping.com'
                    }
                print(headers)
                field = '买车卖车'
                while True:
                    response = self.url_start(action_url, headers, field)
                    print(response.status_code)
                    # print(response.text)
                    if response.status_code == 200 and field in response.text:
                        print('该链接请求成功')
                        break
                    else:
                        print('请求失败')
                        time.sleep(2)
                self.one_full_page(response, type)

    #发起一个new请求拿到响应数据
    def url_start(self,url,headers,field):
        while True:
            try:
                # 捕捉代理超时异常
                times = int(time.time())
                planText = "orderno=隐藏,secret=b5dd53126b3143fba00dda5fec6b9607,timestamp={}".format(times)
                md = hashlib.md5()
                md.update(planText.encode('utf-8'))
                content = md.hexdigest()
                ua = UserAgent()
                headers['User-Agent'] = ua.random
                headers['Proxy-Authorization'] = 'sign={}&orderno=ZF20186170227TPgMj4&timestamp={}'.format(content.upper(), times)

                proxies = {'http': 'forward.xdaili.cn:80'}
                response = requests.get(url, proxies=proxies, headers=headers)
                return response
            except:
                print ('代理超时,重试.....')



    # 下载保存单个列表页的全部店铺详情;
    def one_full_page(self,response,type=0):
        tree = etree.HTML(response.text)
        business_li = tree.xpath('//div[@class="pic"]/a/@href')

        headers = {
            'Referer': response.url,
            'Host': 'www.dianping.com'
        }
        print(headers)
        if len(business_li) > 0:
            for business in business_li:
                id = re.findall(r'/shop/(\d+)', business)[0]
                t = threading.Thread(target=self.parse_detail,args=(business,id,headers,type))
                t.start()
        else:
            print('该页面没有店铺')

    # 解析店铺详情页,保存数据,供one_full_page用
    def parse_detail(self,url,id,headers,type=0):
        field = '地址'
        while True:

            response = self.url_start(url,headers,field)
            print(headers)
            if response.status_code == 200 and len(response.text)>0 and field in response.text:
                print('请求成功200')
                break
            else:
                print('请求失败,重试')
                time.sleep(2)

        content = response.text
        # print(content)
        try:
            tree = etree.HTML(content)
            # print('详情页数据',content)
            name = tree.xpath('//h1[@class="shop-name"]/text()')[0]
            city = self.city_name

            district_list = tree.xpath('//div[@class="breadcrumb"]/a/text()')
            district = ''
            for i in district_list:
                if '区' in i:
                    district = i
            address = tree.xpath('//span[@itemprop="street-address"]/text()')[0].strip()
            second_type =tree.xpath('//div[@class="breadcrumb"]/a/text()')[1]
            if type == 0:
                third_type = ''
            else:
                third_type = type
            try:
                tel = tree.xpath('//p[@class="expand-info tel"]/span[@itemprop="tel"]/text()')[0]
            except:
                tel = ''
            comment_num = tree.xpath('//span[@id="reviewCount"]/text()')[0]
            id =id
            latest_time = self.get_commets_time(id)
            info_list = [id,name,city,district,address,second_type,third_type,tel,comment_num,latest_time]
            # 保存单个店铺详情
            L.acquire()
            with open('luoyang_dianping_car.csv', 'a', newline='', encoding='utf-8') as f:
                writer = csv.writer(f)
                writer.writerow(info_list)
            L.release()
            print('单个店铺详情保存成功')
        except:
            print('详情页没有数据')


    # 获取最新评论的日期,供parse_detail用
    def get_commets_time(self,id):
        getcomments = GetComments(id)
        lasttime = getcomments.get_lasttime()
        return lasttime


if __name__ == '__main__':
    luoyang = Luoyang()
    luoyang.get_first_page()

加入jion(),优化多线程:

使得下载的数据更加的有秩序,对于服务器的压力也更加的小;
代码总链接注释掉的部分,是已经爬完的,代码只跑了剩余部分;
代码中只有这一块做了调整:

            threads = []
            for business in business_li:
                id = re.findall(r'/shop/(\d+)', business)[0]
                t = threading.Thread(target=self.parse_detail,args=(business,id,headers,type))
                threads.append(t)
                t.start()
            for thread in threads:
                thread.join()

以下是完整的代码:

#coding:utf-8
import hashlib
import time
from fake_useragent import UserAgent
import requests
# from UA import data
import json
import scylla_test
import kuai_test
from lxml import etree
import re
import csv
from shantou_links import *
import random
from cookie_parse import GetComments
import threading
L = threading.Lock()


class Luoyang:
    def __init__(self):
        self.city_name = '洛阳'

    # 爬虫入口,开始爬取
    def get_first_page(self):
        ua = UserAgent()
        print('全部爬虫开始工作,从后往前')
        data = ['id从配件厂老城区开始','店铺名称','所属城市','行政区','地址','二级分类','三级分类','电话','点评数量','最新点评时间']
        with open('luoyang_dianping_car.csv', 'a', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(data)
        field = '买车卖车'
        urls = self.get_target_urls()
        self.start(urls)
        print('完成')

    # 获取所有的初始链接
    def get_target_urls(self):
        # beau_urls = self.get_target_url(url='http://www.dianping.com/luoyang/ch65/g34072')
        # mend_urls = self.get_target_url(url='http://www.dianping.com/luoyang/ch65/g176')
        match_urls = self.get_target_url(url='http://www.dianping.com/luoyang/ch65/g177')
        # result = mend_urls + beau_urls + match_urls
        result = match_urls
        print(result)
        return result

    def get_target_url(self, url):
        field = '买车'
        while True:
            headers = {

            }
            content = self.url_start(url,headers,field)

            if content.status_code == 200 and field in content.text:
                tree = etree.HTML(content.text)
                urls = tree.xpath('//div[@id="region-nav"]/a/@href')
                return urls


    # 遍历所有的初始链接,发起请求,保存数据
    def start(self,urls):
        for first_url in urls[2:]:
            field = '买车卖车'
            # 如果不能正确响应,反复试
            while True:

                headers = {

                }
                response = self.url_start(first_url,headers,field)
                print(response.status_code)
                print(response.text)
                if response.status_code == 200 and field in response.text:
                    print('返回200,页面有需求的字段')
                    break
                else:
                    print('该请求失败,准备重试')
                    time.sleep(2)
            if 'g34084' in first_url:
                type = '4s'
            elif 'g34085' in first_url:
                type = '综合经销商'
            elif 'g34086' in first_url:
                type = '二手车'
            else:
                type = 0
            self.parse_firs_full_page(response, type)

    # 拿到页数;保存首页全部店铺信息,循环翻页执行,保存操作,如果没有type 那么type就传参0
    def parse_firs_full_page(self,response,type=0):
        tree = etree.HTML(response.text)
        try:
            pages = tree.xpath('//a[@class="PageLink"]/text()')[-1]
            pages = int(pages)
        except:
            pages = 0
        # 保存首页数据
        # url_t = 'http://www.dianping.com/luoyang/ch65/g34085'
        # if url_t not in response.url:
        #     print('经销商页面进来了,首页不保存')
        self.one_full_page(response,type)
        # 翻页继续保存
        self.turn_page(response.url,pages,type)

    # 翻页并且保存每页的全部店铺数据
    def turn_page(self, url, pages, type):
        if pages > 0:
            for i in range(2, pages + 1):
                time.sleep(5)
                start_url = url + 'p{}'
                action_url = start_url.format(i)
                if i == 2:
                    headers = {
                        'Referer': url,
                        'Host': 'www.dianping.com'
                    }
                else:
                    headers = {
                        'Referer': start_url.format(i - 1),
                        'Host': 'www.dianping.com'
                    }
                print(headers)
                field = '买车卖车'
                while True:
                    response = self.url_start(action_url, headers, field)
                    print(response.status_code)
                    # print(response.text)
                    if response.status_code == 200 and field in response.text:
                        print('该链接请求成功')
                        break
                    else:
                        print('请求失败')
                        time.sleep(2)
                self.one_full_page(response, type)

    #发起一个new请求拿到响应数据
    def url_start(self,url,headers,field):
        while True:
            try:
                # 捕捉代理超时异常
                times = int(time.time())
                planText = "orderno=ZF20186170227TPgMj4,secret=b5dd53126b3143fba00dda5fec6b9607,timestamp={}".format(times)
                md = hashlib.md5()
                md.update(planText.encode('utf-8'))
                content = md.hexdigest()
                ua = UserAgent()
                headers['User-Agent'] = ua.random
                headers['Proxy-Authorization'] = 'sign={}&orderno=ZF20186170227TPgMj4&timestamp={}'.format(content.upper(), times)

                proxies = {'http': 'forward.xdaili.cn:80'}
                response = requests.get(url, proxies=proxies, headers=headers)
                return response
            except:
                print ('代理超时,重试.....')

    # 下载保存单个列表页的全部店铺详情;
    def one_full_page(self,response,type=0):
        tree = etree.HTML(response.text)
        business_li = tree.xpath('//div[@class="pic"]/a/@href')

        headers = {
            'Referer': response.url,
            'Host': 'www.dianping.com'
        }
        print(headers)
        if len(business_li) > 0:
            threads = []
            for business in business_li:
                id = re.findall(r'/shop/(\d+)', business)[0]
                t = threading.Thread(target=self.parse_detail,args=(business,id,headers,type))
                threads.append(t)
                t.start()
            for thread in threads:
                thread.join()
        else:
            print('该页面没有店铺')

    # 解析店铺详情页,保存数据,供one_full_page用
    def parse_detail(self,url,id,headers,type=0):
        field = '地址'
        while True:

            response = self.url_start(url,headers,field)
            print(headers)
            if response.status_code == 200 and len(response.text)>0 and field in response.text:
                print('请求成功200')
                break
            else:
                print('请求失败,重试')
                time.sleep(2)
        content = response.text
        # print(content)
        try:
            tree = etree.HTML(content)
            # print('详情页数据',content)
            name = tree.xpath('//h1[@class="shop-name"]/text()')[0]
            city = self.city_name

            district_list = tree.xpath('//div[@class="breadcrumb"]/a/text()')
            district = ''
            for i in district_list:
                if '区' in i:
                    district = i
            address = tree.xpath('//span[@itemprop="street-address"]/text()')[0].strip()
            second_type =tree.xpath('//div[@class="breadcrumb"]/a/text()')[1]
            if type == 0:
                third_type = ''
            else:
                third_type = type
            try:
                tel = tree.xpath('//p[@class="expand-info tel"]/span[@itemprop="tel"]/text()')[0]
            except:
                tel = ''
            comment_num = tree.xpath('//span[@id="reviewCount"]/text()')[0]
            id =id
            latest_time = self.get_commets_time(id)
            info_list = [id,name,city,district,address,second_type,third_type,tel,comment_num,latest_time]
            # 保存单个店铺详情
            L.acquire()
            with open('luoyang_dianping_car.csv', 'a', newline='', encoding='utf-8') as f:
                writer = csv.writer(f)
                writer.writerow(info_list)
            L.release()
            print('单个店铺详情保存成功')
        except:
            print('详情页没有数据')

    # 获取最新评论的日期,供parse_detail用
    def get_commets_time(self,id):
        getcomments = GetComments(id)
        lasttime = getcomments.get_lasttime()
        return lasttime


if __name__ == '__main__':
    luoyang = Luoyang()
    luoyang.get_first_page()

每分钟8.4条数据,折合每天(24小时)12096条数据,速度确实慢到爆;这还是开了15个子线程的

需要注意的是:

子线程涉及到操作全局变量,进行修改赋值操作的时候,一定要加上锁;因为对于赋值和修改,同样是异步的,第一个子线程可能才刚刚修改完2全局变量(赋值还没进行),第二个子线程就过来修改了,这个时候全局变量得到的值和预期值肯定是不一样的;

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 194,761评论 5 460
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 81,953评论 2 371
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 141,998评论 0 320
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 52,248评论 1 263
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 61,130评论 4 356
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 46,145评论 1 272
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 36,550评论 3 381
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 35,236评论 0 253
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 39,510评论 1 291
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 34,601评论 2 310
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 36,376评论 1 326
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 32,247评论 3 313
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 37,613评论 3 299
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 28,911评论 0 17
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,191评论 1 250
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 41,532评论 2 342
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 40,739评论 2 335

推荐阅读更多精彩内容