有一段时间没有学习了,记录一篇
Python中多线程与多进程的区别
摘抄一段书本文字
“
当计算机运行程序时,就会创建包含代码和状态的进程。这些进程会通过计算机的个或多个 CPU 执行。不过,同一时刻每个 CPU 只会执行一个进程,然后在不同进程间快速切換,这样就给人以多个程序同时运行的感觉。同理,在一个进程中,程序的执行也是在不同线程间进行切换的,每个线程执行程序的不同部分。
这里简单地做个类比:有一个大型工厂,该工厂负责生产玩具;同时工厂下又有多个车间,每个车间负责不同的功能,生产不同的玩具零件;每个车间里又有多个车间工人,这些工人相互合作,彼此共享资源来共同生产某个玩具零件等。这里的工厂就相当于一个网络爬虫,而每个车间相当于一个进程,每个车间工人就相当于线程。这样,通过多线程和多进程,网络爬虫就能高效、快速地进行下去。”
还以爬取豆瓣Top 250电影为例,https://www.jianshu.com/p/c1f57ab65c60
当时我们测试时为单线程,相当于串行工作,面对大量数据爬取时显得力不从心。
优化方案一:使用多线程方式
使用 threading和queue模块
import threading
import queue
创建线程池,使用生产者消费者模式:
thcounts=1
threads=[]
q=queue.Queue()
for url in urls:
q.put(url)
for i in range(thcounts):
# t=threading.Thread(target=db_moives,args=(q,))
threads.append(Db_moives(q))
start1_time = time.time()
for t in threads:
t.start()
for t in threads:
t.join()
完整代码参考:
import requests
from lxml import etree
import re
import time
import threading
import queue
headers ={
'User-Agent' :'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
# 根据每页获取每个电影详细的URL
class Db_moives(threading.Thread):
def __init__(self,q):
threading.Thread.__init__(self)
self.q=q
def run(self):
while not self.q.empty():
url=self.q.get()
print(url)
html =requests.get(url ,headers=headers)
xdata =etree.HTML(html.text)
moive_urls =xdata.xpath('//div[@class="item"]/div[@class="pic"]/a[1]/@href')
# print(moive_urls)
for moive_url in moive_urls:
# print(moive_url)
self.get_info(moive_url) # 调用get_info
# 获取电影的详细参数
def get_info(self,url):
try:
html =requests.get(url ,headers=headers)
xdata =etree.HTML(html.text)
name =xdata.xpath('//div[@id="wrapper"]//h1/span/text()')[0]
# print(name)
year =xdata.xpath('//div[@id="wrapper"]//h1/span/text()')[1][1:5]
# print(year)
director =xdata.xpath('//div[@id="info"]/span[1]/span[2]/a/text()')[0]
# print(director)
actor =xdata.xpath('//div[@id="info"]//span[@class="actor"]//a/text()')[0] # 第一个主演
# print(actor)
styles =xdata.xpath('//div[@id="info"]//span[@property="v:genre"]/text()')
style ='-'.join(styles) # 将list---》str
# print(style)
country =re.findall('<span class="pl">制片国家/地区:</span> (.*?)<br/>' ,html.text ,re.S)[0]
# print(country)
language =re.findall(' <span class="pl">语言:</span> (.*?)<br/>' ,html.text ,re.S)[0].replace(' / ' ,'-')
# print(language)
release_time \
=re.findall('<span class="pl">上映日期:</span> <span property="v:initialReleaseDate" content=".*?">(.*?)</span>'
,html.text ,re.S)[0]
# print(release_time)
time =re.findall('<span class="pl">片长:</span> <span property="v:runtime" content=".*?">(.*?)</span>.*?<br/>'
,html.text ,re.S)[0]
# print(time)
other_name =re.findall('<span class="pl">又名:</span> (.*?)<br/>' ,html.text ,re.S)[0]
# print(other_name)
score =xdata.xpath('//div[@id="interest_sectl"]//strong/text()')[0]
# print(score)
insert_sub = 'insert into dbmoives values(0,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)' % (
'"' + str(name) + '"', '"' + str(year) + '"', '"' + str(director) + '"', '"' + str(actor) + '"',
'"' + str(style) + '"', '"' + str(country) + '"', '"' + str(language) + '"', '"' + str(release_time) + '"',
'"' + str(time) + '"', '"' + str(other_name) + '"', '"' + str(score) + '"')
print(insert_sub)
except Exception as e:
print(e)
if __name__=="__main__":
urls =['https://movie.douban.com/top250?start={}&filter='.format(i) for i in range(0 ,100 ,25)]
thcounts=1
threads=[]
q=queue.Queue()
for url in urls:
q.put(url)
for i in range(thcounts):
# t=threading.Thread(target=db_moives,args=(q,))
threads.append(Db_moives(q))
start1_time = time.time()
for t in threads:
t.start()
for t in threads:
t.join()
end1_time = time.time()
print('串行爬虫' ,end1_time -start1_time)
单线程的话测试一下,测试前4页
修改4线程,测试,可以看到效果还可以,共耗时21s多。
thcounts=4
优化方案二:使用多进程方式
使用multiprocessing库下的 Pool模块
from multiprocessing import Pool
使用方法,创建进程池,指定进行数量,然后使用map将函数和参数列表进行映射即可
p = Pool(processes=2)
p.map(get_url ,urls)
我此处测试了前3页,可以看下效果比较明显
测试源代码如下:
import requests
from lxml import etree
import re
import time
from multiprocessing import Pool
headers ={
'User-Agent' :'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
# 根据每页获取每个电影详细的URL
def get_url(url):
html =requests.get(url ,headers=headers)
xdata =etree.HTML(html.text)
moive_urls =xdata.xpath('//div[@class="item"]/div[@class="pic"]/a[1]/@href')
# print(moive_urls)
for moive_url in moive_urls:
get_info(moive_url) # 调用get_info
# 获取电影的详细参数
def get_info(url):
try:
html =requests.get(url ,headers=headers)
xdata =etree.HTML(html.text)
name =xdata.xpath('//div[@id="wrapper"]//h1/span/text()')[0]
# print(name)
year =xdata.xpath('//div[@id="wrapper"]//h1/span/text()')[1][1:5]
# print(year)
director =xdata.xpath('//div[@id="info"]/span[1]/span[2]/a/text()')[0]
# print(director)
actor =xdata.xpath('//div[@id="info"]//span[@class="actor"]//a/text()')[0] # 第一个主演
# print(actor)
styles =xdata.xpath('//div[@id="info"]//span[@property="v:genre"]/text()')
style ='-'.join(styles) # 将list---》str
# print(style)
country =re.findall('<span class="pl">制片国家/地区:</span> (.*?)<br/>' ,html.text ,re.S)[0]
# print(country)
language =re.findall(' <span class="pl">语言:</span> (.*?)<br/>' ,html.text ,re.S)[0].replace(' / ' ,'-')
# print(language)
release_time \
=re.findall('<span class="pl">上映日期:</span> <span property="v:initialReleaseDate" content=".*?">(.*?)</span>'
,html.text ,re.S)[0]
# print(release_time)
time =re.findall('<span class="pl">片长:</span> <span property="v:runtime" content=".*?">(.*?)</span>.*?<br/>'
,html.text ,re.S)[0]
# print(time)
other_name =re.findall('<span class="pl">又名:</span> (.*?)<br/>' ,html.text ,re.S)[0]
# print(other_name)
score =xdata.xpath('//div[@id="interest_sectl"]//strong/text()')[0]
# print(score)
except Exception as e:
print(e)
if __name__=="__main__":
urls =['https://movie.douban.com/top250?start={}&filter='.format(i) for i in range(0 ,75 ,25)]
start1_time = time.time()
for url in urls:
print(url)
get_url(url)
end1_time =time.time()
print('串行爬虫' ,end1_time -start1_time)
p = Pool(processes=2)
start2_time =time.time()
p.map(get_url ,urls)
end2_time =time.time()
print('并行2' ,end2_time -start2_time)
p = Pool(processes=5)
start3_time =time.time()
p.map(get_url ,urls)
end3_time =time.time()
print('并行5' ,end3_time -start3_time)
总结:Python多线程一直被诟病是伪多线程,关于这一点大家可以搜索一下其他的测试资料。在配置线程数和进程数时也不是越多越好,要看程序的执行过程具体而分析,比如我们上面的多线程案例,如果我们只爬取一页网页,那么设置再多的多线程也无效。因为队列里面只加了一个网页的URL。