Python爬虫 简谱网-简谱
爬取步骤
- request库获取网页,找到规律,循环获取
- 正则表达式获取简谱图片链接,拼接URL
- 下载并保存简谱jpg图片到特定文件夹
注:
- 爬取过程中要随机睡眠1-4s,防止封IP
- 引入Try,except continue 防止中断爬取
- 正则表达式加上re.S参数,html多换行符
爬取完成,总共1293份简谱图片文件,耗时1h
观察网页代码规律
搜索妆台秋思,发现URL中有数字541108,发现每一个简谱页面数字都不一样,除了404页面,找到规律,540432--541727 总1295个
发现图片URL是在img标签 src属性中,利用正则表达式
<img.*?src="(.*?)".*?title="(.*?)"
正则表达式可以在RegExr网站先调试好
https://regexr-cn.com/
正式爬取简谱图片
爬取用requests.get()
方法就可以
命名将URL中 num加入, 末尾加上 简谱 字样,写入jpg文件,wb方式是二进制写入,可新建,可覆盖。
全部代码如下:
# from bs4 import BeautifulSoup
import requests
import re
import time
import random
# url='http://www.27270.com/tag/649.html'
# url='http://www.jianpuw.com/htm/kr/541108.htm'
def get_sace_img(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"}
req = requests.get(url=url, headers=headers)
req.encoding = 'utf-8'
html=req.text
# print(html)
results = re.findall('<img.*?src="(.*?)".*?title="(.*?)"', html,re.S)
for result in results:
print(result)
# print(results)
print(results[0][1])
url_root = "http://www.jianpuw.com/"
get_url = results[0][0]
get_url = get_url[6:]
image_url = url_root + str(get_url)
#图片链接
# image_url= "https://www.shuquge.com/files/article/image/3/3478/3478s.jpg"
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
}
r = requests.get(image_url,headers=headers)
url_num = re.findall('kr\/(.*?).htm', url ,re.S)[0]
print(url_num, "2222")
# 下载图片
img_name = r'./爬虫/Pictures/1/' + str(url_num) + "-" + str(results[0][1]) + '-简谱' + '.jpg'
with open(img_name ,mode = "wb") as f:
f.write(r.content) #图片内容写入文件
x = random.randint(1, 4) # 随机一个大于等于1且小于等于5的整数
time.sleep(x)
if __name__ == '__main__':
# num=541108 540432--541727 1295
num=540432
for i in range(1295):
num = num + 1
print(num)
url = 'http://www.jianpuw.com/htm/kr/{}.htm'.format(str(num))
try:
get_sace_img(url)
except:
pass
continue