前言:qq音乐文件的批量爬取,涉及到的json对网站的解析,请求的有效伪装,字符串的操作等。
目的:爬取想要的音乐资源,包括需要付费下载的音乐。
- 流程
包括网站分析以及代码实现 - 网站分析
运用倒推的方法,从音乐文件的网址出发,找到对应文件的参数
1.音乐文件网址
http://dl.stream.qqmusic.qq.com/C400003KExF60zMMGK.m4a?vkey=CB06A4F49AB76D6C336BEB5BF85B8B6694AE9CAFCA0FF
8000C87984F69777F1AFA6A0159CFC497A7FB2CBB36833900A04C75ECE9FC8CE528&guid=9602668140&uin=0&fromtag=66
分析播放歌曲链接:
只有下列参数不同
1.文件名
C400003KExF60zMMGK.m4a
简化:
003KExF60zMMGK
2.vkey
vkey=F3263444D4844C31F3525B2FBA94935BF0466ACCE675A21B2EC5F599E6A42A812615BF4D83335B5EFE6989ED2BA08D161A00A319598BA6EE
2.从播放页面找到这些不同的参数以及装有这些参数的网址
https://c.y.qq.com/base/fcgi-bin/fcg_music_express_mobile3.fcg?g_tk=5381&jsonpCallback=MusicJsonCallbac
k20480960151150063&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=y
qq&needNewCode=0&cid=205361747&callback=MusicJsonCallback20480960151150063&uin=0&songmid=003KExF60zMMGK&filename=C400003KExF60zMMGK
.m4a&guid=9602668140
分析链接:
找这些不同参数:
songmid:"003KExF60zMMGK"
eferer:https://y.qq.com/portal/player.html
3.从音乐列表找到这些不同的参数和网址
https://c.y.qq.com/qzone/fcg-bin/fcg_ucc_getcdinfo_byids_cp.fcg?type=1&json=1&utf8=1&onlysong=0&dissti
d=1480619034&format=jsonp&g_tk=5381&jsonpCallback=playlistinfoCallback&loginUin=0&hostUin=0&format=json
p&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq&needNewCode=0
分析链接:
找这些不同参数:
找不同:
disstid=1480619034
disstid来源链接:入口链接
referer:https://y.qq.com/n/yqq/playsquare/1480619034.html
4.从播放列表找到这些不同的参数和网址
https://c.y.qq.com/splcloud/fcgi-bin/fcg_get_diss_by_tag.fcg?picmid=1&rnd=0.7709971027608087&g_tk=5381
&jsonpCallback=getPlaylist&loginUin=0&hostUin=0&format=jsonp&inCharset=utf8&outCharset=utf-8¬ice=0
&platform=yqq&needNewCode=0&categoryId=10000000&sortId=5&sin=0&ein=29
找不同:
sin=0
ein=29
sum=5260
referer:https://y.qq.com/portal/playlist.html
入口和出口都找到了,开始写代码
- 代码如下
import requests
import json
import time
def get_Disstid(url):
headers={
"User-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
"Referer":"https://y.qq.com/portal/playlist.html",
"Host":"c.y.qq.com"
}
# 1访问入口得到音乐列表的disstid
res = requests.get(url,headers=headers).text
re=res.strip("getPlaylist()")
r=json.loads(re)
for x in r["data"]["list"]:
# 用得到的dissid进行拼接得到新的url
sub_url = " https://c.y.qq.com/qzone/fcg-bin/fcg_ucc_getcdinfo_byids_cp.fcg?type=1&json=1&utf8=1&onlysong=0&disstid={0}&format=jsonp&g_tk=5381&jsonpCallback=playlistinfoCallback&loginUin=0&hostUin=0&format=jsonp&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq&needNewCode=0".format(x["dissid"])
#2访问音乐分类,得到歌单的songmid,songname
headers["referer"]="https://y.qq.com/n/yqq/playsquare/{0}.html".format(x["dissid"])
res = requests.get(sub_url, headers=headers).text
re = res.strip("playlistinfoCallback()")
r = json.loads(re)
for x in r["cdlist"][0]["songlist"]:
songmid = x["songmid"]
songname = "C400{0}.m4a".format(songmid)
song = x["songname"]
key_url = "https://c.y.qq.com/base/fcgi-bin/fcg_music_express_mobile3.fcg?g_tk=5381&jsonpCallback=MusicJsonCallback20480960151150063&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq&needNewCode=0&cid=205361747&callback=MusicJsonCallback20480960151150063&uin=0&songmid={0}&filename={1}&guid=9602668140".format(
songmid, songname)
#3.访问播放页面,得到每首歌的vkey
headers["Referer"] = "https://y.qq.com/portal/player.html"
res = requests.get(key_url, headers=headers).text
re = res.strip("MusicJsonCallback20480960151150063()")
r = json.loads(re)
for x in r["data"]["items"]:
vkey = x["vkey"]
song_url = "http://dl.stream.qqmusic.qq.com/{0}?vkey={1}&guid=9602668140&uin=0&fromtag=66".format(
songname, vkey)
#4.访问音乐文件下载
headers["Host"]="dl.stream.qqmusic.qq.com"
del headers["Referer"]
res=requests.get(song_url,headers=headers,stream=True)
filename = "music/{0}.m4a".format(song)
print(song)
with open(filename,"wb") as f:
f.write(res.raw.read())
if __name__ == '__main__':
sin = 0
ein = 29
sum = 5620
while True:
url="https://c.y.qq.com/splcloud/fcgi-bin/fcg_get_diss_by_tag.fcg?picmid=1&rnd=0.7709971027608087&g_tk=5381&jsonpCallback=getPlaylist&loginUin=0&hostUin=0&format=jsonp&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq&needNewCode=0&categoryId=10000000&sortId=5&sin={0}&ein={1}".format(sin,ein)
sub_url_list=get_Disstid(url)
if ein<5620:
sub_url_list = get_Disstid(url)
sin+=30
ein+=30
else:
break
time.sleep(1)
结果如图: