前言
看了那个微信里面的常青藤爸爸英语免费课比较好,但是微信里面的东西不能直接下载,我想要得到里面的音频和课堂笔记,方便学习。python 用的不多,每次想写个小工具都要从头查API,很烦累,所以就做个笔记,备不时之需。
用charles 分析了网络请求 分析找出几个关健的请求
- 课程列表
https://m.ivydad.com/api/mobile/knowlege/course/getCourseLessonList?device_id=29b54d7b19da2144c2186dd11e56dcff6525&uid=76a1da0be2ffae4295eac41e63dd764d&token=token-wechat-e6b3dc325dc615465958a9959579409b&version=+27f6576&appid=wx946e48ec53acefad&openid=otXpZxCp-rLc0SeOXMDbJJiAzdDo&app_name=knowlege&sys_type=h5&id=52&limit=10&offset=20&orderBy=DESC&isNew=1
- limit=10&offset=20 替换自己想要的数量
- 课程详情
https://m.ivydad.com/api/mobile/knowlege/course/getLessonDetailById2?device_id=29b54d7b19da2144c2186dd11e56dcff6525&uid=76a1da0be2ffae4295eac41e63dd764d&token=token-wechat-fa27a7cf67c3264a4d688cc62e0f75b5&version=+27f6576&appid=wx946e48ec53acefad&openid=otXpZxCp-rLc0SeOXMDbJJiAzdDo&app_name=knowlege&sys_type=h5&lessonId=1071&isAdminCheck=&includes=hasBought,playList,courseDetail&isNew=1
写 python 脚本
分析发现 1. 课程表 返回 音频信息。所以音频很容易得到,是一个json文件。 但课堂笔记整理,在2.课程详情 返回的 html 中。看了html 没有找到规律,无法直接提取出来。 所以先实现了音频下载。
- 几个关键的 pyhton 技术
- python 解析 json 模块
- 下载用的是urllib3 后来知道 request 是封装了urllib3。
- 用了re 模块
代码贴一下,方便以后用。
#!/usr/bin/evn python3
# -*- coding: UTF-8 -*-
import json
import urllib3
# import requests
import os
from pathlib import Path
import string
import re
# load_dict = {}
def download_mp3(url, dir_name):
# url = "http://service.ivydad.com/tmp/ivy/knowlege/audio/702268d1479cd44aaf49b224cb8fa277/ivy.mp3"
name = os.path.basename(dir_name)+".mp3"
# print(name)
dir_name = Path(dir_name)
if not dir_name.exists():
os.makedirs(dir_name)
path = os.path.join(dir_name, name)
http = urllib3.PoolManager()
response = http.request('GET', url=url)
with open(path, 'wb') as f:
f.write(response.data)
response.release_conn()
def handle_item(dic):
print(dic)
# print(dic["lesson_url"])
title = dic["title"]
# title = title.replace("、", "")
# remove = string.punctuation
title = re.sub(r'[、|. ]', '', title)
# print(title)
download_mp3(dic["lesson_url"], title)
with open('json.json', 'r') as load_f:
load_dict = json.load(load_f)
for item in load_dict:
handle_item(item)
json.json 是 课程列表 请求的结果。直接在浏览器里请求就返回的数据。
补充 无意间找到了获取详情 笔记的方法
改变课程详情的请求为:
https://m.ivydad.com/api/mobile/knowlege/course/getLessonDetailById2?device_id=29b54d7b19da2144c2186dd11e56dcff6525&uid=76a1da0be2ffae4295eac41e63dd764d&token=token-wechat-fa27a7cf67c3264a4d688cc62e0f75b5&version=+27f6576&appid=wx946e48ec53acefad&openid=otXpZxCp-rLc0SeOXMDbJJiAzdDo&app_name=knowlege&lessonId=1071
替换对应的 lessonId 返回的数据 很少基本都是需要的。返回的是一个json 。取到,再 解析就能拿到 笔记的 url 。遍历 搞定。
- 用 requests 下载 图片
- BeautifulSoup 解析html
下载图片和音频 完整代码如下:
#!/usr/bin/evn python3
# -*- coding: UTF-8 -*-
import json
import urllib3
import requests
import os
from pathlib import Path
import re
from bs4 import BeautifulSoup
# load_dict = {}
def create_dir(dir_name):
dir_name = Path(dir_name)
if not dir_name.exists():
os.makedirs(dir_name)
def download_mp3(url, dir_name):
# url = "http://service.ivydad.com/tmp/ivy/knowlege/audio/702268d1479cd44aaf49b224cb8fa277/ivy.mp3"
name = os.path.basename(dir_name)+".mp3"
path = os.path.join(dir_name, name)
if os.path.exists(path):
return
http = urllib3.PoolManager()
response = http.request('GET', url=url)
with open(path, 'wb') as f:
f.write(response.data)
response.release_conn()
# def downloadImg(url, dir_name):
# pass
def parse_lesson(lesson_id, dir_name):
url = "https://m.ivydad.com/api/mobile/knowlege/course/getLessonDetailById2?device_id=29b54d7b19da2144c2186dd11e56dcff6525&uid=76a1da0be2ffae4295eac41e63dd764d&token=token-wechat-fa27a7cf67c3264a4d688cc62e0f75b5&version=+27f6576&appid=wx946e48ec53acefad&openid=otXpZxCp-rLc0SeOXMDbJJiAzdDo&app_name=knowlege&lessonId="+lesson_id
resp = requests.get(url)
# print(resp)
# print(type(resp.text))
json_dic = json.loads(resp.text)
# print(json_dic)
detail = json_dic['lessonDetail']['detail']
soup = BeautifulSoup(detail, features="html.parser")
index = -1
for img in soup.find_all('img'):
src = img.attrs['src']
# print(src)
index = index + 1
path = os.path.join(dir_name, os.path.basename(dir_name)+'_'+str(index)+'.jpg')
if os.path.exists(path):
continue
response = requests.get(src)
with open(path, "wb") as f:
f.write(response.content)
def handle_item(dic):
# print(dic["lesson_url"])
title = dic["title"]
# title = title.replace("、", "")
# remove = string.punctuation
title = re.sub(r'[、|. ]', '', title)
create_dir(title)
download_mp3(dic["lesson_url"], title)
lesson_id = dic['id']
parse_lesson(str(lesson_id), title)
with open('json.json', 'r') as load_f:
load_dict = json.load(load_f)
for item in load_dict:
print("处理"+item["title"])
handle_item(item)