python-获取知乎问题答案并转换为MarkDown文件

首先说明, 这个代码不是原创的, 是参考崔老师博客上的文章写的, 代码基本都是照搬的. 原链在这里https://cuiqingcai.com/4607.html
不过原项目使用python2写的, 自己修改成了python3

观察页面请求, 寻找规律

打开某个知乎问题的链接, 比如这个知乎-男生 25 岁了，应该明白哪些道理？
然后打开开发者工具, 观察到页面中的文本数据基本上来自这个api
https://www.zhihu.com/api/v4/questions/37400041/answers?include=data%5B%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cupvoted_followees%3Bdata%5B%5D.mark_infos%5B%5D.url%3Bdata%5B%5D.author.follower_count%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=3&limit=20&sort_by=default

但是在原项目的中的代码中我发现, 原作者用了一个更简洁的api
https://www.zhihu.com/api/v4/questions/37400041/answers?include=data[*].content,voteup_count,created_time&offset=0&limit=20&sort_by=default
我还特意去github问原作者是怎么发现这个接口的, 但是他说他忘了

好吧, 那就不管了, 就用这个接口吧

解析接口数据

观察接口上面那个接口返回的json数据, 发现结构是这样的

paging中的previous是前面一个ajax请求接口, next是后面一个ajax请求接口
is_end代表是否是最后一个请求
is_start代表是否是第一个请求
data中一共有20条数据, 均是这个问题下的回答数据
所以我们的解析函数可以这样写: 先解析data中的数据, 然后判断是否是最后一条数据, 如果不是, 就递归调用该函数本身继续解析
另外可以观察到, 这个接口is_start为true, 确实是第一个请求, 所以按照这个顺序往下解析就能获取所有的数据了, 貌似并不像原文章中说的那样要分为两部分请求
代码如下:

    def request(self, url):
        try:
            response = requests.get(url=url, headers=headers)
            if response.status_code == 200:
                # 不管是不是最后一条数据, 先进行解析再说
                text = response.text
                # 此处进行进一步解析
                # print('url =', url, 'text =', text)
                content = json.loads(text)
                self.parse_content(content)
                # 如果不是最后一条数据, 继续递归请求并解析
                if not content.get('paging').get('is_end'):
                    next_page_url = content.get('paging').get('next').replace('http', 'https')
                    self.request(next_page_url)

            return None
        except RequestException:
            print('请求网址错误')
            return None

将内容转换为markdown

这一部分的代码我基本是照搬照抄的了, 没有仔细琢磨. 粗略看了一下思路, 主要是使用html2text模块的html2text方法将html格式的文本转换成了text格式, 然后使用正则整理了一下格式, 接着使用正则查找图片链接替换成本地的图片地址
代码有点长, 如下:

    def parse_content(self, content):
        if 'data' in content.keys():
            for data in content.get('data'):
                parsed_data = self.parse_data(data)
                self.transform_to_markdown(parsed_data)

    def parse_data(self, content):
        data = {}
        answer_content = content.get('content')
        # print('content =', content)

        author_name = content.get('author').get('name')
        print('author_name =', author_name)
        answer_id = content.get('id')
        question_id = content.get('question').get('id')
        question_title = content.get('question').get('title')
        vote_up_count = content.get('voteup_count')
        create_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(content.get('created_time')))

        content = html_template(answer_content)
        soup = BeautifulSoup(content, 'lxml')
        answer = soup.find("body")

        soup.body.extract()
        soup.head.insert_after(soup.new_tag("body", **{'class': 'zhi'}))

        soup.body.append(answer)

        img_list = soup.find_all("img", class_="content_image lazy")
        for img in img_list:
            img["src"] = img["data-actualsrc"]
        img_list = soup.find_all("img", class_="origin_image zh-lightbox-thumb lazy")
        for img in img_list:
            img["src"] = img["data-actualsrc"]
        noscript_list = soup.find_all("noscript")
        for noscript in noscript_list:
            noscript.extract()

        data['content'] = soup
        data['author_name'] = author_name
        data['answer_id'] = answer_id
        data['question_id'] = question_id
        data['question_title'] = question_title
        data['vote_up_count'] = vote_up_count
        data['create_time'] = create_time
        return data

    def transform_to_markdown(self, data):
        content = data['content']
        author_name = data['author_name']
        answer_id = data['answer_id']
        question_id = data['question_id']
        question_title = data['question_title']

        vote_up_count = data['vote_up_count']
        create_time = data['create_time']

        file_name = 'vote[%d]_%s的回答.md' % (vote_up_count, author_name)

        folder_name = question_title

        # 如果文件夹不存在, 就创建文件夹
        question_dir = os.path.join(os.getcwd(), folder_name)
        if not os.path.exists(question_dir):
            os.mkdir(folder_name)

        answer_path = os.path.join(os.getcwd(), folder_name, file_name)
        with open(answer_path, 'w+', encoding='utf-8') as f:
            # f.write("-" * 40 + "\n")
            origin_url = 'https://www.zhihu.com/question/{}/answer/{}'.format(question_id, answer_id)
            # print('origin_url =', origin_url)
            f.write("### 本答案原始链接: " + origin_url + "\n")
            f.write("### question_title: " + question_title + "\n")
            f.write("### Author_Name: " + author_name + "\n")
            f.write("### Answer_ID: %d" % answer_id + "\n")
            f.write("### Question_ID %d: " % question_id + "\n")
            f.write("### VoteCount: %s" % vote_up_count + "\n")
            f.write("### Create_Time: " + create_time + "\n")
            f.write("-" * 40 + "\n")

            text = html2text.html2text(content.decode('utf-8'))
            # 标题
            r = re.findall(r'\*\*(.*?)\*\*', text, re.S)
            for i in r:
                if i != " ":
                    text = text.replace(i, i.strip())

            r = re.findall(r'_(.*)_', text)
            for i in r:
                if i != " ":
                    text = text.replace(i, i.strip())
            text = text.replace('_ _', '')
            text = text.replace('_b.', '_r.')
            # 图片
            r = re.findall(r'!\[\]\((?:.*?)\)', text)
            for i in r:
                text = text.replace(i, i + "\n\n")
                folder_name = '%s/image' % os.getcwd()
                if not os.path.exists(folder_name):
                    os.mkdir(folder_name)
                img_url = re.findall('\((.*)\)', i)[0]
                save_name = img_url.split('/')[-1]
                file_path = '%s/%s' % (folder_name, save_name)

                try:
                    content = self.download_image(img_url)
                    if content:
                        self.save_image(content, file_path)
                except Exception as e:
                    print(e)
                else:  # if no exception,get here
                    text = text.replace(img_url, file_path)

            f.write(text)
            f.close()

成果展示

在finder按照名称逆序排列了一下, 这样就能按照赞同数从多到少浏览这些答案了
话说, 赞同数第一的是个什么鬼?明显就是广告, 不能评论, 引用了一句矫情的话语, 赞同数肯定是刷上去的, 我果断给举报了

总结

其实我最初是想有一个好的方式去看我在知乎上关注的问题, 因为有的人写的答案还是很有价值的.但是显然这种markdown的方式并不是很好, 因为查看回答需要一个个的打开markdown文件.而我认为理想的方式是像在知乎的网页上浏览一样, 但并不需要翻页或者什么的, 直接在一个html里面加载好了所有的内容, 就像我以前写的一篇Python-给简书收藏加一个搜索功能一样.
现在来看还是将爬取到的数据保存到本地数据库, 然后一次性加载到网页这种方式比较合适.为了实现这一点我还得去学学前端了.另外我还想在退出网页的时候自动保存我上次浏览到的位置, 下次再打开时自动回到那个位置.希望能够实现.
另外关于代码方面, 因为将图片下载到本地, 这一过程其实占用了主要的时间, 其实可以用图片的在线地址, 这样会快很多.还有就是本来我想使用像街拍美图中那样使用进程池的多进程的, 但是在这里好像应用不上, 因为下一次的请求地址是在本次请求的返回结果里的, 必须先解析了这次请求的内容才能进行下一次的请求.

github代码地址

https://github.com/mundane799699/PythonProjects/tree/master/TransformZhihuAnswersToMarkdown