本节课的任务
通过动手实现一个OSS(Open Source Software)订阅智能体, 来了解MetaGPT如何解决一些日常工作场景中遇到的问题。
主要完成如下任务:
- 为OSS实现两个Action:
- Action 1:实现对Github Trending页面的爬取,并获取每一个项目的 名称、URL链接、描述
- Action 2:独立完成对Huggingface Papers页面的爬取,先获取到每一篇Paper的链接(标题元素中的href标签),并通过链接访问标题的描述页面(例如:https://huggingface.co/papers/2312.03818),在页面中获取一篇Paper的 标题、摘要
- OSS自动生成总结内容的目录,然后根据二级标题进行分块,每块内容做出对应的总结,形成一篇资讯文档;
- OSS定时为通知渠道发送以上总结的资讯文档(尝试实现邮箱发送的功能)
使用MetaGPT实现订阅智能体的步骤
如上图,使用MetaGPT实现订阅智能体基本需要如下步骤:
- 实现OSS Agent(基于Role),并实现Agent需要的爬虫Action和分析Action
- 实现触发(trigger,即如何触发Agent进行Action,比如爬取和分析)
- 实现回调(callback,即完成后干啥事,比如推送到discord、微信,或者发送邮箱)
- 最终把上面的OSS Agent、trigger和callback串联起来工作,就是SubscriptionRunner
当然,你也可以不用SubscriptionRunner,直接基于role.run()来自行编码。但是SubscriptionRunner是一种模式,可以复用。
实现
相关配置
- discord需要配置全局代理
在key.yaml中增加
GLOBAL_PROXY: http://127.0.0.1:8181# 改成自己的代理服务器地址
- 配置环境变量
export DISCORD_TOKEN=MTE5NzE4OTU2NzQ3Mjc0NjU1Ng.GqWXK2.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
export DISCORD_CHANNEL_ID=1197190827886xxxxxx
如果是pycharm中
DISCORD_TOKEN参考官方文档discord readthedocs,"Creating a Bot Account"章节的第7步,从下面页面获取TOKEN,注意TOKEN生成后及时复制。
DISCORD_CHANNEL_ID即希望Bot发送消息的频道,如下:
代码
下面代码(oss.py)通过github trending爬取、总结,信息发布到discord, 并通过邮件发送
import asyncio
import os
import fire
import discord
import aiohttp
from bs4 import BeautifulSoup
from typing import Any
from metagpt.actions import Action
from metagpt.config import CONFIG
from metagpt.environment import Environment
from metagpt.logs import logger
from metagpt.roles import Role
from metagpt.roles.role import RoleReactMode
from metagpt.schema import Message
from metagpt.subscription import SubscriptionRunner
class CrawlOSSTrending(Action):
async def run(self, url: str = "https://github.com/trending"):
# return "https://github.com/trending"
async with aiohttp.ClientSession() as client:
async with client.get(url, proxy=CONFIG.global_proxy) as response:
response.raise_for_status()
html = await response.text()
soup = BeautifulSoup(html, 'html.parser')
repositories = []
for article in soup.select('article.Box-row'):
repo_info = {'name': article.select_one('h2 a').text.strip().replace("\n", "").replace(" ", ""),
'url': "https://github.com" + article.select_one('h2 a')['href'].strip()}
# Description
description_element = article.select_one('p')
repo_info['description'] = description_element.text.strip() if description_element else None
# Language
language_element = article.select_one('span[itemprop="programmingLanguage"]')
repo_info['language'] = language_element.text.strip() if language_element else None
# Stars and Forks
stars_element = article.select('a.Link--muted')[0]
forks_element = article.select('a.Link--muted')[1]
repo_info['stars'] = stars_element.text.strip()
repo_info['forks'] = forks_element.text.strip()
# Today's Stars
today_stars_element = article.select_one('span.d-inline-block.float-sm-right')
repo_info['today_stars'] = today_stars_element.text.strip() if today_stars_element else None
repositories.append(repo_info)
return repositories
class CrawlOSSHugginfacePapers(Action):
async def run(self, msg: Message) -> str:
logger.info(f"{msg}")
return msg.text
TRENDING_ANALYSIS_PROMPT = """# Requirements
You are a GitHub Trending Analyst, aiming to provide users with insightful and personalized recommendations based on the latest
GitHub Trends. Based on the context, fill in the following missing information, generate engaging and informative titles,
ensuring users discover repositories aligned with their interests.
# The title about Today's GitHub Trending
## Today's Trends: Uncover the Hottest GitHub Projects Today! Explore the trending programming languages and discover key domains capturing developers' attention. From ** to **, witness the top projects like never before.
## The Trends Categories: Dive into Today's GitHub Trending Domains! Explore featured projects in domains such as ** and **. Get a quick overview of each project, including programming languages, stars, and more.
## Highlights of the List: Spotlight noteworthy projects on GitHub Trending, including new tools, innovative projects, and rapidly gaining popularity, focusing on delivering distinctive and attention-grabbing content for users.
---
# Format Example
\```
# [Title]
## Today's Trends
Today, ** and ** continue to dominate as the most popular programming languages. Key areas of interest include **, ** and **.
The top popular projects are Project1 and Project2.
## The Trends Categories
1. Generative AI
- [Project1](https://github/xx/project1): [detail of the project, such as star total and today, language, ...]
- [Project2](https://github/xx/project2): ...
...
## Highlights of the List
1. [Project1](https://github/xx/project1): [provide specific reasons why this project is recommended].
...
\```
---
# Github Trending
{trending}
"""
class AnalysisOSSTrending(Action):
async def run(
self,
trending: Any
):
return await self._aask(TRENDING_ANALYSIS_PROMPT.format(trending=trending))
class OssWatcher(Role):
name: str = "XiaoGang"
profile: str = "OssWatcher"
goal: str = "Generate an insightful GitHub Trending and Huggingface papers analysis report."
constraints: str = "Only analyze based on the provided GitHub Trending and Huggingface papers data."
def __init__(self, **kwargs):
super().__init__(**kwargs)
self._init_actions([CrawlOSSTrending, AnalysisOSSTrending])
self._set_react_mode(RoleReactMode.BY_ORDER.value)
async def _act(self) -> Message:
logger.info(f"{self._setting}: to do {self.rc.todo}")
todo = self.rc.todo
msg = self.get_memories(k=1)[0] # find the most recent messages
new_msg = await todo.run(msg.content)
msg = Message(content=str(new_msg), role=self.profile, cause_by=type(todo))
self.rc.memory.add(msg) # add the new message to memory
return msg
async def discord_callback(msg: Message):
intents = discord.Intents.default()
intents.message_content = True
client = discord.Client(intents=intents, proxy=CONFIG.global_proxy)
token = os.environ["DISCORD_TOKEN"]
channel_id = int(os.environ["DISCORD_CHANNEL_ID"])
async with client:
await client.login(token)
channel = await client.fetch_channel(channel_id)
lines = []
for i in msg.content.splitlines():
if i.startswith(("# ", "## ", "### ")):
if lines:
await channel.send("\n".join(lines))
lines = []
lines.append(i)
if lines:
await channel.send("\n".join(lines))
async def mail_callback(msg: Message):
async_mailer = AsyncMailer()
await async_mailer.send(os.environ["MAIL_SENDER"], os.environ["MAIL_RECEIVER"], 'GitHub Trending Analysis', msg.content)
async def oss_callback(discord: bool = True, mail: bool = True):
callbacks = []
if discord:
callbacks.append(discord_callback)
if mail:
callbacks.append(mail_callback)
if not callbacks:
async def _print(msg: Message):
print(msg.content)
callbacks.append(_print)
async def callback(msg: Message):
await asyncio.gather(*[cb(msg) for cb in callbacks])
return callback
async def oss_trigger():
while True:
yield Message(content="https://github.com/trending")
await asyncio.sleep(3600 * 24)
async def main(discord: bool = True, mail: bool = True):
runner = SubscriptionRunner()
callback = await oss_callback(discord, mail)
runner.model_rebuild()
await runner.subscribe(OssWatcher(), oss_trigger(), callback)
await runner.run()
if __name__ == "__main__":
fire.Fire(main)
日志
2024-01-18 00:39:45.138 | INFO | metagpt.const:get_metagpt_package_root:32 - Package root set to D:\workspace\sourcecode\MetaGPT
2024-01-18 00:39:45.281 | INFO | metagpt.config:get_default_llm_provider_enum:126 - API: LLMProviderEnum.ZHIPUAI
2024-01-18 00:39:48.908 | INFO | metagpt.config:get_default_llm_provider_enum:126 - API: LLMProviderEnum.ZHIPUAI
2024-01-18 00:39:48.914 | INFO | __main__:_act:123 - XiaoGang(OssWatcher): to do CrawlOSSTrending
2024-01-18 00:39:50.138 | INFO | __main__:_act:123 - XiaoGang(OssWatcher): to do AnalysisOSSTrending
Here's a title for today's GitHub Trending based on the provided data:
**Today's Trends: Explore the Hottest GitHub Projects in Programming Languages and Domains**
---
## Today's Trends
Today, JavaScript and Python continue to dominate as the most popular programming languages. Key areas of interest include generative AI, personal finance, and scalability. Discover the top popular projects like never before, from **TencentARC/PhotoMaker** to **linexjlin/GPTs**.
## The Trends Categories
1. Generative AI
* [TencentARC/PhotoMaker](https://github.com/TencentARC/PhotoMaker): A powerful photo manipulation tool using AI.
* [linexjlin/GPTs](https://github.com/linexjlin/GPTs): A collection of leaked GPT-3 prompts.
2. Personal Finance
* [maybe-finance/maybe](<https://github.com/maybe-finance/maybe>: A comprehensive personal finance and wealth management app.
3. Scalability
* [binhnguyennus/awesome-scalability](<https://github.com/binhnguyennus/awesome-scalability>: A curated list of patterns for building scalable, reliable, and performant large-scale systems.
## Highlights of the List
1. **TencentARC/PhotoMaker**: This project offers a powerful photo manipulation tool that uses AI to create stunning images. With over 2,000 stars and 150 forks, it's a must-watch repository for AI-driven image processing.
2. **maybe-finance/maybe**: This comprehensive personal finance and wealth management app has earned over 10,000 stars and 741 forks. It's a great resource for anyone looking to manage their finances effectively.
3. **linexjlin/GPTs**: This repository contains a collection of leaked GPT-3 prompts, earning it 22,916 stars and 3,291 forks. It's an interesting resource for those interested in exploring AI-generated text.
Check out these projects and more in the full list above! Stay tuned for more insightful and personalized recommendations based on the latest GitHub Trends.
2024-01-18 00:40:14.373 | INFO | metagpt.utils.cost_manager:update_cost:48 - Total running cost: $0.000 | Max budget: $10.000 | Current cost: $0.000, prompt_tokens: 2858, completion_tokens: 526
发送Discord效果
发送到邮箱
这里使用163的邮箱,需要开启smtp服务
MAIL_PASSWORD不是邮箱密码,是开启smtp服务时会生成,将MAIL_PASSWORD设置到环境变量中。
另外代码中MAIL_SENDER和MAIL_RECEIVER分别表示发件人和收件人,也通过环境变量设置。
发送邮件的类:
import asyncio
import os
from email.mime.text import MIMEText
from email.header import Header
import aiosmtplib
from aiosmtplib.email import formataddr
from metagpt.logs import logger
class AsyncMailer:
def __init__(self, smtp_server="smtp.163.com", smtp_port=25):
self.smtp_server = smtp_server
self.smtp_port = smtp_port
self.password = os.environ["MAIL_PASSWORD"]
async def send(self, sender, receiver, title, content) -> None:
message = MIMEText(content, 'plain', 'utf-8')
message['From'] = formataddr((sender.split('@')[0], sender)) # 设置发件人昵称
message['To'] = formataddr((receiver.split('@')[0], receiver)) # 设置收件人昵称
# message['Message-ID'] = Header('123456789', 'utf-8') # 设置邮件id
message['Content-Type'] = Header('text/plain', 'utf-8') # 设置邮件内容类型
message['Content-Transfer-Encoding'] = Header('base64', 'utf-8') # 设置邮件内容编码
message['X-Priority'] = Header('3', 'utf-8') # 设置邮件优先级
message['X-Mailer'] = Header('Aiosmtplib', 'utf-8') # 设置邮件客户端
message['MIME-Version'] = Header('1.0', 'utf-8') # 设置邮件版本
message['X-AntiAbuse'] = Header('1', 'utf-8') # 设置邮件防垃圾邮件
message['Subject'] = Header(title, 'utf-8') # 设置邮件主题
# 异步连接邮件服务器并登录
smtp_connection = aiosmtplib.SMTP(hostname=self.smtp_server, port=self.smtp_port, local_hostname='localhost')
await smtp_connection.connect()
await smtp_connection.login(sender, self.password)
# 异步发送邮件
await smtp_connection.sendmail(sender, receiver, message.as_string())
# 关闭连接
await smtp_connection.quit()
logger.info("邮件发送成功!")
async def main():
async_mailer = AsyncMailer()
await async_mailer.send(os.environ["MAIL_SENDER"], os.environ["MAIL_RECEIVER"], 'Mail Test', 'Hello World!')
if __name__ == '__main__':
# 运行示例
asyncio.run(main())
增加发送邮件的callback
async def mail_callback(msg: Message):
async_mailer = AsyncMailer()
await async_mailer.send(os.environ["MAIL_SENDER"], os.environ["MAIL_RECEIVER"], 'GitHub Trending Analysis', msg.content)
async def oss_callback(discord: bool = True, mail: bool = True):
callbacks = []
if discord:
callbacks.append(discord_callback)
if mail:
callbacks.append(mail_callback)
if not callbacks:
async def _print(msg: Message):
print(msg.content)
callbacks.append(_print)
async def callback(msg: Message):
await asyncio.gather(*[cb(msg) for cb in callbacks])
return callback
邮箱发送效果
Huggingface Papers页面爬取和总结
下面我们再完成对Huggingface Papers页面的爬取,这个页面是Hugging Face论文页面,分享了与NLP和相关技术领域有关的研究论文、文章和资源,可以在这里找到关于模型、算法、实验等方面的详细信息。这里完成从Huggingface Papers获取每一篇Paper的链接,并通过链接访问标题的描述页面,在页面中获取Paper的 标题、摘要,然后自动生成总结内容的目录,每块内容做出对应的总结,形成一篇资讯文档。
Huggingface Papers页面爬取
通过F12或者右键菜单|检查打开开发者工具
然后找到如下部分:
首先通过bs4获得每篇paper的连接
def hg_article_urls(html_soup):
_urls = []
for article in html_soup.select('article.flex.flex-col.overflow-hidden.rounded-xl.border'):
url = article.select_one('h3 a')['href']
_urls.append('https://huggingface.co' + url)
return _urls
需要注意的是需要使用<h3><a href>来进行定位,不能使用<a href>,即应像上面写为
url = article.select_one('h3 a')['href']
上面获取到url,如https://huggingface.co/papers/2401.10020,通过url链接访问paper描述页面,获取标题和摘要。
以https://huggingface.co/papers/2401.10020为例:
通过下面代码获取上图中data-props的信息,因为data-props的内容是json字符串,所以通过json.loads解析为json对象。
info = soup.select_one('section.pt-8.border-gray-100')
data_props = json.loads(info.select_one('div.SVELTE_HYDRATER.contents')['data-props'])
如上图,通过data_props可以获取到paper的id、投票数、发布时间、标题和摘要的信息。
上面作为工具代码保存到了hg_parse.py中, 完整代码如下:
import asyncio
import json
import aiohttp
from bs4 import BeautifulSoup
from metagpt.config import CONFIG
from metagpt.logs import logger
def get_local_html_soup(url, features='html.parser'):
with open(url, encoding="utf-8") as f:
html = f.read()
soup = BeautifulSoup(html, features)
return soup
async def get_html_soup(url: str):
async with aiohttp.ClientSession() as client:
async with client.get(url, proxy=CONFIG.global_proxy) as response:
response.raise_for_status()
html = await response.text()
soup = BeautifulSoup(html, 'html.parser')
return url, soup
def hg_article_urls(html_soup):
_urls = []
for article in html_soup.select('article.flex.flex-col.overflow-hidden.rounded-xl.border'):
url = article.select_one('h3 a')['href']
_urls.append('https://huggingface.co' + url)
return _urls
def hg_article_infos(_url, html_soup):
logger.info(f'Parsing {_url}')
_article = {}
info = html_soup.select_one('section.pt-8.border-gray-100')
data_props = json.loads(info.select_one('div.SVELTE_HYDRATER.contents')['data-props'])
paper = data_props['paper']
_article['url'] = _url
_article['id'] = paper['id']
_article['title'] = paper['title']
_article['upvotes'] = paper['upvotes']
_article['publishedAt'] = paper['publishedAt']
_article['summary'] = paper['summary']
return _article
async def get_hg_articles():
_, _soup = await get_html_soup("https://huggingface.co/papers")
hg_urls = hg_article_urls(_soup)
_soups = await asyncio.gather(*[get_html_soup(url) for url in hg_urls])
hg_articles = map(lambda param: hg_article_infos(param[0], param[1]), _soups)
return list(hg_articles)
if __name__ == "__main__":
import asyncio
for article in asyncio.run(get_hg_articles()):
print(article)
在前面的oss.py中增加Huggingface Papers页面爬取的Action:
class CrawlOSSHuggingfacePapers(Action):
async def run(self, msg: Message) -> str:
logger.info(f"{msg}")
return await get_hg_articles()
Huggingface Papers页面总结
页面总结Action主要是写Prompt,参考github trending的Prompt实现AnalysisOSSHuggingfacePapers:
HG_PAPERS_ANALYSIS_PROMPT = """# Requirements
You are a Haggingface Papers Analyst, aiming to provide users with insightful and personalized consultation based on the latest
Haggingface Papers abstract. Based on the context, fill in the following missing information, generate engaging and informative titles,
ensuring users discover articles aligned with their interests.
# The title about Today's Haggingface Papers Consultation
## Today's Haggingface Papers Consultation: Uncover the Hottest Haggingface Papers Today! Explore the trending programming languages and discover key domains capturing developers' attention. From ** to **, witness the top papers like never before.
## The Papers Categories: Dive into Today's Haggingface Papers Domains! Explore featured papers in domains such as ** and **. Get a quick overview of each paper, including upvotes, and more.
## Highlights of the List: Spotlight noteworthy papers on Haggingface Papers, including new tools, new methods, innovative papers, and rapidly gaining popularity, focusing on delivering distinctive and attention-grabbing content for users.
---
# Format Example
\```
# [Title]
## Today's Haggingface Papers Consultation
Today, ** and ** continue to dominate as the most popular research areas. Key areas of interest include **, ** and **.
The top popular papers are Paper1 and Paper2.
## The Papers Categories
1. Large Language Model
- [Paper1](https://huggingface.co/papers/paper1): [Abstract of the paper, such as upvotes total ...]
- [Paper2](https://huggingface.co/papers/paper2): ...
...
## Highlights of the List
1. [Paper1](https://huggingface.co/papers/paper1): [provide specific reasons why this paper is recommended].
...
\```
---
# Haggingface Papers
{papers}
"""
class AnalysisOSSHuggingfacePapers(Action):
async def run(
self,
papers: Any
):
return await self._aask(HG_PAPERS_ANALYSIS_PROMPT.format(papers=papers))
最终Haggingface Papers咨询信息发送到discord和邮箱的效果如下: