神经搜索工具
特定语法
excutor
编写自己的flow;
class MyExecutor(Executor):
@requests
def foo(self, docs: DocumentArray, **kwargs):
docs[0].text = 'hello, world!'
docs[1].text = 'goodbye, world!'
@requests(on='/crunch-numbers')
def bar(self, docs: DocumentArray, **kwargs):
for doc in docs:
doc.tensor = torch.tensor(np.random.random([10, 2]))
flow
提供api接口,定义好输入输出,比较灵活;
一个项目可以由多个flow共同决定
可以将写好的flow放到hub上快速加载
Hub
Jcloud
示例:
01:搜索系统
整体框架
- 输入:电影名称,描述,电影类型
- 输出:电影单
流程
- 下周数据集
- 将数据集加载到Docarray中
- 将Docarray,进行数据预处理,比如分词,分句等,然后生成向量表示。
- 构建索引
- 将输入进行编码,在索引中找到最佳匹配选项,通过api返回出来。
02构建PDF搜索系统
流程
- 准备pdf数据
- 解析pdf;准备pdf解析flow
- 文本处理以及分局分词
- embedding
- 构建索引
- 构建输入的flow;进行匹配,返回最近的索引
from docarray import DocumentArray
from jina import Flow
docs = DocumentArray.from_files("pdf_data/*.pdf", recursive=True)
flow = Flow()
flow = (
Flow()
.add(
uses="jinahub://PDFSegmenter",
install_requirements=True,
name="segmenter"
)
.add(
uses="jinahub://SpacySentencizer",
uses_with={"traversal_paths": "@c"},
install_requirements=True,
name="sentencizer",
)
.add(
uses="jinahub://TransformerTorchEncoder",
uses_with={"traversal_paths": "@cc"},
install_requirements=True,
name="encoder"
)
.add(
uses="jinahub://SimpleIndexer",
uses_with={"traversal_right": "@cc"},
install_requirements=True,
name="indexer"
)
)
flow.plot()
with flow:
docs = flow.index(docs, show_progress=True)
# 构建搜索flow
search_flow = (
Flow()
.add(
uses="jinahub://TransformerTorchEncoder",
name="encoder"
)
.add(
uses="jinahub://SimpleIndexer",
uses_with={"traversal_right": "@cc"},
name="indexer"
)
)
search_term = "一种基于词向量的hownet表示方法"
from docarray import Document
query_doc = Document(text=search_term)
with search_flow:
results = search_flow.search(query_doc, show_progress=True, return_results=True)
for match in results[0].matches:
print(match.text)
print(match.scores["cosine"].value)
print("---")