主要观点总结
文章主要介绍了DB-GPT应用开发框架在实际落地场景中对RAG优化的方法和步骤,包括知识加工、知识检索、RAG流程优化和RAG落地案例分享。文章强调了在实际业务场景落地需要做的大量工作,并介绍了DB-GPT框架中RAG关键流程源码解读、知识加工和抽取细节、知识存储方式以及知识检索策略。同时,文章还分享了两个具体的RAG落地案例,包括数据基础设施领域的RAG和金融财报分析领域的RAG,展示了RAG技术在专业领域的应用和效果。
关键观点总结
关键观点1: RAG优化方法和步骤
文章介绍了DB-GPT应用开发框架在实际落地场景中对RAG优化的方法和步骤,包括知识加工、知识检索、RAG流程优化和RAG落地案例分享。
关键观点2: DB-GPT框架中RAG关键流程源码解读
文章详细解读了DB-GPT框架中RAG关键流程的源码,包括知识加载、切片、抽取、存储和检索等。
关键观点3: 知识加工和抽取细节
文章介绍了知识加工和抽取的细节,包括非结构化到结构化的转换、多元化的信息抽取和知识的完整性。
关键观点4: 知识存储方式
文章介绍了知识存储的方式,包括向量数据库、图数据库和全文索引,并提供了具体的存储实现。
关键观点5: 知识检索策略
文章介绍了知识检索的策略,包括查询改写、元数据过滤、多策略混合召回和后置过滤等。
关键观点6: RAG落地案例分享
文章分享了两个具体的RAG落地案例,包括数据基础设施领域的RAG和金融财报分析领域的RAG,展示了RAG技术在专业领域的应用和效果。
正文
#EMBEDDING_MODEL=proxy_qianfan
#proxy_qianfan_proxy_backend=bge-large-zh
#proxy_qianfan_proxy_api_key={your-api-key}
#proxy_qianfan_proxy_api_secret={your-secret-key}
class TripletExtractor(LLMExtractor):
"""TripletExtractor class."""
def __init__(self, llm_client: LLMClient, model_name: str):
"""Initialize the TripletExtractor."""
super().__init__(llm_client, model_name, TRIPLET_EXTRACT_PT)
TRIPLET_EXTRACT_PT = (
"Some text is provided below. Given the text, "
"extract up to knowledge triplets as more as possible "
"in the form of (subject, predicate, object).\n"
"Avoid stopwords. The subject, predicate, object can not be none.\n"
"---------------------\n"
"Example:\n"
"Text: Alice is Bob's mother.\n"
"Triplets:\n(Alice, is mother of, Bob)\n"
"Text: Alice has 2 apples.\n"
"Triplets:\n(Alice, has 2, apple)\n"
"Text: Alice was given 1 apple by Bob.\n"
"Triplets:(Bob, gives 1 apple, Bob)\n"
"Text: Alice was pushed by Bob.\n"
"Triplets:(Bob, pushes, Alice)\n"
"Text: Bob's mother Alice has 2 apples.\n"
"Triplets:\n(Alice, is mother of, Bob)\n(Alice, has 2, apple)\n"
"Text: A Big monkey climbed up the tall fruit tree and picked 3 peaches.\n"
"Triplets:\n(monkey, climbed up, fruit tree)\n(monkey, picked 3, peach)\n"
"Text: Alice has 2 apples, she gives 1 to Bob.\n"
"Triplets:\n"
"(Alice, has 2, apple)\n(Alice, gives 1 apple, Bob)\n"
"Text: Philz is a coffee shop founded in Berkeley in 1982.\n"
"Triplets:\n"
"(Philz, is, coffee shop)\n(Philz, founded in, Berkeley)\n"
"(Philz, founded in, 1982)\n"
"---------------------\n"
"Text: {text}\n"
"Triplets:\n"
)
整个知识持久化统一实现了
IndexStoreBase
接口,目前提供了向量数据库、图数据库、全文索引三类实现。
- VectorStoreBase
- ChromaStore
- MilvusStore
- OceanbaseStore
- ElasticsearchStore
- PGVectorStore
class VectorStoreBase(IndexStoreBase, ABC):
"""Vector store base class."""
@abstractmethod
def load_document(self, chunks: List[Chunk]) -> List[str]:
"""Load document in index database."""
@abstractmethod
async def aload_document(self, chunks: List[Chunk]) -> List[str]:
"""Load document in index database."""
@abstractmethod
def similar_search_with_scores(
self,
text,
topk,
score_threshold: float,
filters: Optional[MetadataFilters] = None,
) -> List[Chunk]:
"""Similar search with scores in index database."""
def similar_search(
self, text: str, topk: int, filters: Optional[MetadataFilters] = None
) -> List[Chunk]:
return self.similar_search_with_scores(text, topk, 1.0, filters)
- GraphStoreBase
- TuGraphStore
- Neo4jStore
def insert_triplet(self, subj: str, rel: str, obj: str) -> None:
"""Add triplet."""
...TL;DR...
subj_query = f"MERGE (n1:{self._node_label} {{id:'{subj}'}})"
obj_query = f"MERGE (n1:{self._node_label} {{id:'{obj}'}})"
rel_query = (
f"MERGE (n1:{self._node_label} {{id:'{subj}'}})"
f"-[r:{self._edge_label} {{id:'{rel}'}}]->"
f"(n2:{self._node_label} {{id:'{obj}'}})"
)
self.conn.run(query=subj_query)
self.conn.run(query=obj_query)
self.conn.run(query=rel_query)
{ "analysis": {"analyzer": {"default": {"type": "standard"}}},
"similarity": {
"custom_bm25": {
"type": "BM25",
"k1": self._k1,
"b": self._b,
}
},
}
self._es_mappings = {
"properties": {
"content": {
"type": "text",
"similarity": "custom_bm25",
},
"metadata": {
"type": "keyword",
},
}
}
目前提供的全文索引接口支持Elasticsearch,同时也定义了OpenSearch的接口
- FullTextStoreBase
- ElasticDocumentStore
- OpenSearchStore
question -> rewrite -> similarity_search -> rerank -> context_candidates
接下来是知识检索,目前社区的检索逻辑主要分为这几步,如果你设置了查询改写参数,目前会通过大模型给你进行一轮问题改写,然后会根据你的知识加工方式路由到对应的检索器,如果你是通过向量进行加工的,那就会通过EmbeddingRetriever进行检索,如果你构建方式是通过知识图谱构建的,就会按照知识图谱方式进行检索,如果你设置了rerank模型,会给粗筛后的候选值进行精筛,让候选值和用户问题更有关联。
class EmbeddingRetriever(BaseRetriever):
"""Embedding retriever."""
def __init__(
self,
index_store: IndexStoreBase,
top_k: int = 4,
query_rewrite: Optional[QueryRewrite] = None,
rerank: Optional[Ranker] = None,
retrieve_strategy: Optional[RetrieverStrategy] = RetrieverStrategy.EMBEDDING,
):
async def _aretrieve_with_score(
self,
query: str,
score_threshold: float,
filters: Optional[MetadataFilters] = None,
) -> List[Chunk]:
"""Retrieve knowledge chunks with score.
Args:
query (str): query text
score_threshold (float): score threshold
filters: metadata filters.
Return:
List[Chunk]: list of chunks with score
"""
queries = [query]
new_queries = await self._query_rewrite.rewrite(
origin_query=query, context=context, nums=1
)
queries.extend(new_queries)
candidates_with_score = [
self._similarity_search_with_score(
query, score_threshold, filters, root_tracer.get_current_span_id()
)
for query in queries
]
...
new_candidates_with_score = await self._rerank.arank(
new_candidates_with_score, query
)
return new_candidates_with_score
-
-
-
-
-
-
score_threshold:得分,我们默认会把相似度得分小于阈值的上下文信息给过滤掉