万字长文讲透 RAG在实际落地场景中的优化

阿里云开发者 · 公众号 · 科技公司 · 2025-03-12 08:30

主要观点总结

文章主要介绍了DB-GPT应用开发框架在实际落地场景中对RAG优化的方法和步骤，包括知识加工、知识检索、RAG流程优化和RAG落地案例分享。文章强调了在实际业务场景落地需要做的大量工作，并介绍了DB-GPT框架中RAG关键流程源码解读、知识加工和抽取细节、知识存储方式以及知识检索策略。同时，文章还分享了两个具体的RAG落地案例，包括数据基础设施领域的RAG和金融财报分析领域的RAG，展示了RAG技术在专业领域的应用和效果。

关键观点总结

关键观点1: RAG优化方法和步骤

文章介绍了DB-GPT应用开发框架在实际落地场景中对RAG优化的方法和步骤，包括知识加工、知识检索、RAG流程优化和RAG落地案例分享。

关键观点2: DB-GPT框架中RAG关键流程源码解读

文章详细解读了DB-GPT框架中RAG关键流程的源码，包括知识加载、切片、抽取、存储和检索等。

关键观点3: 知识加工和抽取细节

文章介绍了知识加工和抽取的细节，包括非结构化到结构化的转换、多元化的信息抽取和知识的完整性。

关键观点4: 知识存储方式

文章介绍了知识存储的方式，包括向量数据库、图数据库和全文索引，并提供了具体的存储实现。

关键观点5: 知识检索策略

文章介绍了知识检索的策略，包括查询改写、元数据过滤、多策略混合召回和后置过滤等。

关键观点6: RAG落地案例分享

文章分享了两个具体的RAG落地案例，包括数据基础设施领域的RAG和金融财报分析领域的RAG，展示了RAG技术在专业领域的应用和效果。

正文

请到「今天看啥」查看全文


   
    #EMBEDDING_MODEL=proxy_qianfan


   
    #proxy_qianfan_proxy_backend=bge-large-zh


   
    #proxy_qianfan_proxy_api_key={your-api-key}


   
    #proxy_qianfan_proxy_api_secret={your-secret-key}

知识图谱抽取 -> knowledge graph，通过利用大模型提取(实体,关系,实体)三元组结构。

class TripletExtractor(LLMExtractor):    """TripletExtractor class."""
    def __init__(self, llm_client: LLMClient, model_name: str):        """Initialize the TripletExtractor."""        super().__init__(llm_client, model_name, TRIPLET_EXTRACT_PT)
TRIPLET_EXTRACT_PT = (    "Some text is provided below. Given the text, "    "extract up to knowledge triplets as more as possible "    "in the form of (subject, predicate, object).\n"    "Avoid stopwords. The subject, predicate, object can not be none.\n"    "---------------------\n"    "Example:\n"    "Text: Alice is Bob's mother.\n"    "Triplets:\n(Alice, is mother of, Bob)\n"    "Text: Alice has 2 apples.\n"    "Triplets:\n(Alice, has 2, apple)\n"    "Text: Alice was given 1 apple by Bob.\n"    "Triplets:(Bob, gives 1 apple, Bob)\n"    "Text: Alice was pushed by Bob.\n"    "Triplets:(Bob, pushes, Alice)\n"    "Text: Bob's mother Alice has 2 apples.\n"    "Triplets:\n(Alice, is mother of, Bob)\n(Alice, has 2, apple)\n"    "Text: A Big monkey climbed up the tall fruit tree and picked 3 peaches.\n"    "Triplets:\n(monkey, climbed up, fruit tree)\n(monkey, picked 3, peach)\n"    "Text: Alice has 2 apples, she gives 1 to Bob.\n"    "Triplets:\n"    "(Alice, has 2, apple)\n(Alice, gives 1 apple, Bob)\n"    "Text: Philz is a coffee shop founded in Berkeley in 1982.\n"    "Triplets:\n"    "(Philz, is, coffee shop)\n(Philz, founded in, Berkeley)\n"    "(Philz, founded in, 1982)\n"    "---------------------\n"    "Text: {text}\n"    "Triplets:\n")

倒排索引抽取 -> keywords分词

可以用es默认的分词库，也可以使用es的插件模式自定义分词

知识存储

整个知识持久化统一实现了 IndexStoreBase 接口，目前提供了向量数据库、图数据库、全文索引三类实现。

VectorStore，向量数据库主要逻辑都在load_document()，包括索引schema创建，向量数据分批写入等等。

- VectorStoreBase    - ChromaStore    - MilvusStore    - OceanbaseStore    - ElasticsearchStore    - PGVectorStore
class VectorStoreBase(IndexStoreBase, ABC):    """Vector store base class."""
    @abstractmethod    def load_document(self, chunks: List[Chunk]) -> List[str]:        """Load document in index database."""    @abstractmethod    async def aload_document(self, chunks: List[Chunk]) -> List[str]:        """Load document in index database."""            @abstractmethod    def similar_search_with_scores(        self,        text,        topk,        score_threshold: float,        filters: Optional[MetadataFilters] = None,    ) -> List[Chunk]:        """Similar search with scores in index database."""    def similar_search(        self, text: str, topk: int, filters: Optional[MetadataFilters] = None    ) -> List[Chunk]:        return self.similar_search_with_scores(text, topk, 1.0, filters)

GraphStore ，具体的图存储提供了三元组写入的实现，一般会调用具体的图数据库的查询语言来完成。例如 TuGraphStore 会根据三元组生成具体的Cypher语句并执行。

图存储接口GraphStoreBase提供统一的图存储抽象，目前内置了 MemoryGraphStore 和 TuGraphStore 的实现，我们也提供Neo4j接口给开发者进行接入。

- GraphStoreBase    - TuGraphStore    - Neo4jStore
def insert_triplet(self, subj: str, rel: str, obj: str) -> None:    """Add triplet."""    ...TL;DR...    subj_query = f"MERGE (n1:{self._node_label} {{id:'{subj}'}})"    obj_query = f"MERGE (n1:{self._node_label} {{id:'{obj}'}})"    rel_query = (        f"MERGE (n1:{self._node_label} {{id:'{subj}'}})"        f"-[r:{self._edge_label} {{id:'{rel}'}}]->"        f"(n2:{self._node_label} {{id:'{obj}'}})"    )    self.conn.run(query=subj_query)    self.conn.run(query=obj_query)    self.conn.run(query=rel_query)

FullTextStore: 通过构建es索引，通过es内置分词算法进行分词，然后由es构建keyword->doc_id的倒排索引。

{            "analysis": {"analyzer": {"default": {"type": "standard"}}},            "similarity": {                "custom_bm25": {                    "type": "BM25",                    "k1": self._k1,                    "b": self._b,                }            },        }        self._es_mappings = {            "properties": {                "content": {                    "type": "text",                    "similarity": "custom_bm25",                },                "metadata": {                    "type": "keyword",                },            }        }

目前提供的全文索引接口支持Elasticsearch，同时也定义了OpenSearch的接口

- FullTextStoreBase    - ElasticDocumentStore    - OpenSearchStore

1.2 知识检索

question -> rewrite -> similarity_search -> rerank -> context_candidates

接下来是知识检索，目前社区的检索逻辑主要分为这几步，如果你设置了查询改写参数，目前会通过大模型给你进行一轮问题改写，然后会根据你的知识加工方式路由到对应的检索器，如果你是通过向量进行加工的，那就会通过EmbeddingRetriever进行检索，如果你构建方式是通过知识图谱构建的，就会按照知识图谱方式进行检索，如果你设置了rerank模型，会给粗筛后的候选值进行精筛，让候选值和用户问题更有关联。

EmbeddingRetriever

class EmbeddingRetriever(BaseRetriever):    """Embedding retriever."""
    def __init__(        self,        index_store: IndexStoreBase,        top_k: int = 4,        query_rewrite: Optional[QueryRewrite] = None,        rerank: Optional[Ranker] = None,        retrieve_strategy: Optional[RetrieverStrategy] = RetrieverStrategy.EMBEDDING,    ):

    async def _aretrieve_with_score(        self,        query: str,        score_threshold: float,        filters: Optional[MetadataFilters] = None,    ) -> List[Chunk]:        """Retrieve knowledge chunks with score.
        Args:            query (str): query text            score_threshold (float): score threshold            filters: metadata filters.        Return:            List[Chunk]: list of chunks with score        """        queries = [query]        new_queries = await self._query_rewrite.rewrite(                    origin_query=query, context=context, nums=1                )                queries.extend(new_queries)        candidates_with_score = [                self._similarity_search_with_score(                    query, score_threshold, filters, root_tracer.get_current_span_id()                )                for query in queries            ]            ...
        new_candidates_with_score = await self._rerank.arank(                new_candidates_with_score, query            )            return new_candidates_with_score

index_store: 具体的向量数据库
top_k: 返回的具体候选chunk个数
query_rewrite：查询改写函数
rerank：重排序函数
query:原始查询
score_threshold：得分，我们默认会把相似度得分小于阈值的上下文信息给过滤掉