从零构建知识图谱：使用大语言模型处理复杂数据的11步实践指南

数据派THU · 公众号 · 大数据 · 2025-06-08 17:00

正文

请到「今天看啥」查看全文


   
    f
    
     "You are an expert Named Entity Recognition system. "


   
    f
    
     "From the provided news article text, identify and extract entities. "


   
    f
    
     "The entity types to focus on are: {entity_types_string_for_prompt}. "


   
    f
    
     "For each identified entity, provide its exact text span from the article and its type (use one of the provided types). "


   
    f
    
     "Output ONLY a valid JSON object with a single key 'entities'. The value of 'entities' MUST be a list of JSON objects, "


   
    f
    
     "where each object has 'text' and 'type' keys. "


   
    f
    
     "Example: {{
    
    
     
      \"
     
    
    
     entities
    
    
     
      \"
     
    
    
     : [{{
    
    
     
      \"
     
    
    
     text
    
    
     
      \"
     
    
    
     :
    
    
     
      \"
     
    
    
     United Nations
    
    
     
      \"
     
    
    
     ,
    
    
     
      \"
     
    
    
     type
    
    
     
      \"
     
    
    
     :
    
    
     
      \"
     
    
    
     ORG
    
    
     
      \"
     
    
    
     }}, {{
    
    
     
      \"
     
    
    
     text
    
    
     
      \"
     
    
    
     :
    
    
     
      \"
     
    
    
     Barack Obama
    
    
     
      \"
     
    
    
     ,
    
    
     
      \"
     
    
    
     type
    
    
     
      \"
     
    
    
     :
    
    
     
      \"
     
    
    
     PERSON
    
    
     
      \"
     
    
    
     }}]}} "


   
    f
    
     "If no entities of the specified types are found, the 'entities' list should be empty: {{
    
    
     
      \"
     
    
    
     entities
    
    
     
      \"
     
    
    
     : []}}."

此系统提示将指导LLM以有效的JSON格式输出实体数据。在创建主处理循环前，我们需要一个JSON解析器函数，将文本输出转换为有效的JSON格式：

def parse_llm_entity_json_output(llm_output_str):      """      解析LLM的JSON字符串并返回实体列表。    假设格式为：{"entities": [{"text": "...", "type": "..."}]}         Args:          llm_output_str (str): 来自LLM的JSON字符串。
    Returns:          list: 提取的实体或如果解析失败则返回空列表。
            """      if not llm_output_str:          return []  # 如果没有输出则返回空列表      # 如果存在markdown代码块则移除         if llm_output_str.startswith("```json"):          llm_output_str = llm_output_str[7:].rstrip("```").strip()        try:          data = json.loads(llm_output_str)          return data.get("entities", [])  # 返回实体列表，如果未找到则为空      except json.JSONDecodeError:          return []  # JSON错误时返回空列表

现在，创建一个循环，对数据集中的每篇文章应用这个系统提示：

# 定义我们的实体提取LLM  TEXT_GEN_MODEL_NAME = "microsoft/phi-4"  # 遍历有限数量的清理后文章以   # 使用LLM提取实体  for i, article_data in enumerate(cleaned_articles):      article_id = article_data['id']      article_text = article_data['cleaned_text']      # 调用LLM提取实体      llm_response_content = call_llm(          llm_ner_system_prompt,          article_text,          TEXT_GEN_MODEL_NAME      )      # 将LLM的响应解析为实体列表      extracted_llm_entities = []      if llm_response_content:          extracted_llm_entities = parse_llm_entity_json_output(llm_response_content)      # 将结果与文章一起存储      articles_with_llm_entities.append({          "id": article_id,          "cleaned_text": article_text,          "summary": article_data['summary'],          "llm_extracted_entities": extracted_llm_entities      })

此循环将处理我们的65,000篇新闻文章，从中提取实体。处理完成后，我们可以查看一篇文章的提取实体：

# 打印一篇样本文章的实体  print(articles_with_llm_entities[4212]['llm_extracted_entities'])  ### OUTPUT ###  Extracted 20 entities for article ID 4cf51ce937a.    Sample entities: [    {      "text": "United Nations",      "type": "ORG"    },    {      "text": "Algiers",      "type": "GPE"    },    {      "text": "CNN",      "type": "ORG"    }     ...

至此，我们已成功从65,000多篇新闻文章中提取了实体，这些实体将作为知识图谱中的节点。然而，要构建一个完整的知识图谱，我们还需要定义这些节点之间的关系(边)，这将在下一步中进行。

步骤3：关系（边）提取

为构建完整的知识图谱，除了识别实体（节点）外，还需明确这些实体间的关系（边）。这些关系将形成知识图谱的连接结构，使图谱能够表达复杂的语义信息。例如，我们需要确定：

哪家公司收购了哪家公司
收购交易的价格
收购公告的具体时间

我们将使用与实体提取相同的LLM调用函数，但需要重新定义一个专注于关系提取的系统提示：

# 关系提取的系统提示  # 我们要求一个带有"relationships"键的JSON对象。  llm_re_system_prompt = (      "You are an expert system for extracting relationships between entities from text, "      "specifically focusing on **technology company acquisitions**. "      "Given an article text and a list of pre-extracted named entities (each with 'text' and 'type'), "      "your task is to identify and extract relationships. "      "The 'subject_text' and 'object_text' in your output MUST be exact text spans of entities found in the provided 'Extracted Entities' list. "      "The 'subject_type' and 'object_type' MUST correspond to the types of those entities from the provided list. "      "Output ONLY a valid JSON object with a single key 'relationships'. The value of 'relationships' MUST be a list of JSON objects. "      "Each relationship object must have these keys: 'subject_text', 'subject_type', 'predicate' (one of the types listed above), 'object_text', 'object_type'. "      "Example: {\"relationships\": [{\"subject_text\": \"Innovatech Ltd.\", \"subject_type\": \"ORG\", \"predicate\": \"ACQUIRED\", \"object_text\": \"Global Solutions Inc.\", \"object_type\": \"ORG\"}]} "      "If no relevant relationships of the specified types are found between the provided entities, the 'relationships' list should be empty: {\"relationships\": []}."  )

在这个系统提示中，我们指导LLM以特定的JSON格式输出关系数据，格式示例如下：

{    "relationships": [      {        "subject_text": "Innovatech Ltd.",        "subject_type": "ORG",        "predicate": "ACQUIRED",        "object_text": "Global Solutions Inc.",        "object_type": "ORG"      }    ]  }

与实体提取类似，我们需要一个解析函数来处理LLM返回的关系JSON数据：

def parse_llm_relationship_json_output(llm_output_str_rels):      """      解析LLM的JSON字符串以提取关系。    预期格式：           {"relationships": [{"subject_text": ..., "predicate": ..., "object_text": ...}]}          Args:          llm_output_str_rels (str): 来自LLM的JSON字符串。        Returns:          list: 提取的关系或如果解析失败则返回空列表。    """      if not llm_output_str_rels:          return []  # 如果没有输出则返回空列表          # 如果存在markdown代码块则移除      if llm_output_str_rels.startswith("```json"):          llm_output_str_rels = llm_output_str_rels[7:].rstrip("```").strip()          try:          data = json.loads(llm_output_str_rels)          return data.get("relationships", [])  # 返回关系列表，如果未找到则为空      except json.JSONDecodeError:          return []  # JSON错误时返回空列表

现在，我们将使用这个系统提示和JSON解析器，对每篇文章进行处理以提取实体间的关系：

# 遍历每篇文章的实体数据  for i, article_entity_data in enumerate(articles_with_llm_entities):      # 从文章数据中提取文章id、清理后的文本和提取的实体      article_id_rels = article_entity_data['id']      article_text_rels = article_entity_data['cleaned_text']      current_entities = article_entity_data['llm_extracted_entities']      # 将实体列表序列化为JSON字符串以包含在提示中      entities_json_for_prompt = json.dumps(current_entities)      # 构建用户提示，请求LLM提取关系      user_prompt_for_re = (          f"Article Text:\n```\n{article_text_rels}\n```\n\n"          f"Extracted Entities (use these exact texts for subjects/objects of relationships):\n```json\n{entities_json_for_prompt}\n```\n\n"          "Identify and extract relationships between these entities based on the system instructions."      )      # 调用LLM基于提示获取关系提取      llm_response_rels_content = call_llm(          llm_re_system_prompt,           user_prompt_for_re,           TEXT_GEN_MODEL_NAME      )      # 初始化一个空列表来存储提取的关系      extracted_llm_rels = []      # 如果LLM响应不为空，从JSON响应中解析提取的关系      if llm_response_rels_content:          extracted_llm_rels = parse_llm_relationship_json_output(llm_response_rels_content)      # 将原始文章数据与提取的关系一起添加到结果列表中      articles_with_llm_relations.append({          **article_entity_data,  # 保留原始文章数据（id、文本、实体等）          "llm_extracted_relationships": extracted_llm_rels  # 添加提取的关系      })

处理完成后，我们可以查看一篇文章中提取的关系样本：

# 打印一篇样本文章的关系  print(f"Extracted {len(articles_with_llm_relations[1234]['llm_extracted_relationships'])} relationships using LLM.")  print("  Sample LLM relationships:", articles_with_llm_relations[1234]['llm_extracted_relationships'][:2])  ### OUTPUT ###  Extracted 3 relationships using LLM.    Sample LLM relationships: [    {      "subject_text": "Microsoft Corp.",      "subject_type": "ORG",      "predicate": "ACQUIRED",      "object_text": "Nuance Communications Inc.",      "object_type": "ORG"    },    {      "subject_text": "Nuance Communications Inc.",      "subject_type": "ORG",      "predicate": "HAS_PRICE",      "object_text": "$19.7 billion",      "object_type": "MONEY"    }  ]

至此，我们已成功从文章数据集中提取了实体（节点）和关系（边），完成了构建知识图谱所需的基本元素。

步骤4：实体规范化

在处理非结构化文本时，同一实体可能以多种不同形式出现。例如，"Microsoft Corp."、"Microsoft"和"MSFT"实际上都指代同一家公司。如果在知识图谱中将这些视为独立节点，将导致重要连接的丢失。

实体规范化（也称为实体消歧或实体解析）旨在解决这一问题，确保相同实体的不同表述被识别为同一节点。虽然将实体链接到像Wikidata这样的大型外部知识库是一项复杂任务，但我们可以采用简化的方法：

文本规范化 ：清理实体文本，例如移除组织名称中常见的后缀如"Inc."、"Ltd."、"Corp."等，将"Microsoft Corp."转换为"Microsoft"。
URI生成 ：为每个规范化的实体及其类型创建唯一标识符(URI)。例如，类型为"ORG"的"Microsoft"将获得特定URI，后续出现的同一实体将使用相同的URI。

首先，创建一个函数来规范化实体文本：

def normalize_entity_text_for_uri(entity_text, entity_type):      """      规范化实体文本，主要通过去除组织的常见后缀。      """      normalized_text = entity_text.strip()      if entity_type == 'ORG':          # 从组织名称中删除的常见后缀列表          # 这个列表可以根据您的数据进行扩展          suffixes_to_remove = [              'Inc.', 'Incorporated', 'Ltd.', 'Limited', 'LLC', 'L.L.C.',              'Corp.', 'Corporation', 'PLC', 'Co.', 'Company',              'Group', 'Holdings', 'Solutions', 'Technologies', 'Systems'          ]          # 按长度排序以先删除较长的匹配项（例如，先删除"Corp."再删除"Co."）          suffixes_to_remove.sort(key=len, reverse=True)          for suffix in suffixes_to_remove:              # 不区分大小写地检查文本是否以后缀结尾              if normalized_text.lower().endswith(" " + suffix.lower()) or normalized_text.lower() == suffix.lower():                  # 在原始大小写字符串中找到后缀的开始                  suffix_start_index = normalized_text.lower().rfind(suffix.lower())                  # 切片字符串以删除后缀                  normalized_text = normalized_text[:suffix_start_index].strip()                  # 一旦删除了后缀，我们就跳出循环以避免不小心过度剥离                  # 例如 "The The Co." -> "The The" 而不是 "The"                  break          # 删除可能留下的任何尾随逗号或句点          normalized_text = re.sub(r'[,.]*$', '', normalized_text).strip()      # 删除有时被NER捕获的所有格，如's或s'      if normalized_text.endswith("'s") or normalized_text.endswith("s'"):          normalized_text = normalized_text[:-2].strip()      # 如果规范化导致空字符串，恢复为原始字符串（应该很少见）      return normalized_text if normalized_text else entity_text

接下来，我们将处理所有实体，规范化其文本并为每个唯一实体创建URI：

def normalize_entity_text_for_uri(entity_text, entity_type):      """      规范化实体文本，主要通过去除组织的常见后缀。      """      normalized_text = entity_text.strip()      if entity_type == 'ORG':          # 从组织名称中删除的常见后缀列表          # 这个列表可以根据您的数据进行扩展          suffixes_to_remove = [              'Inc.', 'Incorporated', 'Ltd.', 'Limited', 'LLC', 'L.L.C.',              'Corp.', 'Corporation', 'PLC', 'Co.', 'Company',              'Group', 'Holdings', 'Solutions', 'Technologies', 'Systems'          ]          # 按长度排序以先删除较长的匹配项（例如，先删除"Corp."再删除"Co."）          suffixes_to_remove.sort(key=len, reverse=True)          for suffix in suffixes_to_remove:              # 不区分大小写地检查文本是否以后缀结尾              if normalized_text.lower().endswith(" " + suffix.lower()) or normalized_text.lower() == suffix.lower():                  # 在原始大小写字符串中找到后缀的开始                  suffix_start_index = normalized_text.lower().rfind(suffix.lower())                  # 切片字符串以删除后缀                  normalized_text = normalized_text[:suffix_start_index].strip()                  # 一旦删除了后缀，我们就跳出循环以避免不小心过度剥离                  # 例如 "The The Co." -> "The The" 而不是 "The"                  break          # 删除可能留下的任何尾随逗号或句点          normalized_text = re.sub(r'[,.]*$', '', normalized_text).strip()      # 删除有时被NER捕获的所有格，如's或s'      if normalized_text.endswith("'s") or normalized_text.endswith("s'"):          normalized_text = normalized_text[:-2].strip()      # 如果规范化导致空字符串，恢复为原始字符串（应该很少见）      return normalized_text if normalized_text else entity_text

处理完成后可以查看一些规范化实体的样本：

# 显示一篇文章的前3个处理后的实体  print("Example of processed entities from the first article (sample):")for ent in articles_with_normalized_entities_and_uris[2222]['processed_entities'][:3]:      print(f"  Original: '{ent['text']}' ({ent['type']})")  # 原始实体文本和原始类型      print(f"  Normalized: '{ent['normalized_text']}' (Simple Type: {ent['simple_type']})")  # 清理后的文本和类型      print(f"  URI: <{ent['uri']}>")  # 为实体生成的URI  



### OUTPUT ###