正文
f
"You are an expert Named Entity Recognition system. "
f
"From the provided news article text, identify and extract entities. "
f
"The entity types to focus on are: {entity_types_string_for_prompt}. "
f
"For each identified entity, provide its exact text span from the article and its type (use one of the provided types). "
f
"Output ONLY a valid JSON object with a single key 'entities'. The value of 'entities' MUST be a list of JSON objects, "
f
"where each object has 'text' and 'type' keys. "
f
"Example: {{
\"
entities
\"
: [{{
\"
text
\"
:
\"
United Nations
\"
,
\"
type
\"
:
\"
ORG
\"
}}, {{
\"
text
\"
:
\"
Barack Obama
\"
,
\"
type
\"
:
\"
PERSON
\"
}}]}} "
f
"If no entities of the specified types are found, the 'entities' list should be empty: {{
\"
entities
\"
: []}}."
)
此系统提示将指导LLM以有效的JSON格式输出实体数据。在创建主处理循环前,我们需要一个JSON解析器函数,将文本输出转换为有效的JSON格式:
def parse_llm_entity_json_output(llm_output_str):
"""
解析LLM的JSON字符串并返回实体列表。
假设格式为:{"entities": [{"text": "...", "type": "..."}]}
Args:
llm_output_str (str): 来自LLM的JSON字符串。
Returns:
list: 提取的实体或如果解析失败则返回空列表。
"""
if not llm_output_str:
return []
if llm_output_str.startswith("```json"):
llm_output_str = llm_output_str[7:].rstrip("```").strip()
try:
data = json.loads(llm_output_str)
return data.get("entities", [])
except json.JSONDecodeError:
return []
现在,创建一个循环,对数据集中的每篇文章应用这个系统提示:
# 定义我们的实体提取LLM
TEXT_GEN_MODEL_NAME = "microsoft/phi-4"
# 遍历有限数量的清理后文章以
# 使用LLM提取实体
for i, article_data in enumerate(cleaned_articles):
article_id = article_data['id']
article_text = article_data['cleaned_text']
# 调用LLM提取实体
llm_response_content = call_llm(
llm_ner_system_prompt,
article_text,
TEXT_GEN_MODEL_NAME
)
# 将LLM的响应解析为实体列表
extracted_llm_entities = []
if llm_response_content:
extracted_llm_entities = parse_llm_entity_json_output(llm_response_content)
# 将结果与文章一起存储
articles_with_llm_entities.append({
"id": article_id,
"cleaned_text": article_text,
"summary": article_data['summary'],
"llm_extracted_entities": extracted_llm_entities
})
此循环将处理我们的65,000篇新闻文章,从中提取实体。处理完成后,我们可以查看一篇文章的提取实体:
# 打印一篇样本文章的实体
print(articles_with_llm_entities[4212]['llm_extracted_entities'])
### OUTPUT ###
Extracted 20 entities for article ID 4cf51ce937a.
Sample entities: [
{
"text": "United Nations",
"type": "ORG"
},
{
"text": "Algiers",
"type": "GPE"
},
{
"text": "CNN",
"type": "ORG"
}
...
至此,我们已成功从65,000多篇新闻文章中提取了实体,这些实体将作为知识图谱中的节点。然而,要构建一个完整的知识图谱,我们还需要定义这些节点之间的关系(边),这将在下一步中进行。
步骤3:关系(边)提取
为构建完整的知识图谱,除了识别实体(节点)外,还需明确这些实体间的关系(边)。这些关系将形成知识图谱的连接结构,使图谱能够表达复杂的语义信息。例如,我们需要确定:
-
哪家公司收购了哪家公司
-
收购交易的价格
-
收购公告的具体时间
我们将使用与实体提取相同的LLM调用函数,但需要重新定义一个专注于关系提取的系统提示:
# 关系提取的系统提示
# 我们要求一个带有"relationships"键的JSON对象。
llm_re_system_prompt = (
"You are an expert system for extracting relationships between entities from text, "
"specifically focusing on **technology company acquisitions**. "
"Given an article text and a list of pre-extracted named entities (each with 'text' and 'type'), "
"your task is to identify and extract relationships. "
"The 'subject_text' and 'object_text' in your output MUST be exact text spans of entities found in the provided 'Extracted Entities' list. "
"The 'subject_type' and 'object_type' MUST correspond to the types of those entities from the provided list. "
"Output ONLY a valid JSON object with a single key 'relationships'. The value of 'relationships' MUST be a list of JSON objects. "
"Each relationship object must have these keys: 'subject_text', 'subject_type', 'predicate' (one of the types listed above), 'object_text', 'object_type'. "
"Example: {\"relationships\": [{\"subject_text\": \"Innovatech Ltd.\", \"subject_type\": \"ORG\", \"predicate\": \"ACQUIRED\", \"object_text\": \"Global Solutions Inc.\", \"object_type\": \"ORG\"}]} "
"If no relevant relationships of the specified types are found between the provided entities, the 'relationships' list should be empty: {\"relationships\": []}."
)
在这个系统提示中,我们指导LLM以特定的JSON格式输出关系数据,格式示例如下:
{
"relationships": [
{
"subject_text": "Innovatech Ltd.",
"subject_type": "ORG",
"predicate": "ACQUIRED",
"object_text": "Global Solutions Inc.",
"object_type": "ORG"
}
]
}
与实体提取类似,我们需要一个解析函数来处理LLM返回的关系JSON数据:
def parse_llm_relationship_json_output(llm_output_str_rels):
"""
解析LLM的JSON字符串以提取关系。
预期格式:
{"relationships": [{"subject_text": ..., "predicate": ..., "object_text": ...}]}
Args:
llm_output_str_rels (str): 来自LLM的JSON字符串。
Returns:
list: 提取的关系或如果解析失败则返回空列表。
"""
if not llm_output_str_rels:
return []
if llm_output_str_rels.startswith("```json"):
llm_output_str_rels = llm_output_str_rels[7:].rstrip("```").strip()
try:
data = json.loads(llm_output_str_rels)
return data.get("relationships", [])
except json.JSONDecodeError:
return []
现在,我们将使用这个系统提示和JSON解析器,对每篇文章进行处理以提取实体间的关系:
for i, article_entity_data in enumerate(articles_with_llm_entities):
article_id_rels = article_entity_data['id']
article_text_rels = article_entity_data['cleaned_text']
current_entities = article_entity_data['llm_extracted_entities']
entities_json_for_prompt = json.dumps(current_entities)
user_prompt_for_re = (
f"Article Text:\n```\n{article_text_rels}\n```\n\n"
f"Extracted Entities (use these exact texts for subjects/objects of relationships):\n```json\n{entities_json_for_prompt}\n```\n\n"
"Identify and extract relationships between these entities based on the system instructions."
)
llm_response_rels_content = call_llm(
llm_re_system_prompt,
user_prompt_for_re,
TEXT_GEN_MODEL_NAME
)
extracted_llm_rels = []
if llm_response_rels_content:
extracted_llm_rels = parse_llm_relationship_json_output(llm_response_rels_content)
articles_with_llm_relations.append({
**article_entity_data,
"llm_extracted_relationships": extracted_llm_rels
})
处理完成后,我们可以查看一篇文章中提取的关系样本:
print(f"Extracted {len(articles_with_llm_relations[1234]['llm_extracted_relationships'])} relationships using LLM.")
print(" Sample LLM relationships:", articles_with_llm_relations[1234]['llm_extracted_relationships'][:2])
Extracted 3 relationships using LLM.
Sample LLM relationships: [
{
"subject_text": "Microsoft Corp.",
"subject_type": "ORG",
"predicate": "ACQUIRED",
"object_text": "Nuance Communications Inc.",
"object_type": "ORG"
},
{
"subject_text": "Nuance Communications Inc.",
"subject_type": "ORG",
"predicate": "HAS_PRICE",
"object_text": "$19.7 billion",
"object_type": "MONEY"
}
]
至此,我们已成功从文章数据集中提取了实体(节点)和关系(边),完成了构建知识图谱所需的基本元素。
步骤4:实体规范化
在处理非结构化文本时,同一实体可能以多种不同形式出现。例如,"Microsoft Corp."、"Microsoft"和"MSFT"实际上都指代同一家公司。如果在知识图谱中将这些视为独立节点,将导致重要连接的丢失。
实体规范化(也称为实体消歧或实体解析)旨在解决这一问题,确保相同实体的不同表述被识别为同一节点。虽然将实体链接到像Wikidata这样的大型外部知识库是一项复杂任务,但我们可以采用简化的方法:
-
文本规范化
:清理实体文本,例如移除组织名称中常见的后缀如"Inc."、"Ltd."、"Corp."等,将"Microsoft Corp."转换为"Microsoft"。
-
URI生成
:为每个规范化的实体及其类型创建唯一标识符(URI)。例如,类型为"ORG"的"Microsoft"将获得特定URI,后续出现的同一实体将使用相同的URI。
首先,创建一个函数来规范化实体文本:
def normalize_entity_text_for_uri(entity_text, entity_type):
"""
规范化实体文本,主要通过去除组织的常见后缀。
"""
normalized_text = entity_text.strip()
if entity_type == 'ORG':
suffixes_to_remove = [
'Inc.', 'Incorporated', 'Ltd.', 'Limited', 'LLC', 'L.L.C.',
'Corp.', 'Corporation', 'PLC', 'Co.', 'Company',
'Group', 'Holdings', 'Solutions', 'Technologies', 'Systems'
]
suffixes_to_remove.sort(key=len, reverse=True)
for suffix in suffixes_to_remove:
if normalized_text.lower().endswith(" " + suffix.lower()) or normalized_text.lower() == suffix.lower():
suffix_start_index = normalized_text.lower().rfind(suffix.lower())
normalized_text = normalized_text[:suffix_start_index].strip()
break
normalized_text = re.sub(r'[,.]*$', '', normalized_text).strip()
if normalized_text.endswith("'s") or normalized_text.endswith("s'"):
normalized_text = normalized_text[:-2].strip()
return normalized_text if normalized_text else entity_text
接下来,我们将处理所有实体,规范化其文本并为每个唯一实体创建URI:
def normalize_entity_text_for_uri(entity_text, entity_type):
"""
规范化实体文本,主要通过去除组织的常见后缀。
"""
normalized_text = entity_text.strip()
if entity_type == 'ORG':
suffixes_to_remove = [
'Inc.', 'Incorporated', 'Ltd.', 'Limited', 'LLC', 'L.L.C.',
'Corp.', 'Corporation', 'PLC', 'Co.', 'Company',
'Group', 'Holdings', 'Solutions', 'Technologies', 'Systems'
]
suffixes_to_remove.sort(key=len, reverse=True)
for suffix in suffixes_to_remove:
if normalized_text.lower().endswith(" " + suffix.lower()) or normalized_text.lower() == suffix.lower():
suffix_start_index = normalized_text.lower().rfind(suffix.lower())
normalized_text = normalized_text[:suffix_start_index].strip()
break
normalized_text = re.sub(r'[,.]*$', '', normalized_text).strip()
if normalized_text.endswith("'s") or normalized_text.endswith("s'"):
normalized_text = normalized_text[:-2].strip()
return normalized_text if normalized_text else entity_text
处理完成后可以查看一些规范化实体的样本:
print("Example of processed entities from the first article (sample):")
for ent in articles_with_normalized_entities_and_uris[2222]['processed_entities'][:3]:
print(f" Original: '{ent['text']}' ({ent['type']})")
print(f" Normalized: '{ent['normalized_text']}' (Simple Type: {ent['simple_type']})")
print(f" URI: <{ent['uri']}>")