让推荐学会思考：用强化学习激活大模型的序列推理能力

AI科技评论 · 公众号 · · 2025-05-22 17:00

正文

请到「今天看啥」查看全文

用户行为序列推理

原作Rec_R1在大模型选型+强化学习+稀疏/稠密搜索方法上进行了不同实验组合，重点在于训练大模型输出最佳搜索关键词，以便在搜索稀疏/稠密数据库后，尽量提高目标文章在搜索结果的排序位置，排名越靠前，间接说明大模型给的关键词越好，极限情况下若排名为第一名，则大模型成功的预测了用户未来点击的下一篇文章。

大部分情况下，我们主要观察搜索评估指标NDCG[6]来客观反映搜索结果与目标文章的相对位置关系，如果NDCG指标越高，则目标文章排名越靠前，这也是我们期望的结果。因此，将NDCG指标分数转化为强化学习算法GRPO[7]的奖励函数，从强化学习的基本原理出发，让大模型按照我们给定的训练要求去自然演化，自动收敛，在我们所关心的评估指标集合上达到一个较好的区间范围。

3.1 Rec_E1训练框架

本作在Rec_R1基础上的升级为Rec_E1，主要工作：

升级了veRL版本，从v0.1 升级为 v0.3，支持更多强化学习算法
支持搜索数据库ElasticSearch和BM25S，探索了bm25算法在不同数据库实现上的效果差异
支持新闻资讯推荐数据集MIND-small
因pyserini升级，修复了部分install问题

在开始实验前，我们需要为后续的实验先安排基础对照组，本作基础对照实验组合为：Qwen2.5-3B-instruct + GRPO强化学习算法 +BM25搜索算法[9](Pyserini稀疏数据库[8]) + MIND-small小型数据集。

GRPO强化学习算法的奖励函数包括搜索指标NDCG分和大模型输出格式分，考虑NDCG分小于1且大部分情况下分值较小，这里将NDCG分扩大10倍，最终的奖励分Advantage=NDCG_score * 10 + response_format_score。

3.2 基础对照组实验

由于MIND-small数据集中用户点击行为序列可能非常长，为控制输入长度，稳定所需GPU显存大小，这里只截取了用户最新点击的8篇文章，同时将目标文章设为第9篇文章。为加快训练进度，这里的训练集截取了MIND-small的train数据集前10000条样本，测试集截取MIND-small的dev数据集前2000条样本，在两卡A800上共训练3轮。
在基础对照组的实验中，一个典型的大模型输入如下：

You first think about the reasoning process in the mind and then provide the user with the answer. Current time is 2019-11-14.
You are very good at personalized news recommendation. Based on the user's clicked news sequence, your task is to generate a search query to search for the next news that the user is most likely to click on in the news retrieval system.
In other words, the user is likely to click on news related to the previous clicks.
Below are the titles and abstract of news that users have clicked on in sequence. The next one news is not in the list.

News-1
Title: Bruce Willis brought Demi Moore to tears after reading her book
Abstract: Demi wasn't sure how her ex would feel about her book until he called her.

News-2
Title: ...

Show your thought process in   tags. Your final response must be in JSON format within   tags. The generated query should use Boolean operators (AND, OR) to structure your query logically. For example: 

{"query": xxx}

为了更好的结合BM25搜索算法，规定大模型的输出为用AND或OR连接的多个关键词。在GRPO训练过程中，一个典型的大模型输出如下：


First, I analyzed the topics of the given news articles. News 1 is about a wildlife rescue, News 2 is about economic issues, News 3 is about Trump's legal troubles, News 4 is about political matters, News 5 is about gift ideas, News 6 is about a suspected serial killer, News 7 is about animal welfare, and News 8 is about community service. To find the most likely next news, I'll look for topics that are related to these categories but not too similar to the previous ones. A good candidate would be another news article about community service or a related topic, such as a story about a senior citizen or a charity event. However, since the last news was about a senior shelter dog, a related topic could be another story about animals or community service. The query is constructed to include these related topics to ensure the next news is likely to be clicked on.


{"query": "fishermen rescue deer OR Trump economic OR NY judge settlement OR impeachment inquiry OR holiday gift ideas OR serial killer OR senior shelter dog OR Waffle House employee"}

这里BM25搜索算法使用了Pyserini数据库，该数据库存储字段包括文章id和文章正文，由于MIND-small数据库的文章没有正文，这里将文章的标题和摘要拼接为“正文”。

解析大模型输出的query，查询Pyserini数据库，获得相应的召回集，同时评估目标文章在召回集中的相对位置，即计算NDCG分值，若目标文章没有出现在召回集中，则NDCG分值为0。

通过wandb可以很方便的看到训练细节，在下图各个奖励函数评分曲线可以看出：

大模型给的query仍有很多没有查到目标文章，wrong_answer_ratio和all_correct_ratio是互补关系，从all_correct_ratio可知，最高约18%的查询结果包含了目标文章。
大模型通过强化学习训练逐渐学会了按照指定格式要求输出，即格式错误率format_error_ratio快速下降到0附近
在大模型查询命中目标文章的样本中，ndcg平均值最高到达0.3353/10 = 0.03353
初步验证了强化学习训练大模型进行序列推理能力的可行性

3.3 BM25的迷惑性

原作使用了Pyserini数据库中的bm25算法，在Amazon电商推荐数据集获得了不错效果，由于bm25算法的普适性，企业界可以立马应用起来。当你走到这里时，下一步可能就要在企业内部集群部署的ElasticSearch数据库上进行本地数据适配化操作了。实际上真正运用起来，还是有些坑要注意的。