Subscribe

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Service

DeepSeek Tackles Reward Modeling with Scalable Approach

Deepseek CEO Deepseek CEO
IMAGE CREDITS: GETTY IMAGES

DeepSeek AI, the Chinese research lab known for building powerful open-source language models like DeepSeek-R1. Has just unveiled a new approach that could push reward modeling in AI to a new level. Their latest innovation, Self-Principled Critique Tuning (SPCT), is designed to make reward models more generalist, flexible, and scalable across a wide range of complex tasks that current models struggle to handle. This could significantly improve how large language models understand and respond to open-ended or subjective queries. Which are notoriously difficult to evaluate with precision.

In reinforcement learning, reward models act like internal judges that guide large language models (LLMs) to produce better answers by assigning scores to outputs during training. While this system has worked well in clearly defined fields like math or coding. It’s far less effective in broader, messier domains where there’s no clear right answer. DeepSeek’s team points out that existing reward models often fall short when faced with open-ended questions. Mainly because these models are built for specific tasks and can’t scale well across different types of content or user intents.

To overcome this, the researchers designed SPCT to train models that can evaluate responses dynamically and with more nuance. Unlike older reward models that rely on static scoring rules. SPCT allows a model to generate principles and critiques on the fly. This gives it the ability to adapt its evaluation process depending on the context of each query. By training the model to generate its own evaluation guidelines. The system becomes more flexible and better equipped to handle diverse and evolving scenarios.

The process begins with a fine-tuning phase where the model is taught to produce clear principles and critiques for a range of input types. It learns through trial and error, improving only when its generated rewards match the known correct answers. Once this baseline is established, a second phase kicks in that uses rule-based reinforcement learning to further refine the system. The model continues generating principles and critiques. But this time it is rewarded based on how well its outputs align with established accuracy rules.

To address the challenge of inference-time scaling—how to make better decisions when more computing power is available. SPCT enables the model to produce multiple responses for the same task. It then aggregates the best results through a voting mechanism, improving accuracy by taking a broader view. To avoid noise or bias in this process, DeepSeek also created a lightweight secondary model, or “meta RM,” to evaluate which of the generated critiques are likely to be useful. This meta evaluator filters out low-quality outputs, further improving the overall result.

The team applied SPCT to Google’s open-weight Gemma-2-27B model, creating DeepSeek-GRM-27B. In head-to-head comparisons, it outperformed many of the leading reward models, including those from OpenAI and other top labs. What’s more impressive is that the model showed strong performance gains when given more compute resources. A sign that it’s well-suited for enterprise-grade deployment. In fact, it even surpassed larger models in performance thanks to its more refined critique-generation process.

This approach doesn’t just promise better results—it could also help enterprises adapt AI more effectively. Whether it’s for creative work, customer service, or handling complex internal queries, the ability to evaluate outputs in real time and across diverse contexts is a game-changer. DeepSeek-GRM also demonstrated reduced bias, performing more consistently across different domains, unlike scalar reward models that tend to underperform in subjective or less structured tasks.

While SPCT still trails behind traditional models in simple, rule-based domains where quick answers are needed, its ability to scale across tasks and handle open-ended inputs is a major leap forward. Efficiency remains a challenge, but the potential applications are massive. The DeepSeek team plans to build on this with future updates aimed at boosting speed, integrating SPCT into real-time learning pipelines, and using it as a robust evaluation system for large-scale AI deployments.

Share with others