Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts
arXiv:2602.13367v1 Announce Type: new Abstract: We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters. To the best of our knowledge, it is the first open-source small language model (SLM) to achieve such versatility in a single model. To improve reasoning and preference alignment, we combine point-wise and pair-wise reward modeling, ensuring high-quality, human-aligned responses. For code generation, we design complexity-aware rewards in Reinforcement Learning, optimizing both correctness and efficiency. In deep search, we perform complex data synthesis and incorporate turn-level supervision during training. This enables stable long-horizon tool interactions, allowing Nanbeige4.1-3B to reliably execute up to 600 tool-call turns for complex problem-solving. Extensive experimental results show that Nanbeige4.1-3B significantly outperforms prior models of similar
arXiv:2602.13367v1 Announce Type: new Abstract: We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters. To the best of our knowledge, it is the first open-source small language model (SLM) to achieve such versatility in a single model. To improve reasoning and preference alignment, we combine point-wise and pair-wise reward modeling, ensuring high-quality, human-aligned responses. For code generation, we design complexity-aware rewards in Reinforcement Learning, optimizing both correctness and efficiency. In deep search, we perform complex data synthesis and incorporate turn-level supervision during training. This enables stable long-horizon tool interactions, allowing Nanbeige4.1-3B to reliably execute up to 600 tool-call turns for complex problem-solving. Extensive experimental results show that Nanbeige4.1-3B significantly outperforms prior models of similar scale, such as Nanbeige4-3B-2511 and Qwen3-4B, even achieving superior performance compared to much larger models, such as Qwen3-30B-A3B. Our results demonstrate that small models can achieve both broad competence and strong specialization simultaneously, redefining the potential of 3B parameter models.
Executive Summary
The article introduces Nanbeige4.1-3B, a compact yet versatile language model with 3 billion parameters, capable of agentic behavior, code generation, and general reasoning. The model employs a combination of point-wise and pair-wise reward modeling to enhance reasoning and preference alignment, while complexity-aware rewards in Reinforcement Learning optimize code generation for correctness and efficiency. Notably, it achieves stable long-horizon tool interactions, executing up to 600 tool-call turns for complex problem-solving. Experimental results demonstrate that Nanbeige4.1-3B outperforms similar and even larger models, highlighting the potential of small models to achieve both broad competence and strong specialization.
Key Points
- ▸ Nanbeige4.1-3B is a 3-billion parameter model that achieves versatility in agentic behavior, code generation, and general reasoning.
- ▸ The model uses a combination of point-wise and pair-wise reward modeling for improved reasoning and alignment.
- ▸ Complexity-aware rewards in Reinforcement Learning optimize code generation for correctness and efficiency.
- ▸ The model can execute up to 600 tool-call turns for complex problem-solving.
- ▸ Experimental results show that Nanbeige4.1-3B outperforms similar and larger models.
Merits
Versatility
Nanbeige4.1-3B demonstrates a high degree of versatility, excelling in multiple domains such as agentic behavior, code generation, and general reasoning.
Performance
The model outperforms both similar-sized and larger models, indicating a significant advancement in the capabilities of small language models.
Innovative Techniques
The use of point-wise and pair-wise reward modeling, along with complexity-aware rewards in Reinforcement Learning, represents innovative approaches to improving model performance.
Demerits
Scalability
While the model shows impressive performance, the scalability of its techniques to even larger models remains untested.
Long-Term Stability
Although the model can execute up to 600 tool-call turns, the long-term stability and reliability of such interactions need further validation.
Open-Source Limitations
Being an open-source model, it may face challenges in terms of resource allocation and community support compared to proprietary models.
Expert Commentary
The introduction of Nanbeige4.1-3B marks a significant milestone in the development of small language models. Its ability to achieve strong performance across multiple domains while maintaining a relatively small parameter count challenges the conventional wisdom that larger models are inherently superior. The innovative use of reward modeling techniques and complexity-aware rewards in Reinforcement Learning demonstrates a nuanced approach to optimizing model performance. However, the long-term stability and scalability of these techniques remain areas of concern. The model's open-source nature presents both opportunities and challenges, as it allows for community-driven improvements but may also face resource constraints. Overall, Nanbeige4.1-3B sets a new benchmark for small language models and highlights the potential for further advancements in the field.
Recommendations
- ✓ Further research should focus on validating the long-term stability and scalability of the techniques employed in Nanbeige4.1-3B.
- ✓ Policymakers should consider the implications of small, versatile models on resource allocation and AI development strategies.