Flash News

Hugging Face Founder Warns of Agentic LLM Training Pitfalls

Clement Delangue, founder of Hugging Face, stated that most teams currently training Agentic LLMs with RL are unaware of the hidden damage in their training loops.

Single-round RL performs well, but when tool calls are added, there are abnormal spikes in Loss and shape mismatch errors. The root cause lies in the inconsistent token round-trip generated during output parsing and re-tokenizing conversations, leading to errors in gradient calculations.

In market mechanisms, AI research teams and open-source model developers are accelerating the adoption of the correct Token-In Token-Out training method; event-driven funding is shifting from problematic training frameworks to stable multi-round RL tools; Hugging Face and frameworks supporting correct multi-round training benefit, while projects using incorrect token processing face pressure.

Source: Public Information

ABAB AI Insight

Clement Delangue, as the founder of Hugging Face, has long advocated for best practices in open-source LLM training, previously highlighting engineering detail risks in multi-modal and agent frameworks. The in-depth article released by his team comprehensively audits the chat templates of mainstream open-source model families.

In terms of capital pathways, the Hugging Face team is investing engineering resources into the correct Token-In Token-Out implementation, eliminating gradient pollution in multi-round tool calls by avoiding re-encoding sampled tokens, thus shifting training efficiency from hidden waste to reliable convergence, saving significant computational resources for large-scale Agentic system development.

Similar to the reward hacking issues seen in early RLHF training, as well as the stability challenges of multi-agent frameworks during tool calls in 2024-2025; current Agentic LLM training is at a critical stage of transitioning from single-round experiments to real multi-round tool interactions.

Essentially, this is a technological replacement, where strict Token-In Token-Out rules replace training loops with hidden errors with mathematically consistent processes. The mechanism is to eliminate noise introduced by decode-encode round-trips, allowing gradient signals to truly reflect model sampling behavior, thereby significantly improving the stability and convergence quality of multi-round RL training.

ABAB News · Cognitive Laws

The most dangerous bug is never a crash, but quietly providing incorrect gradients. Beautiful curves in single rounds often reveal their true form during multi-round tool calls. Truly top-notch engineering is not about creating new features but eliminating those invisible training contaminations.

Source

·ABAB News

05/29/2026, 03:51 AM·

2 min read

·3d ago