Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
sergiopaniego 
posted an update 3 days ago
Post
2241
most multi-turn RL loops have a silent bug: you decode the model's output to detect tool calls, then re-tokenize the conversation for the next turn. BPE isn't invertible, so decode then re-encode can land on different ids. gradient ends up on tokens the model never sampled. no crash, just quietly wrong math and broken training

@qgallouedec wrote a super educational blog on MITO (message-in, token-out) vs TITO (token-in, token-out) and how you might fix the problem above

go read it 🤓

https://qgallouedec-tito.hf.space/
In this post