Jinx: Unlimited LLMs for Probing Alignment Failures
Abstract
Jinx is a helpful-only variant of open-weight LLMs designed for researchers to assess alignment failures and study safety in language models.
Unlimited, or so-called helpful-only language models are trained without safety alignment constraints and never refuse user queries. They are widely used by leading AI companies as internal tools for red teaming and alignment evaluation. For example, if a safety-aligned model produces harmful outputs similar to an unlimited model, this indicates alignment failures that require further attention. Despite their essential role in assessing alignment, such models are not available to the research community. We introduce Jinx, a helpful-only variant of popular open-weight LLMs. Jinx responds to all queries without refusals or safety filtering, while preserving the base model's capabilities in reasoning and instruction following. It provides researchers with an accessible tool for probing alignment failures, evaluating safety boundaries, and systematically studying failure modes in language model safety.
Community
Hi, may I ask what were the public datasets (if any) used for said post-training? If they were synthetically generated, could you share the general steps taken to generate such samples? Thank you!
Thanks for your interest. Yes we used some public datasets on huggingface. You can search it by your self. We have decide not to release our detailed recipe to public.
Get this paper in your agent:
hf papers read 2508.08243 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 12
Jinx-org/Jinx-gpt-oss-20b-GGUF
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper