distilgpt2-grok-coder-reasoning
Model Details
Model Description
This model is a full fine-tuned version of DistilGPT2, exposed to an aggressive, completely uncapped curriculum of Grok-4 level distillation traces, hyper-creative and logic datasets, comprehensive coding logic, and mature internet discourse. It is designed to act as a highly responsive, analytical engine capable of deep structural reasoning and complex logic emulation.
Trained natively at an accelerated maximum learning rate with a cosine decay schedule, the model synthesizes diverse programmatic and theoretical domains from a massive multi-repository corpus, processed at the model's absolute maximum context window of 1024 tokens.
- Developed by: GODsStrongestSoldier
- Model type: Causal Language Model (Transformer Decoder)
- Language: English
- License: Apache 2.0
- Finetuned from model:
distilgpt2
Datasets Used for Fine-Tuning
This model was trained comprehensively on the full, uncapped contents of the following datasets:
- WithinUsAI/Grok4.4_heavy_max_distill_god_seed_25k
- WithinUsAI/GOD_Coder_Complete_DataSet
- acheong08/nsfw_reddit
- TeichAI/grok-code-fast-1-1000x
- TeichAI/brainstorm-v3.1-grok-4-fast-200x
- Crownelius/Hyper-Creative-Grok-V1
- Crownelius/Hyper-UltraData-Grok-V1
Training Details
Training Procedure
The model underwent full fine-tuning without the use of adapters or LoRA layers. All native parameters of the base model were globally updated. The training harness dynamically parsed heavily nested dataset repositories, enforcing a strict shape constraint to generate mathematically perfect 1024-token continuous sequences for the GPU, maxing out the DistilGPT2 context window.
To maximize adaptation to the Grok-level reasoning data, an absolute peak learning rate (3e-4) was utilized alongside a 5% warmup phase and a cosine scheduler.
Hardware
- Environment: Kaggle
- Accelerators: Dual NVIDIA T4 GPUs (15GB VRAM each)
Hyperparameters
- Epochs: 1
- Context Window / Block Size: 1024
- Per-Device Batch Size: 4
- Gradient Accumulation Steps: 16
- Effective Global Batch Size: 128
- Peak Learning Rate: 3e-04
- Learning Rate Scheduler: Cosine
- Warmup Ratio: 0.05
- Optimizer: Fused AdamW (
adamw_torch_fused) - Mixed Precision: fp16
- Gradient Checkpointing: Enabled
- Downloads last month
- 25