Pretraining Data
updated
opencsg/Fineweb-Edu-Chinese-V2.1
Viewer
• Updated • 958M • 20.7k
• 75
Preview
• Updated • 58.7k
• 35
Viewer
• Updated • 3.8B • 4.88k
• 127
allenai/dolma3_dolmino_pool
Updated • 20.6k
• 8
allenai/dolma3_longmino_pool
Updated • 22.8k
• 13
Viewer
• Updated • 476M • 78.2k
• 880
Viewer
• Updated • 4.48B • 90.6k
• 822
Viewer
• Updated • 61.6M • 11.5k
• 304
Viewer
• Updated • 819M • 2.87k
• 13
tokyotech-llm/swallow-code-v2
Viewer
• Updated • 147M • 47k
• 38
ByteDance-Seed/Code-Contests-Plus
Viewer
• Updated • 49.2k • 4.88k
• 66
Viewer
• Updated • 7.09M • 5.15k
• 183
nvidia/Nemotron-Pretraining-Code-v2
Viewer
• Updated • 836M • 53k
• 126
nvidia/Nemotron-Pretraining-Specialized-v1
Viewer
• Updated • 60.7M • 5.44k
• 82
nvidia/Nemotron-CC-Math-v1
Viewer
• Updated • 190M • 54k
• 87
nvidia/Nemotron-Pretraining-SFT-v1
Viewer
• Updated • 299M • 1.18k
• 68
Viewer
• Updated • 1.86M • 5.52k
• 245
EssentialAI/essential-web-v1.0
Preview
• Updated • 233k
• 226
EssentialAI/eai-taxonomy-stem-w-dclm
Preview
• Updated • 1.99k
• 6
EssentialAI/eai-taxonomy-med-w-dclm
Viewer
• Updated • 81.2M • 526
• 8
EssentialAI/eai-taxonomy-code-w-dclm
Viewer
• Updated • 274M • 3.65k
• 9
EssentialAI/eai-taxonomy-math-w-fm
Viewer
• Updated • 21.6M • 1.69k
• 5
Viewer
• Updated • 27.9B • 83
• 5
DataMuncher-Labs/UltiMath
Viewer
• Updated • 32.9B • 7.66k
• 45
HuggingFaceFW/finetranslations
Viewer
• Updated • 3.33B • 18.4k
• 294
Viewer
• Updated • 69.9k • 52.6k
• 402
JetBrains-Research/commit-chronicle
Viewer
• Updated • 10.9M • 5.61k
• 12
ASSERT-KTH/repairllama-datasets
Viewer
• Updated • 460k • 192
• 3
Updated • 6.34k
• 78
Viewer
• Updated • 778k • 18
Viewer
• Updated • 3.5M • 194
nick007x/github-code-2025
Viewer
• Updated • 148M • 1.24k
• 119
tokyotech-llm/swallow-code
Viewer
• Updated • 129M • 959
• 66
loubnabnl/github-code-duplicate
Viewer
• Updated • 115M • 707
• 1
macrocosm-os/code-parrot-github-code
Viewer
• Updated • 115M • 214
• 13
nyuuzyou/google-code-archive
Viewer
• Updated • 65.8M • 1.22k
• 73
ad6398/Deepmind-CodeContest-Unrolled
Viewer
• Updated • 13.2M • 702
• 2
datablations/python-megatron
Updated • 4.62k
• 1
nomic-ai/cornstack-python-v1
Viewer
• Updated • 23.6M • 870
• 27
Viewer
• Updated • 22.6M • 1.31k
Viewer
• Updated • 552k • 263
utter-project/github-code-2025-above-2-stars
Viewer
• Updated • 103M • 76
• 3
tokyotech-llm/swallow-math-v2
Viewer
• Updated • 17.4M • 15.1k
• 31
Viewer
• Updated • 181M • 35.4k
• 318
Viewer
• Updated • 513k • 39
• 8
Updated • 185k
Preview
• Updated • 212k
• 97
CodedotAI/code_clippy_github
Viewer
• Updated • 2.4M • 3.4k
• 20
Lichess/standard-chess-games
Viewer
• Updated • 7.14B • 8.87k
• 71
jablonkagroup/chempile-mlift
Viewer
• Updated • 51.5M • 7.34k
• 14
Viewer
• Updated • 49.8M • 389
jablonkagroup/chempile-code
Viewer
• Updated • 2.27M • 369
• 5
jablonkagroup/chempile-paper
Viewer
• Updated • 11.7M • 1.63k
• 7
jablonkagroup/chempile-education
Viewer
• Updated • 66.9k • 538
• 7
Viewer
• Updated • 164k • 58
• 2
institutional/institutional-books-1.0
Viewer
• Updated • 983k • 17.5k
• 278
Viewer
• Updated • 34.5k • 16
common-pile/youtube_filtered
Viewer
• Updated • 986k • 101
• 5
common-pile/ubuntu_irc_filtered
Viewer
• Updated • 216k • 126
• 1
common-pile/pubmed_filtered
Viewer
• Updated • 4.77M • 624
• 3
common-pile/github_archive_filtered
Viewer
• Updated • 23.3M • 160
• 1
common-pile/biodiversity_heritage_library_filtered
Viewer
• Updated • 16.5M • 129
• 1
common-pile/uspto_filtered
Viewer
• Updated • 14.4M • 1.16k
• 3
common-pile/usgpo_filtered
Viewer
• Updated • 2.34M • 289
• 1
soda-research/emilia-mm-pretrain-fix
Viewer
• Updated • 12.6M • 3.57k
fineinstructions/fineinstructions_nemotron
Viewer
• Updated • 1.23B • 2.64k
• 26
Viewer
• Updated • 1.38M • 669
• 1
Viewer
• Updated • 23.5k • 5.71k
• 55
Viewer
• Updated • 12.3M • 20.7k
• 94