datasets for training ai models
Erik
Tralalabs
AI & ML interests
pretraining from scratch
Recent Activity
updated a dataset about 4 hours ago
SupraLarps/rizzaurapretraining published a dataset about 4 hours ago
SupraLarps/rizzaurapretraining new activity about 4 hours ago
SupraLarps/dumblabs-1234b:YeahOrganizations
Claude (and Others) Generated Datasets
-
nothingiisreal/Claude-3-Opus-Instruct-15K
Viewer • Updated • 29.5k • 142 • 19 -
mlfoundations-dev/oh-dcft-v3.1-claude-3-5-sonnet-20241022
Viewer • Updated • 1.05M • 7 -
mlfoundations-dev/oh-dcft-v3.1-claude-3-5-sonnet-20241022_test
Viewer • Updated • 1k • 6 -
mfielding92/claude-3.7-sonnet-reasoning
Viewer • Updated • 179 • 68 • 35
Chinese Datasets
-
Skywork/SkyPile-150B
Viewer • Updated • 1.76M • 9.86k • 407 -
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models
Paper • 2308.10755 • Published • 1 -
fjcanyue/wikipedia-zh-cn
Viewer • Updated • 5.55M • 1.03k • 27 -
0xDing/wikipedia-cn-20230720-filtered
Viewer • Updated • 255k • 2.28k • 170
My Stuff
Korean Machine Learning Datasets
Datasets for AI and neural networks for Korean generation.
TralalabsLM
Tralalabs's model family called "TralalabsLM". A collection.
BLOOM GGUF
GGUF files of BLOOM and BLOOMz up to 7B
Qwen 3
All the Qwen 3 family (including Qwen-Image models released in 2025-2026)
PII
PII stuff on Hugging Face
-
kalyan-ks/ettin-68m-nemotron-pii
Token Classification • 68.5M • Updated • 599 • 7 -
fastino/gliner2-privacy-filter-PII-multi
Token Classification • 0.3B • Updated • 38.7k • 48 -
OpenMed/OpenMed-PII-SuperClinical-Large-434M-v1
Token Classification • 0.4B • Updated • 63.5k • • 18 -
bardsai/eu-pii-anonimization-multilang
Token Classification • 0.3B • Updated • 3.2k • 15
Dutch AI DATASETS Collection
CHEETAH
Tralalabs's model family "CHEETAH" comes with different sizes finetuned on different datasets.
-
Tralalabs/CHEETAH-350M-LoRA
Text Generation • Updated • 55 -
Tralalabs/CHEETAH-350M-Merged-FP16
Text Generation • 0.4B • Updated • 146 -
Tralalabs/CHEETAH-350M-Merged-FP16-Q6_K-GGUF
Text Generation • 0.4B • Updated • 156 -
mradermacher/CHEETAH-350M-Merged-FP16-i1-GGUF
Text Generation • 0.4B • Updated • 776
My datasets
Spanish Pretraining and Finetuning Datasets
(Conjuntos de datos de preentrenamiento y finetunamiento en español)
Dataset Library
A dataset library.
PicoLM
The PicoLM family made by Tralalabs
AI Datasets
datasets for training ai models
PII
PII stuff on Hugging Face
-
kalyan-ks/ettin-68m-nemotron-pii
Token Classification • 68.5M • Updated • 599 • 7 -
fastino/gliner2-privacy-filter-PII-multi
Token Classification • 0.3B • Updated • 38.7k • 48 -
OpenMed/OpenMed-PII-SuperClinical-Large-434M-v1
Token Classification • 0.4B • Updated • 63.5k • • 18 -
bardsai/eu-pii-anonimization-multilang
Token Classification • 0.3B • Updated • 3.2k • 15
Claude (and Others) Generated Datasets
-
nothingiisreal/Claude-3-Opus-Instruct-15K
Viewer • Updated • 29.5k • 142 • 19 -
mlfoundations-dev/oh-dcft-v3.1-claude-3-5-sonnet-20241022
Viewer • Updated • 1.05M • 7 -
mlfoundations-dev/oh-dcft-v3.1-claude-3-5-sonnet-20241022_test
Viewer • Updated • 1k • 6 -
mfielding92/claude-3.7-sonnet-reasoning
Viewer • Updated • 179 • 68 • 35
Dutch AI DATASETS Collection
Chinese Datasets
-
Skywork/SkyPile-150B
Viewer • Updated • 1.76M • 9.86k • 407 -
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models
Paper • 2308.10755 • Published • 1 -
fjcanyue/wikipedia-zh-cn
Viewer • Updated • 5.55M • 1.03k • 27 -
0xDing/wikipedia-cn-20230720-filtered
Viewer • Updated • 255k • 2.28k • 170
CHEETAH
Tralalabs's model family "CHEETAH" comes with different sizes finetuned on different datasets.
-
Tralalabs/CHEETAH-350M-LoRA
Text Generation • Updated • 55 -
Tralalabs/CHEETAH-350M-Merged-FP16
Text Generation • 0.4B • Updated • 146 -
Tralalabs/CHEETAH-350M-Merged-FP16-Q6_K-GGUF
Text Generation • 0.4B • Updated • 156 -
mradermacher/CHEETAH-350M-Merged-FP16-i1-GGUF
Text Generation • 0.4B • Updated • 776
My Stuff
My datasets
Korean Machine Learning Datasets
Datasets for AI and neural networks for Korean generation.
Spanish Pretraining and Finetuning Datasets
(Conjuntos de datos de preentrenamiento y finetunamiento en español)
TralalabsLM
Tralalabs's model family called "TralalabsLM". A collection.
Dataset Library
A dataset library.
BLOOM GGUF
GGUF files of BLOOM and BLOOMz up to 7B
PicoLM
The PicoLM family made by Tralalabs
Qwen 3
All the Qwen 3 family (including Qwen-Image models released in 2025-2026)