Instructions to use deepseek-ai/DeepSeek-V3.2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use deepseek-ai/DeepSeek-V3.2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-V3.2") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V3.2", dtype="auto") - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use deepseek-ai/DeepSeek-V3.2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "deepseek-ai/DeepSeek-V3.2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V3.2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/deepseek-ai/DeepSeek-V3.2
- SGLang
How to use deepseek-ai/DeepSeek-V3.2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "deepseek-ai/DeepSeek-V3.2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V3.2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "deepseek-ai/DeepSeek-V3.2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V3.2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use deepseek-ai/DeepSeek-V3.2 with Docker Model Runner:
docker model run hf.co/deepseek-ai/DeepSeek-V3.2
Running the model with a dense attention
Since the model is not yet supported in llama.cpp I did an experiment and ran DeepSeek V3.2 with dense attention by disabling lightning indexer. The model seems to perform just fine - at least based on my limited testing. Do you think the model performance (by this I mean intelligence, not speed) will be affected by doing this?
Yes
@Fujikre
I tried quite hard to find a difference in the model intelligence. I tested Q4_K_M quantized DeepSeek V3.2 GGUF with disabled lightning indexer in regular llama.cpp build by running my lineage-bench logical reasoning benchmark. The result (numbers are mean accuracy for given difficulty, there are 4 difficulty levels, for each difficulty there are 40 quizzes):
| Nr | model_name | lineage | lineage-8 | lineage-64 | lineage-128 | lineage-192 |
|---|---|---|---|---|---|---|
| 1 | deepseek/deepseek-v3.2 | 0.988 | 1.000 | 1.000 | 1.000 | 0.950 |
So DeepSeek V3.2 with dense attention solved correctly almost all quizzes (there were 160 quizzes overall), it got wrong only 2 most difficult quizzes. When I tested the original model via API the result was:
| Nr | model_name | lineage | lineage-8 | lineage-64 | lineage-128 | lineage-192 |
|---|---|---|---|---|---|---|
| 1 | deepseek/deepseek-v3.2 | 0.956 | 1.000 | 1.000 | 0.975 | 0.850 |
So it looks like the model with dense attention performed even a bit better than the sparse attention one.
Or perhaps you meant that the model intelligence will be affected, but positively?
I tested the experimental Q4_K_M quantized DeepSeek V3.2 GGUF (dense attention) on translating documents. The translations are similar to the ones I get from the MLX quants that use sparse attention, and similar to the API results.
I revisited this after some time and found clear difference in reasoning performance in more complex tasks. In short using dense attention makes the model a bit dumber. Details here: https://www.reddit.com/r/LocalLLaMA/comments/1rq8otd/running_deepseek_v32_with_dense_attention_like_in/