VietLegal-E5

A Vietnamese legal domain embedding model fine-tuned from intfloat/multilingual-e5-large (560M params, 1024-dim).

Achieves NDCG@10 = 0.7229 on the Zalo AI Legal Text Retrieval benchmark, outperforming all baselines including microsoft/harrier-oss-v1-0.6b.

Benchmark Results

Evaluated on MTEB ZacLegalTextRetrieval (61.4K corpus documents, 818 test queries).

Model	Params	Dim	NDCG@10
mainguyen9/vietlegal-e5	560M	1024	0.7229
mainguyen9/vietlegal-e5	560M	512	0.7208
mainguyen9/vietlegal-e5	560M	256	0.7058
mainguyen9/vietlegal-e5	560M	128	0.7073
microsoft/harrier-oss-v1-0.6b	600M	1024	0.7210
intfloat/multilingual-e5-large	560M	1024	0.6660
bkai-foundation-models/vietnamese-bi-encoder	135M	768	0.6160
intfloat/multilingual-e5-base	278M	768	0.6030
contextboxai/halong_embedding	278M	768	0.6009

Key highlights:

+5.69 points over the mE5-large baseline (0.7229 vs 0.6660)
Outperforms Harrier (600M params) with fewer parameters
Matryoshka support: 128-dim (0.7073) still beats mE5-large at 1024-dim, enabling 8x compression

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("mainguyen9/vietlegal-e5")

# Important: E5 models require "query: " and "passage: " prefixes
queries = ["query: Thủ tục đăng ký kinh doanh gồm những bước nào?"]
passages = ["passage: Điều 27. Trình tự, thủ tục đăng ký doanh nghiệp..."]

q_emb = model.encode(queries)
p_emb = model.encode(passages)

# Matryoshka: truncate to smaller dimension
model.truncate_dim = 256
q_emb_256 = model.encode(queries)

Model Details

Model Type: Sentence Transformer
Base model: intfloat/multilingual-e5-large
Maximum Sequence Length: 512 tokens
Output Dimensionality: 1024 dimensions (Matryoshka: 1024, 512, 256, 128)
Similarity Function: Cosine Similarity
Language: Vietnamese
License: Apache 2.0

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_mean_tokens': True})
  (2): Normalize()
)

Training

Training Data

Legal documents: 518K Vietnamese legal documents from th1nhng0/vietnamese-legal-documents
Query-passage pairs: 507K pairs from phamson02/large-vi-legal-queries

Training Pipeline

Stage 1: Data Preparation
│  518K docs → ~500K chunks (article-aware segmentation)
│
Stage 2: Contrastive Fine-tuning
│  MatryoshkaLoss(MultipleNegativesRankingLoss)
│
Stage 3: Hard Negative Mining
│  FAISS retrieval → mine rank 50-100 as hard negatives
│
Stage 4: Multi-task Blending
│  70% retrieval + 20% classification + 10% STS
│  → Final model (NDCG@10 = 0.7229)

Training Hyperparameters

Learning rate: 5e-6
Batch size: 48 per device × 4 GPUs × 2 gradient accumulation = 384 effective
Epochs: 1 (multitask stage)
Warmup: 10%
Scheduler: Cosine
Precision: bf16
Hardware: 4x NVIDIA H100 80GB

Citation

@misc{vietlegal-e5,
  title={VietLegal-E5: Vietnamese Legal Domain Embedding Model},
  author={Nguyen, Mai},
  year={2026},
  url={https://huggingface.co/mainguyen9/vietlegal-e5}
}

Acknowledgements

Base model: intfloat/multilingual-e5-large
Training framework: sentence-transformers
Benchmark: Zalo AI Legal Text Retrieval

Downloads last month: 40

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for mainguyen9/vietlegal-e5

Base model

intfloat/multilingual-e5-large

Finetuned

(171)

this model

mainguyen9
/

vietlegal-e5