FALCON dual-encoder — SNORT / `all-MiniLM-L6-v2`

Contrastive encoder fine-tuned to map CTI text and SNORT rules into a shared embedding space. Backbone: sentence-transformers/all-MiniLM-L6-v2.

Test-set metrics

split	recall@1	F1	threshold	diag mean	off-diag mean
pretrained	0.7993	0.3346	0.7216	0.9425	0.8358
run_0	0.9539	0.8945	0.6762	0.8634	0.0394
run_1	0.9551	0.9132	0.6886	0.8966	0.0373
run_2	0.9576	0.9276	0.6931	0.9105	0.0176
run_3	0.9551	0.9202	0.6902	0.9246	0.0408
run_4	0.9539	0.9211	0.7031	0.9504	0.0050

Training

Symmetric InfoNCE / NT-Xent over in-batch negatives. Best checkpoint selected by validation loss.

Run 0 — batch=16, epochs=5, lr=2e-05, schedule=constant, T=0.05
Run 1 — batch=50, epochs=10, lr=2e-05, schedule=constant, T=0.05
Run 2 — batch=70, epochs=30, lr=2e-05, schedule=constant, T=0.05
Run 3 — batch=128, epochs=30, lr=5e-05, schedule=warmup_cosine, T=0.05
Run 4 — batch=70, epochs=50, lr=2e-05, schedule=constant, T=0.07

Loading

from transformers import AutoModel, AutoTokenizer
tok   = AutoTokenizer.from_pretrained("shaswatamitra/falcon-snort-dual-all-MiniLM-L6-v2", subfolder='rule')
model = AutoModel.from_pretrained("shaswatamitra/falcon-snort-dual-all-MiniLM-L6-v2", subfolder='rule')

Dual-encoder layout: this repo has rule/ (encodes SNORT rules) and cti/ (encodes CTI text) subfolders. Load each with subfolder=....

Citation

@article{mitra2025falcon,
  title={FALCON: Autonomous Cyber Threat Intelligence Mining with LLMs for IDS Rule Generation},
  author={Mitra, Shaswata and Bazarov, Azim and Duclos, Martin and Mittal, Sudip and Piplai, Aritran and Rahman, Md Rayhanur and Zieglar, Edward and Rahimi, Shahram},
  journal={arXiv preprint arXiv:2508.18684},
  year={2025}
}