Solon Embeddings 0.1

Solon Embeddings 0.1 is an MIT-licensed open-source French embedding template trained by Ordalie.Solon large is the best open-source French embedding template known to date, achieving the highest scores on public benchmarks.

📍 Available on HuggingFace in two versions:

Base: 268M parameters, 768 dimensions
Large: 560M parameters, 1024 dimensions

TL;DR

Overall performance measured on 9 benchmarks: 6 from the MTEB collection, 1 from Miracl, and 2 developed by Ordalie.
Custom benchmarks :
- Ordalie-FR-STS-benchmark (10k): evaluation of similarities between two sentences.
- Ordalie-FR-Reranking-benchmark (2k): association of short queries with long passages.
Supervised training on French and English data, with advanced "hard negatives" strategies.

Performance & Benchmarks

Methodology

Performance was evaluated using :

MTEB(arXiv)
1. 6 datasets in French.
Miracl(arXiv)
1. French subset.
Ordalie custom benchmarks:
1. FR-STS Benchmark.
2. FR-Reranking Benchmark.

💡 Benchmarks available on GitHub (fork MTEB) to reproduce results.

Results

Solon Large outperforms other open source models on average for French-language similarity and search tasks.

Technical details

Architecture

Multilingual: 90% French, 10% English.
Based on Microsoft's xlm-roberta/e5 with massive contrastive pre-training.

Training

Phase 1: Multilingual recalibration
1. Data: 36M French sentences, 4M English sentences.
2. Hardware: Nvidia A100 80GB.
3. Duration: ~24h to 26h.
Phase 2: Targeted optimization
1. Data: 60k semi-synthetic pairs with "hard negatives".
2. Hardware: Nvidia A100 80GB.
3. Duration: 1h to 2h.

Conclusions and outlook

Objective achieved: best French textual similarity score with limited resources.
Creation of new French benchmarks for evaluation.

🚀 Future improvements:

Increase context size (currently limited to 512 tokens).
Strengthen multilingual capabilities.

Join us

📧 Interested in the project or recruiting at Ordalie? Contact us at [email protected].