Solon Embeddings 0.1 is an MIT-licensed open-source French embedding template trained by Ordalie.Solon large is the best open-source French embedding template known to date, achieving the highest scores on public benchmarks.
📍 Available on HuggingFace in two versions:
- Base: 268M parameters, 768 dimensions
- Large: 560M parameters, 1024 dimensions
TL;DR
- Overall performance measured on 9 benchmarks: 6 from the MTEB collection, 1 from Miracl, and 2 developed by Ordalie.
- Custom benchmarks :
- Ordalie-FR-STS-benchmark (10k): evaluation of similarities between two sentences.
- Ordalie-FR-Reranking-benchmark (2k): association of short queries with long passages.
- Supervised training on French and English data, with advanced "hard negatives" strategies.
Performance & Benchmarks
Methodology
Performance was evaluated using :
💡 Benchmarks available on GitHub (fork MTEB) to reproduce results.
Results
Solon Large outperforms other open source models on average for French-language similarity and search tasks.
Technical details
Architecture
- Multilingual: 90% French, 10% English.
- Based on Microsoft's xlm-roberta/e5 with massive contrastive pre-training.
Training
- Phase 1: Multilingual recalibration
- Data: 36M French sentences, 4M English sentences.
- Hardware: Nvidia A100 80GB.
- Duration: ~24h to 26h.
- Phase 2: Targeted optimization
- Data: 60k semi-synthetic pairs with "hard negatives".
- Hardware: Nvidia A100 80GB.
- Duration: 1h to 2h.
Conclusions and outlook
- Objective achieved: best French textual similarity score with limited resources.
- Creation of new French benchmarks for evaluation.
🚀 Future improvements:
- Increase context size (currently limited to 512 tokens).
- Strengthen multilingual capabilities.
Join us
📧 Interested in the project or recruiting at Ordalie? Contact us at [email protected].
