Ordalie
Solon Embeddings 0.1
Baudouin Arbarétier
Baudouin Arbarétier

Solon Embeddings 0.1

Solon Embeddings 0.1 is an MIT-licensed open-source French embedding template trained by Ordalie.Solon large is the best open-source French embedding template known to date, achieving the highest scores on public benchmarks.

📍 Available on HuggingFace in two versions:

  • Base: 268M parameters, 768 dimensions
  • Large: 560M parameters, 1024 dimensions

TL;DR

  • Overall performance measured on 9 benchmarks: 6 from the MTEB collection, 1 from Miracl, and 2 developed by Ordalie.
  • Custom benchmarks :
    • Ordalie-FR-STS-benchmark (10k): evaluation of similarities between two sentences.
    • Ordalie-FR-Reranking-benchmark (2k): association of short queries with long passages.
  • Supervised training on French and English data, with advanced "hard negatives" strategies.

Performance & Benchmarks

Methodology

Performance was evaluated using :

  1. MTEB(arXiv)
    1. 6 datasets in French.
  2. Miracl(arXiv)
    1. French subset.
  3. Ordalie custom benchmarks:
    1. FR-STS Benchmark.
    2. FR-Reranking Benchmark.

💡 Benchmarks available on GitHub (fork MTEB) to reproduce results.

Results

Solon Large outperforms other open source models on average for French-language similarity and search tasks.

Technical details

Architecture

  • Multilingual: 90% French, 10% English.
  • Based on Microsoft's xlm-roberta/e5 with massive contrastive pre-training.

Training

  1. Phase 1: Multilingual recalibration
    1. Data: 36M French sentences, 4M English sentences.
    2. Hardware: Nvidia A100 80GB.
    3. Duration: ~24h to 26h.
  2. Phase 2: Targeted optimization
    1. Data: 60k semi-synthetic pairs with "hard negatives".
    2. Hardware: Nvidia A100 80GB.
    3. Duration: 1h to 2h.

Conclusions and outlook

  • Objective achieved: best French textual similarity score with limited resources.
  • Creation of new French benchmarks for evaluation.

🚀 Future improvements:

  • Increase context size (currently limited to 512 tokens).
  • Strengthen multilingual capabilities.

Join us

📧 Interested in the project or recruiting at Ordalie? Contact us at [email protected].