Skip to content

1.30.0

Compare
Choose a tag to compare
@KennethEnevoldsen KennethEnevoldsen released this 25 Jan 04:05
· 5 commits to main since this release

1.30.0 (2025-01-25)

Feature

  • feat: Integrating ChemTEB (#1708)

  • Add SMILES, AI Paraphrase and Inter-Source Paragraphs PairClassification Tasks

  • Add chemical subsets of NQ and HotpotQA datasets as Retrieval tasks

  • Add PubChem Synonyms PairClassification task

  • Update task init for previously added tasks

  • Add nomic-bert loader

  • Add a script to run the evaluation pipeline for chemical-related tasks

  • Add 15 Wikipedia article classification tasks

  • Add PairClassification and BitextMining tasks for Coconut SMILES

  • Fix naming of some Classification and PairClassification tasks

  • Fix some classification tasks naming issues

  • Integrate WANDB with benchmarking script

  • Update .gitignore

  • Fix nomic_models.py issue with retrieval tasks, similar to issue #1115 in original repo

  • Add one chemical model and some SentenceTransformer models

  • Fix a naming issue for SentenceTransformer models

  • Add OpenAI, bge-m3 and matscibert models

  • Add PubChem SMILES Bitext Mining tasks

  • Change metric namings to be more descriptive

  • Add English e5 and bge v1 models, all the sizes

  • Add two Wikipedia Clustering tasks

  • Add a try-except in evaluation script to skip faulty models during the benchmark.

  • Add bge v1.5 models and clustering score extraction to json parser

  • Add Amazon Titan embedding models

  • Add Cohere Bedrock models

  • Add two SDS Classification tasks

  • Add SDS Classification tasks to classification init and chem_eval

  • Add a retrieval dataset, update dataset names and revisions

  • Update revision for the CoconutRetrieval dataset: handle duplicate SMILES (documents)

  • Update CoconutSMILES2FormulaPC task

  • Change CoconutRetrieval dataset to a smaller one

  • Update some models

  • Integrate models added in ChemTEB (such as amazon, cohere bedrock and nomic bert) with latest modeling format in mteb.
  • Update the metadata for the mentioned models
  • Fix a typo
    open_weights argument is repeated twice

  • Update ChemTEB tasks

  • Rename some tasks for better readability.
  • Merge some BitextMining and PairClassification tasks into a single task with subsets (PubChemSMILESBitextMining and PubChemSMILESPC)
  • Add a new multilingual task (PubChemWikiPairClassification) consisting of 12 languages.
  • Update dataset paths, revisions and metadata for most tasks.
  • Add a Chemistry domain to TaskMetadata
  • Remove unnecessary files and tasks for MTEB

  • Update some ChemTEB tasks

  • Move PubChemSMILESBitextMining to eng folder
  • Add citations for tasks involving SDS, NQ, Hotpot, PubChem data
  • Update Clustering tasks category
  • Change main_score for PubChemAISentenceParaphrasePC
  • Create ChemTEB benchmark

  • Remove CoconutRetrieval

  • Update tasks and benchmarks tables with ChemTEB

  • Mention ChemTEB in readme

  • Fix some issues, update task metadata, lint

  • eval_langs fixed
  • Dataset path was fixed for two datasets
  • Metadata was completed for all tasks, mainly following fields: date, task_subtypes, dialect, sample_creation
  • ruff lint
  • rename nomic_bert_models.py to nomic_bert_model.py and update it.
  • Remove nomic_bert_model.py as it is now compatible with SentenceTransformer.

  • Remove WikipediaAIParagraphsParaphrasePC task due to being trivial.

  • Merge amazon_models and cohere_bedrock_models.py into bedrock_models.py

  • Remove unnecessary load_data for some tasks.

  • Update bedrock_models.py, openai_models.py and two dataset revisions

  • Text should be truncated for amazon text embedding models.
  • text-embedding-ada-002 returns null embeddings for some inputs with 8192 tokens.
  • Two datasets are updated, dropping very long samples (len > 99th percentile)
  • Add a layer of dynamic truncation for amazon models in bedrock_models.py

  • Replace metadata_dict with self.metadata in PubChemSMILESPC.py

  • fix model meta for bedrock models

  • Add reference comment to original Cohere API implementation (4d66434)

Unknown