Add absolute confidence metric #413

pemistahl · 2024-12-06T10:22:52Z

Currently, the library only provides a relative confidence metric that tells you how likely a language is in comparison to another language. It is desirable to have an additional absolute confidence metric that works with a single language only and independently from any other language. With such an absolute confidence metric, a LanguageDetector instance could be built from a single language. This instance would then be able to provide binary decisions, i.e. tell whether some text is written in a specific language or not.

An absolute confidence metric could be based on unique or the most common n ngrams of a language.

The text was updated successfully, but these errors were encountered:

hemju · 2025-01-31T12:51:49Z

@pemistahl I just found this library, and it looks awesome. The new feature seems really useful. Basically, I just wanted to say thank you for the hard work.

pemistahl · 2025-01-31T13:45:33Z

Thank you very much @hemju for the kind words. Really appreciated. :) This new feature is nearly completed. I hope that I will be able to create a new release within the next two weeks at the latest.

hemju · 2025-01-31T14:44:35Z

I hope that I will be able to create a new release within the next two weeks at the latest.

That sounds great 🥳 👏 Thank you again.

I have one question (I am not sure if this is the right way to ask), but I am not sure about initialization (also not in the Java version; this is the version that brought me here. I will stick with the RUST version though). Is a LanguageDetector thread safe? Should it be reused? Or should it be initialized for each call? I understand that the underlying LanguageModels are kept in the Memory, but what about the detector?

Thanks in advance.

pemistahl · 2025-01-31T15:19:02Z

I will answer that once and for all in the documentation of the next release, but: Yes, the library is thread-safe. That includes instances of LanguageDetector as well. Just create a single instance and reuse it in multiple threads. Only create multiple instances if the settings for each instance shall be different, e.g. different language sets to recognize.

hemju · 2025-01-31T15:34:39Z

Thank you!

pemistahl added the new feature label Dec 6, 2024

pemistahl added this to the Lingua 1.7.0 milestone Dec 6, 2024

pemistahl linked a pull request Dec 31, 2024 that will close this issue

Add absolute confidence metric based on unique and most common ngrams #419

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add absolute confidence metric #413

Add absolute confidence metric #413

pemistahl commented Dec 6, 2024

hemju commented Jan 31, 2025

pemistahl commented Jan 31, 2025

hemju commented Jan 31, 2025

pemistahl commented Jan 31, 2025 •

edited

Loading

hemju commented Jan 31, 2025

Add absolute confidence metric #413

Add absolute confidence metric #413

Comments

pemistahl commented Dec 6, 2024

hemju commented Jan 31, 2025

pemistahl commented Jan 31, 2025

hemju commented Jan 31, 2025

pemistahl commented Jan 31, 2025 • edited Loading

hemju commented Jan 31, 2025

pemistahl commented Jan 31, 2025 •

edited

Loading