- After applying some internal optimizations, language detection is now faster, at least between 20% and 30%, approximately. For long input texts, the speed improvement is greater than for short input texts.
- For long input texts, an error occurred whiled computing the confidence values due to numerical underflow when converting probabilities. This has been fixed. Thanks to @jordimas for reporting this bug. (#102)
- The min-max normalization method for the confidence values has been replaced with applying the softmax function. This gives more realistic probabilities. Big thanks to @Alex-Kopylov for proposing and implementing this change. (#99)
- Under certain circumstances, calling the method
LanguageDetector.detect_multiple_languages_of()
raised anIndexError
. This has been fixed. Thanks to @Saninsusanin for reporting this bug. (#98)
-
The new method
LanguageDetector.detect_multiple_languages_of()
has been introduced. It allows to detect multiple languages in mixed-language text. (#4) -
The new method
LanguageDetector.compute_language_confidence()
has been introduced. It allows to retrieve the confidence value for one specific language only, given the input text. (#86)
- The computation of the confidence values has been revised and the min-max normalization algorithm is now applied to the values, making them better comparable by behaving more like real probabilities. (#78)
- The library now has a fresh and colorful new logo. Why? Well, why not? (-:
- An
__all__
variable has been added indicating which types are exported by the library. This helps with type checking programs using Lingua. Big thanks to @bscan for the pull request. (#76) - The rule-based language filter has been improved for German texts. (#71)
- A further bottleneck in the code has been removed, making language detection 30 % faster compared to version 1.1.2, approximately.
- The language models are now stored on disk as serialized NumPy arrays instead of JSON. This reduces the preloading time of the language models significantly.
- A bottleneck in the language detection code has been removed, making language detection 40 % faster, approximately.
- The
py.typed
file that actives static type checking was missing. Big thanks to @Vasniktel for reporting this problem. (#63) - The ISO 639-3 code for Urdu was wrong. Big thanks to @pluiez for reporting this bug. (#64)
- For certain ngrams, wrong probabilities were returned. This has been fixed. Big thanks to @3a77 for reporting this bug. (#62)
- The new method
LanguageDetectorBuilder.with_low_accuracy_mode()
has been introduced. By activating it, detection accuracy for short text is reduced in favor of a smaller memory footprint and faster detection performance.
- The memory footprint has been reduced significantly by storing the language models in structured NumPy arrays instead of dictionaries. This reduces memory consumption from 2600 MB to 800 MB, approximately.
- Several language model files have become obsolete and could be deleted without decreasing detection accuracy. This results in a smaller memory footprint.
- The lowest supported Python version is 3.8 now. Python 3.7 is no longer compatible with this library.
- This patch release makes the library compatible with Python >= 3.7.1. Previously, it could be installed from PyPI only with Python >= 3.9.
- The very first release of Lingua. Enjoy!