-
-
Notifications
You must be signed in to change notification settings - Fork 406
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: better analysis of "oil (rapeseed, something unrecognized)" + se…
…paration of additive class + additive (#11251) In the test set for the estimation of the % of ingredients, we have 43 products out of 1000 with the ingredients "rapeseed". It turns out that a lot of that is due to our inability to correctly parse things like "vegetable oil (rapeseed, something that we don't recognize as oil)". We have an ingredient preparsing algorithm that tries to recognize things like [category of ingredients] ([enumeration of types of ingredients]). e.g. the preparsing turns "vegetable oils (palm, rapeseed, soy)" in "vegetable oils (palm vegetable oils, rapeseed vegetable oils, soy vegetable oils)". This only works if we can identify all the oil types in the enumeration, and that we have corresponding oils in the taxonomy. So it fails a lot. This PR introduces an alternative to the preparsing, with a more general approach: When an ingredient has a parent ingredient, we check if there is a known ingredient "parent + child" in the taxonomy (e.g. for "oil (palm)", we check if we have a known ingredient "palm oil". In other languages, we reverse the order "huiles (palme)" -> checks for "huile palme". It would have to be tested, but we could potentially keep this and completely remove the equivalent function in the preparsing, which requires to hardcode all types of oils, flavours etc. and fails when we miss one in the enumeration. This PR also removes some entries from the ingredients taxonomy, like "emulsifier soy lecithin" which is incorrect but common in ingredients list. This is because we don't want "emulsifier (soy lecithin)" to be converted to "emulsifier soy lecithin". We have a function to turn "emulsifier : soy lecithin" in the preparsing. It didn't work in this specific case because "soy lecithin" is in the ingredients taxonomy but not the additives taxonomy. I changed the function to use the ingredients taxonomy instead. There may be unwanted false positives, it is useful to look at all the tests.
- Loading branch information
1 parent
cec7a08
commit fde3287
Showing
15 changed files
with
548 additions
and
143 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.