Skip to content

Commit

Permalink
fix: better analysis of "oil (rapeseed, something unrecognized)" + se…
Browse files Browse the repository at this point in the history
…paration of additive class + additive (#11251)

In the test set for the estimation of the % of ingredients, we have 43
products out of 1000 with the ingredients "rapeseed". It turns out that
a lot of that is due to our inability to correctly parse things like
"vegetable oil (rapeseed, something that we don't recognize as oil)".

We have an ingredient preparsing algorithm that tries to recognize
things like [category of ingredients] ([enumeration of types of
ingredients]). e.g. the preparsing turns "vegetable oils (palm,
rapeseed, soy)" in "vegetable oils (palm vegetable oils, rapeseed
vegetable oils, soy vegetable oils)".

This only works if we can identify all the oil types in the enumeration,
and that we have corresponding oils in the taxonomy. So it fails a lot.

This PR introduces an alternative to the preparsing, with a more general
approach:

When an ingredient has a parent ingredient, we check if there is a known
ingredient "parent + child" in the taxonomy (e.g. for "oil (palm)", we
check if we have a known ingredient "palm oil". In other languages, we
reverse the order "huiles (palme)" -> checks for "huile palme".

It would have to be tested, but we could potentially keep this and
completely remove the equivalent function in the preparsing, which
requires to hardcode all types of oils, flavours etc. and fails when we
miss one in the enumeration.

This PR also removes some entries from the ingredients taxonomy, like
"emulsifier soy lecithin" which is incorrect but common in ingredients
list. This is because we don't want "emulsifier (soy lecithin)" to be
converted to "emulsifier soy lecithin". We have a function to turn
"emulsifier : soy lecithin" in the preparsing. It didn't work in this
specific case because "soy lecithin" is in the ingredients taxonomy but
not the additives taxonomy. I changed the function to use the
ingredients taxonomy instead.

There may be unwanted false positives, it is useful to look at all the
tests.
  • Loading branch information
stephanegigandet authored Feb 4, 2025
1 parent cec7a08 commit fde3287
Show file tree
Hide file tree
Showing 15 changed files with 548 additions and 143 deletions.
129 changes: 112 additions & 17 deletions lib/ProductOpener/Ingredients.pm
Original file line number Diff line number Diff line change
Expand Up @@ -1854,7 +1854,7 @@ sub parse_ingredients_text_service ($product_ref, $updated_product_fields_ref, $

delete $product_ref->{ingredients};

# and indicate that the service is creating the "ingredients" structure
# indicate that the service is creating the "ingredients" structure
$updated_product_fields_ref->{ingredients} = 1;

my $ingredients_lc = get_or_select_ingredients_lc($product_ref);
Expand Down Expand Up @@ -1939,7 +1939,36 @@ sub parse_ingredients_text_service ($product_ref, $updated_product_fields_ref, $
# Extract phrases related to specific ingredients at the end of the ingredients list
$text = parse_specific_ingredients_from_text($product_ref, $text, $percent_or_quantity_regexp, $per_100g_regexp);

my $analyze_ingredients_function = sub ($analyze_ingredients_self, $ingredients_ref, $level, $s) {
=head2 analyze_ingredient_function($analyze_ingredients_self, $ingredients_ref, $parent_ref, $level, $s)
This function is used to analyze the ingredients text and extract individual ingredients.
It identifies one ingredient at a time, and calls itself recursively to identify other ingredients and sub ingredients
=head3 Arguments
=head4 $analyze_ingredients_self
Reference to itself in order to call itself recursively
=head4 $ingredients_ref
Reference to an array of ingredients that will be filled with the extracted ingredients
=head4 $parent_ref
Reference to the parent ingredient (if any)
=head4 $level
Level of depth of sub ingredients
=head4 $s
Text to analyze
=cut

my $analyze_ingredients_function = sub ($analyze_ingredients_self, $ingredients_ref, $parent_ref, $level, $s) {

# print STDERR "analyze_ingredients level $level: $s\n";

Expand Down Expand Up @@ -2460,7 +2489,7 @@ sub parse_ingredients_text_service ($product_ref, $updated_product_fields_ref, $
or $ingredients_ref->[$last_ingredient]{ingredients} = [];
$analyze_ingredients_self->(
$analyze_ingredients_self, $ingredients_ref->[$last_ingredient]{ingredients},
$between_level, $between
$parent_ref, $between_level, $between
);
}

Expand Down Expand Up @@ -2992,6 +3021,47 @@ sub parse_ingredients_text_service ($product_ref, $updated_product_fields_ref, $

if (not $skip_ingredient) {

# If we have a parent ingredient, check if "parent ingredient + child ingredient" is a known ingredient
# e.g. "vegetal oil (palm, rapeseed)" -> if we have "palm" as the child, try to transform it in "palm vegetal oil"

if (defined $parent_ref) {

# Generate the text for the canonicalized parent ingredient (so that we don't get percentages, labels etc. in it)
my $parent_ingredient_text
= display_taxonomy_tag($ingredients_lc, "ingredients", $parent_ref->{id});

my $parent_plus_child_ingredient_text;

if ($ingredients_lc eq "en") {
# oil (palm) -> palm oil
$parent_plus_child_ingredient_text = $ingredient . ' ' . $parent_ingredient_text;
}
else {
# huile (palme) -> huile palme
$parent_plus_child_ingredient_text = $parent_ingredient_text . ' ' . $ingredient;
}

# Check if the parent + child ingredient is a known ingredient
my $exists_in_taxonomy;
my $parent_plus_child_ingredient_id
= canonicalize_taxonomy_tag($ingredients_lc, "ingredients", $parent_plus_child_ingredient_text,
\$exists_in_taxonomy);

if ($exists_in_taxonomy) {
$ingredient_id = $parent_plus_child_ingredient_id;
$log->debug(
"parse_ingredient_text - parent + child ingredient recognized",
{
parent => $parent_ingredient_text,
child => $ingredient,
parent_plus_child_ingredient_text => $parent_plus_child_ingredient_text,
parent_plus_child_ingredient_id => $parent_plus_child_ingredient_id
}

) if $log->is_debug();
}
}

my %ingredient = (
id => get_taxonomyid($ingredients_lc, $ingredient_id),
text => $ingredient
Expand Down Expand Up @@ -3065,7 +3135,9 @@ sub parse_ingredients_text_service ($product_ref, $updated_product_fields_ref, $
if ((scalar @ingredients) == 0) {
$ingredient{ingredients} = [];
$analyze_ingredients_self->(
$analyze_ingredients_self, $ingredient{ingredients},
$analyze_ingredients_self,
$ingredient{ingredients},
$ingredients_ref->[-1],
$between_level, $between
);
}
Expand All @@ -3078,12 +3150,12 @@ sub parse_ingredients_text_service ($product_ref, $updated_product_fields_ref, $
}

if ($after ne '') {
$analyze_ingredients_self->($analyze_ingredients_self, $ingredients_ref, $level, $after);
$analyze_ingredients_self->($analyze_ingredients_self, $ingredients_ref, $parent_ref, $level, $after);
}

};

$analyze_ingredients_function->($analyze_ingredients_function, $product_ref->{ingredients}, 0, $text);
$analyze_ingredients_function->($analyze_ingredients_function, $product_ref->{ingredients}, undef, 0, $text);

$log->debug("ingredients: ", {ingredients => $product_ref->{ingredients}}) if $log->is_debug();

Expand Down Expand Up @@ -5783,8 +5855,9 @@ sub separate_additive_class ($ingredients_lc, $additive_class, $spaces, $colon,
#print STDERR "separate_additive_class - after 2 : $after\n";

# also look if we have additive 1 and additive 2
my $after2;
my ($after1, $after2);
if ($after =~ /$and/i) {
$after1 = $`;
$after2 = $`;
}

Expand All @@ -5793,14 +5866,25 @@ sub separate_additive_class ($ingredients_lc, $additive_class, $spaces, $colon,

if (
(
not exists_taxonomy_tag(
"additives", canonicalize_taxonomy_tag($ingredients_lc, "additives", $additive_class . " " . $after)
not(
exists_taxonomy_tag("ingredients",
canonicalize_taxonomy_tag($ingredients_lc, "ingredients", $additive_class . " " . $after))
or (
(defined $after1)
and exists_taxonomy_tag(
"ingredients",
canonicalize_taxonomy_tag($ingredients_lc, "ingredients", $additive_class . " " . $after1)
)
)
)
)
and (
exists_taxonomy_tag("additives", canonicalize_taxonomy_tag($ingredients_lc, "additives", $after))
# we use the ingredients taxonomy here as some additives like "soy lecithin" are currently in the ingredients taxonomy
# but not in the additives taxonomy
exists_taxonomy_tag("ingredients", canonicalize_taxonomy_tag($ingredients_lc, "ingredients", $after))
or ((defined $after2)
and exists_taxonomy_tag("additives", canonicalize_taxonomy_tag($ingredients_lc, "additives", $after2)))
and
exists_taxonomy_tag("ingredients", canonicalize_taxonomy_tag($ingredients_lc, "ingredients", $after2)))
)
)
{
Expand Down Expand Up @@ -5936,6 +6020,16 @@ my %ingredients_categories_and_types = (
},
],

es => [
# oils
{
categories => ["aceite", "aceite vegetal", "aceites vegetales"],
types =>
["aguacate", "coco", "colza", "girasol", "linaza", "nabina", "oliva", "palma", "palmiste", "soja",],
alternate_names => ["aceite de <type>", "aceite d'<type>"],
},
],

fr => [
# huiles
{
Expand All @@ -5945,12 +6039,13 @@ my %ingredients_categories_and_types = (
],
types => [
"arachide", "avocat", "carthame", "chanvre",
"coco", "colza", "coton", "illipe",
"karité", "lin", "mangue", "noisette",
"noix", "noyaux de mangue", "olive", "olive extra",
"olive vierge", "olive extra vierge", "olive vierge extra", "palme",
"palmiste", "pépins de raisin", "sal", "sésame",
"soja", "tournesol", "tournesol oléique",
"coco", "colza", "coprah", "coton",
"graines de colza", "illipe", "karité", "lin",
"mangue", "noisette", "noix", "noyaux de mangue",
"olive", "olive extra", "olive vierge", "olive extra vierge",
"olive vierge extra", "palme", "palmiste", "pépins de raisin",
"sal", "sésame", "soja", "tournesol",
"tournesol oléique",
],
alternate_names => [
"huile de <type>",
Expand Down
4 changes: 2 additions & 2 deletions taxonomies/allergens.txt
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
stopwords:en: and, byproducts, other, of, this, product, made, manufactured, in, a, factory, which, also, uses, trace, traces, possible, eventual, potential
stopwords:en: and, byproducts, other, of, this, product, made, manufactured, in, a, factory, which, also, uses, trace, traces, possible, eventual, potential, including
stopwords:bg: и, следи от, други, може да съдържа, може да съдържа следи от , може да има следи от, съдържа, възможни са следи от
stopwords:da: spor
stopwords:de: enthalten, von, und, kann, spuren, andere, anderen, weitere, weiteren
stopwords:es: traza, trazas, y, de, que, contiene, contienen, otros, producto, productos, derivado, derivados, incluido, incluidos, incluida, incluidas, la, una, fuente, contener, puede, con, elaborado, en, linea, donde, se, procesa, granos, posible, posibles, este
stopwords:fi: ja, muita, muuta, tehtaassa, valmistettu, saattaa, sisältää, pieniä, pienehköjä, määriä, jäämiä, ainesosia
stopwords:fr: d'autres, autre, autres, ce, produit, est, fabriqué, élaboré, transformé, emballé, dans, un, atelier, une, usine, qui, utilise, aussi, également, céréale, céréales, farine, farines, extrait, extraits, graine, graines, trace, traces, éventuelle, éventuelles, possible, possibles, potentielle, potentielles, peut, pourrait, pouvant, contenir, contenant, contient, de, des, du, d', l', la, le, les, et, dérivés, à, base, ce, ces, hybrides, ou, produits, y, compris, contiennent
stopwords:fr: d'autres, autre, autres, ce, produit, est, fabriqué, élaboré, transformé, emballé, dans, un, atelier, une, usine, qui, utilise, aussi, également, céréale, céréales, farine, farines, extrait, extraits, graine, graines, trace, traces, éventuelle, éventuelles, possible, possibles, potentielle, potentielles, peut, pourrait, pouvant, contenir, contenant, contient, de, des, du, d', l', la, le, les, et, dérivés, à, base, ce, ces, hybrides, ou, produits, y, compris, contiennent, dont
stopwords:hr: i, ili, može, sadržavati, sadrži, tragove, u tragovima, mogući tragovi
stopwords:is: í, snefilmagni
stopwords:it: derivanti, del, de, la, e, tracce, contiene, una, fonte, di, puo, contenere, altra, contenenti, a, base, semi
Expand Down
59 changes: 8 additions & 51 deletions taxonomies/food/ingredients.txt
Original file line number Diff line number Diff line change
Expand Up @@ -502,7 +502,7 @@ el: λεκιθίνη σόγιας
es: lecitina de soja, lecitina de soya, lecitinas de soja, lecitinas de soya
et: sojaletsitiin
fi: soijalesitiini, soijalesitiiniä, soijalesitiinit, E322 soijasta
fr: lécithine de soja, lécithines de soja, émulsifiant lécithines de soja, lécithine de soja e322
fr: lécithine de soja, lécithines de soja, lécithine de soja e322
he: לציטין סויה
hr: emulgator soja lecitin, emulgator sojin lecitin, soja lecitin, emulgatori sojin lecitin, sojin lecithin, sojin lecitin, sojini lecitini
hu: szójalecitin, szójalecitinek
Expand Down Expand Up @@ -578,12 +578,6 @@ en: non-gmo sunflower lecithin
hr: ne-gmo suncokretov lecitin
it: lecitina di girasole non ogm

# <en:emulsifier
< en:sunflower lecithin
fr: émulsifiant lécithine de tournesol
fi: emulgointiaine auringonkukkalesitiini
# ingredient/fr:émulsifiant-lécithine-de-tournesol has 100 products in 6 languages @2019-02-22

< en:soya lecithin
< en:sunflower lecithin
en: soya and sunflower lecithin
Expand Down Expand Up @@ -624,18 +618,6 @@ sv: rapslecitin, rapslecithin
vegan:en: yes
vegetarian:en: yes

< en:E322(i)
fr: émulsifiant lécithines, émulsifiants lécithines
de: emulgator lecithine
es: emulsionante de lecitina, emulsionantes de lecitina
hr: emulgator lecitin iz uljane repice
it: emulsionante lecitine
nl: emulgator lecithinen
pt: emulsionante lecitinas
# fr:émulsifiant-lécithines has 88 products in 6 languages @2018-11-10
# fr:émulsifiants-lécithines has 21 products in 4 languages @2018-11-10



# Has NO E-number!!!

Expand Down Expand Up @@ -838,20 +820,6 @@ hr: limunov pektin
it: pectina di limone
lt: citrinų pektinas



< en:E471
en: emulsifier mono- and diglycerides of fatty acids
bg: емулгатор Моно- и диглицериди на мастни киселини, емулгатор e471
ca: emulsionant e471, emulsificant e471, emulgent e471
de: Emulgator Mono- und Diglyceride von Fettsäuren
es: emulsionante e471, emulsificante e471, emulgente e471
fi: emulgointiaine rasvahappojen mono- ja diglyseridit
fr: émulsifiant e471
hr: emulgator mono- i digliceridi masnih kiselina
it: emulsionante e471, emulsionante mono- e digliceridi degli acidi grassi
lt: riebalų rūgščių mono- ir digliceridai, emulsiklis riebalų rūgščių mono- ir digliceridai

< en:E471
en: mono and diglycerides of fatty acids of vegetable origin
bg: моно- и диглицериди на мастни киселини от растителен произход
Expand Down Expand Up @@ -15389,7 +15357,7 @@ el: έλαια και λίπη
es: aceites y grasas, materia grasa
et: õlid ja rasvad
fi: öljy ja rasva, öljyt ja rasvat
fr: huiles et graisses, matières grasses
fr: huiles et graisses, matières grasses, huile et graisse, matière grasse
he: שמן ושומן
hr: ulja i masti
hu: Olajok és zsírok, olaj és zsír
Expand Down Expand Up @@ -15633,7 +15601,7 @@ et: rasvad
eu: koipe
fa: چربی
fi: rasva, rasvat
fr: graisse, graisse alimentaire
fr: matière grasse, matières grasses, graisse, graisse alimentaire, graisses alimentaires
ga: saill
gl: greix
gu: ચરબી
Expand Down Expand Up @@ -16124,7 +16092,7 @@ el: φυτικά λίπη
es: grasas vegetales, grasa vegetal, grasa vegetal no hidrogenada, Manteca vegetal
et: taimne rasv, taimsed rasvad
fi: kasvirasva, kasvirasvat, kasvisrasva, kasvisrasvat, kasvisrasvoja, kasvirasvoja, kasvisrasvasta
fr: matière grasse végétale, matières grasses végétales, matière graisse végétale, graisse végétale, graisses végétales, grasse végétale
fr: matière grasse végétale, matières grasses végétales, matière grasse végétale, graisse végétale, graisses végétales, graisse végétale
he: שמנים צמחיים, שמן צמחי, שומן צמחי, שומן מן הצומח
hr: biljna mast, biljne masti, biljne mast, biljne masnoće, biljna masnoća
hu: növényi zsír, növényi zsírok
Expand Down Expand Up @@ -16165,7 +16133,7 @@ sv: vegetabilisk fettprodukt
# usage:fr:préparation de matières grasses végétales (graisse de palme, huile de colza, colorant : caroténoïdes)

< en:vegetable oil and fat
en: non hydrogenated vegetable oil and fat
en: non hydrogenated vegetable oil and fat, non hydrogenated vegetable oils and fats, vegetable oils and fats non hydrogenated
bg: нехидрогенирани растителни масла и мазнини
de: ungehärtete Pflanzenfette und Pflanzenöle, nicht hydrierte pflanzliche Fette und Öle
el: μη υδρογονωμένα φυτικά λίπη και έλαια
Expand Down Expand Up @@ -17389,7 +17357,7 @@ de: Kolzaöl, Kohlsaatöl
el: κραμβέλαιο
es: aceite de colza
fi: rypsiöljy
fr: huile de colza, huile végétale de colza, huile colza, graisses végétales de colza, graisses vegetales de colza, huiles végétales de colza
fr: huile de colza, huile végétale de colza, huile colza, graisses végétales de colza, graisses vegetales de colza, huiles végétales de colza, huile de graines de colza, huile de graine de colza
hr: uljana repica, uljane repice, ulje uljane repice, ulje repice
hu: Repceolaj
it: olio di colza, olio vegetale di colza
Expand Down Expand Up @@ -18049,7 +18017,7 @@ wikipedia:en: https://en.wikipedia.org/wiki/Illipe
en: sunflower fat
de: Sonnenblumenfett
fi: auringonkukkarasva
fr: Matière grasse issue du tournesol
fr: Matière grasse issue du tournesol, matière grasse de tournesol
hr: suncokretova mast
hu: napraforgózsír
pl: tłuszcz słonecznikowy, tłuszcz roślinny słonecznikowy, tłuszcze roślinne słonecznikowy
Expand Down Expand Up @@ -18194,7 +18162,7 @@ es: aceite de linaza
et: Linaõli
fa: روغن بزرک
fi: pellavaöljy, pellavansiemenöljy, liinaöljy
fr: huile de lin
fr: huile de lin, huile végétale de lin
fy: Lynoalje
hr: laneno ulje
hu: lenmagolaj, lenolaj
Expand Down Expand Up @@ -30784,14 +30752,6 @@ hr: ksantan guma i guar guma
hu: xantángumi és guargumi
it: Gomma xanthan e gomma di guar

< en:gum arabic
de: Verdickungsmittel Gummi arabicum
fr: épaississant gomme arabique
it: addensante gomma arabica
# ingredient/fr:épaississant-gomme-arabique has 47 products in 4 languages @2019-03-02
# stabilisant (gomme arabique)


##################################################################################
#
# SUGAR
Expand Down Expand Up @@ -37589,9 +37549,6 @@ zh: 黄原胶
wikidata:en: Q410768
wikipedia:en: https://en.wikipedia.org/wiki/Xanthan_gum

< en:E415
fr: épaississant e415

en: gellan gum, E418, E-418, E 418, INS418, INS-418, INS 418, gellan
ar: صمغ الجيلان
bg: Гума джелан
Expand Down
Loading

0 comments on commit fde3287

Please sign in to comment.