CSV export: add the main language of each product, and delete useless fields #9563

CharlesNepote · 2023-12-20T18:08:54Z

This issue is continuing the conversation of #2325 (I think we could close it).

Fields to add

`lang` to address the ingredients_text issue

The field ingredients_text is interesting because, corresponding to main language of the product, it is the most likely to be filled. It would not be useful to export ingredients_en, ingredients_fr, etc. because many of them would be empty. So in the past we have chosen:

to export the product main language's ingredients as ingredients_text
to also export the normalized version of ingredients_text: ingredients_tags. Eg. en:brown-sugar.

That said, there is no way to know what IS the main language for each product. Should we add either: lc, ingredients_lc or lang?

`obsolete_since_date` field

Many producers are sending us information when products are obsolete. We should add it to the CSV for many reasons:

reusers could filter those products
it can be possible to monitor those products
etc.

`rev` field

This field represents the number of revisions of a product. As it is short, it's not very costly. It would allow to:

monitor if there is an important activity on some products
rev is also a good proxy for popular products
identify products with only one rev

`unknown_ingredients_n`

This would allow to better investigate how to improve/prioritize ingredients' quality.

The number of photos

It is a good proxy for products' popularity. It can be also a way to know if the product has good chances to be fixed.
It also allows to monitor the products with new photos: for example the ones where photos are not selected.
As the field is just a number, it isn't too costly.

Useless fields

On the other hand we should try no to modify the CSV too often. So I would be in favor to delete useless fields at the same time:

countries: this field can mix data in different, it's better to rely on the country_tags (eg. en:united-kingdom)which is a normalized version of the countries.
countries_en (eg. United Kingdom) is here for comfort. But we could also remove it.
the same questions are relevant for all the taxonomized fields:
- categories and categories_en
- labels and labels_en
- packaging and packaging_en
- origins and origins_en
- traces and traces_en
- additives and additives_en
- food_groups and food_groups_en: food_groups is always in a normalized way??
- states and states_en: same remark as food_groups
  For food_groups and states, at least, states and states_tags are almost identical, the only difference is that states contains spaces

Exceptions

Some fields in the CSV doesn't have an _en equivalent.

manufacturing_places: we have manufacturing_places_tags but we don't have manufacturing_places_en
idem for emb_codes, cities, allergens.
Should we only keep the _tags fields?

Curiously, we have main_category and main_category_en, but not main_category_tag.

Redundant date fields

Should we also remove all the fields ending with a _t (unix epoch format), redundant with _datetime fields?
60 Mb are lost due to this redundancy.

The text was updated successfully, but these errors were encountered:

benbenben2 · 2024-01-03T16:42:32Z

I see the @export_fields in Config_off.pm which contain the first fields that you mentioned.

For the number of photos, in the api result we have something like this:

"images":{"1":{"sizes":{"100":{"h":40,"w":100},"400":{"h":161,"w":400},"full":{"h":1200,"w":2984}},"uploaded_t":1535370936,"uploader":"kiliweb"},"10":{"sizes":{"100":{"h":22,"w":100},"400":{"h":87,"w":400},"full":{"h":658,"w":3024}},"uploaded_t":1610373128,"uploader":"kiliweb"},"11":{"sizes":{"100":{"h":75,"w":100},"400":{"h":299,"w":400},"full":{"h":2992,"w":4000}},"uploaded_t":1660335243,"uploader":"moon-rabbit"},"12":{"sizes":{"100":{"h":75,"w":100},"400":{"h":299,"w":400},"full":{"h":2992,"w":4000}},"uploaded_t":1660335252,"uploader":"openfoodfacts-contributors"},"13":{"sizes":{"100":{"h":75,"w":100},"400":{"h":299,"w":400},"full":{"h":2992,"w":4000}},"uploaded_t":1660335260,"uploader":"openfoodfacts-contributors"},"14":{"sizes":{"100":{"h":100,"w":56},"400":{"h":400,"w":225},"full":{"h":1280,"w":720}},"uploaded_t":1685811820,"uploader":"insectproductadd"},"15":{"sizes":{"100":{"h":17,"w":100},"400":{"h":68,"w":400},"full":{"h":330,"w":1949}},"uploaded_t":1693411520,"uploader":"mismer"},"16":{"sizes":{"100":{"h":100,"w":89},"400":{"h":400,"w":356},"full":{"h":671,"w":597}},"uploaded_t":1693411555,"uploader":"mismer"},"2":{"sizes":{"100":{"h":22,"w":100},"400":{"h":87,"w":400},"full":{"h":562,"w":2583}},"uploaded_t":1535370948,"uploader":"kiliweb"},"3":{"sizes":{"100":{"h":50,"w":100},"400":{"h":199,"w":400},"full":{"h":2431,"w":4896}},"uploaded_t":1538851611,"uploader":"anticultist"},"4":{"sizes":{"100":{"h":36,"w":100},"400":{"h":146,"w":400},"full":{"h":1582,"w":4338}},"uploaded_t":1538851798,"uploader":"anticultist"},"5":{"sizes":{"100":{"h":22,"w":100},"400":{"h":88,"w":400},"full":{"h":924,"w":4213}},"uploaded_t":1538851824,"uploader":"anticultist"},"6":{"sizes":{"100":{"h":40,"w":100},"400":{"h":160,"w":400},"full":{"h":481,"w":1200}},"uploaded_t":1547153415,"uploader":"twoflower"},"7":{"sizes":{"100":{"h":42,"w":100},"400":{"h":169,"w":400},"full":{"h":508,"w":1200}},"uploaded_t":1547153419,"uploader":"twoflower"},"8":{"sizes":{"100":{"h":22,"w":100},"400":{"h":88,"w":400},"full":{"h":264,"w":1200}},"uploaded_t":1547153424,"uploader":"twoflower"},"9":{"sizes":{"100":{"h":40,"w":100},"400":{"h":161,"w":400},"full":{"h":1200,"w":2984}},"uploaded_t":1610373126,"uploader":"kiliweb"},

Not sure if we can do a count of this

Everything is not in this Config_off.pm variable. For example, if we export a csv, we have some columns like "packaging_1_number_of_units" or "packaging_1_shape". This is not clear to me where it is defined.

Also, I do not see all these "countries_en", "categories_en", etc.

Maybe I am not looking at the same csv...

CharlesNepote added Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data 🧽 Data quality https://wiki.openfoodfacts.org/Quality CSV exports labels Dec 21, 2023

CharlesNepote added this to 🧽 Ensuring Data Quality Dec 21, 2023

github-project-automation bot moved this to To do in 🧽 Ensuring Data Quality Dec 21, 2023

teolemon added this to 🍊 Open Food Facts Server issues Apr 4, 2024

teolemon moved this to To discuss and validate in 🍊 Open Food Facts Server issues Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV export: add the main language of each product, and delete useless fields #9563

CSV export: add the main language of each product, and delete useless fields #9563

CharlesNepote commented Dec 20, 2023 •

edited

Loading

benbenben2 commented Jan 3, 2024

CSV export: add the main language of each product, and delete useless fields #9563

CSV export: add the main language of each product, and delete useless fields #9563

Comments

CharlesNepote commented Dec 20, 2023 • edited Loading

Fields to add

lang to address the ingredients_text issue

obsolete_since_date field

rev field

unknown_ingredients_n

The number of photos

Useless fields

Exceptions

Redundant date fields

benbenben2 commented Jan 3, 2024

CharlesNepote commented Dec 20, 2023 •

edited

Loading

`lang` to address the ingredients_text issue

`obsolete_since_date` field

`rev` field

`unknown_ingredients_n`