Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV export: add the main language of each product, and delete useless fields #9563

Open
CharlesNepote opened this issue Dec 20, 2023 · 1 comment
Labels
CSV exports Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data 🧽 Data quality https://wiki.openfoodfacts.org/Quality

Comments

@CharlesNepote
Copy link
Member

CharlesNepote commented Dec 20, 2023

This issue is continuing the conversation of #2325 (I think we could close it).

Fields to add

lang to address the ingredients_text issue

The field ingredients_text is interesting because, corresponding to main language of the product, it is the most likely to be filled. It would not be useful to export ingredients_en, ingredients_fr, etc. because many of them would be empty. So in the past we have chosen:

  • to export the product main language's ingredients as ingredients_text
  • to also export the normalized version of ingredients_text: ingredients_tags. Eg. en:brown-sugar.

That said, there is no way to know what IS the main language for each product. Should we add either: lc, ingredients_lc or lang?

obsolete_since_date field

Many producers are sending us information when products are obsolete. We should add it to the CSV for many reasons:

  • reusers could filter those products
  • it can be possible to monitor those products
  • etc.

rev field

This field represents the number of revisions of a product. As it is short, it's not very costly. It would allow to:

  • monitor if there is an important activity on some products
  • rev is also a good proxy for popular products
  • identify products with only one rev

unknown_ingredients_n

This would allow to better investigate how to improve/prioritize ingredients' quality.

The number of photos

It is a good proxy for products' popularity. It can be also a way to know if the product has good chances to be fixed.
It also allows to monitor the products with new photos: for example the ones where photos are not selected.
As the field is just a number, it isn't too costly.

Useless fields

On the other hand we should try no to modify the CSV too often. So I would be in favor to delete useless fields at the same time:

  • countries: this field can mix data in different, it's better to rely on the country_tags (eg. en:united-kingdom)which is a normalized version of the countries.
  • countries_en (eg. United Kingdom) is here for comfort. But we could also remove it.
  • the same questions are relevant for all the taxonomized fields:
    • categories and categories_en
    • labels and labels_en
    • packaging and packaging_en
    • origins and origins_en
    • traces and traces_en
    • additives and additives_en
    • food_groups and food_groups_en: food_groups is always in a normalized way??
    • states and states_en: same remark as food_groups
      For food_groups and states, at least, states and states_tags are almost identical, the only difference is that states contains spaces

Exceptions

Some fields in the CSV doesn't have an _en equivalent.

  • manufacturing_places: we have manufacturing_places_tags but we don't have manufacturing_places_en
  • idem for emb_codes, cities, allergens.
    Should we only keep the _tags fields?

Curiously, we have main_category and main_category_en, but not main_category_tag.

Redundant date fields

Should we also remove all the fields ending with a _t (unix epoch format), redundant with _datetime fields?
60 Mb are lost due to this redundancy.

@CharlesNepote CharlesNepote added Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data 🧽 Data quality https://wiki.openfoodfacts.org/Quality CSV exports labels Dec 21, 2023
@benbenben2
Copy link
Collaborator

I see the @export_fields in Config_off.pm which contain the first fields that you mentioned.

For the number of photos, in the api result we have something like this:

"images":{"1":{"sizes":{"100":{"h":40,"w":100},"400":{"h":161,"w":400},"full":{"h":1200,"w":2984}},"uploaded_t":1535370936,"uploader":"kiliweb"},"10":{"sizes":{"100":{"h":22,"w":100},"400":{"h":87,"w":400},"full":{"h":658,"w":3024}},"uploaded_t":1610373128,"uploader":"kiliweb"},"11":{"sizes":{"100":{"h":75,"w":100},"400":{"h":299,"w":400},"full":{"h":2992,"w":4000}},"uploaded_t":1660335243,"uploader":"moon-rabbit"},"12":{"sizes":{"100":{"h":75,"w":100},"400":{"h":299,"w":400},"full":{"h":2992,"w":4000}},"uploaded_t":1660335252,"uploader":"openfoodfacts-contributors"},"13":{"sizes":{"100":{"h":75,"w":100},"400":{"h":299,"w":400},"full":{"h":2992,"w":4000}},"uploaded_t":1660335260,"uploader":"openfoodfacts-contributors"},"14":{"sizes":{"100":{"h":100,"w":56},"400":{"h":400,"w":225},"full":{"h":1280,"w":720}},"uploaded_t":1685811820,"uploader":"insectproductadd"},"15":{"sizes":{"100":{"h":17,"w":100},"400":{"h":68,"w":400},"full":{"h":330,"w":1949}},"uploaded_t":1693411520,"uploader":"mismer"},"16":{"sizes":{"100":{"h":100,"w":89},"400":{"h":400,"w":356},"full":{"h":671,"w":597}},"uploaded_t":1693411555,"uploader":"mismer"},"2":{"sizes":{"100":{"h":22,"w":100},"400":{"h":87,"w":400},"full":{"h":562,"w":2583}},"uploaded_t":1535370948,"uploader":"kiliweb"},"3":{"sizes":{"100":{"h":50,"w":100},"400":{"h":199,"w":400},"full":{"h":2431,"w":4896}},"uploaded_t":1538851611,"uploader":"anticultist"},"4":{"sizes":{"100":{"h":36,"w":100},"400":{"h":146,"w":400},"full":{"h":1582,"w":4338}},"uploaded_t":1538851798,"uploader":"anticultist"},"5":{"sizes":{"100":{"h":22,"w":100},"400":{"h":88,"w":400},"full":{"h":924,"w":4213}},"uploaded_t":1538851824,"uploader":"anticultist"},"6":{"sizes":{"100":{"h":40,"w":100},"400":{"h":160,"w":400},"full":{"h":481,"w":1200}},"uploaded_t":1547153415,"uploader":"twoflower"},"7":{"sizes":{"100":{"h":42,"w":100},"400":{"h":169,"w":400},"full":{"h":508,"w":1200}},"uploaded_t":1547153419,"uploader":"twoflower"},"8":{"sizes":{"100":{"h":22,"w":100},"400":{"h":88,"w":400},"full":{"h":264,"w":1200}},"uploaded_t":1547153424,"uploader":"twoflower"},"9":{"sizes":{"100":{"h":40,"w":100},"400":{"h":161,"w":400},"full":{"h":1200,"w":2984}},"uploaded_t":1610373126,"uploader":"kiliweb"},

Not sure if we can do a count of this

Everything is not in this Config_off.pm variable. For example, if we export a csv, we have some columns like "packaging_1_number_of_units" or "packaging_1_shape". This is not clear to me where it is defined.

Also, I do not see all these "countries_en", "categories_en", etc.

Maybe I am not looking at the same csv...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CSV exports Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data 🧽 Data quality https://wiki.openfoodfacts.org/Quality
Projects
Status: To discuss and validate
Status: To do
Development

No branches or pull requests

2 participants