CSV export: add the main language of each product, and delete useless fields #9563
Labels
CSV exports
Data export
We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data
🧽 Data quality
https://wiki.openfoodfacts.org/Quality
This issue is continuing the conversation of #2325 (I think we could close it).
Fields to add
lang
to address the ingredients_text issueThe field
ingredients_text
is interesting because, corresponding to main language of the product, it is the most likely to be filled. It would not be useful to export ingredients_en, ingredients_fr, etc. because many of them would be empty. So in the past we have chosen:ingredients_text
ingredients_text
:ingredients_tags
. Eg.en:brown-sugar
.That said, there is no way to know what IS the main language for each product. Should we add either:
lc
,ingredients_lc
orlang
?obsolete_since_date
fieldMany producers are sending us information when products are obsolete. We should add it to the CSV for many reasons:
rev
fieldThis field represents the number of revisions of a product. As it is short, it's not very costly. It would allow to:
unknown_ingredients_n
This would allow to better investigate how to improve/prioritize ingredients' quality.
The number of photos
It is a good proxy for products' popularity. It can be also a way to know if the product has good chances to be fixed.
It also allows to monitor the products with new photos: for example the ones where photos are not selected.
As the field is just a number, it isn't too costly.
Useless fields
On the other hand we should try no to modify the CSV too often. So I would be in favor to delete useless fields at the same time:
countries
: this field can mix data in different, it's better to rely on thecountry_tags
(eg. en:united-kingdom)which is a normalized version of the countries.countries_en
(eg.United Kingdom
) is here for comfort. But we could also remove it.categories
andcategories_en
labels
andlabels_en
packaging
andpackaging_en
origins
andorigins_en
traces
andtraces_en
additives
andadditives_en
food_groups
andfood_groups_en
:food_groups
is always in a normalized way??states
andstates_en
: same remark asfood_groups
For
food_groups
andstates
, at least,states
andstates_tags
are almost identical, the only difference is that states contains spacesExceptions
Some fields in the CSV doesn't have an
_en
equivalent.manufacturing_places
: we havemanufacturing_places_tags
but we don't havemanufacturing_places_en
emb_codes
,cities
,allergens
.Should we only keep the
_tags
fields?Curiously, we have
main_category
andmain_category_en
, but notmain_category_tag
.Redundant date fields
Should we also remove all the fields ending with a
_t
(unix epoch format), redundant with_datetime
fields?60 Mb are lost due to this redundancy.
The text was updated successfully, but these errors were encountered: