Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split es locale into several variations and migrate existing user translations to es-ES #22982

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

diox
Copy link
Member

@diox diox commented Jan 10, 2025

See mozilla/addons#15232

Context

As part of streamlining our locales we're splitting es into several variations. The Pontoon team have asked us to copy over the existing locale to the new ones and they'll sort them (either as actual translations or suggestions) in Pontoon. That makes the diff for this PR rather large and scary since I've copied the existing .po file into several new ones. You can diff them with the es one to make sure I haven't messed it up and then ignore them.

Testing

Locally, you need to either bypass addons-frontend in nginx.conf (by replacing the try_files $uri @frontendamo by try_files $uri @olympia) or build a custom addons-frontend image with mozilla/addons-frontend#13401 otherwise it won't let you use the new locales, as in local development we're proxying requests through addons-frontend instead of fully defining routes in nginx.

The es locale should no exist anymore, and be replaced by the new variants.

Migrations

This change contains 3 migrations that will affect hundreds of thousands of rows in dev/stage/prod. I've tested them locally with 700k+ rows and the biggest one took less than 10 seconds.

@diox diox force-pushed the create-es-locales-folders branch from b0db161 to 60ef7a2 Compare January 10, 2025 14:13
@diox diox marked this pull request as draft January 10, 2025 16:50
@diox diox changed the title Add new locales folders for the various es variants Split es locale into several variations and migrate existing user translations to es-ES Jan 20, 2025
@diox diox marked this pull request as ready for review January 21, 2025 12:13
@diox diox requested review from a team and KevinMind and removed request for a team January 21, 2025 12:41
@diox diox force-pushed the create-es-locales-folders branch from 35a6260 to cf0de0a Compare January 21, 2025 12:42
@@ -161,7 +161,7 @@ def test_multiple_objects_with_multiple_translations(self):
assert set(addon2.translations[addon2.name_id]) == (
{
('en-us', 'English 2 Name'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh god how I wish we had enums for locales...

@@ -279,6 +279,7 @@ def fetch_single_translation(self, obj, field, requested_language):
translations = self.fetch_all_translations(obj, field) or {}
locale = None
value = None
requested_language = to_language(requested_language)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear why this change is here.. could you add some context?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug in the way we fetch translations coming from ES, it did not convert the requested language like we do when the translations are coming from the database.

It was revealed by the tests when I changed them from es to es-ES - it would only occur when we're fetching a language with a "regional" variant (xx-XX) that is not the default locale for the add-on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have an explicit test for this edge case then 1) to be damn sure it works and 2) to document the edge case itself.

Copy link
Member Author

@diox diox Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dug a bit more into what was happening here:

The broken test was TestESTranslationSerializerField.test_attach_translations_target_name(), which is meant to verify how we attach the translation dict of data coming from Elasticsearch to an object using ESTranslationSerializerField.attach_translations(). The test attaches the translations then verifies that the dummy Addon object has both a correct <field>_translations and a <field> containing the
translation in the current locale.

The <field>_translations dict was correct, only that second part was broken. The reason we never noticed is that it's not actively used in our ESAddonSerializer right now: when we perform a search and call the serializer, we call attach_translations() but only really use the first part of what it does, setting <field>_translations. Even when a lang parameter is passed and we return a single translation, we still only use that <field>_translations dict.

Why does that second part exist then ? The way our ESAddonSerializer works, it creates fake Addon (and Version, Preview, License etc instances) but then shares most of its code with the AddonSerializer that deals with database instances. Setting the translated fields correctly ensures we don't accidentally trigger a database query or return bad data if some piece of code somewhere ends up directly using the field.

Ultimately the test was previously incorrect as it was setting up bad data - the translations dict keys were incorrect. I'll add some comments and make the test more explicit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

49ce11d adds even more tests/docs.

#
msgid ""
msgstr ""
"Project-Id-Version: messages\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the rationale that it is easier for translators to modify the original spanish translation to variants than to start from scratch?

Also.. it's weird.. it's like did we support the different spanish variants already or is this introducing them. On the one hand, you're adding the locale files ... so they weren't there.. but onb the other, there was already reference to those locales ... can you explain that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were no references to the variants before, I'm adding them. Translator teams have expressed different preferences regarding what to do with the files, but the conclusion we reached when talking to @mathjazz (Pontoon lead) was to "seed" the new locales by copying the old file over and they'd sort them out in pontoon later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Probably worth dumping that in the "context" on the PR for posterity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Probably worth dumping that in the "context" on the PR for posterity.

FYI there was a reference to the language but I guess not sufficient enough to trigger creating a translation file for that locale.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@diox diox requested a review from KevinMind January 23, 2025 13:40
@KevinMind
Copy link
Contributor

@diox how did you create 700k+ rows? Even better, you can zip your database and send it to me by running

make data_dump ARGS="--name pr-22982"

Then zip the directory and send it to me. I can then load that directly on my instance and boom!

@diox
Copy link
Member Author

diox commented Jan 24, 2025

I'll generate the dump, but for the record, I essentially did:

translations = []
for x in range(0, 700000):
    t = Translation.new('Lorem ipsum dolor sit amet, erat graece accusata eum te', 'es')
    translations.append(t)
Translation.objects.bulk_create(translations, batch_size=1000)

(Translation.new() is a custom hack in addons-server used behind the scenes when creating translations "the normal way", that is there to maintain the separate id sequence table. I've tried to reproduce that behavior here while still being efficient with a bulk_create() after, but I haven't checked if it completely works as expected in very aspect, I was only interested in creating garbage data)

@diox
Copy link
Member Author

diox commented Jan 24, 2025

Updated, and on top of providing the snippet I used to generate extra data locally I've shared my database dump privately on slack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants