Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add natural translation for DSL #574

Open
wants to merge 32 commits into
base: master
Choose a base branch
from

Conversation

BrentBlanckaert
Copy link
Collaborator

@BrentBlanckaert BrentBlanckaert commented Dec 11, 2024

You can run the pre-processor by using

python -m tested.nat_translation ./exercise/simple-example/program_language_map/suite.yaml en # English translation

tested/nat_translation.py Fixed Show fixed Hide fixed
@pdawyndt
Copy link
Contributor

Maybe we could support translations of a testplan (and rollout of templates?) as

python -m tested.translate <testplan>

with an extra option (or argument) to pass the natural language for the translation.

tested/nat_translation.py Fixed Show fixed Hide fixed
tested/nat_translation.py Fixed Show fixed Hide fixed
@BrentBlanckaert
Copy link
Collaborator Author

BrentBlanckaert commented Dec 12, 2024

Maybe we could support translations of a testplan (and rollout of templates?) as

python -m tested.translate <testplan>

with an extra option (or argument) to pass the natural language for the translation.

This should work.

@BrentBlanckaert
Copy link
Collaborator Author

@pdawyndt in #559 it also says that translations for files should be provided. In what sense?
I've got something like the following:

- files: !natural_language
    en:
      - name: "file.txt"
        url: "media/workdir/file.txt"
    nl:
      - name: "fileNL.txt"
        url: "media/workdir/fileNL.txt"

This seems pointless since I could also just do:

- files:
  - name: "file.txt"
    url: "media/workdir/file.txt"
  - name: "fileNL.txt"
    url: "media/workdir/fileNL.txt"

tested/nat_translation.py Fixed Show fixed Hide fixed
tested/nat_translation.py Fixed Show fixed Hide fixed
@pdawyndt
Copy link
Contributor

Not really pointless as TESTed will show all "linked files" to the students. For each file, TESTed will try to find its name in the expression/statement and then turn that into a hyperlink. If it doesn't find the filename, it will add it to a list of files that is displayed for the testcase.

@BrentBlanckaert
Copy link
Collaborator Author

BrentBlanckaert commented Dec 13, 2024

@pdawyndt , I started looking for adding a translation table like

translation:
  animal:
    en: "animal"
    nl: "dier"
  result:
    en: "result"
    nl: "resultaat"

Is it even usefull to then also add support in a statement like the following:

- statement: !natural_language
   en: 'result = Trying(10, "{animal}")'
   nl: 'resultaat = Proberen(10, "{animal}")'

I would suggest not even searching lookingany deeper when a natural_language map is already found and only using translation map when the expected (like a string) is given.

@pdawyndt
Copy link
Contributor

pdawyndt commented Dec 13, 2024

I definitely have many exercises where this (the combination of translation and template strings) is useful. So I would say yes. If we use Python format strings, then we could even write your example as

- statement: !natural_language
   en: 'result = Trying(10, {animal!r})'
   nl: 'resultaat = Proberen(10, {animal!r})'

And not even bother about using single or double quotes or escaping any quotes in the thing you put in the placeholders (which otherwise adds a lot of complication on the side of the DSL-author).

If you have a variable statement pointing to the format string for the statement, a dictionary translation containing the merged translation from the DSL hierarchy and a dictionary data containing the testcase data, turning the template string into the actual string (by filling up the placeholders) would then come down to

statement = statement.format(**translation, **data)

If we also allow data to be an YAML-array instead of a YAML map (positional instead of named placeholders), then formatting is done by

statement = statement.format(*data, **translation)

For example, if data = [3, 4, 7] then we could have

'{} +  {} = {}'

or with explicit positions (which would also allow reodering and reusing the array values)

'{0} +  {1} = {2}'

tested/nat_translation.py Fixed Show fixed Hide fixed
@BrentBlanckaert
Copy link
Collaborator Author

BrentBlanckaert commented Dec 15, 2024

Currently I've implemented support for !natural_language and a translation map you can define globally, in a tab and in a context. Here is a quick rundown of everything that is possible:

The translation map looks like the following:

translation:
  animal:
    en: "animals"
    nl: "dieren"
  result:
    en: "results"
    nl: "resultaten"

This can be defined

  • Next to the tabs (globally)
  • In a tab
  • In a context

The !natural_language map can be defined in the following ways:

In a tab

  • If tab (the name) is a dict, it means it's a !natural_language map where using !natural_language is not necessary.
    • After that translation of the !natural_language map, it is assumed that the name will be a string. This will then be formatted using the translation maps.

In a testcase

  • For a statement or expression using !natural_language is mandatory.
    • If it is there it'll first perform the translation.
    • After that, it'll check if it's a dict. If it is, then we do formatting based of the translation maps on each value.
    • If it's a string we just perform formatting on that.
  • When a stdin is a dict it is assumed that it's a !natural_language map. So using !natural_language is not necessary.
    • From this dict a translation is performed.
    • The result of that should be a string, which is always formatted even if stdin wasn't a dict.
  • For arguments the same holds as stdin except that the result will be a list and formatting is performed on each item.
  • stderr, exception and stdout follow the same structure:
    • The usage of !natural_language mandatory. If it's there we'll do the translation. If the result is a string, it will be formatted.
    • If a dict remains, we'll look at the "data" key ("message" for exception)
      • Check if that is a dict:
      • If it is, perform translation (no !natural_language needed).
    • The value of "data" should be a string or should be one after the translation. That string is formatted.
  • For files I've only added support for usage of !natural_language. No formatting is done.
  • If the return is an Oracle:
    • We look at the arguments and do the exact same as specified before.
    • After that we look at the value. If translation is done, it's mandatory to use !natural_language. That translation will turn it in a list, dict, int or string. This will be parsed an correctly formatted.
  • If it's not an Oracle, we check if it's a !natural_language map. If it is, we parse the result of the translation for possible formatting.
  • Otherwise just parse the value for possible formatting.
  • When using a description using !natural_language is also mandatory. The result of that translation will be formatted if its a sstring
  • When it's a dict, check the "description" key. If that is a dict, then it's a translation. After the value of the "description" key should always be a string and formatted.

tested/nat_translation.py Fixed Show fixed Hide fixed
tested/nat_translation.py Fixed Show fixed Hide fixed


def create_enviroment() -> Environment:
enviroment = Environment()

Check warning

Code scanning / CodeQL

Jinja2 templating with autoescape=False Medium test

Using jinja2 templates with autoescape=False can potentially allow XSS attacks.

Copilot Autofix AI 26 days ago

Copilot could not generate an autofix suggestion

Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be relevant for our case. Since, we're not passing any html or xml through this.

Comment on lines +388 to +389
# def represent_str(dumper, data):
# return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='"')

Check notice

Code scanning / CodeQL

Commented-out code Note test

This comment appears to contain commented-out code.
@BrentBlanckaert
Copy link
Collaborator Author

Documentation Natural language translation

In this documentation 2 things will be discussed.
Firstly, there is the natural_language map and
the translations map that can be defined
in the test-suite.

Globally

You can start with a list of tabs or a directory.
In this dict you can define the tabs, but now you can
also define a translations map. In this map you can
specify for each key a corresponding translation in
different languages. For example, We could create the
following test-suite:

translations:
  animals:
    en: "animals"
    nl: "dieren"
  humans:
    en: "humans"
    nl: "mensen"
tabs:
  - tab: "{{animals}}"
    ...
  - tab: "{{humans}}"
    ...

This defined a translations map with the keys animals
and humans. The values define the corresponding
translation.
Each of these keys can be used in the tabs map to perform
translations on nearly every string (for example: the tab-titles).
This is done by using double brackets around the key ({{animals}}).
If a translation in dutch was performed, it would generate
the following test-suite:

tabs:
  - tab: "dieren"
    ...
  - tab: "mensen"
    ...

TESTed would be able to understand this again.

  • Say you want the tab name to be "{dieren}" instead of "dieren".
    In this case, there are two things you could do.
    • You could use "{{ animal|braces }}" or,
    • "{{ '{' + result + '}' }}"

Inside a tab

As discussed above, a tab could have a title that can be
translated using the translations map, but you could also
use a natural_language map for the title:

- tab: !natural_language
    en: 'animal/{{animals}}'
    nl: 'dier/{{animals}}'
  ...

In this case, the natural_language map is used to generate the
tab-title. The key animals can also be used here.
So a combination of a natural_language map with
key placeholders for the translations map is possible.

In a tab you can also define a translations map that can
overwrite certain translations for that tab.

Inside a tab, context and testcase, you can also define files
that act as potential input files. Top level you can define
a natural_language map for it and you can use
placeholders for the url and name attribute of each file:

files: !natural_language
  en:
    - name: "file_{{animal}}.txt"
      url: "media/workdir/file_{{animal}}.txt"
  nl:
    - name: "bestand_{{animal}}.txt"
      url: "media/workdir/bestand_{{animal}}.txt"

Inside a context

A tab can contain a context object that contains all the
testcases. Just like a tab, you can define a translations map
inside of it.

Inside a testcase

A testcase can contain all kinds of different things. One of those
are statements and expressions. An example of what kind of
translations you can do, are the following:

- statement: !natural_language
    en: '{{result}} = Trying(10)'
    nl: '{{result}} = Proberen(10)'
- expression: !natural_language
    en: 'count_words({{result}})'
    nl: 'tel_woorden({{result}})'

For statements and expressions you can also define a
program language specific map. Normally you don't need to add
anything special for this, but for consistency reasons
you must now also add !programming_language. So you could have
the following expression:

- expression: !programming_language
    javascript: !natural_language
      en: "{{animal}}_javascript_en(1 + 1)"
      nl: "{{animal}}_javascript_nl(1 + 1)"
    typescript: !natural_language
      en: "{{animal}}_typescript_en(1 + 1)"
      nl: "{{animal}}_typescript_nl(1 + 1)"
    java: !natural_language
      en: "Submission.{{animal}}_java_en(1 + 1)"
      nl: "Submission.{{animal}}_java_nl(1 + 1)"
    python: !natural_language
      en: "{{animal}}_python_en(1 + 1)"
      nl: "{{animal}}_python_nl(1 + 1)"

An equivalent of this would be:

- expression: !natural_language
    en: !programming_language
      javascript: "{{animal}}_javascript_en(1 + 1)"
      typescript: "{{animal}}_typescript_en(1 + 1)"
      java: "Submission.{{animal}}_java_en(1 + 1)"
      python: "{{animal}}_python_en(1 + 1)"
    nl: !programming_language
      javascript: "{{animal}}_javascript_nl(1 + 1)"
      typescript: "{{animal}}_typescript_nl(1 + 1)"
      java: "Submission.{{animal}}_java_nl(1 + 1)"
      python: "{{animal}}_python_nl(1 + 1)"

If no statements and expressions are specified,
there must be a stdin and arguments provided.

For both of these, you can specify a natural_language
map. For stdin the values should be strings that
also can contain formatting for translation
with the translations map.

For arguments, you can also specify a natural_language
map. The values should be lists that represent the arguments.
Formatting can be performed on all strings in those lists.

Next up, you can also specify stdout, stderr, and
exception. All three of these follow the
same format when it comes to translations.
At the top level, you can specify a natural_language
map. The values should either be a string or a
dictionary.

  • If it's a string, it can just be formatted with the
    translations map.
  • If it's a dictionary, it should contain the key data
    (message for exception). The value of that could
    be another natural_language map.
    The values of that natural_language map should be
    of any yaml type. Strings in thoses values can also be
    formatted with the translations map.

An example could be the following:

stderr:
  data: !natural_language
    en: "Nothing to see here {{User}}"
    nl: "Hier is niets te zien {{User}}"
  config:
    ignoreWhitespace: true

You can also add a file which corresponds to an output file.
This can be a natural_language map and the values should
be dictionaries. Those dictionaries should contain
the keys content and location, that can be formatted.

When specifying the return, it can be a lot of things
at the top level:

  • An oracle that is basically a dictionary where two
    keys can have translations:
    • arguments: works the exact same way as discussed above.
    • value: works the same way as data in stderr and stdout.
  • A natural_language map: the values are expected to be valid yaml types, where the strings can be formatted.
  • A valid yaml value, where the strings can be formatted.

An example of this would be the following:

return: !oracle
  value: !natural_language
    en: "The {{result}} 10 is OK!"
    nl: "Het {{result}} 10 is OK!"
  oracle: "custom_check"
  file: "test.py"
  name: "evaluate_test"
  arguments: !natural_language
    en: ["The value", "is OK!", "is not OK!"]
    nl: ["Het {{result}}", "is OK!", "is niet OK!"]

Lastly, there is a description that could be added.
At the top level, this can be a natural_language map.
The values of this are either a dictionary or a string.

  • If it's a string, it can simply be formatted.
  • If it's a dictionary, it should contain the key description:
    This can be a also be a natural_language map. The values
    should be a string. These strings can be formatted.

An example of this would be the following:

description:
  description: !natural_language
    en: "Eleven_{{elf}}"
    nl: "Elf_{{elf}}"
  format: "code"

@BrentBlanckaert BrentBlanckaert marked this pull request as ready for review January 7, 2025 17:07
Copy link
Member

@niknetniko niknetniko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not looked at all the implementation code itself, I mostly read the "documentation".

  • I don't see any immediate conflicts in the DSL with other features, so I think the proposal is good.
  • I think the choice for explicit !natural_language is a good choice, even if it adds some verbosity to the test suites. It is always easier to add shorthands later if the experience of using the feature indicates that they are needed (the reverse is much more difficult).

So in summary, I think this is a good way of adding translation to the DSL, while keeping simple things simple but still providing a fairly flexible approach.

@BrentBlanckaert BrentBlanckaert self-assigned this Jan 11, 2025
@BrentBlanckaert BrentBlanckaert added the enhancement New feature or request label Jan 11, 2025
Copy link
Contributor

@jorg-vr jorg-vr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all sorry for the late review.

As Niko already reviewed the docs, and I agree with his take, I have mostly looked at the code.

I have left more questions then actual comments, sorry for that, I am not very familiar with this part of the code.

I think it would be a big improvement should it be possible to write the code a bit more agnostic of the actually TESTed DSL. But it might be best to discuss this with @pdawyndt before you start a rewrite, as he might deem it more worthwhile for you to focus on other features instead of over-optimizing this one.

@@ -148,6 +184,8 @@ def _parse_yaml(yaml_stream: str) -> YamlObject:
yaml.add_constructor("!" + actual_type, _custom_type_constructors, loader)
yaml.add_constructor("!expression", _expression_string, loader)
yaml.add_constructor("!oracle", _return_oracle, loader)
yaml.add_constructor("!natural_language", _natural_language_map, loader)
yaml.add_constructor("!programming_language", _programming_language_map, loader)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused why !programming_language is added in this pr, as I assumed this to already work before this pr. But I am not very familiar with this part of the code.

Could you explain it to me?

@@ -20,7 +20,14 @@
StringTypes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is easy to do, I would put these new tests in a separate file.
(But do keep them here if creating a second testfile requires a lot of code duplication)

@@ -0,0 +1,412 @@
import sys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must say I expected this file to be much simpler.

In principle the !natural_language should simply be replaced by the content of the specified language in the map.
So in my mind, this code should not know whether it is working within a tab, context, testcase,...

The fact that this preprocess script is so heavily linked to the precise TESTed DSL, will make it harder to maintain in the future. Any change to the TESTed DSL will also have to be verified here.

Do you think it is possible to write a more abstract solution, or have I missed some potential issues?



def wrap_in_braces(value):
return f"{{{value}}}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a standard way of doing this in jinja2?
To me it felt a bit odd. But I assume this was an explicit feature request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants