Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Feat/expectations #307

Draft
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

thorehusfeldt
Copy link
Collaborator

@thorehusfeldt thorehusfeldt commented Sep 16, 2023

This is my first attempt at implementing the expectations framework into BAPC.

Draft documentation is in doc/expectations.md, and the file bin/expectations.py has extensive documentation (including doctest.)

This is a very early draft, and mainly an invitation for feedback.

Why care?

Correct semantics for required and permitted verdicts

Out-of-the box, an immediate improvement to the current situation is that it implements various verification conventions correctly, such as time_limit_exceeded and run_time_exception.

Here’s an example of a submission (in test/problems/spanishinquisition, where all the examples for the expectations framework reside) that fails for two different reasons. It lives in wrong_answer, but gets TLE. The tool now correctly reports this, by distinguishing between permitted verdicts (here, “all verdict must get AC and WA”), and required verdicts (here, “some verdict must get WA”). Both violations are reported.

image

Of course, the normal behavious is still supported. A submission in accepted that gets WA gets

image

Note that the tool is more helpful in line 2 and tells us where to find the violation (here, because the submission resides in accepted), and which verdicts would have been permitted (here, ACCEPTED).

You can specify your own conventions for what should reside in submissions/other or submissions/mixed; the framework or the tool doesn’t care.

mixed/: # correct submissions that might run out of time 
  sample: accepted
  secret:
    permitted: [AC, TLE]

More fine-grained test data specification:

Requirements can, by default, be specified for all of data/. But it is also possible to be more fine-grained. A contest that wants to ensure that the submissions in wrong_answer still get accepted on all the samples, can just write

wrong_answer/:
  sample: accepted
  secret: wrong answer

This rule now applies to all submissions in wrong_answer.

The specification of “which testdata to match” is just a regex, so you can go crazy:

mixed/superstitious.py:            # for this submission
  secret/thirteen:                 # ... on testcases that match this pattern
    permitted: [WA]                # ... permit exactly the verdict WA, nothing else
  secret/(?!thirteen): accepted    # everywhere else: accept

This can be useful for specifying specific testcases that are aimed at “killing” specific submissions (or vice versa).

More generally, this allows specifying the behaviour of submissions on subdirectories of data, useful for expressing things like “the greedy submissions are expected to pass on the data in this directory, but not that”, or problems with subtasks.

More fine-grained submission specification

Matching of submissions is by prefix; the standard prefixes accepted, wrong_answer, etc. just happen to match the path of the submissions that reside in the corresponding directories. As with superstitious.py above, you can specify a specific submission by giving its full name; or anything inbetween. If Thore wants to ensure that his submissions (say, th-greedy.py) in wrong_answer at least pass sample, he can add

wrong_answer/th:
  sample: accepted

Can read, say, `test/problems/hello/exectations.yaml` and
verifies the expectations set therein (rather than looking
at directory-defined or @EXPECTED_GRADES@.)
Also introduce shortform like AC for ACCEPTED
@thorehusfeldt
Copy link
Collaborator Author

thorehusfeldt commented Sep 16, 2023

Here’s the output for bt run on spanishinquisistion that shows a ton of the functionality at work at the same time:

image

Note for instance the feedback for bad-fourteen (a submission that should have failed on test case secret/thirteen but does fail on secret/fourteen instead), where the tool tells us which patterns where violated and what would have been expected there. Also note the use of the regex \d+ to pick up automatically numbered testcases at mixed/bad-eight.sh. That is probably the most important use of regex matching. (In fact, it’s the motivation for that decision, otherwise we could never target individual testcases whose numbering we don’t control.)

Copy link
Collaborator

@mpsijm mpsijm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick check, commenting the first things that popped into mind 🙂

pass # TODO (Thore): made this shut up!
#error(
# f'Submission {self.short_path} must have @EXPECTED_RESULTS@. Defaulting to ACCEPTED.'
#)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is probably still necessary. DOMjudge does not support an expectations.yaml, and reads the @EXPECTED_RESULTS@ tag to know which final verdicts are allowed. So if the submission is not in one of the four default directories, and it does not have such a tag, DOMjudge will default to ACCEPTED, which is not intended in most cases 😅

I think that the @EXPECTED_RESULTS@ tag converted to expectations.yaml syntax would look something like:

rejected/broken.py:  # contains `@EXPECTED_RESULTS@: TIME_LIMIT_EXCEEDED, WRONG_ANSWER`
  permitted: [AC, TLE, WA]  # only these verdicts may appear
  required: [TLE, WA]       # at least one of these verdicts must appear

It would be nice if the existence of an @EXPECTED_RESULTS@ could be checked for consistency with the expectations.yaml. Perhaps we could even infer the tag from the YAML on export, but that may result in too much seemingly magic behaviour?

Copy link
Collaborator Author

@thorehusfeldt thorehusfeldt Sep 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All correct, and all (in my mind) part of the tool more than of the expecations framework.

Note that the framework does allow file-based expectations; in principle a submission itself can supply “its own” expectations, which then compose with expectations in expectations.yaml in well-defined ways.

I’ve silenced @EXPECTED_RESULTS@ in my branch simply because there are too many decisions that I don’t feel strongly about involved with how to handle it. (A tiny method that translates @EXPECTED_RESULTS@ into specific expectations.BaseExpectations would be the quickest way to be complatible. Very easy to do.)

Copy link
Collaborator Author

@thorehusfeldt thorehusfeldt Sep 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, if

permitted: [AC, TLE, WA] 
required: [TLE, WA]  

is sufficiently popular, the cleanest way is to elevate it to a common abbreviation next to accepted, does not terminate, etc. In principle, there are 16 different possible permitted subsets (two of which make now sense), each of which can have a number of required subsets. I was just taking the standard set and add a few that seemed useful to me (such as not accepted).

If somebody has a large repository of @EXPECTED_RESULTS@-tagged files, we could run statistics on them! Please so.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running the following command on my folder that contains all Git repos that I've worked with in the past years (FPC from 2018, BAPC+preliminaries from 2020, NWERC from 2022):

$ grep -r @EXPECTED_RESULTS@ 2> /dev/null | grep 20 | cut -d ':' -f 3 | sed 's/ //g' | sort | uniq -c
  • The grep 20 is to only select repositories that have a year number (2018–2023), because I want to exclude things like BAPCtools and domjudge that use this tag many times in their tests.
  • The sed command removes the optional space after the comma that separates the verdicts.
  • The verdicts may still have different orders (e.g. both ACCEPTED,WRONG_ANSWER and WRONG_ANSWER,ACCEPTED occur), but I've manually aggregated those in the results below.
      3 ACCEPTED,RUN_TIME_ERROR
      2 ACCEPTED,RUN_TIME_ERROR,TIME_LIMIT_EXCEEDED
      2 ACCEPTED,RUN_TIME_ERROR,WRONG_ANSWER
     47 ACCEPTED,TIME_LIMIT_EXCEEDED
     16 ACCEPTED,WRONG_ANSWER
     10 RUN_TIME_ERROR,TIME_LIMIT_EXCEEDED
      3 RUN_TIME_ERROR,TIME_LIMIT_EXCEEDED,WRONG_ANSWER
     22 RUN_TIME_ERROR,WRONG_ANSWER
     49 TIME_LIMIT_EXCEEDED,WRONG_ANSWER

After running this command, I realize that my earlier example only works if the tag only has non-AC verdicts. For example, a combined AC,TLE verdict may better be written without required:

mixed/sometimes-too-slow.py:  # contains `@EXPECTED_RESULTS@: ACCEPTED, TIME_LIMIT_EXCEEDED`
  permitted: [AC, TLE]  # only these verdicts may appear

Adding the required field is probably not even strictly necessary in any of these cases, but for the cases that do not include AC, it's probably helpful if the tool warns about a submission that we expect to give some non-AC verdict, but is in fact AC.


And as a final remark, I'm finding it difficult to switch my mental model from "the verdict that the submission receives" to "the set of verdicts that the test cases receive". The fact that DOMjudge (and @EXPECTED_RESULTS@) use the former and expectations.yaml uses the latter, probably doesn't help either. 😂

Copy link
Collaborator Author

@thorehusfeldt thorehusfeldt Sep 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this. Also for this formulation:

I'm finding it difficult to switch my mental model from "the verdict that the submission receives" to "the set of verdicts that the test cases receive".

That is exactly why I’m doing this; and I’m trying to be terminologically stringent about it. To repeat this in my own words,

  1. The separation of these two concepts is crucial for even defining the semantics of, e.g., time_limit_exceeded, in accordance with the specification.
  2. Another separation is between the behaviour of the construction and validation tool (which is used by authors) and the electronic judge (which is used by contest organisers, teachers, and solvers.) The former (say, BAPCtools or problemtools) needs a way of saying “no testcase for the brute-force solution can ever be WA, but TLE and RTE are acceptable, and the final verdict shan’t be AC”, and construction tools should support this to ensure problem quality. On the other hand, the judging environment (Kattis, DomJudge, etc.) needs well-defined semantics for what the final verdict is (or maybe subtask scores of IOI), and when it is OK to abort prematurely to reduce server load. For instance a judge can report the lexicographically first (by testcase path) non-AC verdict.

By the way, I’m happy to add final to the expectations specification, and tried many ideas. Here’s how that could look:

final: TLE
permitted: [AC, TLE, WA]
required: [TLE]

This is then either a requirement about the lexicographic ordering of testcases, ensuring that the TLE verdict is encountered before the WA verdict (what the spec calls first_error), or that the judge has an ordering of rejection-verdicts that makes TLE worse than WA, continues grading after encountering WA, and then reports the worst error.

But I don’t really see a need for final, nor is the semantics clear to me. This is the kind of feedback I’m soliciting here. (In my current BAPCtools integration, this issues would need to be resolved to understand whether the final verdict should be red or green.) See, for instance, submissions bad-tle-wa.py in the large screenshot above, which fails (because it gets a WA), yet gets a green TLE (which, after all, is a permitted testcase verdict). I have no opinion on what the semantics or visualisation should be.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the elaboration! I don't think that a final expectation is needed, indeed 🙂

Copy link
Collaborator Author

@thorehusfeldt thorehusfeldt Sep 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve updated (b4d4df9) the pydoc documentation of expectations.yaml to spell out the workflow that avoids expectation.yaml altogether and specifies expectations submission-by-submission by writing

registry['wrong_answer/greedy.py'] = Expectations({'sample': 'accepted'})

This would be my suggestion for translating submissions that contain @EXPECTED_RESULTS@. But I simply do not understand the semantics of the latter, because I can’t wrap my head around

Use this to annotate the possible outcomes that a submission can have, that will be accepted by the Judging verifier.

in the DOMjudge appendix.

As far as I can tell, the final verdict (“outcome”) in DOMjudge depends on results_prio, which is not part of the problem package, and in the default setting seems to be what the problem package specification calls first_error. Thus, the outcome depends on the lexicographic ordering of the testcases. I don’t know if this granularity is intended or used, nor don‘t I know how to translate it.

closed PR @ problem_package_format

Copy link
Collaborator

@mpsijm mpsijm Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For DOMjudge, the final verdict not only depends on result_prio, but also on whether lazy judging is enabled: https://www.domjudge.org/docs/manual/8.2/judging.html#lazy-judging-and-results-priority

When [lazy judging is] enabled, judging of a submission will stop when a highest priority result has been found for any testcase. You can find these priorities under the results_prio setting. In the default configuration, when enabling this, judging will stop with said verdict when a testcase results in e.g. run-error, timelimit or wrong-answer. When a testcase is correct (lower priority), judging will continue to the next test case. In other words, to arrive at a verdict of correct, all testcases will have been evaluated, while any of the ‘error’ verdicts will immediately return this answer for the submission and the other testcases will never be tested [...].

[...]

When not using lazy judging, all testcases will always be ran for each submission. The results_prio list will then determine which of the individual testcase results will be the overall submission result: the highest priority one. In case of a tie, the first occurring testcase result with highest priority is returned.

I think we can assume the default DOMjudge settings for BAPCtools. At least, we have never changed this setting in DOMjudge in the past years 🙂 Note that the lazy judging, especially when running multiple runners in parallel, may not necessarily return the result of the first non-accepted test case. In the cases where we use @EXPECTED_REULTS@, I think we usually don't care about the ordering, as long as the submission never receives any other verdict than a verdict from this list (i.e. we don't use first_error currently, but I also don't think we knew that it existed). Thus, at least for now, I don't think we have to translate first_error to expectations.

I would translate @EXPECTED_RESULTS@: <verdict_list> as follows (and it would be nice if BAPCtools does this automatically, when an expectations.yaml does not exist):

  • if AC in <verdict_list>:
    permitted: <verdict_list>
  • if AC not in <verdict_list>:
    permitted: [AC, *<verdict_list>]  # some test cases may pass, so also include `AC`
    required: verdict_list  # at least one of the specified verdicts must appear for at least one of the test cases

- Imports are usually ordered with system libraries first, followed by a
newline, followed by local includes. Both groups are sorted alphabetically,
and `import` comes before `from ... import`.
See [doc/expectations](doc/expectations.md)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably didn't mean to delete the entire readme.md? 😛

Comment on lines +445 to +446
# TODO Q from Thore: ok to use self here instead of problem?
def get_expectations_registry(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably best to stay consistent with existing code, so I would use problem here instead of self, unless we rename the first parameter of all class methods from problem to self. I don't mind either way, as long as it's consistent, so this decision is better for Ragnar to make 🙂

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really a fan of self somehow. I think it's because it's not always clear in which class you're currently reading code with the big functions everywhere, so naming things after what they are seems clearer generally. (Especially given that self doesn't really mean anything special inside the member functions anyway.)

So yes, I'd prefer problem.

@RagnarGrootKoerkamp
Copy link
Owner

We just came across #313.

Currently the lazy-judging situatoin is kind of messy/ad-hoc:

  • WA has priority 90
  • TLE without being aborted (ie above time limit, below safety margin) has priority 99-eps.
  • TLE (aborted) and RTE have max priority 99
  • Lazy judging stops on the first max-priority verdict.

This means that submissions/wrong_answer are ensured to not give TLE/RTE anywhere, but submissions/{run_time_error,time_limit_exceeded} may give WA. Actually that's somewhat in line with the spec, but also pretty arbitrary.

With expectations this all becomes much nicer: we can simply stop on the first non-permitted verdict.

Then the only edge case that I would like to keep is to keep running on TLE (non-aborted) to determine if the time limit is good or not. Maybe this could/should be behind a flag, but I think it's also fine to do this unconditionally.

@@ -0,0 +1,279 @@
# Expectations

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice writeup :)

doc/expectations.md Show resolved Hide resolved
doc/expectations.md Show resolved Hide resolved
doc/expectations.md Show resolved Hide resolved
#common
#range
permitted?: #verdicts // only these verdicts may appear
required?: #verdicts // at least one of these verdicts must appear

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The at least one here is I think the most non-intuitive part of the spec, but also what is wanted/needed/useful.
Requiring at least one WA and at least one TLE (in a single selection of testcases) is weird/rare anyway.

Maybe require_one would be a better key? But I like the simplicity of required so probably this is just something to get used to.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that duplicate yaml dictionary keys are disallowed, there may not be a nice way at all to specify multiple required verdicts for a single testgroup. Probably that needs require_all: or so.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way around could be to additional allow here required: [[WA], [TLE]]: a list of lists, and at least inner-list behaves like the current required key where at least one of the verdicts must be present.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. For a while the keys were called all, any, only, some and so on, but it wasn’t clear even to myself. I agree that it feels as if require_all should be there, but (again) I haven’t seen a sufficient use case.

If it’s important to a problem author, the author can always be more precise:

thore.cpp:
  secret/\d+-makes_thore_wrong: wrong answer
  secret/\d+-makes_thore_slow: time limit exceeded

However, I’m more than willing to reintroduce only, all, some. I found that very neat. Other naming suggestions are welcome. (subset, superset, intersects). require_any and require_all are very clear, but make my eyes bleed.

Copy link
Owner

@RagnarGrootKoerkamp RagnarGrootKoerkamp Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, being more precise to the individual testcase level sounds good enough to me. This is probably a rare case anyway and this is a sufficiently good workaround IMO. So let's just keep require:.
(Maybe add this workaround as a comment/example somewhere.)


# Specification for subgroups works "all the way down to the tescase"
# though it's seldomly needed:
funky.cpp:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry to disappoint, but duplicate dictionary keys are not allowed by yaml, so this would have to be merged into the funky.cpp above.

doc/expectations.md Show resolved Hide resolved
doc/expectations.md Show resolved Hide resolved
* accepted: Every test case is `AC`.
* wrong answer: Every test case receives `AC` or `WA`; _some_ test case receives `WA`.
* time limit exceeded: Every test case receives `AC` or `TLE`; _some_ test case receives `TLE`.
* runtime exception: Every test case receives `AC` or `RTE`; _some_ test case receives `RTE`.
Copy link
Owner

@RagnarGrootKoerkamp RagnarGrootKoerkamp Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will trip over time <space> limit vs run <no space> time all the time. oh well. Let's make nice IDE integration to autocomplete all options here :)

For reference, currently time_limit_exceeded/ and run_time_error/ directories are used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, deliberate (and a pet peeve.)

  • runtime refers to a part of the softward life cycle. It appears in runtime error, or phrases like “at runtime”. It can be spelt runtime or run time.
  • running time is the execution time of a program, or the complexity of an algorithm

I spent an hour yesterday communicating with a student about why their Kattis submission failed, and it turned out that the student thought “run time error” meant “the running time is wrong”, so we discussed the wrong thing for a long time. So our terminological sloppiness has really bad consequences.

So I do what I can to separate runtime and running time, including a strong preference for the unspaced orthography. Also, runtime remains the more common spelling.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I only now realize you also renamed error to exception. Works for me.

(Maybe we should actually rename the directory as well then.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(These details are not really important to me, by the way and shouldn’t get in the way of anything.)

@RagnarGrootKoerkamp
Copy link
Owner

I'm super happy with where this is going :)

One question that remains to be answered (probably in the spec) is what will be the meaning of the existing directories.
Do you already generate default expectations for them?

Probably we want each directory to directly map to the equivalent shorthand you introduce, but in the current spec time_limit_exceeded/* submissions are allowed to WA on some cases... I think we should change that.

@thorehusfeldt
Copy link
Collaborator Author

Just to make sure:

the current spec time_limit_exceeded/* submissions are allowed to WA on some cases... I think we should change that.

I explicitly have no opinion about what time limit exceeded should mean; the problem package format specification will decide this. I merely want to provide a conceptual framework where we can express whatever we agree on.

@RagnarGrootKoerkamp
Copy link
Owner

I think my question was mostly:
It would be nice to include the default mapping for the default directory structure, in case you don't do this already, basically the following boilerplate:

accepted: accepted
wrong_answer: wrong answer
time_limit_exceeded: time limit exceeded
run_time_error: runtime exception

We could even extend that with

not_accepted: not accepted  (or rejected: not accepted is more our current convention but that can change)
does_not_terminate: does not terminate

@thorehusfeldt
Copy link
Collaborator Author

What you suggest is what I would do (and I deliberate chose terminology to make this suggestion tempting), but the decision belongs to the problem package format, not to the expectations framework.

@RagnarGrootKoerkamp
Copy link
Owner

I'll review the code as well. I suppose the next step is to get the last details worked out in the spec and then we can merge it here. (I'd also be happy to merge before that.)

@thorehusfeldt thorehusfeldt marked this pull request as draft January 15, 2024 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants