-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Expectations proposal #112
Comments
Harping on about the bespoke syntax, the suggestion is orthogonal to the rest of the proposal if we just view it as a shorthand. mixed/th.py: not wrong is just a shorthand for mixed/th.py:
allowed: [AC, TLE, RTE] and mixed/th.py: time out is a shorthand for mixed/th.py:
allowed: [AC, TLE]
required: TLE This sounds useful to me (and easy to implement and describe). |
For scoring, I definitely like the idea of For Sidenotes:
|
After some more feedback, here’s another iteration (0.3.0). First, some examples: # Simple examples for some common cases
a.py: accepted # AC on all cases
b.py: wrong answer # at least one WA, otherwise AC
c.py: time limit exceeded # at least one TLE, otherwise AC
d.py: runtime exception # at least one RTE, otherwise AC
e.py: does not terminate # at least one RTE or TLE, but no WA
f.py: not accepted # at least one RTE, TLE, or WA
g.py: full score # gets max_score
# Other common cases can be easily added to the specification,
# tell me if one comes to mind.
#
# Abbreviations are just shorthands for richer maps
# of "required" and "allowed" keys.
#
# For instance, these are the same:
th.py: accepted
---
th.py:
allowed: AC
required: AC
---
# A submission failed by the output validator on some testcase
# These are the same:
wrong.py: wrong answer
---
wrong.py:
allowed: [WA, AC]
required: WA
---
wrong.py:
allowed: # altenative yaml syntax for list of strings
- WA
- AC
required: WA
---
# Specify that the submission fails, but passes the samples.
# These are the same, using the same abbreviations as
# above for "accepted" and "wrong answer"
wrong.py:
sample: accepted
secret: wrong answer
---
wrong.py:
sample:
allowed: AC
required: AC
secret:
allowed: [AC, WA]
required: WA
# Specification for subgroups works "all the way down"
# though it's seldomly needed:
funky.cpp:
secret:
subgroups:
huge_instances:
subgroups:
disconnected_graph:
allowed: [RTE, TLE]
required: TLE
---
# Same as:
funky.cpp:
secret:
subgroups:
huge_instances/disconnected_graph:
allowed: [RTE, TLE]
required: TLE
# Can also specify a required judgemessage, not only verdicts
linear_search.py:
judge_message: "too many rounds" # matches judgemessage.txt as substring, case-insensitive
#########
# Scoring
#########
# simplest case:
th.py: full score
# Partial solutions can be given in various ways:
partial.py: [50, 60]
---
partial.py: 60
---
partial.py:
score: 60
---
partial.py:
score: [50, 60]
---
partial.py:
fractional_score: [.5, .6] # percentage of full score
---
# For subtasks, you probably want to specify the
# outcome per subgroup. You need to be more verbose:
partial.py:
secret:
subgroups:
subtask1: full score
subtask2: 0
subtask3: 0
---
# Can be even more verbose about scores
partial.py:
secret:
subgroups:
subtask1: full score
subtask2:
score: 13 # absolute score on this group is exactly 13
subtask3: # between 10% and 40% of (full score for subtask 3)
fractional_score: [.1, .4]
---
# Can still specify testcases
bruteforce.py:
secret:
subgroups:
subtask1: full score # subtask 1 has small instances
subtask2:
score: 0 # No points for this subtask
required: TLE # ... because some testcase timed out
allowed: [AC, TLE] # ... rather than any WAs
---
# The common abbreviations work here as well, you probably want to write this instead:
bruteforce.py:
secret:
subgroups:
subtask1: full score # could write "accepted" as well in this case
subtask2: time limit exceeded # this is more informative than "0" And here’s the schema: #registry
#registry: close({ [string]: #root })
#test_case_verdict: "AC" | "WA" | "RTE" | "TLE"
#root: #common | #range | {
#expectations
sample?: #subgroup
secret?: #subgroup
}
#subgroup: #common | #range | {
#expectations
subgroups: close({ [string]: #subgroup })
}
#expectations: {
allowed?: // only these verdicts may appear
#test_case_verdict | [...#test_case_verdict] | *["AC", "WA", "RTE", "TLE"]
required?: // at least one of these verdicts must appear
#test_case_verdict | [...#test_case_verdict]
judge_message?: string // this judgemessage must appear
score?: #range
fractional_score?: #fractional_range
}
#common: "accepted" | // { allowed: AC; required: AC }
"wrong answer" | // { allowed: [AC, WA]; required: WA }
"time limit exceeded" | // { allowed: [AC, TLE]; required: TLE }
"runtime exception" | // { allowed: [AC, RTE]; required: RTE }
"does not terminate" | // { allowed: [AC, TLE, RTE}; required: [RTE, TLE]
"not accepted" | // { required: [RTE, TLE, WA] }
"full score" // { fractional_score: 1.0 }
#range: number | [number, number]
#fractional_range: #fraction | [#fraction, #fraction]
#fraction: >= 0.0 | <= 1.0 | float |
I don't have any strong opinions on the scoring part, but the verdict specification sounds fine to me. |
Thank you, Jaap! Clear documentation is exactly what I’m after. The organisation into directories is orthogonal to the expectations proposal, but consistent with it. In particular, I would like to allow globbing in submission pathnames. This means that the directory definitions in the legacy specification are the same as the # Legacy specification expectations implicit in directory structure
accepted/: accepted
wrong_answer/: wrong answer
time_limit_exceeded/:
allowed: [AC, WA, TLE]
required: TLE
runtime_error/:
required: RTE (And as far as I understand, these expectations are not consistently implemented, nor do we agree that that they are what we want. My proposal should help us be much more clear about that.) I’d be happy to include “default expectations for unlisted submissions”, but I’m not sure it’s part of the expectations proposal. (I’m willing to make it so, but as we observed, current judge systems already pursue different traditions about what the directories mean. My main ambitions is to facilitate the specification of expectations, rather than mandate requirements on implementations.) Note that the globbing idea also allows Thore to promise that all his submissions pass the samples: **/*-th.py:
sample: accepted Similarly, a head of jury could even require that everybody’s WA submissions pass all samples: wrong_answer/*:
sample: accepted But globbing is science-fiction for now. |
Version 0.4.0 Changes since 0.3.0: flattened the hierarchy of subgroups to be just strings. Reason: simpler to read, write, and specify. Far less verbose in 99.9% of cases. Also added regular expression to allow for auto-numbered groups and testcases. First, some examples: # Simple examples for some common cases
a.py: accepted # AC on all cases
b.py: wrong answer # at least one WA, otherwise AC
c.py: time limit exceeded # at least one TLE, otherwise AC
d.py: runtime exception # at least one RTE, otherwise AC
e.py: does not terminate # at least one RTE or TLE, but no WA
f.py: not accepted # at least one RTE, TLE, or WA
g.py: full score # gets max_score
# submission are identified by prefix:
wrong_answer/: wrong answer # expectations "wrong answer" apply to "wrong_answer/th.py" etc.
# Abbreviations are just shorthands for richer maps
# of "required" and "allowed" keys.
#
# For instance, these are the same:
th.py: accepted
---
th.py:
allowed: AC
required: AC
---
# A submission failed by the output validator on some testcase
# These are the same:
wrong.py: wrong answer
---
wrong.py:
allowed: [WA, AC]
required: WA
---
wrong.py:
allowed: # altenative yaml syntax for list of strings
- WA
- AC
required: WA
---
# Specify that the submission fails, but passes the samples.
# These are the same, using the same abbreviations as
# above for "accepted" and "wrong answer"
wrong.py:
sample: accepted
secret: wrong answer
---
wrong.py:
sample:
allowed: AC
required: AC
secret:
allowed: [AC, WA]
required: WA
# Constraints apply to testcases in entire subtree of cases that match the string:
funky.cpp:
allowed: [AC, WA, RTE]
secret:
allowed: [AC, RTE, TLE] # TLE is forbidden at ancestor, so this makes no sense
secret/small: accepted # more restrictive than ancestor: this is fine
# Specification for subgroups works "all the way down to the tescase"
# though it's seldomly needed:
funky.cpp:
secret/huge_instances/disconnected_graph:
allowed: [RTE, TLE]
# Can also specify a required judgemessage, not only verdicts
linear_search.py:
judge_message: "too many rounds" # matches judgemessage.txt as substring, case-insensitive
# Allow digit regex to catch auto-numbered groups: `\d+`
submission:py
secret/\d+-group/: accepted # matches 02-group
#########
# Scoring
#########
# simplest case:
th.py: full score
# Partial solutions can be given in various ways:
partial.py: [50, 60]
---
partial.py: 60
---
partial.py:
score: 60
---
partial.py:
score: [50, 60]
---
partial.py:
fractional_score: [.5, .6] # percentage of full score
---
# For subtasks, you probably want to specify the
# outcome per subgroup. You need to be more verbose:
partial.py:
secret/subtask1: full score
secret/subtask2: 0
secret/subtask3: 0
---
# Can be even more verbose about scores
partial.py:
secret/subtask1: full score
secret/subtask2:
score: 13 # absolute score on this group is exactly 13
secret/subtask3: # between 10% and 40% of (full score for subtask 3)
fractional_score: [.1, .4]
---
# Can still specify testcases
bruteforce.py:
secret/subtask1: full score # subtask 1 has small instances
secret/subtask2:
score: 0 # No points for this subtask
required: TLE # ... because some testcase timed out
allowed: [AC, TLE] # ... rather than any WAs
---
# The common abbreviations work here as well, you probably want to write this instead:
bruteforce.py:
secret/subtask1: full score # could write "accepted" as well in this case
secret/subtask2: time limit exceeded # this is more informative than "0" And here’s the schema: #registry
#registry: close({ [string]: #root })
#test_case_verdict: "AC" | "WA" | "RTE" | "TLE"
#root: #common | #range | {
#expectations
[=~"^(sample|secret)"]: #expectations
}
#expectations: {
#common
#range
allowed?: // only these verdicts may appear
#test_case_verdict | [...#test_case_verdict]
required?: // at least one of these verdicts must appear
#test_case_verdict | [...#test_case_verdict]
judge_message?: string // this judgemessage must appear
score?: #range
fractional_score?: #fractional_range
}
#common: "accepted" | // { allowed: AC; required: AC }
"wrong answer" | // { allowed: [AC, WA]; required: WA }
"time limit exceeded" | // { allowed: [AC, TLE]; required: TLE }
"runtime exception" | // { allowed: [AC, RTE]; required: RTE }
"does not terminate" | // { allowed: [AC, TLE, RTE}; required: [RTE, TLE]
"not accepted" | // { required: [RTE, TLE, WA] }
"full score" // { fractional_score: 1.0 }
#range: number | [number, number]
#fractional_range: #fraction | [#fraction, #fraction]
#fraction: >= 0.0 | <= 1.0 | float |
Draft implementation of 0.4.0 is now underway at https://github.com/thorehusfeldt/BAPCtools/blob/feat/expectations/bin/expectations.py , with quite thorough comments and python doctest. Eventually, I hope to merge this into https://github.com/RagnarGrootKoerkamp/BAPCtools … |
@thorehusfeldt I have a few questions... Could you explain what you mean by "the verdict of a test group is no longer a meaningful concept."? Is the meaning of What happens if a test data directory is names "allowed", "required", "score", "judge_message", or "fractional_score"? Are these the same?: partial.py:
secret/subtask2:
score: 13
---
partial.py:
secret/subtask2: 13 Are globs allowed for both file name paths and test data paths? Can you specify a top level expectation, and also a group expectation on the same file? What does that look like? Could you elaborate a bit on why You say that this is orthogonal to the current directory structure, but what is your thinking about how that should work together? I can see 3 obvious options:
|
Thank you so much for your comments, @niemela ; good questions all. I’ll respond in detail in a moment here, but many of the issues are addressed in the documentation of the pull request at BAPC-tools: https://github.com/thorehusfeldt/BAPCtools/blob/feat/expectations/doc/expectations.md |
With the changes to “how grading works” suggested mainly by @jsannemo , there is now no longer “a verdict for a test group”, in particular for scoring problems. Scoring problems have a “score for a testgroup”. (In
Yes, exactly. Note that the framework does not offer a way of saying “all of TLE, RTE need to appear”. It’s trivial to add, but I don’t see a use case for it, so I haven’t added it. That’s exactly the kind of decision I want feedback on.
This can never become a problem. Test data directory paths start with [=~"^(sample|secret)"]: #expectations So you could have a test data directory whose full path name is
Yes. But note that the syntax for scoring problems is at best temporary right now (because the specification for scoring problems is very much in flux.)
Great question! I distinguish between submission patterns and test data patterns. Here are some submission patterns: accepted
accepted/
accepted/th.py
mixed/fredrik.cpp
mixed/greedy- They match as prefixes, in python Here are some typical test data patterns: sample: # match data/sample/*
secret/huge : # match both case secret/huge-instance-934 and group secret/huge/023-graph/
secret/group1/ : # match subtask data/sample/group1
secret/\d+-monotone: # match secret/034-monotone
secret/(?!greedy) : # match all testcase not starting with secret/greedy Test case matches are as regexen from the start of the string, using python’s In 99% of the cases for pass/fail problems, you’ll only ever specify one-string submission patterns like
Yes, you can. The specification has no opinion on where the expectations are specified, but you can do it in wrong_answer/: wrong answer # this is the default, teh
wrong_answer/th:
sample: accepted # Thore promises that all his submissions get accepted on sample
wrong_answer/th-greedy: # Also promises that th-greedy.py...
secret/\d+greedy-killer: [WA] # ... get exactly WA on that specific test case The registry assembles all the matching prefixes; in my draft implementation this is the method In the pydoc comments you can see a complete and concrete example for how that could work in a concrete implementation.
I just try to be consistent with the direction that the
All of your options work; I have no strong opinions either way and don’t want it to get in they way of specifying the expectations framework. I would assume that a tool would at least warn about any unexpected behaviour, such as an author violating the meaning of those submission directories that are specified in the problem package specification. Or inconsistencies about “what the yaml says” and “what the @EXPECTED_RESULTS@ say” and “which directory the submission is in”. If I were to decide I’d go with
That’s very easy and clean to specify and implement. But: no strong opinion; my ambition is to allow us to specify expectations, not to decide how various traditions or communities or tools should behave. |
(@thorehusfeldt Thanks for the numbering, I should have done that.)
wrong.py:
allowed: WA
allowed: AC That said.... this would be legal (but very ugly), and mean the same thing: wrong.p:
allowed: WA
wrong.py:
allowed: AC Right?
Follow up question is, why do we feel that test data patterns needs to be more powerful? Why is it not enough to have them also be prefixes only?
|
But fractional score reduces work in case you reassign subtask scores.
That's why I would use fractional score over score.
…On Mon, Sep 18, 2023, 12:52 Fredrik Niemelä ***@***.***> wrote:
***@***.*** <https://github.com/thorehusfeldt> Thanks for the
numbering, I should have done that.)
1.
I disagree somewhat with your claims/reasoning, but now I understand
what you mean, and the disagreement is not relevant for this question. So,
good enough.
2.
Good. I also don't see a natural use case for "require *all* of these
to appear". That said the way I have understood the expectations are that
they are completely "additive", so wouldn't the following (if it was legal
YAML) actually mean "require both of WA and AC"?:
wrong.py:
allowed: WA
allowed: AC
That said.... this would be legal (but very ugly), and mean the same thing:
wrong.p:
allowed: WA
wrong.py:
allowed: AC
Right?
1.
Ah, that makes sense, and solves the problem.
2.
Ok. I don't think it that much in flux anymore, and we really should
nail this down as well.
3.
So, if I understand you correctly; submission patterns are only (and
always) prefixes, test data patterns are always prefixes, but *also*
regexps?
Follow up question is, why do we feel that test data patterns needs to be
more powerful? Why is it not enough to have them also be prefixes only?
1.
Ok. My current feeling is that fractional_score is not *that* useful.
I would lean towards leaving it out for now.
2.
Ok. I believe you stated (slight) preference is my option 3. My slight
preference would be option 2.
—
Reply to this email directly, view it on GitHub
<#112 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2ZBKT6VOUQ7GSUBZPEKCDX3A72PANCNFSM6AAAAAA3G273FM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I went through @niemela ’s question of the phone with him, but let me repeat point 5 here: I want to be able to match testcase One could argue for a much more restrictive subset of regexen (such as To summarise, the two different ways of matching are very easily expressed in Python:
|
I'm not really a fan of the inconsistent matching rules for submissions (prefix) vs testcases (regex).
|
Thank you, Ragnar! I have no strong feelings about “how to match” and am more than willing to have this go in any direction.
One more thing: Unlike many other decisions, globbing vs. regexen are not compatible. This is in contrast to deciding that submission patterns match by prefix but later extend the framework to instead match by regex when we get some experience and feel there’s a need for it. That extension would be backwards compatible, |
But it's prefix for submissions and prefix with regex for test cases. Right?
I agree that it would be sufficient for both. I don't see how it's necessary for submission.
I absolutely don't think this is needed. You very seldom have more than a handful submissions anyway. I'm not even sure I think would be a good thing. What rule were you imagining to apply this this hypothetical set of files? Allowing it to TLE? That wouldn't work with the assumed rule on
Yes, this is the example that convinced me that something more than only prefix matching is needed for test cases.
In the current suggestion, as I understand it, either would work.
So you would suggest
|
Indeed, I think the full string vs. prefix-only may depend a bit on whether we go with regex or globs. |
Thank you all for giving this serious thought! Wrt globbing, keep in mind common cases, like accepted: accepted
wrong_answer: wrong answer
... compared to accepted/*: accepted
wrong_answer/*: wrong answer
... or even wrong_answer:
sample: accepted compared to wrong_answer/*:
sample/*: accepted My ambition is to keep the common cases clean and unencumbered, while making advanced situations possible. Python’s |
Closing since there are newer proposals of this being discussed. |
Background: I presented an expectations proposal in Lund in July and received a lot of useful feedback. Moreover, the definition of test groups, grading, and scoring have evolved a lot, so that much of the framework is now obsolete. In particular,
I’ll try to summarise a new proposal here, hopefully in line with the working group’s preferences.
To solidify terminology, testgroup here means a directory in
data
(includingdata
itself).data
is a testgroup, so isdata/sample
, the directorydata/secret/huge_instances/overflow/
in a pass-fail task, and the subtaskdata/secret/group1
in an IOI-style problem.Expectations are for testcases, not testgroups
The conceptually largest change is that expectations are specified for (sets of) testcases, not for testgroups. In particular, an expectation is given for
path
.path/**
.For instance, in a pass-fail problem, you can write
Required and allowed
As far as I can tell, we need to be able to specify both required and allowed testcase verdicts. The above syntax seems less verbose than the alternative:The difference becomes particular striking in IOI-style problems with subtasks. (Try it.)I have no strong feelings about the names of keys, but$V$ is the set of verdicts for the testcases below the specified testgroup then $R\cap V\neq\emptyset$ and $V\subseteq A$ . I played around with
required
andallowed
seem clear to me. The semantics is that ifR\subseteq V\subseteq Anone
,any
,all
, but it didn’t become clearer or shorter. Suggestions are welcome (but try them out first by actually writing resulting YAML expressions.)[Update: better syntax, see two posts down]
# Useful shorthands:is a shorthand for
which is the most important usecase. Also,
string
is a shorthand for the singleton[string]
.Full schema
The schema is something like this, if this makes sense to you
With subtasks
The most important use case for me is to specify expected behaviour on subtasks. This becomes less natural than in my original proposal (where the concept “testgroup verdict” existed.)
Now we’re at:
This is quite verbose, but I can’t find a way to make it shorter. Feel free to try.
Scoring
Currently I’m at
full
is important because I don’t want to remember on the values intestdata.yaml
when the score for subsask 1 changes; the valuefull
communicates more to the reader than23
.Q1: Should we instead have a fraction here, such as
score: 1.0
meaningfull
andscore: [.2, .45]
meaning “this gets between 20% and 45% of the full value for this subtask? This sounds more useful to me.Judgemessages
I want to allow judgemessages as well, which doesn’t change the schema (just add
| string
to#verdict
):I think this will make it much easier to construct custom validators (because you can check for full code coverage in your validator.)
Toplevel group name
Consider
This is (as far as I can tell) the best way of specifying “this is a WA submission that passes on sample”. But the role of this example is to highlight the fact that the toplevel directory doesn’t have a good name.
Q2: what should be done about this?
""
or maybe"."
are perfectly fine names fordata
when you actually need them. (Which is seldom, mostly it follows from the descendant verdicts anyway so you’re just being sloppy.)data/
to all testgroup names, so it’sdata/sample
etc. from now ondata
meansdata
andsample
meansdata/sample
. If authors have bothdata/secret/foo
anddata/secret/baz/foo
then they have themselves to blameBespoke Verdict Syntax Would Get Rid of Lists and Required / Allowed
An alternative would be to not have the$R$ and $E$ with $R\subseteq A$ that can every appear since $|A|\leq 4$ . For instance
required
andallowed
keys and instead bake in the expected behaviour into the terminology. After all, there is only a constant number ofaccepted
means “must get exactly on all test cases”, buttimeout
means “AC and TLE are allowed, and TLE is required”,not_wrong
means thatWA
is disallowed (everything else is OK). I guess there are at best 10 different actually-existing cases that ever need to be defined.This would allow some very useful shorthands.
Q3: Is this sufficiently tempting to try to come up with a list of those cases, and think about good names?
Please comment.
The text was updated successfully, but these errors were encountered: