RFC: Expectations proposal #112

thorehusfeldt · 2023-08-07T11:24:38Z

Background: I presented an expectations proposal in Lund in July and received a lot of useful feedback. Moreover, the definition of test groups, grading, and scoring have evolved a lot, so that much of the framework is now obsolete. In particular,

the verdict of a test group is no longer a meaningful concept.

I’ll try to summarise a new proposal here, hopefully in line with the working group’s preferences.

To solidify terminology, testgroup here means a directory in data (including data itself). data is a testgroup, so is data/sample, the directory data/secret/huge_instances/overflow/ in a pass-fail task, and the subtask data/secret/group1 in an IOI-style problem.

Expectations are for testcases, not testgroups

The conceptually largest change is that expectations are specified for (sets of) testcases, not for testgroups. In particular, an expectation is given for

in graph-theoretic terms: the testcases that are descendants of a testgroup path.
in filesystem-terms: the testcases of the form path/**.

For instance, in a pass-fail problem, you can write

accepted/th.py: [AC, WA] # final verdict of th.py is either WA or AC
wrong_answer/js.py:
  sample:
    allowed: AC # js.py must accept on sample
  secret:
    required: WA # js.py must get WA on at least one secret test case
mixed/alice.py: # funky example:
  subgroups:
    secret/huge_instances/overflow:
      required: [AC, TLE] # deeply nested testcase uses same semantics

Required and allowed

~~As far as I can tell, we need to be able to specify both required and allowed testcase verdicts. The above syntax seems less verbose than the alternative:~~

~~wrong_answer/js.py: data: sample: allowed: AC # js.py must accept on sample secret: required: WA # js.py must get WA on at least one secret test case~~

~~The difference becomes particular striking in IOI-style problems with subtasks. (Try it.)~~

I have no strong feelings about the names of keys, but required and allowed seem clear to me. The semantics is that if $V$ is the set of verdicts for the testcases below the specified testgroup then ~~R\subseteq V\subseteq A~~ $R\cap V\neq\emptyset$ and $V\subseteq A$. I played around with none, any, all, but it didn’t become clearer or shorter. Suggestions are welcome (but try them out first by actually writing resulting YAML expressions.)

[Update: better syntax, see two posts down]

~~# Useful shorthands:~~

submission_name: string

is a shorthand for

submission_name: allowed: string

which is the most important usecase. Also, string is a shorthand for the singleton [string].

Full schema

The schema is something like this, if this makes sense to you

[string] : // mixed/th.py string | // AC [...string] | // [AC, WA] number | // 23 [number, number] | // [23, 37] { // full map allowed?: [string]: #verdict required?: [string]: #verdict score?: [string]: number | [number, number] | "full" }

With subtasks

The most important use case for me is to specify expected behaviour on subtasks. This becomes less natural than in my original proposal (where the concept “testgroup verdict” existed.)

Now we’re at:

mixed/greedy.py: allowed: sample: AC # must pass samples (we’re sneaky and haven’t included sample that needs DP) secret/group1: AC secret/group2: AC secret/group3: [WA, AC] # should not crash or TLE required: secret/group3: WA # at least one testcase must fail

This is quite verbose, but I can’t find a way to make it shorter. Feel free to try.

Scoring

Currently I’m at

mixed/baz.py: score: full mixed/bar.py: score: 54 mixed/baf.py: score: [12, 20]

full is important because I don’t want to remember on the values in testdata.yaml when the score for subsask 1 changes; the value full communicates more to the reader than 23.

Q1: Should we instead have a fraction here, such as score: 1.0 meaning full and score: [.2, .45] meaning “this gets between 20% and 45% of the full value for this subtask? This sounds more useful to me.

Judgemessages

I want to allow judgemessages as well, which doesn’t change the schema (just add | string to #verdict):

wrong_answer/th.py: required: secret: "too many rounds" # this submission must fail with "too many rounds" on some instance

I think this will make it much easier to construct custom validators (because you can check for full code coverage in your validator.)

Toplevel group name

Consider

mixed/th.py: "": [AC, WA] sample: [AC] secret: allowed: [AC, WA] required: WA

This is (as far as I can tell) the best way of specifying “this is a WA submission that passes on sample”. But the role of this example is to highlight the fact that the toplevel directory doesn’t have a good name.

Q2: what should be done about this?

Nothing. "" or maybe "." are perfectly fine names for data when you actually need them. (Which is seldom, mostly it follows from the descendant verdicts anyway so you’re just being sloppy.)

Add data/ to all testgroup names, so it’s data/sample etc. from now on

Identify testgroup names by their last part. data means data and sample means data/sample. If authors have both data/secret/foo and data/secret/baz/foo then they have themselves to blame

something else

Bespoke Verdict Syntax Would Get Rid of Lists and Required / Allowed

An alternative would be to not have the required and allowed keys and instead bake in the expected behaviour into the terminology. After all, there is only a constant number of $R$ and $E$ with $R\subseteq A$ that can every appear since $|A|\leq 4$. For instance accepted means “must get exactly on all test cases”, but timeout means “AC and TLE are allowed, and TLE is required”, not_wrong means that WA is disallowed (everything else is OK). I guess there are at best 10 different actually-existing cases that ever need to be defined.

This would allow some very useful shorthands.

~~Q3: Is this sufficiently tempting to try to come up with a list of those cases, and think about good names?~~

Please comment.

The text was updated successfully, but these errors were encountered:

thorehusfeldt · 2023-08-07T11:32:35Z

Harping on about the bespoke syntax, the suggestion is orthogonal to the rest of the proposal if we just view it as a shorthand.

mixed/th.py: not wrong

is just a shorthand for

mixed/th.py:
  allowed: [AC, TLE, RTE]

and

mixed/th.py: time out

is a shorthand for

mixed/th.py:
  allowed: [AC, TLE]
  required: TLE

This sounds useful to me (and easy to implement and describe).

Tagl · 2023-08-07T16:12:57Z

For scoring, I definitely like the idea of full, or more generally, ranges supporting percentages.
I think the cases where I would use something other than 100% and 0% are rare, but I would use it, for example, in tasks which score heuristic algorithms. The algorithms could even be non-deterministic which makes exact score expectations unreliable.

For judgemessage.txt we would have to support regex or something like it, not just a pure string matching. For example, the contents may include formatted messages describing the index which caused the wrong answer verdict.

Sidenotes:

There is a missing space after allowed in the section Required and allowed
In your comment about shorthands, accepted should be allowed?

thorehusfeldt · 2023-08-08T09:48:46Z

After some more feedback, here’s another iteration (0.3.0).

First, some examples:

# Simple examples for some common cases

a.py: accepted            # AC on all cases
b.py: wrong answer        # at least one WA, otherwise AC
c.py: time limit exceeded # at least one TLE, otherwise AC
d.py: runtime exception   # at least one RTE, otherwise AC
e.py: does not terminate  # at least one RTE or TLE, but no WA
f.py: not accepted        # at least one RTE, TLE, or WA
g.py: full score          # gets max_score

# Other common cases can be easily added to the specification, 
# tell me if one comes to mind.
#
# Abbreviations are just shorthands for richer maps 
# of "required" and "allowed" keys.
#
# For instance, these are the same:

th.py: accepted
---
th.py:
  allowed: AC
  required: AC
---

# A submission failed by the output validator on some testcase
# These are the same:

wrong.py: wrong answer
---
wrong.py:
  allowed: [WA, AC]
  required: WA
---
wrong.py:
  allowed: # altenative yaml syntax for list of strings
    - WA
    - AC
  required: WA
---

# Specify that the submission fails, but passes the samples.
# These are the same, using the same abbreviations as
# above for "accepted" and "wrong answer"

wrong.py:
  sample: accepted
  secret: wrong answer
---
wrong.py:
  sample: 
    allowed: AC
    required: AC
  secret:
    allowed: [AC, WA]
    required: WA

# Specification for subgroups works "all the way down"
# though it's seldomly needed:

funky.cpp:
  secret:
    subgroups:
      huge_instances:
        subgroups:
          disconnected_graph:
            allowed: [RTE, TLE]
            required: TLE
---
# Same as:

funky.cpp:
  secret:
    subgroups:
      huge_instances/disconnected_graph:
          allowed: [RTE, TLE]
          required: TLE

# Can also specify a required judgemessage, not only verdicts

linear_search.py:
  judge_message: "too many rounds" # matches judgemessage.txt as substring, case-insensitive

#########
# Scoring
#########

# simplest case:
th.py: full score

# Partial solutions can be given in various ways:
partial.py: [50, 60]
---
partial.py: 60
---
partial.py:
  score: 60
---
partial.py:
  score: [50, 60]
---
partial.py:
  fractional_score: [.5, .6] # percentage of full score
---
# For subtasks, you probably want to specify the
# outcome per subgroup. You need to be more verbose:
partial.py:
  secret:
    subgroups:
      subtask1: full score
      subtask2: 0
      subtask3: 0
---
# Can be even more verbose about scores
partial.py:
  secret:
    subgroups:
      subtask1: full score
      subtask2:
        score: 13   # absolute score on this group is exactly 13
      subtask3: # between 10% and 40% of (full score for subtask 3)
        fractional_score: [.1, .4] 
---
# Can still specify testcases
bruteforce.py:
  secret:
    subgroups:
      subtask1: full score # subtask 1 has small instances
      subtask2:
        score: 0      # No points for this subtask
        required: TLE # ... because some testcase timed out
        allowed: [AC, TLE] # ... rather than any WAs
---
# The common abbreviations work here as well, you probably want to write this instead:
bruteforce.py:
  secret:
    subgroups:
      subtask1: full score # could write "accepted" as well in this case
      subtask2: time limit exceeded # this is more informative than "0"

And here’s the schema:

#registry

#registry: close({ [string]: #root })

#test_case_verdict: "AC" | "WA" | "RTE" | "TLE" 

#root: #common | #range |  { 
    #expectations
    sample?: #subgroup
    secret?: #subgroup
}

#subgroup: #common | #range |  { 
    #expectations
    subgroups: close({ [string]: #subgroup })
}
    
#expectations: {
    allowed?:  // only these verdicts may appear
        #test_case_verdict | [...#test_case_verdict] | *["AC", "WA", "RTE", "TLE"] 
    required?: // at least one of these verdicts must appear
        #test_case_verdict | [...#test_case_verdict] 
    judge_message?: string // this judgemessage must appear
    score?: #range
    fractional_score?: #fractional_range
    }

#common: "accepted" |        // { allowed: AC; required: AC }
    "wrong answer" |         // { allowed: [AC, WA]; required: WA }
    "time limit exceeded" |  // { allowed: [AC, TLE]; required: TLE }
    "runtime exception" |    // { allowed: [AC, RTE]; required: RTE }
    "does not terminate" |   // { allowed: [AC, TLE, RTE}; required: [RTE, TLE]
    "not accepted" |         // { required: [RTE, TLE, WA] }
    "full score"             // { fractional_score: 1.0 }
    
#range: number | [number, number] 
#fractional_range: #fraction | [#fraction, #fraction]
#fraction: >= 0.0 | <= 1.0 | float

eldering · 2023-08-12T10:41:34Z

I don't have any strong opinions on the scoring part, but the verdict specification sounds fine to me.
I do think it's important to have the list of common expectations clearly documented and I'd strongly prefer that we then can use directories named after these to organize the submissions in; I realize that this is a not really part of this proposal, but good to keep in mind.

thorehusfeldt · 2023-08-13T06:21:42Z

Thank you, Jaap!

Clear documentation is exactly what I’m after.

The organisation into directories is orthogonal to the expectations proposal, but consistent with it.

In particular, I would like to allow globbing in submission pathnames. This means that the directory definitions in the legacy specification are the same as the expectations.yaml file

# Legacy specification expectations implicit in directory structure
accepted/: accepted
wrong_answer/: wrong answer
time_limit_exceeded/:
  allowed: [AC, WA, TLE]
  required: TLE
runtime_error/:
  required: RTE

(And as far as I understand, these expectations are not consistently implemented, nor do we agree that that they are what we want. My proposal should help us be much more clear about that.)

I’d be happy to include “default expectations for unlisted submissions”, but I’m not sure it’s part of the expectations proposal. (I’m willing to make it so, but as we observed, current judge systems already pursue different traditions about what the directories mean. My main ambitions is to facilitate the specification of expectations, rather than mandate requirements on implementations.)

Note that the globbing idea also allows Thore to promise that all his submissions pass the samples:

**/*-th.py:
  sample: accepted

Similarly, a head of jury could even require that everybody’s WA submissions pass all samples:

wrong_answer/*:
  sample: accepted

But globbing is science-fiction for now.

thorehusfeldt · 2023-09-07T12:51:46Z

Version 0.4.0

Changes since 0.3.0: flattened the hierarchy of subgroups to be just strings. Reason: simpler to read, write, and specify. Far less verbose in 99.9% of cases. Also added regular expression to allow for auto-numbered groups and testcases.

First, some examples:

# Simple examples for some common cases

a.py: accepted            # AC on all cases
b.py: wrong answer        # at least one WA, otherwise AC
c.py: time limit exceeded # at least one TLE, otherwise AC
d.py: runtime exception   # at least one RTE, otherwise AC
e.py: does not terminate  # at least one RTE or TLE, but no WA
f.py: not accepted        # at least one RTE, TLE, or WA
g.py: full score          # gets max_score

# submission are identified by prefix:

wrong_answer/: wrong answer # expectations "wrong answer" apply to "wrong_answer/th.py" etc.

# Abbreviations are just shorthands for richer maps 
# of "required" and "allowed" keys.
#
# For instance, these are the same:

th.py: accepted
---
th.py:
  allowed: AC
  required: AC
---

# A submission failed by the output validator on some testcase
# These are the same:

wrong.py: wrong answer
---
wrong.py:
  allowed: [WA, AC]
  required: WA
---
wrong.py:
  allowed: # altenative yaml syntax for list of strings
    - WA
    - AC
  required: WA
---

# Specify that the submission fails, but passes the samples.
# These are the same, using the same abbreviations as
# above for "accepted" and "wrong answer"

wrong.py:
  sample: accepted
  secret: wrong answer
---
wrong.py:
  sample: 
    allowed: AC
    required: AC
  secret:
    allowed: [AC, WA]
    required: WA

# Constraints apply to testcases in entire subtree of cases that match the string:
funky.cpp:
  allowed: [AC, WA, RTE]
  secret:
      allowed: [AC, RTE, TLE] # TLE is forbidden at ancestor, so this makes no sense
  secret/small: accepted # more restrictive than ancestor: this is fine
          
# Specification for subgroups works "all the way down to the tescase"
# though it's seldomly needed:
funky.cpp:
  secret/huge_instances/disconnected_graph:
          allowed: [RTE, TLE]
          
# Can also specify a required judgemessage, not only verdicts
linear_search.py:
  judge_message: "too many rounds" # matches judgemessage.txt as substring, case-insensitive
  
# Allow digit regex to catch auto-numbered groups: `\d+`

submission:py
  secret/\d+-group/: accepted # matches 02-group

#########
# Scoring
#########

# simplest case:
th.py: full score

# Partial solutions can be given in various ways:
partial.py: [50, 60]
---
partial.py: 60
---
partial.py:
  score: 60
---
partial.py:
  score: [50, 60]
---
partial.py:
  fractional_score: [.5, .6] # percentage of full score
---
# For subtasks, you probably want to specify the
# outcome per subgroup. You need to be more verbose:
partial.py:
  secret/subtask1: full score
  secret/subtask2: 0
  secret/subtask3: 0
---
# Can be even more verbose about scores
partial.py:
  secret/subtask1: full score
  secret/subtask2:
        score: 13   # absolute score on this group is exactly 13
  secret/subtask3: # between 10% and 40% of (full score for subtask 3)
        fractional_score: [.1, .4] 
---
# Can still specify testcases
bruteforce.py:
  secret/subtask1: full score # subtask 1 has small instances
  secret/subtask2:
        score: 0      # No points for this subtask
        required: TLE # ... because some testcase timed out
        allowed: [AC, TLE] # ... rather than any WAs
---
# The common abbreviations work here as well, you probably want to write this instead:
bruteforce.py:
  secret/subtask1: full score # could write "accepted" as well in this case
  secret/subtask2: time limit exceeded # this is more informative than "0"

And here’s the schema:

#registry

#registry: close({ [string]: #root })

#test_case_verdict: "AC" | "WA" | "RTE" | "TLE" 

#root: #common | #range |  { 
    #expectations
    [=~"^(sample|secret)"]: #expectations
}

#expectations: {
    #common
    #range
    allowed?:  // only these verdicts may appear
        #test_case_verdict | [...#test_case_verdict]
    required?: // at least one of these verdicts must appear
        #test_case_verdict | [...#test_case_verdict] 
    judge_message?: string // this judgemessage must appear
    score?: #range
    fractional_score?: #fractional_range
    }

#common: "accepted" |        // { allowed: AC; required: AC }
    "wrong answer" |         // { allowed: [AC, WA]; required: WA }
    "time limit exceeded" |  // { allowed: [AC, TLE]; required: TLE }
    "runtime exception" |    // { allowed: [AC, RTE]; required: RTE }
    "does not terminate" |   // { allowed: [AC, TLE, RTE}; required: [RTE, TLE]
    "not accepted" |         // { required: [RTE, TLE, WA] }
    "full score"             // { fractional_score: 1.0 }
    
#range: number | [number, number] 
#fractional_range: #fraction | [#fraction, #fraction]
#fraction: >= 0.0 | <= 1.0 | float

thorehusfeldt · 2023-09-08T08:08:22Z

Draft implementation of 0.4.0 is now underway at https://github.com/thorehusfeldt/BAPCtools/blob/feat/expectations/bin/expectations.py , with quite thorough comments and python doctest. Eventually, I hope to merge this into https://github.com/RagnarGrootKoerkamp/BAPCtools …

niemela · 2023-09-18T00:50:48Z

@thorehusfeldt I have a few questions...

Could you explain what you mean by "the verdict of a test group is no longer a meaningful concept."?

Is the meaning of allowed that all verdicts must be in the specified set, and required that at least some verdict must be in the specified set? (I'm pretty sure the answer is yes)

What happens if a test data directory is names "allowed", "required", "score", "judge_message", or "fractional_score"?

Are these the same?:

partial.py:
  secret/subtask2:
        score: 13
---
partial.py:
  secret/subtask2: 13

Are globs allowed for both file name paths and test data paths?

Can you specify a top level expectation, and also a group expectation on the same file? What does that look like?

Could you elaborate a bit on why fractional_score is worth inclusion? It seems to me that you could always do without, and it's not even that much simpler, but I'm sure I'm missing some use case.

You say that this is orthogonal to the current directory structure, but what is your thinking about how that should work together? I can see 3 obvious options:

directories have no special meaning, but you can choose to use directories and give them the semantics we are used to.
directories have special meaning, and we can express that meaning as implicit expectations, these can't be overridden (but additional requirements can be added to these directories or the files in them)
directories have special meaning, ..., and they can be overridden. I.e. the implicit expectations are only applied if there are no explicit expectations.
Which of these are you imagining, or is it something else?

thorehusfeldt · 2023-09-18T06:50:58Z

Thank you so much for your comments, @niemela ; good questions all. I’ll respond in detail in a moment here, but many of the issues are addressed in the documentation of the pull request at BAPC-tools: https://github.com/thorehusfeldt/BAPCtools/blob/feat/expectations/doc/expectations.md

thorehusfeldt · 2023-09-18T07:55:58Z

Could you explain what you mean by "the verdict of a test group is no longer a meaningful concept."?

With the changes to “how grading works” suggested mainly by @jsannemo , there is now no longer “a verdict for a test group”, in particular for scoring problems. Scoring problems have a “score for a testgroup”. (In legacy, there is such a concept, and September 2023 problemtools implements it.) Tools like DOMjudge apparently never had this concept; only the entire set of test data receives a “final verdict” (called the “outcome”.) I signal no opinion here on how this should be, I merely observe that “the verdict of a test group” is not a socially useful concept.

Is the meaning of allowed that all verdicts must be in the specified set, and required that at least some verdict must be in the specified set? (I'm pretty sure the answer is yes)

Yes, exactly. Note that the framework does not offer a way of saying “all of TLE, RTE need to appear”. It’s trivial to add, but I don’t see a use case for it, so I haven’t added it. That’s exactly the kind of decision I want feedback on.

What happens if a test data directory is names "allowed", "required", "score", "judge_message", or "fractional_score"?

This can never become a problem. Test data directory paths start with sample or secret, and therefore the test data pattern must start with those words. Ensured in the schema here:

 [=~"^(sample|secret)"]: #expectations

So you could have a test data directory whose full path name is <problempath>/data/secret/required/ if you really wanted to do something silly, but you still need to refer to it as, say, secret/required or a prefix of that; never as required.

Are these the same? […example…]

Yes. But note that the syntax for scoring problems is at best temporary right now (because the specification for scoring problems is very much in flux.)

Are globs allowed for both file name paths and test data paths?

Great question! I distinguish between submission patterns and test data patterns. Here are some submission patterns:

accepted
accepted/
accepted/th.py
mixed/fredrik.cpp
mixed/greedy-

They match as prefixes, in python str.startswith.

Here are some typical test data patterns:

sample: # match data/sample/*
secret/huge : # match both case secret/huge-instance-934 and group secret/huge/023-graph/
secret/group1/ : # match subtask data/sample/group1
secret/\d+-monotone: # match secret/034-monotone
secret/(?!greedy) : # match all testcase not starting with secret/greedy

Test case matches are as regexen from the start of the string, using python’s re.match.

In 99% of the cases for pass/fail problems, you’ll only ever specify one-string submission patterns like accepted or maybe accepted/, and you’ll only ever specify all testcases, maybe with particular attention to sample in some traditions. For scoring problems, you’d use secret/group1/ (or maybe secret/\d+-greedy/) a lot.

Can you specify a top level expectation, and also a group expectation on the same file? What does that look like?

Yes, you can. The specification has no opinion on where the expectations are specified, but you can do it in expectations.yaml:

wrong_answer/: wrong answer # this is the default, teh 
wrong_answer/th:
  sample: accepted # Thore promises that all his submissions get accepted on sample
wrong_answer/th-greedy: # Also promises that th-greedy.py...
  secret/\d+greedy-killer: [WA] # ... get exactly WA on that specific test case

The registry assembles all the matching prefixes; in my draft implementation this is the method Registy.for_path(submission_path): https://github.com/thorehusfeldt/BAPCtools/blob/b4d4df92e9b7bce2e1ccadff872168e6694a4b58/bin/expectations.py#L294

In the pydoc comments you can see a complete and concrete example for how that could work in a concrete implementation.

Could you elaborate a bit on why fractional_score is worth inclusion?

I just try to be consistent with the direction that the 2023-07 draft goes in, mainly following @jsannemo ’s ideas. Scoring problems are in flux right now; my current implementation at BAPCtools (which only does pass-fail anyway) does not touch this issue. These conversations are better had after the problem package specification converges.

You say that this is orthogonal to the current directory structure, but what is your thinking about how that should work together? I can see 3 obvious options:

All of your options work; I have no strong opinions either way and don’t want it to get in they way of specifying the expectations framework.

I would assume that a tool would at least warn about any unexpected behaviour, such as an author violating the meaning of those submission directories that are specified in the problem package specification. Or inconsistencies about “what the yaml says” and “what the @EXPECTED_RESULTS@ say” and “which directory the submission is in”. If I were to decide I’d go with

a default expectations.yaml file with minimal contents (exactly four lines, one for each directory in the problem package specification).
If no such file exists, then the tool is expected to fall back on that default file.
Problem authors can do what they want in their own expectations.yaml.

That’s very easy and clean to specify and implement. But: no strong opinion; my ambition is to allow us to specify expectations, not to decide how various traditions or communities or tools should behave.

niemela · 2023-09-18T12:52:43Z

(@thorehusfeldt Thanks for the numbering, I should have done that.)

I disagree somewhat with your claims/reasoning, but now I understand what you mean, and the disagreement is not relevant for this question. So, good enough.
Good. I also don't see a natural use case for "require all of these to appear". That said the way I have understood the expectations are that they are completely "additive", so wouldn't the following (if it was legal YAML) actually mean "require both of WA and AC"?:

 wrong.py:
  allowed: WA 
  allowed: AC

That said.... this would be legal (but very ugly), and mean the same thing:

 wrong.p:
  allowed: WA 
 wrong.py:
  allowed: AC

Right?

Ah, that makes sense, and solves the problem.
Ok. I don't think it that much in flux anymore, and we really should nail this down as well.
So, if I understand you correctly; submission patterns are only (and always) prefixes, test data patterns are always prefixes, but also regexps?

Follow up question is, why do we feel that test data patterns needs to be more powerful? Why is it not enough to have them also be prefixes only?

Ok. My current feeling is that fractional_score is not that useful. I would lean towards leaving it out for now.
Ok. I believe you stated (slight) preference is my option 3. My slight preference would be option 2.

Tagl · 2023-09-18T13:41:43Z

But fractional score reduces work in case you reassign subtask scores. That's why I would use fractional score over score.

…

On Mon, Sep 18, 2023, 12:52 Fredrik Niemelä ***@***.***> wrote: ***@***.*** <https://github.com/thorehusfeldt> Thanks for the numbering, I should have done that.) 1. I disagree somewhat with your claims/reasoning, but now I understand what you mean, and the disagreement is not relevant for this question. So, good enough. 2. Good. I also don't see a natural use case for "require *all* of these to appear". That said the way I have understood the expectations are that they are completely "additive", so wouldn't the following (if it was legal YAML) actually mean "require both of WA and AC"?: wrong.py: allowed: WA allowed: AC That said.... this would be legal (but very ugly), and mean the same thing: wrong.p: allowed: WA wrong.py: allowed: AC Right? 1. Ah, that makes sense, and solves the problem. 2. Ok. I don't think it that much in flux anymore, and we really should nail this down as well. 3. So, if I understand you correctly; submission patterns are only (and always) prefixes, test data patterns are always prefixes, but *also* regexps? Follow up question is, why do we feel that test data patterns needs to be more powerful? Why is it not enough to have them also be prefixes only? 1. Ok. My current feeling is that fractional_score is not *that* useful. I would lean towards leaving it out for now. 2. Ok. I believe you stated (slight) preference is my option 3. My slight preference would be option 2. — Reply to this email directly, view it on GitHub <#112 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2ZBKT6VOUQ7GSUBZPEKCDX3A72PANCNFSM6AAAAAA3G273FM> . You are receiving this because you commented.Message ID: ***@***.***>

thorehusfeldt · 2023-09-18T16:21:18Z

I went through @niemela ’s question of the phone with him, but let me repeat point 5 here:

I want to be able to match testcase secret/053-all-points-on-a-line as well as the testgroup secret/1-tiny/, even after these two things have changed their numbering. This is relevant for those generation tools (like BAPCtools generators or Kodsport testdata tools) that automatically generate numbers for testcases or testdata groups. The problem author thinks of these as “the testcase I called all-points-on-a-line”, or “the testgroup with tiny inputs”, not testcase 053 etc.

One could argue for a much more restrictive subset of regexen (such as \d+ and -?, but not much else) but it’s harder to specify and implement. As a bonus, I get so refer to “all test cases not starting with foo” as secret/(?!foo), for which I can see use cases, like preparing a faulty submission that should work always, except for testcases with…

To summarise, the two different ways of matching are very easily expressed in Python:

submissionpath.startswith(submissionpattern)
re.match(testdatapath, testdatapattern)

RagnarGrootKoerkamp · 2023-09-18T22:03:05Z

I'm not really a fan of the inconsistent matching rules for submissions (prefix) vs testcases (regex).

Personally I think (an equivalent of) simple *-globs would be necessary and sufficient for me in both cases. This allows

accepted/*-cubic.cpp
secret/*-slow

I have mixed feelings about matching against prefixes for submissions:

You definitely want to specify directories like accepted/. [Should this end in a /?]
Specifying accepted/ragnar feels very implicit/incomplete. Doing accepted/ragnar* feels nicer and clearer to me.

So for submissions we match against the full file (/program) name, while for testcases we skip the .in. (I'm OK with that.)

thorehusfeldt · 2023-09-19T07:49:42Z

Thank you, Ragnar! I have no strong feelings about “how to match” and am more than willing to have this go in any direction.

I can see the conceptual attraction of having only one pattern matching convention, instead of treating submissions and test data differently. Whether that should be prefixes, globs, or regexen is less clear to me.
Regexen allow me to do secret/(?!huge), which I would find super useful. (See the my screenshot at BAPCtools pull request 307 for a concrete case.) They are both more expressive (I can target things that I can’t get at with globs) but also more restrictive (\d+ is less permissive than *). I remain mildly in favour of them. On the other hand, a strong argument for globs is that their semantics are easy, and that we can syntax-check them. If you dislike accepted/ragnar then you are free to write accepted/ragnar.* to make your intention clear to readers who share your aesthetics. But note that in 90% of the cases this is not an issue, whereas I expect to see many accepted and maybe accepted/ (nice and clean) that would have to be accepted/* (clear but busy) if we go with globbing.
I tried to be consistent about submission paths having a suffix (accepted/th.py is not accepted/th.cpp) but test case paths not having it. (A test case is not its input file; in particular there are interactive test cases that don’t even have input files.) So a test case is identifed by secret/huge/054-random.

One more thing: Unlike many other decisions, globbing vs. regexen are not compatible. This is in contrast to deciding that submission patterns match by prefix but later extend the framework to instead match by regex when we get some experience and feel there’s a need for it. That extension would be backwards compatible, accepted/ would still match accepted/ragnar.cpp. But if we go for globbing now and later decide for regexen then this would not be backwards compatible; accepted/* would no longer match accepted/ragnar.cpp. So glob vs. regex is more important to get right from the start than prefix vs. regex.

niemela · 2023-09-19T12:34:59Z

I'm not really a fan of the inconsistent matching rules for submissions (prefix) vs testcases (regex).

But it's prefix for submissions and prefix with regex for test cases. Right?

1. Personally I think (an equivalent of) simple `*`-globs would be necessary and sufficient for me in both cases. This allows

I agree that it would be sufficient for both. I don't see how it's necessary for submission.

* `accepted/*-cubic.cpp`

I absolutely don't think this is needed. You very seldom have more than a handful submissions anyway.

I'm not even sure I think would be a good thing. What rule were you imagining to apply this this hypothetical set of files? Allowing it to TLE? That wouldn't work with the assumed rule on accepted.

* `secret/*-slow`

Yes, this is the example that convinced me that something more than only prefix matching is needed for test cases.

2. I have mixed feelings about matching against prefixes for submissions:

* You definitely want to specify directories like `accepted/`. [Should this end in a `/`?]

In the current suggestion, as I understand it, either would work.

* Specifying `accepted/ragnar` feels very implicit/incomplete. Doing `accepted/ragnar*` feels nicer and clearer to me.

So you would suggest *-globs on the full string, and no prefix matching, right? I think that would work.

3. So for submissions we match against the full file (/program) name, while for testcases we skip the `.in`. (I'm OK with that.)

RagnarGrootKoerkamp · 2023-09-19T12:53:10Z

Indeed, accepted/*-cubic.cpp isn't very much needed. I think it's nice to have but won't insist on it.

I think the full string vs. prefix-only may depend a bit on whether we go with regex or globs. secret/slow* glob to match full strings is fine to me, but secret/slow.* as a regex is a bit ugly, and also confusing since this is different from secret/slow..* 🤔

thorehusfeldt · 2023-09-19T17:55:50Z

Thank you all for giving this serious thought!

Wrt globbing, keep in mind common cases, like

accepted: accepted
wrong_answer: wrong answer
...

compared to

accepted/*: accepted
wrong_answer/*: wrong answer
...

or even

wrong_answer:
  sample: accepted

compared to

wrong_answer/*:
  sample/*: accepted

My ambition is to keep the common cases clean and unencumbered, while making advanced situations possible. Python’s re.match(pattern, string) does exactly that: it’s simultaneously light-weight and powerful. Globbing is slighly worse on both counts.

niemela · 2023-12-01T07:59:05Z

Closing since there are newer proposals of this being discussed.

meisterT mentioned this issue Dec 1, 2023

Expectations 0.6: Framework for specifying expectations in terms of timings and verdicts #135

Closed

niemela closed this as completed Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Expectations proposal #112

RFC: Expectations proposal #112

thorehusfeldt commented Aug 7, 2023 •

edited

Loading

Full schema

With subtasks

Scoring

Judgemessages

Toplevel group name

Bespoke Verdict Syntax Would Get Rid of Lists and Required / Allowed

thorehusfeldt commented Aug 7, 2023 •

edited

Loading

Tagl commented Aug 7, 2023

thorehusfeldt commented Aug 8, 2023 •

edited

Loading

eldering commented Aug 12, 2023

thorehusfeldt commented Aug 13, 2023 •

edited

Loading

thorehusfeldt commented Sep 7, 2023 •

edited

Loading

thorehusfeldt commented Sep 8, 2023

niemela commented Sep 18, 2023

thorehusfeldt commented Sep 18, 2023

thorehusfeldt commented Sep 18, 2023 •

edited

Loading

niemela commented Sep 18, 2023

Tagl commented Sep 18, 2023 via email

thorehusfeldt commented Sep 18, 2023 •

edited

Loading

RagnarGrootKoerkamp commented Sep 18, 2023

thorehusfeldt commented Sep 19, 2023 •

edited

Loading

niemela commented Sep 19, 2023

RagnarGrootKoerkamp commented Sep 19, 2023

thorehusfeldt commented Sep 19, 2023 •

edited

Loading

niemela commented Dec 1, 2023

RFC: Expectations proposal #112

RFC: Expectations proposal #112

Comments

thorehusfeldt commented Aug 7, 2023 • edited Loading

Expectations are for testcases, not testgroups

Required and allowed

Full schema

With subtasks

Scoring

Judgemessages

Toplevel group name

Bespoke Verdict Syntax Would Get Rid of Lists and Required / Allowed

thorehusfeldt commented Aug 7, 2023 • edited Loading

Tagl commented Aug 7, 2023

thorehusfeldt commented Aug 8, 2023 • edited Loading

eldering commented Aug 12, 2023

thorehusfeldt commented Aug 13, 2023 • edited Loading

thorehusfeldt commented Sep 7, 2023 • edited Loading

thorehusfeldt commented Sep 8, 2023

niemela commented Sep 18, 2023

thorehusfeldt commented Sep 18, 2023

thorehusfeldt commented Sep 18, 2023 • edited Loading

niemela commented Sep 18, 2023

Tagl commented Sep 18, 2023 via email

thorehusfeldt commented Sep 18, 2023 • edited Loading

RagnarGrootKoerkamp commented Sep 18, 2023

thorehusfeldt commented Sep 19, 2023 • edited Loading

niemela commented Sep 19, 2023

RagnarGrootKoerkamp commented Sep 19, 2023

thorehusfeldt commented Sep 19, 2023 • edited Loading

niemela commented Dec 1, 2023

thorehusfeldt commented Aug 7, 2023 •

edited

Loading

thorehusfeldt commented Aug 7, 2023 •

edited

Loading

thorehusfeldt commented Aug 8, 2023 •

edited

Loading

thorehusfeldt commented Aug 13, 2023 •

edited

Loading

thorehusfeldt commented Sep 7, 2023 •

edited

Loading

thorehusfeldt commented Sep 18, 2023 •

edited

Loading

thorehusfeldt commented Sep 18, 2023 •

edited

Loading

thorehusfeldt commented Sep 19, 2023 •

edited

Loading

thorehusfeldt commented Sep 19, 2023 •

edited

Loading