[experimental] Run crosshair in CI #4034

Zac-HD · 2024-07-07T18:06:24Z

To reproduce this locally, you can run make check-crosshair-cover/nocover/niche for the same command as in CI, but I'd recommend pytest --hypothesis-profile=crosshair hypothesis-python/tests/{cover,nocover,datetime} -m xf_crosshair --runxfail to select and run only the xfailed tests.

Hypothesis' problems

Probably Crosshair's problems

Error in `operator.eq(Decimal('sNaN'), an_int)`

____ test_rewriting_does_not_compare_decimal_snan ____
  File "hypothesis/strategies/_internal/strategies.py", line 1017, in do_filtered_draw
    if self.condition(value):
TypeError: argument must be an integer
while generating 's' from integers(min_value=1, max_value=5).filter(functools.partial(eq, Decimal('sNaN')))

Cases where crosshair doesn't find a failing example but Hypothesis does

Seems fine, there are plenty of cases in the other direction. Tracked with @xfail_on_crosshair(Why.undiscovered) in case we want to dig in later.

Nested use of the Hypothesis engine (e.g. given-inside-given)

This is just explicitly unsupported for now. Hypothesis should probably offer some way for backends to declare that they don't support this, and then raise a helpful error message if you try anyway.

pschanely · 2024-07-07T21:03:45Z

@Zac-HD your triage above is SO great. I am investigating.

pschanely · 2024-07-08T17:47:56Z

Knocked out a few of these in 0.0.60.
I think that means current status on my end is:

TypeError: conversion from SymbolicInt to Decimal is not supported
Unsupported operand type(s) for -: 'float' and 'SymbolicFloat' in test_float_clamper
TypeError: descriptor 'keys' for 'dict' objects doesn't apply to a 'ShellMutableMap' object (or 'values' or 'items').
TypeError: _int() got an unexpected keyword argument 'base'
Symbolic not realized (in e.g. test_suppressing_filtering_health_check)
Error in operator.eq(Decimal('sNaN'), an_int)
Zac's cursed example below!

More soon.

Zac-HD · 2024-07-12T07:17:48Z

Ah - the Flaky failures are of course because we had some failure under the Crosshair backend, which did not reproduce under the Hypothesis backend. This is presumably going to point to a range of integration bugs, but is also something that we'll want to clearly explain to users because integration bugs are definitely going to happen in future and users will need to respond (by e.g. using a different backend, ignoring the problem, whatever).

improve the reporting around Flaky failures where the differing or missing errors are related to a change of backend while shrinking. See also Change Flaky to be an ExceptionGroup #4040.
triage all the current failures so we can fix them

tybug · 2024-07-12T15:33:33Z

Most/all of the "expected x, got symbolic" errors are symptoms of an underlying error in my experience (often operation on symbolic while not tracing). In this case running with export HYPOTHESIS_NO_TRACEBACK_TRIM=1 reveals limited_category_index_cache in cm.query is at fault.

Zac-HD · 2024-07-12T18:43:06Z

ah-ha, seems like we might want some #4029 - style 'don't cache on backends with avoid_realize=True' logic.

pschanely · 2024-07-13T00:39:13Z

Still here and excited about this! I am on a detour of doing a real symbolic implementation of the decimal module - should get that out this weekend.

Zac-HD · 2024-07-13T08:18:25Z

Triaging a pile of the Flaky erorrs, most were due to getting a RecursionError under crosshair and then passing under Hypothesis - and it looks like most of those were in turn because of all our nested-@given() test helpers.

So I've tried de-nesting those, which seems to work nicely and even makes things a bit faster by default; and when CI finishes we'll see how much it helps on crosshair 🤞

Zac-HD · 2024-10-10T19:28:09Z

@pschanely huge progress from recent updates! The BackendCannotProceed mechanism entirely fixed several classes of issues, the floats changes have been great (signed zero ftw!), from_type() generates instances more often, I'm no longer skipping categories of stuff, and overall we've dropped from about +350 to +250 lines of code in this PR 🎊

At this point my only real reason to avoid merging is that crosshair updates often cause a fair bit of churn, causing some tests to start failing and some to start xpassing - it's net-good, but would be toil in our CI. I feel like we've crossed from an alpha-version which is a neat proof of concept, to a beta-version which is still early but already both useful and clearly on a path to stability and wider adoption. Incredibly excited about this ✨

If you want to pull out Crosshair issues,

this PR is probably useful as a pre-release test, to check whether there are any regressions you didn't expect
there's a commit marking some things that look like Crosshair bugs to me, and many more where Crosshair just doesn't find a failure that Hypothesis does (within the test budget, and which might or might not be a problem)
there's a commit full of tests skipped because they were very slow, if you want to look at performance issues. I haven't audited it lately but would guess at least a third are still slow + also Crosshair's problem.
the last big commit is pretty messy, probably best to ignore that for now

pschanely · 2024-10-11T12:58:30Z

@pschanely huge progress from recent updates! The BackendCannotProceed mechanism entirely fixed several classes of issues, the floats changes have been great (signed zero ftw!), from_type() generates instances more often, I'm no longer skipping categories of stuff, and overall we've dropped from about +350 to +250 lines of code in this PR 🎊

So great.

At this point my only real reason to avoid merging is that crosshair updates often cause a fair bit of churn, causing some tests to start failing and some to start xpassing - it's net-good, but would be toil in our CI.

Frankly, I'm not sure it makes sense to block hypothesis on a crosshair-related failure, even in a very distant, stable future. Would love your ideas making the integration more "eventually" correct. Maybe a dedicated testing repo that pulls the hypothesis source and has these pytest markers externally applied? (or submodules? but those scare me)

If you want to pull out Crosshair issues,

Always. Thanks for the commit breakdown. More updates soon!

Zac-HD · 2024-10-12T00:57:16Z

Frankly, I'm not sure it makes sense to block hypothesis on a crosshair-related failure, even in a very distant, stable future. Would love your ideas making the integration more "eventually" correct. Maybe a dedicated testing repo that pulls the hypothesis source and has these pytest markers externally applied? (or submodules? but those scare me)

For clarity, "blocking" would mean 'when we update our pinned dependencies, if Crosshair has changed we'll update the xfail markers accordingly and report any issues upstream, or maybe add a != requirement for that version'. Similarly, if a Hypothesis PR doesn't work with Crosshair I'd prefer to learn that at the time so I can decide to either xfail the tests, or do some extra work to support it - and my guess is that the converse would be useful for you too.

In practice I expect I'll just keep updating this PR for now, and you can grab a local copy of the branch if you want to run the tests before a Crosshair release 😁 (and note the test-selection tips at the top of the pr!)

pschanely · 2024-10-12T19:55:08Z

For clarity, "blocking" would mean 'when we update our pinned dependencies, if Crosshair has changed we'll update the xfail markers accordingly and report any issues upstream, or maybe add a != requirement for that version'. Similarly, if a Hypothesis PR doesn't work with Crosshair I'd prefer to learn that at the time so I can decide to either xfail the tests, or do some extra work to support it - and my guess is that the converse would be useful for you too.

Fair enough! I was concerned about how much churn in CrossHair pass/fails you'll see for unrelated hypothesis changes, but it's also true that I want to know about what you see. Current plan SGTM.

In practice I expect I'll just keep updating this PR for now, and you can grab a local copy of the branch if you want to run the tests before a Crosshair release 😁 (and note the test-selection tips at the top of the pr!)

Yup! I've been doing this a little already; works for me.

pschanely · 2025-01-08T14:54:04Z

@Zac-HD I've been looking into getting this rebased against master, and I think there are at least some mainline changes that are affecting the tests. I am able to do some early triage, but hoping that you or @tybug can assist with the resolution. Would that be ok? And, do we want to work through things here? Alternatively, I guess I could be opening actual hypothesis issues saying "hey, I think this test X should work under the crosshair profile and here's why..."

tybug · 2025-01-08T16:32:14Z

Confirmed that we regressed crosshair at some point:

@given(st.floats(min_value=0))
@settings(backend="crosshair")
def f(xs):
    pass
f()

...
  File "/Users/tybug/Desktop/Liam/coding/hypothesis/hypothesis-python/src/hypothesis/internal/conjecture/engine.py", line 1540, in cached_test_function
    result = check_result(data.as_result())
                          ^^^^^^^^^^^^^^^^
  File "/Users/tybug/Desktop/Liam/coding/hypothesis/hypothesis-python/src/hypothesis/internal/conjecture/data.py", line 2370, in as_result
    assert self.frozen
           ^^^^^^^^^^^
AssertionError

Will investigate (but not sure I'll have time today specifically). I think this is almost certainly our fault, not crosshair.

IMO initial triage in here is best, with the intent to only open an issue if we expect a fix to take longer than ~days.

tybug · 2025-01-08T18:23:50Z

#4230 will fix the above issue!

Zac-HD · 2025-01-08T23:45:32Z

I'm now leaning towards merging this onto master - we've got it almost-entirely-working, and have (I think correctly) used the word "regression" to describe changes which made it work less well. So having CI to let us know about those as they happen seems pretty valuable to me!

(though who knows when I'll have a day free to get this back up to date again 😅)

tybug · 2025-01-09T00:04:41Z

Agree - though I think a notable prereq to merging is investigating the runtimes of very slow tests. this job suggests that regex is consistently slow, and clicking through pytest output for other jobs shows quite a few tests at 15+ minutes:

test_resolves_forward_references_outside_annotations
test_does_not_print_reproduction_for_large_data_examples_by_default
test_efficient_lists_of_tuples_first_element_sampled_from
test_flags_finds_all_bits_set
several more

I think this is a mix of hypothesis testing weird stuff (we should skip those), and crosshair slowdowns on mostly-reasonable things (since these are more indicative of real code, it would be nice if crosshair could improve runtime on them, though we can of course still skip in the meantime).

Zac-HD · 2025-01-09T00:34:15Z

Agree we should investigate, but I think we should skip-for-now rather than block merging on that - overall I feel like it's "usable with known issues / early-beta quality" and I don't want to miss regressions on the parts that do work in the interim.

pschanely · 2025-01-09T03:41:58Z

RE performance, yes: there are some cases where I need or want to do some optimization CrossHair side, and there are other cases where CrossHair will just not be a good fit for the problem. Skipping all the slow tests seems like a reasonable thing to do for now, and I'm happy to make un-skiping PRs when I can make improvements on my side. But before getting there, I'd like to deal with (inexplicable) test failures.

To that end, @tybug, thanks for the freeze() fix! After that, I was (arbitrarily) looking at the test_flatmap_retrieve_from_db test today, which continues to fail. Honestly, this test should probably be marked Why.symbolic_outside_context, but there's also something else going on - we don't even get the first assert to trigger. I think it's because we're prematurely realizing the symbolic float, seemingly when attempting to compute its "ir_size." My compressed traceback for the realization point looks like this:

(test_flatmap_retrieve_from_db test_flatmap.py:66) (record_and_test_size test_flatmap.py:59) (wrapped_test core.py:1765) (run_engine core.py:1219) (run engine.py:813) (_run engine.py:1307) (generate_new_examples engine.py:1058) (test_function engine.py:470) (__stoppable_test_function engine.py:338) (_execute_once_for_engine core.py:1064) (execute_once core.py:1003) (default_executor core.py:713) (run core.py:923) (prep_args_kwargs_from_strategies control.py:154) (draw data.py:2550) (do_draw flatmapped.py:30) (draw data.py:2544) (do_draw lazy.py:167) (draw data.py:2544) (do_draw numbers.py:183) (draw_float data.py:2307) (_draw data.py:2197) (ir_size data.py:1103) (ir_to_bytes database.py:744)

I noticed that the ir_size function changed here, which certainly looks like it could cause this problem, but I also strongly suspect my analysis is overly naive.

I'll continue to poke at other tests this week and report back here with additional findings. And, just to be explicit, I'm not expecting anyone to be rushing to investigate these - it's low priority stuff.

tybug · 2025-01-09T04:54:16Z

Ah, hm. Your analysis is spot-on. We place a limit on the entropy a test case can consume, and we increment the entropy consumed by a test case so far as a function of the particular values drawn (ir_size takes a value and returns its entropy "size"). This happens right after the value is drawn, before we execute the test with it, so I think we're prematurely realizing every crosshair choice. Maybe we shouldn't be tracking size/overruns for non-hypothesis providers at all?

BTW, reports like this are extremely helpful <3 I wouldn't have found this problem on my own until much later if ever.

And, just to be explicit, I'm not expecting anyone to be rushing to investigate these - it's low priority stuff.

On the contrary - I'm really excited for crosshair to work well with hypothesis, and am happy to do what I can do help things along here!

pschanely · 2025-01-10T01:09:42Z

I think we're prematurely realizing every crosshair choice. Maybe we shouldn't be tracking size/overruns for non-hypothesis providers at all?

Honestly, for most of the non-float data types, CrossHair can likely retain some degree of symbolics going through ir_to_bytes. (which now reminds me that struct support is a big lift that's overdue) But either way, serialization effectively negates the advantage we get from operating in the IR layer, so I certainly want to avoid it!

BTW, reports like this are extremely helpful <3 I wouldn't have found this problem on my own until much later if ever.

Heh, I feel similarly about the hypothesis side; I should probably be sending partially-complete thoughts more liberally! More soon.

pschanely · 2025-01-10T16:34:17Z

Haha, ok here's another thing to start mulling over in the background.

test_strategy_state.py is failing with a pluggy.PluggyTeardownRaisedWarning. The root cause is in the hypothesis pytest plugin teardown here where we attempt to join a list that contains some symbolic strings. I think the ultimate issue is that symbolics can escape via hypothesis hypothesis.reporting.*report() calls.

Apologies in advance. I imagine most solutions to this are ugly.

tybug · 2025-01-10T17:50:27Z

Thanks, will look into that one. This made me realize that we unconditionally realize for debug prints in verbosity <= Verbosity.verbose as well - will pr a fix for that.

For debugging purposes, is there a way to check if a variable is symbolic? Below prints <int> (and mro is also normal), but crosshair finds the error so I think n really is symbolic.

@given(st.integers())
@settings(backend="crosshair", database=None)
def f(n):
    print(type(n))
    assert n != 999999
f()

pschanely · 2025-01-10T18:10:27Z

For debugging purposes, is there a way to check if a variable is symbolic? Below prints <int> (and mro is also normal), but crosshair finds the error so I think n really is symbolic.

Normally, the illusion needs to be this complete to account for user code that checks types 😄 , but yes, if you run things in a NoTracing() context, you'll get the real stuff:

from crosshair import NoTracing

@given(st.integers())
@settings(backend="crosshair", database=None)
def f(n):
    with NoTracing():
        print(type(n))
    assert n != 999999
f()

Edit: and, when not tracing, an isinstance(x, crosshair.core.CrossHairValue) should be sufficient to detect anything crosshair would throw at you. Shallow-ly, at least: I don't have something for you to check that a value is deeply concrete.

pschanely · 2025-01-10T18:25:47Z

Your trivial example highlights another problem - I'm using BackendCannotProceed.verified when completing all paths, even if an exception was raised on a prior iteration, causing hypothesis to say backend='crosshair' claimed to verify this test passes. I'll detect this on my side and raise exhausted instead.

tybug · 2025-01-10T20:05:28Z

#4234 should resolve the premature realization in Verbosity.{verbose, debug} - which also happened to be the root cause of test_strategy_state.

And huh, I actually don't get that! (logs from the crosshair provider indicate that I don't hit exhaustion). Though I'm sure it's still an issue.

trace

@given(st.integers())
@settings(backend="crosshair")
def f(n):
    assert n != 123456
f()

Traceback (most recent call last):
  File "/Users/tybug/Desktop/sandbox5.py", line 32, in <module>
    f()
  File "/Users/tybug/Desktop/sandbox5.py", line 29, in f
    @settings(backend="crosshair")
                   ^^^^^
  File "/Users/tybug/Desktop/Liam/coding/hypothesis/hypothesis-python/src/hypothesis/core.py", line 1805, in wrapped_test
    raise the_error_hypothesis_found
  File "/Users/tybug/Desktop/sandbox5.py", line 31, in f
    assert n != 123456
           ^^^^^^^^^^^
AssertionError
Falsifying example: f(
    n=123456,
)

crosshair-tool                         0.0.81
hypothesis-crosshair                   0.0.18

pschanely · 2025-01-10T21:37:18Z

#4234 should resolve the premature realization in Verbosity.{verbose, debug} - which also happened to be the root cause of test_strategy_state.

🎉

And huh, I actually don't get that! (logs from the crosshair provider indicate that I don't hit exhaustion). Though I'm sure it's still an issue.

Oh thanks! Indeed, it turns out that my hypothesis branch was not quite as up-to-date as it should have been. 😅

tybug · 2025-01-13T07:19:40Z

To follow up on the premature realization in ir_size: I'm waiting to complete #3921 (we are very close!) before fixing this. The notion of "size" and overruns is one of the last things to fully hash out on the typed choice sequence, which may change the resolution here.

tybug · 2025-01-21T18:32:26Z

@pschanely OK, after #4247 we should not be prematurely realizing crosshair values anywhere. Please let me know if that's not the case!

The semantics after that pull is that we do not overrun (abort too-large test cases) from avoid_realization backends, as this would require realizing the values to get their size. So CrosshairProvider should be somewhat careful that it's not generating extremely large inputs. The size limit here is pretty generous and I think crosshair already has internal heuristics to this effect, so I don't expect this to be a problem. We've just removed the guardrails for now.

tybug · 2025-01-22T06:39:19Z

Hmm, lots of nondeterministic errors in CI. We're now calling provider.draw_* from multiple distinct stacktraces. Is this something the crosshair provider could ignore? I think we'd like to only make guarantees that the (ir_type, kwargs) of each draw is deterministic across iterations, not necessarily the stacktrace at which it was drawn.

Zac-HD added tests/build/CI about testing or deployment *of* Hypothesis interop how to play nicely with other packages labels Jul 7, 2024

This comment was marked as outdated.

Sign in to view

Zac-HD force-pushed the crosshair-in-ci branch 3 times, most recently from 175b347 to 424943f Compare July 7, 2024 20:26

Zac-HD mentioned this pull request Jul 7, 2024

Stable support for symbolic execution #3914

Open

19 tasks

Zac-HD force-pushed the crosshair-in-ci branch from 424943f to b2d11c7 Compare July 7, 2024 20:56

Zac-HD force-pushed the crosshair-in-ci branch from b2d11c7 to 98ccf44 Compare July 11, 2024 07:23

This comment was marked as outdated.

Sign in to view

Zac-HD force-pushed the crosshair-in-ci branch from 98ccf44 to 4bd7e45 Compare July 12, 2024 07:48

Zac-HD force-pushed the crosshair-in-ci branch 2 times, most recently from 1d2345d to 7bf8983 Compare July 12, 2024 20:15

Zac-HD force-pushed the crosshair-in-ci branch 2 times, most recently from cc07927 to 018ccab Compare July 13, 2024 07:23

This comment was marked as outdated.

Sign in to view

pschanely mentioned this pull request Jul 13, 2024

Auto-realize encode/decode arguments for unsupported codecs pschanely/CrossHair#271

Closed

This was referenced Jul 13, 2024

Make Flaky into an ExceptionGroup #4043

Merged

Test-only improvements #4047

Merged

Zac-HD force-pushed the crosshair-in-ci branch from 5d77e32 to 00a9931 Compare July 13, 2024 21:35

This was referenced Jul 13, 2024

Duplicate type "<class 'array.array'>" registered from repeated imports? pschanely/hypothesis-crosshair#17

Closed

hashlib requires the buffer protocol, which symbolics bytes don't provide pschanely/CrossHair#272

Closed

Zac-HD mentioned this pull request Nov 30, 2024

TEMP: test patches for crosshair #4190

Draft

tybug mentioned this pull request Jan 8, 2025

Freeze in BackendCannotProceed #4230

Merged

Zac-HD added 4 commits January 22, 2025 01:27

run crosshair in CI

314995b

skip hanging or very slow tests

70cdeb9

Mark expected failures under crosshair

7b89ade

Mark failures for crosshair to fix?

648bbe9

tybug force-pushed the crosshair-in-ci branch from 765cc55 to 648bbe9 Compare January 22, 2025 06:32

Zac-HD mentioned this pull request Jan 26, 2025

CrossHairInternal: Not in a statespace context error on tests with Pytest fixtures pschanely/hypothesis-crosshair#24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experimental] Run crosshair in CI #4034

[experimental] Run crosshair in CI #4034

Zac-HD commented Jul 7, 2024 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

pschanely commented Jul 7, 2024

pschanely commented Jul 8, 2024 •

edited

Loading

Zac-HD commented Jul 12, 2024 •

edited

Loading

This comment was marked as outdated.

tybug commented Jul 12, 2024

Zac-HD commented Jul 12, 2024 •

edited

Loading

pschanely commented Jul 13, 2024

Zac-HD commented Jul 13, 2024

This comment was marked as outdated.

Zac-HD commented Oct 10, 2024 •

edited

Loading

pschanely commented Oct 11, 2024

Zac-HD commented Oct 12, 2024

pschanely commented Oct 12, 2024

pschanely commented Jan 8, 2025

tybug commented Jan 8, 2025 •

edited

Loading

tybug commented Jan 8, 2025 •

edited

Loading

Zac-HD commented Jan 8, 2025

tybug commented Jan 9, 2025

Zac-HD commented Jan 9, 2025

pschanely commented Jan 9, 2025

tybug commented Jan 9, 2025 •

edited

Loading

pschanely commented Jan 10, 2025

pschanely commented Jan 10, 2025

tybug commented Jan 10, 2025

pschanely commented Jan 10, 2025 •

edited

Loading

pschanely commented Jan 10, 2025 •

edited

Loading

tybug commented Jan 10, 2025

pschanely commented Jan 10, 2025

tybug commented Jan 13, 2025

tybug commented Jan 21, 2025

tybug commented Jan 22, 2025 •

edited

Loading

[experimental] Run crosshair in CI #4034

Are you sure you want to change the base?

[experimental] Run crosshair in CI #4034

Conversation

Zac-HD commented Jul 7, 2024 • edited Loading

Hypothesis' problems

Probably Crosshair's problems

Error in operator.eq(Decimal('sNaN'), an_int)

Cases where crosshair doesn't find a failing example but Hypothesis does

Nested use of the Hypothesis engine (e.g. given-inside-given)

This comment was marked as outdated.

This comment was marked as outdated.

pschanely commented Jul 7, 2024

pschanely commented Jul 8, 2024 • edited Loading

Zac-HD commented Jul 12, 2024 • edited Loading

This comment was marked as outdated.

tybug commented Jul 12, 2024

Zac-HD commented Jul 12, 2024 • edited Loading

pschanely commented Jul 13, 2024

Zac-HD commented Jul 13, 2024

This comment was marked as outdated.

Zac-HD commented Oct 10, 2024 • edited Loading

pschanely commented Oct 11, 2024

Zac-HD commented Oct 12, 2024

pschanely commented Oct 12, 2024

pschanely commented Jan 8, 2025

tybug commented Jan 8, 2025 • edited Loading

tybug commented Jan 8, 2025 • edited Loading

Zac-HD commented Jan 8, 2025

tybug commented Jan 9, 2025

Zac-HD commented Jan 9, 2025

pschanely commented Jan 9, 2025

tybug commented Jan 9, 2025 • edited Loading

pschanely commented Jan 10, 2025

pschanely commented Jan 10, 2025

tybug commented Jan 10, 2025

pschanely commented Jan 10, 2025 • edited Loading

pschanely commented Jan 10, 2025 • edited Loading

tybug commented Jan 10, 2025

pschanely commented Jan 10, 2025

tybug commented Jan 13, 2025

tybug commented Jan 21, 2025

tybug commented Jan 22, 2025 • edited Loading

Zac-HD commented Jul 7, 2024 •

edited

Loading

Error in `operator.eq(Decimal('sNaN'), an_int)`

pschanely commented Jul 8, 2024 •

edited

Loading

Zac-HD commented Jul 12, 2024 •

edited

Loading

Zac-HD commented Jul 12, 2024 •

edited

Loading

Zac-HD commented Oct 10, 2024 •

edited

Loading

tybug commented Jan 8, 2025 •

edited

Loading

tybug commented Jan 8, 2025 •

edited

Loading

tybug commented Jan 9, 2025 •

edited

Loading

pschanely commented Jan 10, 2025 •

edited

Loading

pschanely commented Jan 10, 2025 •

edited

Loading

tybug commented Jan 22, 2025 •

edited

Loading