Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[experimental] Run crosshair in CI #4034

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Zac-HD
Copy link
Member

@Zac-HD Zac-HD commented Jul 7, 2024

See #3914

To reproduce this locally, you can run make check-crosshair-cover/nocover/niche for the same command as in CI, but I'd recommend pytest --hypothesis-profile=crosshair hypothesis-python/tests/{cover,nocover,datetime} -m xf_crosshair --runxfail to select and run only the xfailed tests.

Hypothesis' problems

  • Vast majority of failures are Flaky: Inconsistent results from replaying a failing test... - mostly backend-specific failures; we've both
    • improved reporting in this case to show the crosshair-specific traceback
    • got most of the affected tests passing
  • Invalid internal boolean probability, e.g. "hypothesis/internal/conjecture/data.py", line 2277, in draw_boolean assert p > 2 ** (-64), fixed in 1f845e0 (#4049)
  • many of our test helpers involved nested use of @given, fixed in 3315be6
  • symbolic outside context
  • avoid uninstalling typing_extensions when crosshair depends on it
  • tests which are not really expected to pass on other backends. I'm slowly applying a backend-specific xfail decorator to them, @xfail_on_crosshair(...).
    • tests which expect to raise a healthcheck, and fail because our crosshair profile disables healthchecks. Disable only .too_slow and .filter_too_much, and skip remaining affected tests under crosshair.
    • undo some over-broad skips, e.g. various xfail decorators, pytestmarks, -k 'not decimal' once we're closer
  • provide a special exception type for when running the test or realizing values would hit a PathTimeout; see Rare PathTimeout errors in provider.realize(...) pschanely/hypothesis-crosshair#21 and Stable support for symbolic execution #3914 (comment)
    • and something to signal that we've exhausted Crosshair's ability to explore the test. If this is sound, we've verified the function and can stop! (and should record that in the stop_reason). If unsound, we can continue testing with Hypothesis' default backend - so it's important to distinguish.
      Add BackendCannotProceed to improve integration #4092

Probably Crosshair's problems

Error in operator.eq(Decimal('sNaN'), an_int)

____ test_rewriting_does_not_compare_decimal_snan ____
  File "hypothesis/strategies/_internal/strategies.py", line 1017, in do_filtered_draw
    if self.condition(value):
TypeError: argument must be an integer
while generating 's' from integers(min_value=1, max_value=5).filter(functools.partial(eq, Decimal('sNaN')))

Cases where crosshair doesn't find a failing example but Hypothesis does

Seems fine, there are plenty of cases in the other direction. Tracked with @xfail_on_crosshair(Why.undiscovered) in case we want to dig in later.

Nested use of the Hypothesis engine (e.g. given-inside-given)

This is just explicitly unsupported for now. Hypothesis should probably offer some way for backends to declare that they don't support this, and then raise a helpful error message if you try anyway.

@Zac-HD Zac-HD added tests/build/CI about testing or deployment *of* Hypothesis interop how to play nicely with other packages labels Jul 7, 2024
@tybug

This comment was marked as outdated.

@Zac-HD

This comment was marked as outdated.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch 3 times, most recently from 175b347 to 424943f Compare July 7, 2024 20:26
@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch from 424943f to b2d11c7 Compare July 7, 2024 20:56
@pschanely
Copy link
Contributor

@Zac-HD your triage above is SO great. I am investigating.

@pschanely
Copy link
Contributor

pschanely commented Jul 8, 2024

Knocked out a few of these in 0.0.60.
I think that means current status on my end is:

  • TypeError: conversion from SymbolicInt to Decimal is not supported
  • Unsupported operand type(s) for -: 'float' and 'SymbolicFloat' in test_float_clamper
  • TypeError: descriptor 'keys' for 'dict' objects doesn't apply to a 'ShellMutableMap' object (or 'values' or 'items').
  • TypeError: _int() got an unexpected keyword argument 'base'
  • Symbolic not realized (in e.g. test_suppressing_filtering_health_check)
  • Error in operator.eq(Decimal('sNaN'), an_int)
  • Zac's cursed example below!

More soon.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch from b2d11c7 to 98ccf44 Compare July 11, 2024 07:23
@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 12, 2024

Ah - the Flaky failures are of course because we had some failure under the Crosshair backend, which did not reproduce under the Hypothesis backend. This is presumably going to point to a range of integration bugs, but is also something that we'll want to clearly explain to users because integration bugs are definitely going to happen in future and users will need to respond (by e.g. using a different backend, ignoring the problem, whatever).

  • improve the reporting around Flaky failures where the differing or missing errors are related to a change of backend while shrinking. See also Change Flaky to be an ExceptionGroup #4040.
  • triage all the current failures so we can fix them

@Zac-HD

This comment was marked as outdated.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch from 98ccf44 to 4bd7e45 Compare July 12, 2024 07:48
@tybug
Copy link
Member

tybug commented Jul 12, 2024

Most/all of the "expected x, got symbolic" errors are symptoms of an underlying error in my experience (often operation on symbolic while not tracing). In this case running with export HYPOTHESIS_NO_TRACEBACK_TRIM=1 reveals limited_category_index_cache in cm.query is at fault.

@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 12, 2024

ah-ha, seems like we might want some #4029 - style 'don't cache on backends with avoid_realize=True' logic.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch 2 times, most recently from 1d2345d to 7bf8983 Compare July 12, 2024 20:15
@pschanely
Copy link
Contributor

Still here and excited about this! I am on a detour of doing a real symbolic implementation of the decimal module - should get that out this weekend.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch 2 times, most recently from cc07927 to 018ccab Compare July 13, 2024 07:23
@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 13, 2024

Triaging a pile of the Flaky erorrs, most were due to getting a RecursionError under crosshair and then passing under Hypothesis - and it looks like most of those were in turn because of all our nested-@given() test helpers.

So I've tried de-nesting those, which seems to work nicely and even makes things a bit faster by default; and when CI finishes we'll see how much it helps on crosshair 🤞

@Zac-HD

This comment was marked as outdated.

@Zac-HD
Copy link
Member Author

Zac-HD commented Oct 10, 2024

@pschanely huge progress from recent updates! The BackendCannotProceed mechanism entirely fixed several classes of issues, the floats changes have been great (signed zero ftw!), from_type() generates instances more often, I'm no longer skipping categories of stuff, and overall we've dropped from about +350 to +250 lines of code in this PR 🎊

At this point my only real reason to avoid merging is that crosshair updates often cause a fair bit of churn, causing some tests to start failing and some to start xpassing - it's net-good, but would be toil in our CI. I feel like we've crossed from an alpha-version which is a neat proof of concept, to a beta-version which is still early but already both useful and clearly on a path to stability and wider adoption. Incredibly excited about this ✨


If you want to pull out Crosshair issues,

  • this PR is probably useful as a pre-release test, to check whether there are any regressions you didn't expect
  • there's a commit marking some things that look like Crosshair bugs to me, and many more where Crosshair just doesn't find a failure that Hypothesis does (within the test budget, and which might or might not be a problem)
  • there's a commit full of tests skipped because they were very slow, if you want to look at performance issues. I haven't audited it lately but would guess at least a third are still slow + also Crosshair's problem.
  • the last big commit is pretty messy, probably best to ignore that for now

@pschanely
Copy link
Contributor

@pschanely huge progress from recent updates! The BackendCannotProceed mechanism entirely fixed several classes of issues, the floats changes have been great (signed zero ftw!), from_type() generates instances more often, I'm no longer skipping categories of stuff, and overall we've dropped from about +350 to +250 lines of code in this PR 🎊

So great.

At this point my only real reason to avoid merging is that crosshair updates often cause a fair bit of churn, causing some tests to start failing and some to start xpassing - it's net-good, but would be toil in our CI.

Frankly, I'm not sure it makes sense to block hypothesis on a crosshair-related failure, even in a very distant, stable future. Would love your ideas making the integration more "eventually" correct. Maybe a dedicated testing repo that pulls the hypothesis source and has these pytest markers externally applied? (or submodules? but those scare me)

If you want to pull out Crosshair issues,

Always. Thanks for the commit breakdown. More updates soon!

@Zac-HD
Copy link
Member Author

Zac-HD commented Oct 12, 2024

Frankly, I'm not sure it makes sense to block hypothesis on a crosshair-related failure, even in a very distant, stable future. Would love your ideas making the integration more "eventually" correct. Maybe a dedicated testing repo that pulls the hypothesis source and has these pytest markers externally applied? (or submodules? but those scare me)

For clarity, "blocking" would mean 'when we update our pinned dependencies, if Crosshair has changed we'll update the xfail markers accordingly and report any issues upstream, or maybe add a != requirement for that version'. Similarly, if a Hypothesis PR doesn't work with Crosshair I'd prefer to learn that at the time so I can decide to either xfail the tests, or do some extra work to support it - and my guess is that the converse would be useful for you too.

In practice I expect I'll just keep updating this PR for now, and you can grab a local copy of the branch if you want to run the tests before a Crosshair release 😁 (and note the test-selection tips at the top of the pr!)

@pschanely
Copy link
Contributor

For clarity, "blocking" would mean 'when we update our pinned dependencies, if Crosshair has changed we'll update the xfail markers accordingly and report any issues upstream, or maybe add a != requirement for that version'. Similarly, if a Hypothesis PR doesn't work with Crosshair I'd prefer to learn that at the time so I can decide to either xfail the tests, or do some extra work to support it - and my guess is that the converse would be useful for you too.

Fair enough! I was concerned about how much churn in CrossHair pass/fails you'll see for unrelated hypothesis changes, but it's also true that I want to know about what you see. Current plan SGTM.

In practice I expect I'll just keep updating this PR for now, and you can grab a local copy of the branch if you want to run the tests before a Crosshair release 😁 (and note the test-selection tips at the top of the pr!)

Yup! I've been doing this a little already; works for me.

@pschanely
Copy link
Contributor

@Zac-HD I've been looking into getting this rebased against master, and I think there are at least some mainline changes that are affecting the tests. I am able to do some early triage, but hoping that you or @tybug can assist with the resolution. Would that be ok? And, do we want to work through things here? Alternatively, I guess I could be opening actual hypothesis issues saying "hey, I think this test X should work under the crosshair profile and here's why..."

@tybug
Copy link
Member

tybug commented Jan 8, 2025

Confirmed that we regressed crosshair at some point:

@given(st.floats(min_value=0))
@settings(backend="crosshair")
def f(xs):
    pass
f()
...
  File "/Users/tybug/Desktop/Liam/coding/hypothesis/hypothesis-python/src/hypothesis/internal/conjecture/engine.py", line 1540, in cached_test_function
    result = check_result(data.as_result())
                          ^^^^^^^^^^^^^^^^
  File "/Users/tybug/Desktop/Liam/coding/hypothesis/hypothesis-python/src/hypothesis/internal/conjecture/data.py", line 2370, in as_result
    assert self.frozen
           ^^^^^^^^^^^
AssertionError

Will investigate (but not sure I'll have time today specifically). I think this is almost certainly our fault, not crosshair.

IMO initial triage in here is best, with the intent to only open an issue if we expect a fix to take longer than ~days.

@tybug
Copy link
Member

tybug commented Jan 8, 2025

#4230 will fix the above issue!

@Zac-HD
Copy link
Member Author

Zac-HD commented Jan 8, 2025

I'm now leaning towards merging this onto master - we've got it almost-entirely-working, and have (I think correctly) used the word "regression" to describe changes which made it work less well. So having CI to let us know about those as they happen seems pretty valuable to me!

(though who knows when I'll have a day free to get this back up to date again 😅)

@tybug
Copy link
Member

tybug commented Jan 9, 2025

Agree - though I think a notable prereq to merging is investigating the runtimes of very slow tests. this job suggests that regex is consistently slow, and clicking through pytest output for other jobs shows quite a few tests at 15+ minutes:

  • test_resolves_forward_references_outside_annotations
  • test_does_not_print_reproduction_for_large_data_examples_by_default
  • test_efficient_lists_of_tuples_first_element_sampled_from
  • test_flags_finds_all_bits_set
  • several more

I think this is a mix of hypothesis testing weird stuff (we should skip those), and crosshair slowdowns on mostly-reasonable things (since these are more indicative of real code, it would be nice if crosshair could improve runtime on them, though we can of course still skip in the meantime).

@Zac-HD
Copy link
Member Author

Zac-HD commented Jan 9, 2025

Agree we should investigate, but I think we should skip-for-now rather than block merging on that - overall I feel like it's "usable with known issues / early-beta quality" and I don't want to miss regressions on the parts that do work in the interim.

@pschanely
Copy link
Contributor

RE performance, yes: there are some cases where I need or want to do some optimization CrossHair side, and there are other cases where CrossHair will just not be a good fit for the problem. Skipping all the slow tests seems like a reasonable thing to do for now, and I'm happy to make un-skiping PRs when I can make improvements on my side. But before getting there, I'd like to deal with (inexplicable) test failures.

To that end, @tybug, thanks for the freeze() fix! After that, I was (arbitrarily) looking at the test_flatmap_retrieve_from_db test today, which continues to fail. Honestly, this test should probably be marked Why.symbolic_outside_context, but there's also something else going on - we don't even get the first assert to trigger. I think it's because we're prematurely realizing the symbolic float, seemingly when attempting to compute its "ir_size." My compressed traceback for the realization point looks like this:

(test_flatmap_retrieve_from_db test_flatmap.py:66) (record_and_test_size test_flatmap.py:59) (wrapped_test core.py:1765) (run_engine core.py:1219) (run engine.py:813) (_run engine.py:1307) (generate_new_examples engine.py:1058) (test_function engine.py:470) (__stoppable_test_function engine.py:338) (_execute_once_for_engine core.py:1064) (execute_once core.py:1003) (default_executor core.py:713) (run core.py:923) (prep_args_kwargs_from_strategies control.py:154) (draw data.py:2550) (do_draw flatmapped.py:30) (draw data.py:2544) (do_draw lazy.py:167) (draw data.py:2544) (do_draw numbers.py:183) (draw_float data.py:2307) (_draw data.py:2197) (ir_size data.py:1103) (ir_to_bytes database.py:744)

I noticed that the ir_size function changed here, which certainly looks like it could cause this problem, but I also strongly suspect my analysis is overly naive.

I'll continue to poke at other tests this week and report back here with additional findings. And, just to be explicit, I'm not expecting anyone to be rushing to investigate these - it's low priority stuff.

@tybug
Copy link
Member

tybug commented Jan 9, 2025

Ah, hm. Your analysis is spot-on. We place a limit on the entropy a test case can consume, and we increment the entropy consumed by a test case so far as a function of the particular values drawn (ir_size takes a value and returns its entropy "size"). This happens right after the value is drawn, before we execute the test with it, so I think we're prematurely realizing every crosshair choice. Maybe we shouldn't be tracking size/overruns for non-hypothesis providers at all?

BTW, reports like this are extremely helpful <3 I wouldn't have found this problem on my own until much later if ever.

And, just to be explicit, I'm not expecting anyone to be rushing to investigate these - it's low priority stuff.

On the contrary - I'm really excited for crosshair to work well with hypothesis, and am happy to do what I can do help things along here!

@pschanely
Copy link
Contributor

I think we're prematurely realizing every crosshair choice. Maybe we shouldn't be tracking size/overruns for non-hypothesis providers at all?

Honestly, for most of the non-float data types, CrossHair can likely retain some degree of symbolics going through ir_to_bytes. (which now reminds me that struct support is a big lift that's overdue) But either way, serialization effectively negates the advantage we get from operating in the IR layer, so I certainly want to avoid it!

BTW, reports like this are extremely helpful <3 I wouldn't have found this problem on my own until much later if ever.

Heh, I feel similarly about the hypothesis side; I should probably be sending partially-complete thoughts more liberally! More soon.

@pschanely
Copy link
Contributor

Haha, ok here's another thing to start mulling over in the background.

test_strategy_state.py is failing with a pluggy.PluggyTeardownRaisedWarning. The root cause is in the hypothesis pytest plugin teardown here where we attempt to join a list that contains some symbolic strings. I think the ultimate issue is that symbolics can escape via hypothesis hypothesis.reporting.*report() calls.

Apologies in advance. I imagine most solutions to this are ugly.

@tybug
Copy link
Member

tybug commented Jan 10, 2025

Thanks, will look into that one. This made me realize that we unconditionally realize for debug prints in verbosity <= Verbosity.verbose as well - will pr a fix for that.

For debugging purposes, is there a way to check if a variable is symbolic? Below prints <int> (and mro is also normal), but crosshair finds the error so I think n really is symbolic.

@given(st.integers())
@settings(backend="crosshair", database=None)
def f(n):
    print(type(n))
    assert n != 999999
f()

@pschanely
Copy link
Contributor

pschanely commented Jan 10, 2025

For debugging purposes, is there a way to check if a variable is symbolic? Below prints <int> (and mro is also normal), but crosshair finds the error so I think n really is symbolic.

Normally, the illusion needs to be this complete to account for user code that checks types 😄 , but yes, if you run things in a NoTracing() context, you'll get the real stuff:

from crosshair import NoTracing

@given(st.integers())
@settings(backend="crosshair", database=None)
def f(n):
    with NoTracing():
        print(type(n))
    assert n != 999999
f()

Edit: and, when not tracing, an isinstance(x, crosshair.core.CrossHairValue) should be sufficient to detect anything crosshair would throw at you. Shallow-ly, at least: I don't have something for you to check that a value is deeply concrete.

@pschanely
Copy link
Contributor

pschanely commented Jan 10, 2025

Your trivial example highlights another problem - I'm using BackendCannotProceed.verified when completing all paths, even if an exception was raised on a prior iteration, causing hypothesis to say backend='crosshair' claimed to verify this test passes. I'll detect this on my side and raise exhausted instead.

@tybug
Copy link
Member

tybug commented Jan 10, 2025

#4234 should resolve the premature realization in Verbosity.{verbose, debug} - which also happened to be the root cause of test_strategy_state.

And huh, I actually don't get that! (logs from the crosshair provider indicate that I don't hit exhaustion). Though I'm sure it's still an issue.

trace
@given(st.integers())
@settings(backend="crosshair")
def f(n):
    assert n != 123456
f()

Traceback (most recent call last):
  File "/Users/tybug/Desktop/sandbox5.py", line 32, in <module>
    f()
  File "/Users/tybug/Desktop/sandbox5.py", line 29, in f
    @settings(backend="crosshair")
                   ^^^^^
  File "/Users/tybug/Desktop/Liam/coding/hypothesis/hypothesis-python/src/hypothesis/core.py", line 1805, in wrapped_test
    raise the_error_hypothesis_found
  File "/Users/tybug/Desktop/sandbox5.py", line 31, in f
    assert n != 123456
           ^^^^^^^^^^^
AssertionError
Falsifying example: f(
    n=123456,
)
crosshair-tool                         0.0.81
hypothesis-crosshair                   0.0.18

@pschanely
Copy link
Contributor

#4234 should resolve the premature realization in Verbosity.{verbose, debug} - which also happened to be the root cause of test_strategy_state.

🎉

And huh, I actually don't get that! (logs from the crosshair provider indicate that I don't hit exhaustion). Though I'm sure it's still an issue.

Oh thanks! Indeed, it turns out that my hypothesis branch was not quite as up-to-date as it should have been. 😅

@tybug
Copy link
Member

tybug commented Jan 13, 2025

To follow up on the premature realization in ir_size: I'm waiting to complete #3921 (we are very close!) before fixing this. The notion of "size" and overruns is one of the last things to fully hash out on the typed choice sequence, which may change the resolution here.

@tybug
Copy link
Member

tybug commented Jan 21, 2025

@pschanely OK, after #4247 we should not be prematurely realizing crosshair values anywhere. Please let me know if that's not the case!

The semantics after that pull is that we do not overrun (abort too-large test cases) from avoid_realization backends, as this would require realizing the values to get their size. So CrosshairProvider should be somewhat careful that it's not generating extremely large inputs. The size limit here is pretty generous and I think crosshair already has internal heuristics to this effect, so I don't expect this to be a problem. We've just removed the guardrails for now.

@tybug
Copy link
Member

tybug commented Jan 22, 2025

Hmm, lots of nondeterministic errors in CI. We're now calling provider.draw_* from multiple distinct stacktraces. Is this something the crosshair provider could ignore? I think we'd like to only make guarantees that the (ir_type, kwargs) of each draw is deterministic across iterations, not necessarily the stacktrace at which it was drawn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interop how to play nicely with other packages tests/build/CI about testing or deployment *of* Hypothesis
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants