-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[experimental] Run crosshair in CI #4034
base: master
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
175b347
to
424943f
Compare
@Zac-HD your triage above is SO great. I am investigating. |
Knocked out a few of these in 0.0.60.
More soon. |
Ah - the
|
This comment was marked as outdated.
This comment was marked as outdated.
Most/all of the "expected x, got symbolic" errors are symptoms of an underlying error in my experience (often operation on symbolic while not tracing). In this case running with |
ah-ha, seems like we might want some #4029 - style 'don't cache on backends with avoid_realize=True' logic. |
1d2345d
to
7bf8983
Compare
Still here and excited about this! I am on a detour of doing a real symbolic implementation of the |
cc07927
to
018ccab
Compare
Triaging a pile of the So I've tried de-nesting those, which seems to work nicely and even makes things a bit faster by default; and when CI finishes we'll see how much it helps on crosshair 🤞 |
This comment was marked as outdated.
This comment was marked as outdated.
@pschanely huge progress from recent updates! The At this point my only real reason to avoid merging is that crosshair updates often cause a fair bit of churn, causing some tests to start failing and some to start xpassing - it's net-good, but would be toil in our CI. I feel like we've crossed from an alpha-version which is a neat proof of concept, to a beta-version which is still early but already both useful and clearly on a path to stability and wider adoption. Incredibly excited about this ✨ If you want to pull out Crosshair issues,
|
So great.
Frankly, I'm not sure it makes sense to block hypothesis on a crosshair-related failure, even in a very distant, stable future. Would love your ideas making the integration more "eventually" correct. Maybe a dedicated testing repo that pulls the hypothesis source and has these pytest markers externally applied? (or submodules? but those scare me)
Always. Thanks for the commit breakdown. More updates soon! |
For clarity, "blocking" would mean 'when we update our pinned dependencies, if Crosshair has changed we'll update the xfail markers accordingly and report any issues upstream, or maybe add a In practice I expect I'll just keep updating this PR for now, and you can grab a local copy of the branch if you want to run the tests before a Crosshair release 😁 (and note the test-selection tips at the top of the pr!) |
Fair enough! I was concerned about how much churn in CrossHair pass/fails you'll see for unrelated hypothesis changes, but it's also true that I want to know about what you see. Current plan SGTM.
Yup! I've been doing this a little already; works for me. |
@Zac-HD I've been looking into getting this rebased against master, and I think there are at least some mainline changes that are affecting the tests. I am able to do some early triage, but hoping that you or @tybug can assist with the resolution. Would that be ok? And, do we want to work through things here? Alternatively, I guess I could be opening actual hypothesis issues saying "hey, I think this test X should work under the crosshair profile and here's why..." |
Confirmed that we regressed crosshair at some point: @given(st.floats(min_value=0))
@settings(backend="crosshair")
def f(xs):
pass
f()
Will investigate (but not sure I'll have time today specifically). I think this is almost certainly our fault, not crosshair. IMO initial triage in here is best, with the intent to only open an issue if we expect a fix to take longer than ~days. |
#4230 will fix the above issue! |
I'm now leaning towards merging this onto master - we've got it almost-entirely-working, and have (I think correctly) used the word "regression" to describe changes which made it work less well. So having CI to let us know about those as they happen seems pretty valuable to me! (though who knows when I'll have a day free to get this back up to date again 😅) |
Agree - though I think a notable prereq to merging is investigating the runtimes of very slow tests. this job suggests that regex is consistently slow, and clicking through pytest output for other jobs shows quite a few tests at 15+ minutes:
I think this is a mix of hypothesis testing weird stuff (we should skip those), and crosshair slowdowns on mostly-reasonable things (since these are more indicative of real code, it would be nice if crosshair could improve runtime on them, though we can of course still skip in the meantime). |
Agree we should investigate, but I think we should skip-for-now rather than block merging on that - overall I feel like it's "usable with known issues / early-beta quality" and I don't want to miss regressions on the parts that do work in the interim. |
RE performance, yes: there are some cases where I need or want to do some optimization CrossHair side, and there are other cases where CrossHair will just not be a good fit for the problem. Skipping all the slow tests seems like a reasonable thing to do for now, and I'm happy to make un-skiping PRs when I can make improvements on my side. But before getting there, I'd like to deal with (inexplicable) test failures. To that end, @tybug, thanks for the (test_flatmap_retrieve_from_db test_flatmap.py:66) (record_and_test_size test_flatmap.py:59) (wrapped_test core.py:1765) (run_engine core.py:1219) (run engine.py:813) (_run engine.py:1307) (generate_new_examples engine.py:1058) (test_function engine.py:470) (__stoppable_test_function engine.py:338) (_execute_once_for_engine core.py:1064) (execute_once core.py:1003) (default_executor core.py:713) (run core.py:923) (prep_args_kwargs_from_strategies control.py:154) (draw data.py:2550) (do_draw flatmapped.py:30) (draw data.py:2544) (do_draw lazy.py:167) (draw data.py:2544) (do_draw numbers.py:183) (draw_float data.py:2307) (_draw data.py:2197) (ir_size data.py:1103) (ir_to_bytes database.py:744) I noticed that the ir_size function changed here, which certainly looks like it could cause this problem, but I also strongly suspect my analysis is overly naive. I'll continue to poke at other tests this week and report back here with additional findings. And, just to be explicit, I'm not expecting anyone to be rushing to investigate these - it's low priority stuff. |
Ah, hm. Your analysis is spot-on. We place a limit on the entropy a test case can consume, and we increment the entropy consumed by a test case so far as a function of the particular values drawn ( BTW, reports like this are extremely helpful <3 I wouldn't have found this problem on my own until much later if ever.
On the contrary - I'm really excited for crosshair to work well with hypothesis, and am happy to do what I can do help things along here! |
Honestly, for most of the non-float data types, CrossHair can likely retain some degree of symbolics going through
Heh, I feel similarly about the hypothesis side; I should probably be sending partially-complete thoughts more liberally! More soon. |
Haha, ok here's another thing to start mulling over in the background. test_strategy_state.py is failing with a Apologies in advance. I imagine most solutions to this are ugly. |
Thanks, will look into that one. This made me realize that we unconditionally realize for debug prints in verbosity <= Verbosity.verbose as well - will pr a fix for that. For debugging purposes, is there a way to check if a variable is symbolic? Below prints @given(st.integers())
@settings(backend="crosshair", database=None)
def f(n):
print(type(n))
assert n != 999999
f() |
Normally, the illusion needs to be this complete to account for user code that checks types 😄 , but yes, if you run things in a NoTracing() context, you'll get the real stuff: from crosshair import NoTracing
@given(st.integers())
@settings(backend="crosshair", database=None)
def f(n):
with NoTracing():
print(type(n))
assert n != 999999
f() Edit: and, when not tracing, an |
Your trivial example highlights another problem - I'm using |
#4234 should resolve the premature realization in And huh, I actually don't get that! (logs from the crosshair provider indicate that I don't hit exhaustion). Though I'm sure it's still an issue. trace@given(st.integers())
@settings(backend="crosshair")
def f(n):
assert n != 123456
f()
Traceback (most recent call last):
File "/Users/tybug/Desktop/sandbox5.py", line 32, in <module>
f()
File "/Users/tybug/Desktop/sandbox5.py", line 29, in f
@settings(backend="crosshair")
^^^^^
File "/Users/tybug/Desktop/Liam/coding/hypothesis/hypothesis-python/src/hypothesis/core.py", line 1805, in wrapped_test
raise the_error_hypothesis_found
File "/Users/tybug/Desktop/sandbox5.py", line 31, in f
assert n != 123456
^^^^^^^^^^^
AssertionError
Falsifying example: f(
n=123456,
)
|
🎉
Oh thanks! Indeed, it turns out that my hypothesis branch was not quite as up-to-date as it should have been. 😅 |
To follow up on the premature realization in |
@pschanely OK, after #4247 we should not be prematurely realizing crosshair values anywhere. Please let me know if that's not the case! The semantics after that pull is that we do not overrun (abort too-large test cases) from |
765cc55
to
648bbe9
Compare
Hmm, lots of nondeterministic errors in CI. We're now calling |
See #3914
To reproduce this locally, you can run
make check-crosshair-cover/nocover/niche
for the same command as in CI, but I'd recommendpytest --hypothesis-profile=crosshair hypothesis-python/tests/{cover,nocover,datetime} -m xf_crosshair --runxfail
to select and run only the xfailed tests.Hypothesis' problems
Flaky: Inconsistent results from replaying a failing test...
- mostly backend-specific failures; we've both"hypothesis/internal/conjecture/data.py", line 2277, in draw_boolean
assert p > 2 ** (-64)
, fixed in1f845e0
(#4049)@given
, fixed in 3315be6target()
, fixed in85712ad
(#4049)typing_extensions
when crosshair depends on it@xfail_on_crosshair(...)
..too_slow
and.filter_too_much
, and skip remaining affected tests under crosshair.-k 'not decimal'
once we're closerPathTimeout
; see RarePathTimeout
errors inprovider.realize(...)
pschanely/hypothesis-crosshair#21 and Stable support for symbolic execution #3914 (comment)Add
BackendCannotProceed
to improve integration #4092Probably Crosshair's problems
Duplicate type "<class 'array.array'>" registered
from repeated imports? pschanely/hypothesis-crosshair#17RecursionError
, seeRecursionError
in_issubclass
pschanely/CrossHair#294unsupported operand type(s) for -: 'float' and 'SymbolicFloat'
intest_float_clamper
TypeError: descriptor 'keys' for 'dict' objects doesn't apply to a 'ShellMutableMap' object
(or'values'
or'items'
). Fixed in Implement various fixes for hypothesis integration pschanely/CrossHair#269TypeError: _int() got an unexpected keyword argument 'base'
hashlib
requires the buffer protocol, which symbolics bytes don't provide pschanely/CrossHair#272typing.get_type_hints()
raisesValueError
, seetyping.get_type_hints()
raisesValueError
when used inside Crosshair pschanely/CrossHair#275TypeError
in bytes regex, seeTypeError
in bytes regex pschanely/CrossHair#276provider.draw_boolean()
insideFeatureStrategy
, see Invalid combination of arguments todraw_boolean(...)
pschanely/hypothesis-crosshair#18dict(name=value)
, see Support nameddict
init syntax pschanely/CrossHair#279PurePath
constructor, seePurePath(LazyIntSymbolicStr)
error pschanely/CrossHair#280zlib.compress()
not symbolic, see a bytes-like object is required, notSymbolicBytes
when callingzlib.compress(b'')
pschanely/CrossHair#286int.from_bytes(map(...), ...)
, see Acceptmap()
object - or any iterable - inint.from_bytes()
pschanely/CrossHair#291base64.b64encode()
and friends pschanely/CrossHair#293TypeError: conversion from SymbolicInt to Decimal is not supported
; see also snan belowTypeVar
problem, seez3.z3types.Z3Exception: b'parser error'
from interaction withTypeVar
pschanely/CrossHair#292RecursionError
inside Lark, see Weird failures using sets pschanely/CrossHair#297Error in
operator.eq(Decimal('sNaN'), an_int)
Cases where crosshair doesn't find a failing example but Hypothesis does
Seems fine, there are plenty of cases in the other direction. Tracked with
@xfail_on_crosshair(Why.undiscovered)
in case we want to dig in later.Nested use of the Hypothesis engine (e.g. given-inside-given)
This is just explicitly unsupported for now. Hypothesis should probably offer some way for backends to declare that they don't support this, and then raise a helpful error message if you try anyway.