Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize SQL expression comparison by caching hash codes (currently probably not needed) #34149

Open
ranma42 opened this issue Jul 3, 2024 · 17 comments

Comments

@ranma42
Copy link
Contributor

ranma42 commented Jul 3, 2024

In the query pipeline sometimes expressions are compared for (deep) equality.
This currently is based on a recursive visit of the subtree, which can be very costly if done multiple times while visiting the expression (it is quite easy to construct expression trees that require O(n^2) operations when visited).

This could be improved by computing a hash that filters out most of the inequalities as suggested in #34133 (comment)

@ranma42
Copy link
Contributor Author

ranma42 commented Jul 3, 2024

Side note: another case that could be interesting is the negation of an expression. Having it "ready to use" would make some predicate-related optimizations faster/cheaper.

@roji roji added this to the Backlog milestone Jul 3, 2024
@roji
Copy link
Member

roji commented Jul 3, 2024

Blocked on making our entire SqlExpression tree immutable (#32927); SelectExpression is currently mutable, and since it can be contained inside most SqlExpressions (thanks @ranma42), we can't cache the hashcode.

@roji
Copy link
Member

roji commented Jul 4, 2024

Blocked on making our entire SqlExpression tree immutable (#32927); SelectExpression is currently mutable, and since it can be contained inside most SqlExpressions (#34133 (comment)), we can't cache the hashcode.

Though on second thought, SelectExpression should never be mutable when it's already composed upon...

@roji roji changed the title Comparing expressions can be slow Optimize SQL expression comparison by caching hash codes Aug 22, 2024
@roji
Copy link
Member

roji commented Aug 22, 2024

@ranma42 just to continue on my comment just above, I think this should be safe to implement even in the current bits - any case in which a mutable SelectExpression is contained within another expression (except ShapedQueryExpression) should be a violation of our current invariants: only the top-level select in the tree may be mutable.

I still absolutely want to make SelectExpression fully immutable (#32927), but that's quite a big task, and I don't think it needs to block this optimization here.

@roji
Copy link
Member

roji commented Aug 22, 2024

Note also #19859, which is about not exhaustively calculating hash codes for the entire tree, but just some shallower subset; this would further improve our performance here.

@ranma42
Copy link
Contributor Author

ranma42 commented Aug 22, 2024

@ranma42 just to continue on my comment just above, I think this should be safe to implement even in the current bits - any case in which a mutable SelectExpression is contained within another expression (except ShapedQueryExpression) should be a violation of our current invariants: only the top-level select in the tree may be mutable.

I still absolutely want to make SelectExpression fully immutable (#32927), but that's quite a big task, and I don't think it needs to block this optimization here.

If that is the case, I believe I can try and implement this. What would be the best way to evaluate the performance? (would a micro-benchmark make sense?)

@roji
Copy link
Member

roji commented Aug 22, 2024

Sure, an ad-hoc BenchmarkDotNet benchmark could work - not sure we need to commit it etc. Though this is one of the cases where the benefits depend on the tree depth/complexity which you choose to benchmark, which is a bit arbitrary... I think it's OK to do this improvement in any case?

@ranma42
Copy link
Contributor Author

ranma42 commented Dec 22, 2024

SqlConstantExpressions can contain any Value. Is it fine to assume that they are immutable? (once they are being processed by the EFCore query pipeline)

For example, would it be legitimate to capture enumerable values and "clone" them into immutable lists during the processing? (I would assume that it should be fine for constants, as non-constants would instead be handled as parameters)

@roji
Copy link
Member

roji commented Dec 22, 2024

@ranma42 yes, that makes sense, and unless I'm mistaken we already make that assumption in various places (i.e. SqlConstantExpression already calculates the value's hash code - though I'm indeed not sure how/where that's currently used).

In regular compiler-generated incoming expression trees, ConstantExpression is only given for inline literals inside the query, which of course can't be mutated. We indeed generate SqlConstantExpressions later inside the query pipeilne itself, but I think that if we then mutated the value referenced by SqlConstantExpressions, that would be a bug...

@ranma42
Copy link
Contributor Author

ranma42 commented Dec 23, 2024

I did a couple of experiments in https://github.com/ranma42/efcore/tree/wip/hash-lazy and https://github.com/ranma42/efcore/tree/wip/hash-eager to evaluate the impact; the results are benchmarks.zip

The CompareEquals benchmark (not in the attachment) compares different expressions (of varying sizes) and shows that both the lazy and the eager approach are very effective in avoiding needless deep comparison, effectively making the check constant-time when expressions are not equal.

In addition to that, I implemented some benchmarks that compare other affected operations:

  • Hash measures the time spent to GetHashCode on an expression that has been already created and whose hash has already been computed
  • Create measures the time spent when creating expressions (and the related memory usage)
  • CreateAndHash measures the time spent when creating and hashing an expression (this is useful to know the time taken to hash in the lazy case, by difference from the Create benchmark)
  • CompareSameEquals compares two expressions that are structurally equal, but which share no objects

Some notes:

  • both lazy and eager result in negligible time being spent in already-computed hashes (as expected)
  • lazy requires additional 8 bytes per object vs 4 bytes in the eager case
  • eager hashing makes creation about 40-45% slower
  • lazy hashing is slower than eager hashing: the create+hash scenario requires more than twice the time of create-only; this is probably caused by the need to visit the expression tree twice (in fact, lazy hashing and no cache have matching times for this)
  • the overhead of the nullable hash seem to cause a significant regression (about 70%) in the time needed to compare two equal structures

Unsurprisingly, each implementation has its own strengths:

  • no caching is best when expressions are unlikely to be compared/hashed
  • eager is best when expressions are likely to be compared/hashed (even just once)
  • lazy is best when expressions are likely to fall into one of the two categories:
    • expressions are compared/hashed multiple times against unequal expressions
    • expressions are never compared/hashed

@roji
Copy link
Member

roji commented Dec 24, 2024

@ranma42 do you mind posting the benchmark results here in the issue (possibly inside <details>)? It's always nice to just see things directly here etc.

Thanks for doing this work... Yeah, everything here makes sense of course. What's your opinion on where we should go?

I don't have a super clear preference - but my leaning here is maybe slightly towards lazy: at least in the current query pipeline, I think most nodes are unlikely to actually be compared/hashed, since we don't compare for equality that often, and even when we do, most cases will fail early as the trees aren't identical. However, as we introduce more optimizations requiring node comparisons, this may shift the balance towards eager calculation.

One possible interesting thing to measure would be to just run the full EF test suite for SQL Server with both systems, and compare the running time (but that requires fully implementing both, which seems like some lost efort).

@ranma42
Copy link
Contributor Author

ranma42 commented Dec 24, 2024

I ran the benchmarks on my laptop:

BenchmarkDotNet v0.14.0, Kali GNU/Linux 2024.4
13th Gen Intel Core i7-13850HX, 1 CPU, 28 logical and 14 physical cores
.NET SDK 10.0.100-alpha.1.24573.1
  [Host]     : .NET 10.0.0 (10.0.24.57009), X64 RyuJIT AVX2
  DefaultJob : .NET 10.0.0 (10.0.24.57009), X64 RyuJIT AVX2

This is a WSL2 on a Windows 11, so scheduling might be kind of messy.
In order to reduce the variance, I set the affinity to only use the P cores.

main
Method Size Mean Error StdDev Op/s Gen0 Gen1 Allocated
Hash 1 22.87 ns 0.243 ns 0.228 ns 43,718,312.6 - - -
Create 1 41.38 ns 0.404 ns 0.338 ns 24,163,894.1 0.0117 - 184 B
CreateAndHash 1 60.49 ns 0.974 ns 0.863 ns 16,531,922.8 0.0117 - 184 B
CompareSameEquals 1 24.63 ns 0.325 ns 0.304 ns 40,602,219.0 - - -
Hash 10 198.37 ns 2.955 ns 2.619 ns 5,041,006.8 - - -
Create 10 266.70 ns 4.667 ns 4.138 ns 3,749,472.6 0.0806 - 1264 B
CreateAndHash 10 455.03 ns 8.318 ns 7.780 ns 2,197,660.2 0.0806 - 1264 B
CompareSameEquals 10 207.53 ns 2.867 ns 2.682 ns 4,818,677.6 - - -
Hash 100 2,507.36 ns 32.763 ns 30.646 ns 398,826.1 - - -
Create 100 2,481.69 ns 35.566 ns 29.699 ns 402,950.5 0.7668 0.0153 12064 B
CreateAndHash 100 4,982.31 ns 55.039 ns 51.483 ns 200,709.9 0.7629 0.0153 12064 B
CompareSameEquals 100 1,391.11 ns 14.889 ns 13.198 ns 718,850.9 - - -
Hash 1000 27,762.86 ns 481.691 ns 427.007 ns 36,019.3 - - -
Create 1000 24,970.55 ns 367.987 ns 344.215 ns 40,047.2 7.6294 1.4954 120064 B
CreateAndHash 1000 52,963.58 ns 788.848 ns 699.293 ns 18,880.9 7.6294 1.4648 120064 B
CompareSameEquals 1000 15,234.68 ns 226.824 ns 212.171 ns 65,639.7 - - -
lazy
Method Size Mean Error StdDev Op/s Gen0 Gen1 Allocated
Hash 1 0.8337 ns 0.0182 ns 0.0152 ns 1,199,415,629.7 - - -
Create 1 41.9007 ns 0.6305 ns 0.5898 ns 23,865,930.9 0.0132 - 208 B
CreateAndHash 1 61.3182 ns 0.9792 ns 0.8680 ns 16,308,364.7 0.0132 - 208 B
CompareSameEquals 1 21.9323 ns 0.3565 ns 0.3334 ns 45,594,911.1 - - -
Hash 10 0.8320 ns 0.0470 ns 0.0440 ns 1,201,898,864.0 - - -
Create 10 273.2060 ns 5.1626 ns 4.5765 ns 3,660,241.8 0.0911 - 1432 B
CreateAndHash 10 460.5862 ns 5.8491 ns 5.1850 ns 2,171,146.1 0.0911 - 1432 B
CompareSameEquals 10 196.0948 ns 3.6676 ns 3.4307 ns 5,099,575.4 - - -
Hash 100 0.8618 ns 0.0371 ns 0.0329 ns 1,160,410,233.5 - - -
Create 100 2,575.5453 ns 29.0094 ns 27.1354 ns 388,267.3 0.8698 0.0191 13672 B
CreateAndHash 100 5,131.1822 ns 48.4635 ns 40.4692 ns 194,886.9 0.8698 0.0153 13672 B
CompareSameEquals 100 2,433.2913 ns 38.0084 ns 35.5530 ns 410,966.0 - - -
Hash 1000 0.5273 ns 0.0316 ns 0.0296 ns 1,896,286,776.3 - - -
Create 1000 25,080.6950 ns 306.3839 ns 286.5917 ns 39,871.3 8.6670 1.9531 136072 B
CreateAndHash 1000 54,738.8424 ns 695.1416 ns 650.2359 ns 18,268.6 8.6670 1.8921 136072 B
CompareSameEquals 1000 25,852.6325 ns 396.2357 ns 351.2526 ns 38,680.8 - - -
eager
Method Size Mean Error StdDev Median Op/s Gen0 Gen1 Allocated
Hash 1 0.0013 ns 0.0053 ns 0.0047 ns 0.0000 ns 799,449,543,540.5 - - -
Create 1 62.0325 ns 1.2414 ns 1.2748 ns 61.9899 ns 16,120,569.7 0.0126 - 200 B
CreateAndHash 1 55.9104 ns 1.1136 ns 1.0937 ns 55.6630 ns 17,885,749.5 0.0127 - 200 B
CompareSameEquals 1 23.7113 ns 0.3387 ns 0.3168 ns 23.7238 ns 42,174,041.6 - - -
Hash 10 0.0045 ns 0.0109 ns 0.0107 ns 0.0000 ns 224,473,417,467.3 - - -
Create 10 396.1307 ns 7.8899 ns 7.3802 ns 394.0148 ns 2,524,419.5 0.0858 - 1352 B
CreateAndHash 10 381.1490 ns 4.5019 ns 3.9908 ns 381.5535 ns 2,623,645.8 0.0858 - 1352 B
CompareSameEquals 10 192.6524 ns 2.3442 ns 2.1928 ns 192.6032 ns 5,190,694.5 - - -
Hash 100 0.0029 ns 0.0086 ns 0.0081 ns 0.0000 ns 345,063,648,378.4 - - -
Create 100 3,605.7217 ns 44.5567 ns 41.6784 ns 3,590.7824 ns 277,337.0 0.8202 0.0191 12872 B
CreateAndHash 100 3,576.9620 ns 35.8328 ns 33.5181 ns 3,573.0936 ns 279,566.8 0.8202 0.0191 12872 B
CompareSameEquals 100 1,504.7692 ns 22.5100 ns 19.9545 ns 1,504.5868 ns 664,553.7 - - -
Hash 1000 0.0106 ns 0.0180 ns 0.0168 ns 0.0000 ns 94,335,649,749.4 - - -
Create 1000 36,023.6415 ns 581.7447 ns 544.1643 ns 36,260.6855 ns 27,759.5 8.1177 1.6479 128072 B
CreateAndHash 1000 34,980.4683 ns 484.8567 ns 453.5353 ns 34,912.6100 ns 28,587.4 8.1177 1.6479 128072 B
CompareSameEquals 1000 14,012.0293 ns 114.7950 ns 101.7627 ns 13,998.1726 ns 71,367.3 - - -

Some further details for those inspecting the values:

  • the expression "trees" the benchmark builds are basically lists with the structure (true && (true && (true && ...)))
  • N = 1 is (true && true)
  • N = 2 is (true && (true && true))
  • in general, the expression is made of N operands and N+1 constants (keep that into account to understand my 4/8 bytes per object)

@roji implementing both is quite trivial (in fact, it's little more than a find & replace 😄). I can try to run them against SqlServer, but for benchmarking I believe Sqlite might be more effective (in-process, faster).

@roji
Copy link
Member

roji commented Dec 24, 2024

Thanks. Yeah, if it's trivial for you to run, then it would be nice to see how the functional tests fare on all three scenarios (FWIW both SQL Server and SQLite could be interesting, with the former a bit more real-world). Whatever you feel like doing and have time for :)

@ranma42
Copy link
Contributor Author

ranma42 commented Dec 26, 2024

I tried running the EFCore.Sqlite.FunctionalTests and EFCore.SqlServer.FunctionalTests multiple times to collect some data.

  Sqlite     Sqlserver    
  Main Lazy Eager Main Lazy Eager
  228.87 258.05 265.12 637.81 737.05 778.36
  234.41 270.49 267.56 688.09 774.12 783.14
  234.71 274.75 268.68 770.71 785.96 788.48
  249.32 275.30 270.73 773.66 822.89 816.26
  260.91 277.97 274.67 783.01 861.23 831.61
  272.13 287.54 275.55 835.76 891.60 906.10
  284.14 289.38 276.52 932.33 916.53 910.58
  292.41 292.49 278.70 1012.65 956.55 968.55
             
median 255.12 276.64 272.70 778.34 842.06 823.94
average 257.11 278.25 272.19 804.25 843.24 847.89
stddev 24.18 11.35 4.85 122.20 76.18 71.41

All measurements are in seconds.
The "lazy" column contains measurement from https://github.com/ranma42/efcore/tree/compute-hash-lazy
The "eager" column contains measurement from https://github.com/ranma42/efcore/tree/compute-hash-eager
The "main" column contains measurement from their merge base, i.e. ranma42@c4d6aa0 (it is not the real main branch to get a couple of fixes to the SqlServer baselines, to align a couple of Equals implementation to the usual one and to fix a broken GetHashCode)

As I mentioned before, my environment is probably not ideal for benchmarking and SqlServer is in general harder to measure reliably (see the standard deviation 😅 ).
Given these results I would say that:

  • no caching is actually quite cheap (the main branch is outperforming both eager and lazy caching)
  • eager and lazy are barely distinguishable (comparing median or average)
  • if I had to choose between any of the 3 implementations, the current behavior is probably good enough or even the best for the test suite (with the current set of optimizations)

Note that this "benchmark" is neither representative of the real-world nor of worst-case behavior (for that, my previous benchmarks show a major improvement in comparison of huge expressions).

I am not sure what to make of this; I am somewhat wary of introducing such a change without seeing any improvement.
Maybe an acceptable outcome of this is the knowledge that such a change is reasonably easy to perform when it is deemed convenient?
OTOH I when implementing query optimizations, in some cases Equals was frowned upon (and IIRC in a couple of cases I implemented incomplete optimizations in order to avoid a deep comparison), so I wonder what the best tradeoff is here 🤔

@roji
Copy link
Member

roji commented Dec 28, 2024

Thanks for doing this and posting the results @ranma42! Yeah, I'm slightly surprised as well, but I think this mainly shows that we're simply not doing a lot of optimizations requiring deep comparisons. I suspect that as we start to add more stuff, this will become more relevant. Stuff like #12634 and #34507 specifically comes to mind: these would be very broad optimizations that go over lots of nodes in the tree (i.e. are not subject to a very specific pattern-match that happens beforehand, such as the recent NULLIF and related work in #35327, which only does the deep comparison for CASE/WHEN over equality/inequality).

I'd say this... Let's keep this in the backlog for now, and as we implement more equality-based optimizations, we should keep an eye on this and rerun the benchmarks to see where we are. You can keep the branches (linking to them from here would be great) - I think we it's likely we'd find ourselves implementing this later...

How does that sound?

@ranma42
Copy link
Contributor Author

ranma42 commented Dec 29, 2024

Thanks for doing this and posting the results @ranma42! Yeah, I'm slightly surprised as well, but I think this mainly shows that we're simply not doing a lot of optimizations requiring deep comparisons. I suspect that as we start to add more stuff, this will become more relevant.

I am not sure this will actually be the case; note that

  1. as soon as the recursive comparison finds a difference, it immediately stops
  2. most expression trees in EFCore are smallish (at least in the test suite; it is possible that it is not representative of actual use as it is mostly aimed at exercising different optimizations/cases/...)

Stuff like #12634 and #34507 specifically comes to mind: these would be very broad optimizations that go over lots of nodes in the tree (i.e. are not subject to a very specific pattern-match that happens beforehand, such as the recent NULLIF and related work in #35327, which only does the deep comparison for CASE/WHEN over equality/inequality).

Yes, additional optimizations might make it more relevant, but it is also likely that the smallish expression used in the test suite will never cause a significant improvement (on average) in caching the hash codes.

I'd say this... Let's keep this in the backlog for now, and as we implement more equality-based optimizations, we should keep an eye on this and rerun the benchmarks to see where we are. You can keep the branches (linking to them from here would be great) - I think we it's likely we'd find ourselves implementing this later...

How does that sound?

LGTM 👍 The branches are mentioned above (https://github.com/ranma42/efcore/tree/compute-hash-lazy and https://github.com/ranma42/efcore/tree/compute-hash-eager).
I will try to periodically rebase them and run them (especially if we add a considerable amount of optimizations that would benefit from them).
On the other side, I will worry less about optimizations that risk being slow because we don't have this ;)

@roji
Copy link
Member

roji commented Dec 30, 2024

I am not sure this will actually be the case

You may very well be right... Though it may still be better to do caching to limit the worse-case compilation performance, even if it regresses the average case; I'm thinking about those cases where the expression trees are deep (this is indeed rare), and multiple fragments are equal except for one node somewhere in the bottom. In this sort of worse-case scenario (which is probably pretty contrived), lots of deep equality could become problematic...

But anyway, I think we both agree that for now, we should continue with optimizations leveraging deep comparison without worrying about caching hash codes - but that we should keep an eye on it and revisit as needed. I'll keep this issue open in the backlog for now.

@roji roji changed the title Optimize SQL expression comparison by caching hash codes Optimize SQL expression comparison by caching hash codes (currently probably not needed) Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants