Proof-of-Concept: Fast path for search #9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TL;DR: In practice, this just reduces CPU usage slightly for typical datasets.
Main idea in this PR is that most files can be rejected by just looking at the first few bytes (it's in the server's own interest to not break this pattern).
For the sake of keeping this PR small, I assumed
p1team
appears afterp1
,p2
. This could be done in a less hacky manner by searching for"p1":
and"p2":
, then parsing out the string values.The overwhelming bottleneck is how the battle logs are stored - many small JSON files. Storage devices are just very bad at small random reads; even SSDs much prefer long sequential reads.
Anyway, this was a quick thing I thought I'd share, maybe it gives some ideas for future problems.
cargo bench
statsThis is slower on files with matching player IDs, and faster on files that don't match (the common case).
I added a bench case to help demonstrate that.
Most of the overhead added to the former case could be avoided by reorganizing the code so that the file is only opened once, but that's an invasive change and I just wanted to demonstrate the core idea here.
Before
After
perf
statsConstructed a directory with 500K files, 8GB total.
Uncached
The target usecase.
Dominated by FS and disk seeks.
echo 3 | sudo tee /proc/sys/vm/drop_caches
to clear the OS's file cache between each run. My drive doesn't have its own cache.Before
After
Cached
Not a realistic scenario, but better isolates the CPU-intensive portion.
Before
After