Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Construct Tree Sitter queries only once #18

Merged
merged 2 commits into from
Jan 16, 2025
Merged

Conversation

oyarsa
Copy link
Contributor

@oyarsa oyarsa commented Jan 15, 2025

This is the speed-up mentioned in #14 (comment). I figured it was easy enough to just do it. I can post more rigorous results, but here's why I did this:

I was benchmarking the tool to see if filtering afterwards made any difference. I noticed that when parsing multiple files, the same query was being re-created, and it wasn't free. My test case was taking https://github.com/rust-lang/rust, adding tags to the comments with sed [1], and running anot . in the repository root. Before this PR, it took ~50s. Now, it takes ~12s. It doesn't make any difference if it runs on a single file.

[1] git ls-files '*.rs' | xargs sed -i '/[^\/]\/\/ /s/\/\/ /\/\/ @tag: /'

@oyarsa oyarsa changed the title Ts query once Construct Tree Sitter queries only once Jan 15, 2025
@rorybyrne
Copy link
Collaborator

rorybyrne commented Jan 15, 2025

50s -> 12s is a big improvement, thank you. But even 12s still seems quite long. How much faster is ripgrep?

@rorybyrne
Copy link
Collaborator

Nvm I'll just run it and see myself

@rorybyrne
Copy link
Collaborator

rg "println" --type rust 0.23s user 1.31s system 436% cpu 0.353 total

Running at the base of rust-lang/rust.

Wow!

@oyarsa
Copy link
Contributor Author

oyarsa commented Jan 15, 2025

ripgrep takes 1s and regular grep -r 2.5s. The latter is likely because it doesn't ignore gitignore. Does anot do that?

@rorybyrne
Copy link
Collaborator

No we just naively walk the directory at the moment. Should open some issues for .gitignore and possibly filetype filtering.

@oyarsa
Copy link
Contributor Author

oyarsa commented Jan 15, 2025

Regarding why it's so much slower, it's because tree-sitter parsing is slow. Here's a flamegraph of the latest version:

rust4

Since it's CPU-bound, we might get some speed-up from parallelising this. It should be straightforward with rayon.

Maybe we can figure out a way to exclude files without annotations before they're parsed. Straight regex testing for //.*@.*: would be a lot faster. That might work well in real codebases, but it wouldn't change the benchmark I'm running.

Edit: it seems that GitHub removes the interactive parts of the flamegraph. The stuff on the left that you can't read is input::read_file with only 8% of the samples.

@rorybyrne
Copy link
Collaborator

rorybyrne commented Jan 15, 2025

I was wondering if rg can do this, and it seems it can. With a bit of coaxing, Claude was able to come up with this equivalent rg command:

❯ rg -n --json "@(\w+):\s*(.+)" tests | jq -c 'select(.type == "match") | {
 kind: (.data.submatches[0].match.text | capture("@(?<tag>\\w+):.*").tag),
 content: (.data.submatches[0].match.text | capture("@\\w+:\\s*(?<content>.+)").content),
 location: {
   file: .data.path.text,
   line: .data.line_number,
   inline: (
     .data.lines.text |
     (test("^\\s*//") or test("^\\s*#")) | not
   )
 }
}'

{"kind":"note","content":"this experiment will be re-written later","location":{"file":"tests/data/test.py","line":2,"inline":false}}
{"kind":"hypothesis","content":"5 is better than 4","location":{"file":"tests/data/test.py","line":5,"inline":true}}
{"kind":"todo","content":"Add initialization","location":{"file":"tests/data/test.js","line":3,"inline":true}}
{"kind":"bug","content":"Sometimes fails on Safari","location":{"file":"tests/data/test.js","line":7,"inline":false}}
{"kind":"todo","content":"Add more fields","location":{"file":"tests/data/test.rs","line":2,"inline":false}}
{"kind":"fixme","content":"This needs better error handling","location":{"file":"tests/data/test.rs","line":9,"inline":true}}

Firstly, this is one of those moments where LLMs amaze me. That took <5 minutes...

Secondly, it makes me question what value our tool adds. Tree sitter integration, potentially? Some other yet-unknown feature we could build?

This makes it easier to profile the program with a profiler (e.g. cargo
flamegraph), even if the Rust main won't be distributed.
Creating the queries isn't free, and when we're parsing lots of files in
directoy, it becomes a hotspot.
@rorybyrne rorybyrne merged commit e4f3793 into flywhl:main Jan 16, 2025
1 check passed
@oyarsa oyarsa deleted the ts-query-once branch January 16, 2025 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants