Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to report back slices of printable characters #127

Open
epage opened this issue Jan 21, 2025 · 8 comments
Open

Ability to report back slices of printable characters #127

epage opened this issue Jan 21, 2025 · 8 comments

Comments

@epage
Copy link
Contributor

epage commented Jan 21, 2025

The trait currently reports a char at a time. While terminal emulators might need to operate on chars, other tools like stripping of ANSI escape codes or converting them, need to work on slices

  • &str -> char conversion is expensive, relatively speaking
  • doing IO per char would be expensive
  • allocating a String to aggregate the chars would also be expensive

To workaround this, anstream exposed the state changes from its fork, anstyle-parse, to do this.

@kchibisov
Copy link
Member

&str -> char conversion is expensive, relatively speaking

I'm not sure you can do that either, you don't have stages where you clearly have a string, unless it's some osc/etc. Especially given that some characters end up in execute and some in print.

What if you just carry a Vec around and dispatch when you feel like it, the string you can get with str::from_raw?

The main issue is that it'll kill all the perf, since you have different states, so buffering won't really work most of the time, and in your case, you can workaround that with just one buffer you carry around and build from it or just create a String on your end to collect, since that's with what you'll end up anyway?

Like I have an ansi stripper myself and never felt a need for anything like that, mostly because you have buffer writer on the other end and you writing a single char doesn't matter since you don't flush that often.

@kchibisov
Copy link
Member

I'd also say that any slice API will work by buffering in internal buffer, until we have some other state change to dispatch the collected buffer, which would just kill the performance and can be easily be done on the consumer side of things.

@epage
Copy link
Contributor Author

epage commented Jan 21, 2025

I'm not sure you can do that either, you don't have stages where you clearly have a string, unless it's some osc/etc. Especially given that some characters end up in execute and some in print.

I have it implemented without any buffering at https://github.com/rust-cli/anstyle/blob/59252439de0945cb93985c74858b7addae591d62/crates/anstream/src/adapter/strip.rs#L115-L144

Yes, this leverages lower level information than the Print trait to include some control characters but even having those separate would be a big help.

@kchibisov
Copy link
Member

kchibisov commented Jan 21, 2025

Could you use a terminated and advance_until_terminated property and stop the parser yourself if the goal is to work on the original slice? I'm pretty sure you can slice it yourself that way.

@kchibisov
Copy link
Member

Like you should track a print/execute -> escape and versa and stop the parser, the buffering could be done entirely by just letting parser run and then using the index diff to create a &str, with the str::from_utf8_unchecked. If the goal is to have an iterator that yields strings without escape sequences.

The state introspection is not really needed for you, since you have a Handler trait and you have all the osc_dispatch/etc, you just ignore them and break the transition until print or execute.

Or is there something else you need?

@epage
Copy link
Contributor Author

epage commented Jan 21, 2025

Depending on how the optimizer works, I'd still have the &str -> char work going on.

terminated is called after each byte is processed but it can only react based off of calls to Perform. I could have a PrintPerform that is meant to stop on the transition from a &str but it only does that after another Perform call is made in which the bytes-read has already been advanced. I would then need to track what state we are now in, figure out how many bytes that used, and roll back the advanced bytes.

@kchibisov
Copy link
Member

Hm, I guess one option would be to make ground_dispatch a part of Perform, and then with terminated you should be able to do what you want? And provide a default implementation of the ground_dispatch that does what it does right now? Other than that I'm not sure you can do that much.

@kchibisov
Copy link
Member

kchibisov commented Jan 21, 2025

Like with terminated, it'll stop once you try to change state, and which basically means after every byte, except for the ground. There're also cases with partial escapes in the buffer, but I think they are not valid for you, since you generally already have a valid utf8 string, and all you want is to remove extra from it, so you work on already validated input, pretty much.

Though, the issue still would be that we call str::from_utf8 and check for errors/validate the buffer, which is already valid for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants