Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Regex dependency #201

Merged
merged 6 commits into from
Jul 28, 2024
Merged

Remove Regex dependency #201

merged 6 commits into from
Jul 28, 2024

Conversation

RossSmyth
Copy link
Contributor

Regex being the slowest dependency for my entire project just to do a couple matches has been really bugging me. So I just recreated it.

The main draw back is the Unicode table. I just tried to recreate the semantics of the regex, and this is what it specified (any word character).

This is even lighter weight than regex-lite so I would say this succeeds #168, plus there is no MSRV junk to worry about (wrt dependencies).

I don't personally believe there is any maintenance burden to it since the matching is pretty straightforward.

I also added a test to test that we select the second part of serial numbers that are split by "&" as specified in one issue. That test seems to have been updated so that it no longer did that after parent IDs were added.

@parasyte
Copy link
Contributor

💯

I have a pet peeve with static RegExs. They add a lot of overhead compared to a simple FSM.

@sirhcel
Copy link
Contributor

sirhcel commented Jul 19, 2024

Thank you for keeping on working on this issue @RossSmyth! The call for regex support is also imminent with #170 but in looks like we can get away there without regex or regex-lite there as well.

After a quick look at HWIDs and the MODALIAS strings from #170, I'm wondering if we actually need to deal with word boundaries according to Perl regex or if we could use a simpler definition in our context? I'm not opposed to a solution based on PERL_WORD but with this big step already made, isn't there a even more compact solution within reach?

@RossSmyth
Copy link
Contributor Author

I'm wondering if we actually need to deal with word boundaries according to Perl regex or if we could use a simpler definition in our context?

Yeah I was thinking the same, but I personally do not know what that subset is. The Windows documentation on HWID isn't really that great.

@sirhcel
Copy link
Contributor

sirhcel commented Jul 22, 2024

Yeah I was thinking the same, but I personally do not know what that subset is. The Windows documentation on HWID isn't really that great.

I found the note

Device identification strings should not be parsed. They are meant only for string comparisons and should be treated as opaque strings.

in the documentation for device identification strings. Looks like what we are doing here is not spec'd. 😄 But something done by other libraries as well (like for example pySerial, nusb). The common pattern seen is:

  1. If we are dealing with a composite device, use the parent device's identification string
  2. Interpret the entire last "path" element as serial number

pySerial uses a word regex on the last path element for checking for a composite device and not MI_ like we are doing here (and which is documented as being present for composite devices). But once the device identification string to use is determined, the entire last path element is taken as serial number.

This procedure works with the example device identification strings from pyserial/pyserial#283 (comment).

After looking into this a bit more into the history of our code, regex-matching for word characters seems to be a relic from #47. It looks like it survived subsequent changes to this mechanism when implementing/cleaning up support for retrieving serial numbers for composite devices like #141 recently.

From my my perspective, we good to go with just using the entire last path element of the actual USB device as serial number and without the check for word characters.

@RossSmyth
Copy link
Contributor Author

Okie dokie. I'll modify it so.

@RossSmyth RossSmyth force-pushed the HwidMatches branch 2 times, most recently from 648bb98 to 529e7eb Compare July 22, 2024 20:04
src/windows/enumerate.rs Outdated Show resolved Hide resolved
@RossSmyth
Copy link
Contributor Author

RossSmyth commented Jul 22, 2024

Thinking about how the serial number is matched, we currently make the serial number an optional field, even from the parent device if part of a composite device. We do not test this behavior though. So I wonder what the semantics are supposed to be.

  1. Serial numbers are always optional (this is what is currently implemented)
  2. Serial numbers of child devices of composite devices are optional, but the parent will always have a serial
  3. Something weirder?

I'll stare at pyserial to see if I can divine what they do.

@RossSmyth
Copy link
Contributor Author

RossSmyth commented Jul 22, 2024

Ok, my Python skills are quite rusty but I think it's staring right in the face in your link.
https://github.com/pyserial/pyserial/blob/7aeea35429d15f3eefed10bbb659674638903e3a/serial/tools/list_ports_windows.py#L337-L343

I believe it is supposed to be that if the child device doesn't have a serial number, the parent must, or else pyserial will throw an exception OR stack overflow because it seems to just infinitely recurse
https://github.com/pyserial/pyserial/blob/7aeea35429d15f3eefed10bbb659674638903e3a/serial/tools/list_ports_windows.py#L230-L231

I'll change this in a later PR though as it is a bit orthogonal.

@sirhcel
Copy link
Contributor

sirhcel commented Jul 22, 2024

Thinking about how the serial number is matched, we currently make the serial number an optional field, even from the parent device if part of a composite device. We do not test this behavior though. So I wonder what the semantics are supposed to be.

1. Serial numbers are always optional (this is what is currently implemented)

As far as I've overseeing things here: This is is the way. Serial numbers for USB devices are provided by the device and are optional. In terms of USB descriptors the serial number is a property of the USB device and not the interface. In case of a composite device there is one serial number per device and the interfaces are numbered.

This is also related to #178 and we are maybe using the wrong serial number for composite FTDI devices: The suffix seems to be generated by the FTDI driver and according to pyserial/pyserial#283 (comment) (COM96 and COM134), the parent device provides the right serial number.

@RossSmyth
Copy link
Contributor Author

the parent device provides the right serial number.

If you look at get_parent_serial_number() that does not seem to be the guaranteed case.
https://github.com/pyserial/pyserial/blob/7aeea35429d15f3eefed10bbb659674638903e3a/serial/tools/list_ports_windows.py#L162-L164

So yeah, I agree. Always optional.

Remove regex as the capture that are done are not too complex.
Just implement our own capturing.
@sirhcel
Copy link
Contributor

sirhcel commented Jul 22, 2024

I believe it is supposed to be that if the child device doesn't have a serial number, the parent must, or else pyserial will throw an exception OR stack overflow because it seems to just infinitely recurse https://github.com/pyserial/pyserial/blob/7aeea35429d15f3eefed10bbb659674638903e3a/serial/tools/list_ports_windows.py#L230-L231

Recursion depth is limited here.

I'll change this in a later PR though as it is a bit orthogonal.

Yes. No need to change/fix everything here. We just should end up at least en par with the beforehand state.

@sirhcel
Copy link
Contributor

sirhcel commented Jul 22, 2024

If you look at get_parent_serial_number() that does not seem to be the guaranteed case. https://github.com/pyserial/pyserial/blob/7aeea35429d15f3eefed10bbb659674638903e3a/serial/tools/list_ports_windows.py#L162-L164

Like in: No serial if the actual device is nested too deep down in the device stack or there is none? The latter is the case for older FTDI devices like the FT2232D which requires an external EEPROM for storing serial number and other configuration data.

So yeah, I agree. Always optional.

Great to be on the same page! 😄

@sirhcel
Copy link
Contributor

sirhcel commented Jul 25, 2024

I would like to keep the actual wrong serial numbers returned in #203 and not return a longer number which is also not the devices serial number.

Switching from char::is_ascii_alphanumeric to char::is_alphanumeric should bring us a bit more toward PERL_WORD with still just code from the standard library. Do you see an issue with that? Did you use is_ascii_alphanumeric for a certain reason?

Locally testing the changes shows no difference in the output for my set of test hardware. From my perspective, we are good to go. Thank you very much for your perseverance with this PR!

@RossSmyth
Copy link
Contributor Author

I used it because it is semantically identical to the pyserial regex. But we can preserve the PERL_WORD semantics instead, that works for me.

Rust str are indexed by byte, so doing .take().count()
is incorrect since Unicode values are often multiple
bytes. This fixes it and adds tests for the edge
cases.
@RossSmyth
Copy link
Contributor Author

RossSmyth commented Jul 26, 2024

Oops forgot to actually add the changes 😄

It did not handle actual Unicode in the serial number properly if we are to accept that. The previous implementation would index it by byte. For example the test serial I added, 3854356β03432, would be matched as 3854356β0343 since it would index it by byte and have the end chopped off. That is fixed with tests that poke it with the edge cases I found while implementing.

sirhcel added 2 commits July 26, 2024 23:31
This includes throwing some random data at HwidMatches::new with the
help of quickcheck.
@sirhcel
Copy link
Contributor

sirhcel commented Jul 28, 2024

Thank you for finally remembering and adding the last fix!

@sirhcel sirhcel merged commit b0d046b into serialport:main Jul 28, 2024
30 checks passed
@RossSmyth
Copy link
Contributor Author

Thanks!

@RossSmyth RossSmyth deleted the HwidMatches branch July 30, 2024 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants