Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace the blib2to3 tokenizer with pytokens #4536

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

tusharsadhwani
Copy link
Collaborator

@tusharsadhwani tusharsadhwani commented Dec 22, 2024

Description

Replaces black's tokenizer with a from-scratch rewrite done by me. We could vendor the code into black itself, but either pinning it or keeping it as-is would be my recommendation, the tokenizer can be used by multiple tools for perfect compatibility.

Resolves #4520
Resolves #970
Resolves #3700

Tests passing so far: 381/381 (!)

@tusharsadhwani
Copy link
Collaborator Author

@JelleZijlstra with this, the test suite is fully passing. Primer is failing (mostly just because some file in hypothesis failed to parse), I'll be working on that, but the code should be good for a first review.

@MeGaGiGaGon
Copy link
Collaborator

I don't think we currently have any tests for this, but I just linked the above two issues here because they are the same bug in the parser where \rs cause issues. The most minimal reproduction is {\r}, and since this is a parser rewrite hopefully it can be solved. Note that this is currently only observable by directly calling internal methods due to how black reads input.

@tusharsadhwani
Copy link
Collaborator Author

Thanks for linking this, I'll make sure these parse identically to how CPython does it.

@tusharsadhwani
Copy link
Collaborator Author

Okay, primer is fixed, and all tests are green.

@JelleZijlstra
Copy link
Collaborator

Might this fuzzer failure indicate a bug?

Falsifying example: test_idempotent_any_syntatically_valid_python(
    src_contents='\n#\r#',
    mode=Mode(target_versions=set(),
     line_length=88,
     string_normalization=False,
     is_pyi=False,
     is_ipynb=False,
     skip_source_first_line=False,
     magic_trailing_comma=False,
     python_cell_magics=set(),
     preview=False,
     unstable=False,
     enabled_features=set()),  # or any other generated value
)

token: pytokens.Token, source: str, prev_token: Optional[pytokens.Token]
) -> pytokens.Token:
r"""
Black treats `\\\n` at the end of a line as a 'NL' token, while it
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't sound particularly intentional, I'd be open to changing Black to remove this divergence.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to! I can give you a test case with the expected behaviour.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is enough as a test case actually:

a \
  b

But, the reason this probably exists is to support formatting this file:

class Plotter:
\
    pass

class AnotherCase:
  \
    """Some
    \
    Docstring
    """

@tusharsadhwani
Copy link
Collaborator Author

Might this fuzzer failure indicate a bug?

Yeah, but I'm pretty sure it is a bug in CPython. For now we can work it out in the tokenizer though. I'll add a flag in pytokens to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants