License header formatting #983

natibek · 2024-07-08T17:38:40Z

There are inconsistencies in when license headers are added to source code (and whether the added ones are the same). This PR solves this with a checker that checks if the source code has a license header and if it does, whether it is the correct one. It accounts for shebang lines, comments at the beginning of files unrelated to license headers, and pylint and mypy disable lines. This check has been added to all_.py.

Like the other checks, it can be called with ./checks/license_header_format_.py from the root directory.
-i can be used to perform an incremental check. This is not enabled by default.
A specific file can be passed as an argument and just that will be checked. By default, it checks all the files matching *.py.
The --no-header flag is used to handle only cases where no header is found. By default, it is False.
The --bad-header flag is used to handle only cases where an incorrect header is found. By default, it is False.
The --apply flag is used to fix the header problems. By default, it only checks if the headers are correct and does not make any fixes.
- It will remove bad headers and replace them with the correct ones.
- If no license header is found, it will add one.

richrines1

thank you for this!! added some inline suggestions. a few comments more generally:

there should be no references to "server" here. Any dynamic/repo-specific behavior should be handled via arguments or the config file - we would like checks in checks-superstaq to generalize outside of our repos as much as possible (within reason)
once we correct outdated headers once, it seems like we shouldn't need to keep checking them? in which case maybe the "outdated" functionality doesn't need to live in in this script - we can save the code you use to make these initial corrections somewhere internally, and then use this script to check headers from here on out
if we still need to explicitly check for the string "Infleqtion" in places, maybe we could add a "licensee" value to the config file in addition to the header, instead of hard coding it?
similarly, i can maybe see why it's unavoidable but i feel like the hard-coded "apache" checks somewhat defeat the purposes of saving the header in the config. do you think there's an easy way to check if the headers are ~the same, up to licensee/year? maybe we could allow the header in pyproject.toml to include {YEAR} and {LICENSEE} tags, which we could convert to wildcards when comparing against existing licenses

also fwiw it's also ok if this script doesn't handle every possible case perfectly - if it gets confused it can always just throw an error saying to fix the headers manually :)

richrines1 · 2024-07-16T22:29:30Z

checks-superstaq/checks_superstaq/license_header_format_.py

+        """
+    )
+    parser.add_argument(
+        "--apply", action="store_true", help="Add the license header to files.", default=False


no need to set default=False with action="store_true"

Suggested change

"--apply", action="store_true", help="Add the license header to files.", default=False

"--apply", action="store_true", help="Add the license header to files."

(ditto below)

richrines1 · 2024-07-18T16:48:37Z

checks-superstaq/checks_superstaq/license_header_format_.py

+    license_header = ""
+    exceptions = ["# pylint:", "#!/", "# mypy:"]
+
+    with open(file, "r+") as f:


do we need the "+" here? if not:

Suggested change

with open(file, "r+") as f:

with open(file, "r") as f:

richrines1 · 2024-07-18T18:43:15Z

checks-superstaq/checks_superstaq/license_header_format_.py

+try:
+    data: dict[str, Any] = tomlkit.parse(Path("pyproject.toml").read_text())
+    expected_license_header = str(data["tool"]["license_header_format"]["license_header"])
+    in_server = "Apache" not in expected_license_header


we should put this in a function instead executing it globally

richrines1 · 2024-07-18T18:44:42Z

checks-superstaq/checks_superstaq/license_header_format_.py

+    raise KeyError(
+        "Under [tool.license_header_format] add a license_header field with the license\
+ heder that should be added to source code files in the repository."
+    )


no need to raise an error in this case, we can just have an info message saying that no license header was found and then return as if it succeeded

checks-superstaq/checks_superstaq/license_header_format_.py

richrines1 · 2024-07-18T18:49:32Z

checks-superstaq/checks_superstaq/license_header_format_.py

+        return f"""
+    Beginning at line: {self.start_line_num}
+    Ending at line   : {self.end_line_num}\n
+{self.license_header}\n"""


nit

Suggested change

return f"""

Beginning at line: {self.start_line_num}

Ending at line : {self.end_line_num}\n

{self.license_header}\n"""

return (

f"Beginning at line: {self.start_line_num}\n"

f"Ending at line : {self.end_line_num}\n\n"

f"{self.license_header}\n"

)

natibek · 2024-08-05T15:57:16Z

once we correct outdated headers once, it seems like we shouldn't need to keep checking them? in which case maybe the "outdated" functionality doesn't need to live in in this script - we can save the code you use to make these initial corrections somewhere internally, and then use this script to check headers from here on out

That makes sense. We can also keep it but change the logic a bit. After the initial fix, instead of checking for ColdQuanta in the license header, we can check if it belongs to the licensee but is a different license. This can catch cases of changing the license provider.

similarly, i can maybe see why it's unavoidable but i feel like the hard-coded "apache" checks somewhat defeat the purposes of saving the header in the config. do you think there's an easy way to check if the headers are ~the same, up to licensee/year? maybe we could allow the header in pyproject.toml to include {YEAR} and {LICENSEE} tags, which we could convert to wildcards when comparing against existing licenses

I added a few more fields to replace the hard-coded variables. The cirq license header check pylint plugin does something similar. However, apache 2.0 licenses seem to have 2 different formattings from what I have seen in the license headers and that would mess with the matching if we use the wild card approach.

richrines1 · 2024-08-06T16:44:01Z

checks-superstaq/checks_superstaq/license_header_format_.py

+                and license_header.start_line_num <= line_num + 1 < license_header.end_line_num
+            ):
+                if line[-2] == ",":
+                    prepend += line[:-1] + f" 2024 {licensee}.\n"


silly legal q: will we need to update the year in every file on 1/1/2025? or do they stay the same until if/when we update the file?

if the former we might want to replace 2024 with e.g. datetime.datetime.now().year. if the latter maybe we want to make the last two digits wildcards?

will we need to update the year in every file on 1/1/2025?

No. Given that, we can pick whichever path is easier.

… are from the same license, wildcards in pyproject toml input

…q into license-check

natibek · 2024-08-08T21:52:43Z

@richrines1 can you please take a look? The biggest change is that I am using difflib.SequenceMatcher to check if the body of the license header matches the header specified in the pyproject.toml file. Instead of checking if the license name (eg Apache) is in the header, we check if the header body matches the provided header body. If it matches above a threshold and the licensee is not included in the copyright line and the header is editable, the licensee is appended to the header.

…cense-check

… headers with no body

richrines1

(partial review)

richrines1 · 2024-08-12T19:08:37Z

checks-superstaq/checks_superstaq/license_header_format_.py

+        if re.search(copyright_pattern, line):
+            copyright_line += line
+            body = "\n".join(header_as_lst[idx + 1 :]).strip("#")
+            break
+        else:
+            copyright_line += line


Suggested change

if re.search(copyright_pattern, line):

copyright_line += line

body = "\n".join(header_as_lst[idx + 1 :]).strip("#")

break

else:

copyright_line += line

copyright_line += line

if re.search(copyright_pattern, line):

body = "\n".join(header_as_lst[idx + 1 :]).strip("#")

break

richrines1 · 2024-08-12T19:31:49Z

checks-superstaq/checks_superstaq/license_header_format_.py

+
+    for license_header in license_header_lst:
+        similar_body = (
+            difflib.SequenceMatcher(None, body, license_header.license_header).ratio() > 0.94


it looks like this is comparing the existing header to the part of the existing header below the copyright line. should it be comparing to the expected header instead?

we also might want to generalize this a bit to determine similarity, e.g. by comparing

"".join(line.lstrip("#").strip().lower() for line in header.splitlines()), "".join(line.lstrip("#").strip().lower() for line in expected_header.splitlines()),

so that licenses will always get marked as similar if they only differ by cases/whitespace/comment style/etc

richrines1 · 2024-08-12T20:20:33Z

checks-superstaq/checks_superstaq/license_header_format_.py

+    target = (
+        expected_license_header.replace("{YEAR}", r"20\d{2}")
+        .replace("{LICENSEE}", licensee)
+        .replace("\n", "")


why are we removing the newlines?

richrines1 · 2024-08-12T20:21:03Z

checks-superstaq/checks_superstaq/license_header_format_.py

+        .replace("(", r"\(")
+        .replace(")", r"\)")
+        .replace(".", r"\.")
+        .replace("'", r"\'")
+        .replace('"', r"\"")


can we use re.escape() to do this?

richrines1 · 2024-08-12T20:31:14Z

checks-superstaq/checks_superstaq/license_header_format_.py

+            license_header.header_type = HeaderType.VALID
+            valid = True
+        elif similar_body and re.search(appended_pattern, license_header.license_header):
+            license_header.header_type = HeaderType.VALID


we might want to treat this differently - if the licenses are similar but not exactly the same we should probably still rewrite them to match the expected formatting

richrines1 · 2024-08-12T20:47:51Z

checks-superstaq/checks_superstaq/license_header_format_.py

+    """
+    copyright_line = ""
+    body = ""
+    copyright_pattern = re.compile(r"Copyright .*")


should we make this one case insensitive? e.g.

Suggested change

copyright_pattern = re.compile(r"Copyright .*")

copyright_pattern = re.compile(r"Copyright .*", flags=re.IGNORECASE)

natibek · 2024-10-10T22:09:40Z

@richrines1 can you take one last look at this please? I have responded to the comments.

natibek requested review from vtomole and richrines1 July 8, 2024 17:38

natibek changed the title ~~License header checker and adder~~ License header formatting Jul 11, 2024

Working license header formatter

1737f51

natibek force-pushed the license-check branch from 3d0caf6 to 1737f51 Compare July 11, 2024 22:28

natibek and others added 12 commits July 11, 2024 17:36

Fix formatting

c274c39

Remove match case to if statements

8fde62b

Reduce complexity

06cc52a

Toml reading fixed for older python versions, improved printing

d8a052c

Add license_header_format_ to all_.py

4dbc5c4

Report number of incorrect license headers

350fbdb

Fix print for incorrect header case

74db2da

Fix mypy issues

92e7900

Fix license headers for repo

50750a5

Fix bug with appending Infleqtion to apache license

3fa1d1c

Fix bug with appending Infleqtion to apache license

5727998

Merge branch 'main' into license-check

23495e7

richrines1 self-assigned this Aug 2, 2024

richrines1 reviewed Aug 2, 2024

View reviewed changes

natibek added 2 commits August 5, 2024 10:40

Implement suggested changes

6645aa8

Complete merge

05b5f61

natibek added 5 commits August 5, 2024 11:00

Add license header

939ff01

Edit docstrings and minor fixes

47b3771

fix bad merge

ac45240

Remove check for outdated license

3e17822

wildcard approach

d2207c7

richrines1 reviewed Aug 6, 2024

View reviewed changes

natibek and others added 2 commits August 6, 2024 14:25

Use copyright body similarity license header in file to check if they…

89372a3

… are from the same license, wildcards in pyproject toml input

Merge branch 'main' into license-check

893e546

natibek and others added 6 commits August 8, 2024 16:20

Replace checking for licensee in string with regex

ae9285e

Merge branch 'license-check' of github.com:Infleqtion/client-supersta…

eddf930

…q into license-check

Merge branch 'main' into license-check

549e073

fix indent

8d79196

add new line

e5ac00e

add license headers and minor fix for check

9534d2c

richrines1 mentioned this pull request Aug 9, 2024

Feature/base qcvv framework #992

Merged

natibek added 3 commits August 9, 2024 11:35

add similarity check for exact match case, fix header

0be382c

Merge branch 'main' of github.com:Infleqtion/client-superstaq into li…

3b9aa34

…cense-check

remove license_name field and improve similarity check accounting for…

0330f29

… headers with no body

richrines1 reviewed Aug 12, 2024

View reviewed changes

Merge from main and make suggested changes

18fdef5

vtomole marked this pull request as draft August 29, 2024 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License header formatting #983

License header formatting #983

natibek commented Jul 8, 2024 •

edited

Loading

richrines1 left a comment

richrines1 Jul 16, 2024

richrines1 Jul 18, 2024

richrines1 Jul 18, 2024

richrines1 Jul 18, 2024

richrines1 Jul 18, 2024

natibek commented Aug 5, 2024

richrines1 Aug 6, 2024

vtomole Aug 6, 2024

natibek commented Aug 8, 2024

richrines1 left a comment

richrines1 Aug 12, 2024

richrines1 Aug 12, 2024

richrines1 Aug 12, 2024

richrines1 Aug 12, 2024

richrines1 Aug 12, 2024

richrines1 Aug 12, 2024

richrines1 Aug 12, 2024

natibek commented Oct 10, 2024

	"--apply", action="store_true", help="Add the license header to files.", default=False
	"--apply", action="store_true", help="Add the license header to files."

	copyright_pattern = re.compile(r"Copyright .*")
	copyright_pattern = re.compile(r"Copyright .*", flags=re.IGNORECASE)

License header formatting #983

Are you sure you want to change the base?

License header formatting #983

Conversation

natibek commented Jul 8, 2024 • edited Loading

richrines1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natibek commented Aug 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natibek commented Aug 8, 2024

richrines1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natibek commented Oct 10, 2024

natibek commented Jul 8, 2024 •

edited

Loading