Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working on a large git repo #711

Open
last-partizan opened this issue Oct 10, 2024 · 8 comments
Open

Working on a large git repo #711

last-partizan opened this issue Oct 10, 2024 · 8 comments

Comments

@last-partizan
Copy link
Contributor

I tried this on relatively large old repo, and i'm getting this error (after waiting for a few minutes):

2024-10-10 10:30:07,184 Checking rwarning: exhaustive rename detection was skipped due to too many files.
warning: you may want to set your diff.renameLimit variable to at least 77823 and retry the command.
Exception in thread Thread-1 (_worker_function):
Traceback (most recent call last):
  File "/home/serg/.local/pipx/venvs/seagoat/lib/python3.12/site-packages/seagoat/queue/base_queue.py", line 78, in _worker_function
    task = self._task_queue.get(timeout=0.1)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/queue.py", line 179, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:
...
  File "/home/serg/.local/pipx/venvs/seagoat/lib/python3.12/site-packages/seagoat/repository.py", line 119, in top_files
    (self.get_file(filename), score)
     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/serg/.local/pipx/venvs/seagoat/lib/python3.12/site-packages/seagoat/repository.py", line 134, in get_file
    self.get_file_object_id(filename),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/serg/.local/pipx/venvs/seagoat/lib/python3.12/site-packages/seagoat/repository.py", line 45, in get_file_object_id
    subprocess.check_output(

After raising renameLimit, it again fails on same line, but with IndexError. Upon investigation, it tries to access src/app/node_modules/abbrev/README.md, which should be ignored, because node_modules is in my .gitignore. But, not in root, so maybe it treats ignore patterns differenly ...

output is '' - empty string, and it fails to access second part after split().

I can make a patch for it, if you suggest proper way to do it. My first guess is to return None for such cases and filter them later...

@last-partizan
Copy link
Contributor Author

Adding **/node_modules/* to ignorePatterns didn't fix this issue.

@kantord
Copy link
Owner

kantord commented Oct 16, 2024

Adding **/node_modules/* to ignorePatterns didn't fix this issue.

but did it at least stop it from attempting to process those files?

In any case it sounds like a bug, SeaGOAT should never attempt to process gitignore files regardless of the ignore configuration (it's debatable if it's a good decision or not, but the sorting mechanism partially depends on git history, so it's how it's done at the moment)

so if you are certain that those files are not managed by git, then it is some nasty bug in SeaGOAT.

I'm surprised to see that you get a warning based on rename detection. I ran SeaGOAT on fairly large repositories such as Linux and React, and I don't remember getting such an error. Perhaps it's also dependent on git version or config that the repo might already have. In any case, it seems like an exception is thrown because git gives a non-zero exit code, right? I think that having exact rename detection for large repositories is not essential for SeaGOAT to function properly, it will just make the frecency-based sorting less accurate. So I think we should probably ignore this warning, what do you think?

last-partizan added a commit to last-partizan/SeaGOAT that referenced this issue Oct 16, 2024
last-partizan added a commit to last-partizan/SeaGOAT that referenced this issue Oct 16, 2024
@last-partizan
Copy link
Contributor Author

I tried to dig deeper, and ... there's actually two problems.

  1. git log ... takes a really long time on this repo (around 5 minutes). Do we need all history here? Can we limit it for a last year or maybe last 1000 entries?
  2. Someone (me), commited entire node_modules somewhere in the 2016, and then removed it and added to .gitignore.

To fix second issue, we can use git ls-files, and process only those files returned. Take a look, please: #726

@last-partizan
Copy link
Contributor Author

last-partizan commented Oct 16, 2024

In any case, it seems like an exception is thrown because git gives a non-zero exit code, right?

Upon furhter investigation, turned out that renameLimit (coming from git log ...) wasn't crashing a process. It was just a warning printed to stderr by git, and it was already ignored.

Crash was occurring in this part of code:

        object_id = (
            subprocess.check_output(
                [
                    "git",
                    "-C",
                    str(self.path),
                    "ls-tree",
                    "HEAD",
                    str(file_path),
                ],
                text=True,
            )
            .split()[2]
            .strip()
        )

I tried to make exception log shorter and excluded line with IndexError, pointing at split()[2]. My bad!

Anyway, if we use git ls-files to filter only files currently existing in git, this error disappears.

@kantord
Copy link
Owner

kantord commented Oct 26, 2024

I tried to dig deeper, and ... there's actually two problems.

  1. git log ... takes a really long time on this repo (around 5 minutes). Do we need all history here? Can we limit it for a last year or maybe last 1000 entries?
  2. Someone (me), commited entire node_modules somewhere in the 2016, and then removed it and added to .gitignore.

To fix second issue, we can use git ls-files, and process only those files returned. Take a look, please: #726

yes, I think that it should be possible to limit the amount of history we read, at least by time. One of the major reasons why git history is to calculate frecency scores, which is used for different things. One of them being prioritizing frequently and/or recently edited files for indexing so that you can get decent results even before the indexing has finished. A limitation such as reading only the last 1 year could be a decent proxy for this. This could still mean that in a huge repo with many contributors, you would still get a massive amount of commits, however in this you are likely dealing with a monorepo or some sort of monolith, so maybe being able to index a subset of a repo would be a more interesting feature?

@last-partizan
Copy link
Contributor Author

This is single repo, just with lot of history, so limitation by --since 1y would be ideal.

I can create PR with changes, the only thing I want to ask first: should we make this configurable, or just default 1year is enough?

@kantord
Copy link
Owner

kantord commented Oct 27, 2024

This is single repo, just with lot of history, so limitation by --since 1y would be ideal.

I can create PR with changes, the only thing I want to ask first: should we make this configurable, or just default 1year is enough?

this is a good question. I would make it configurable because I think that the minimum amount of history that you would prefer kinda varies 🤔 for instance, if you have a small long term project in maintenance with 2-3 maintainers that averages like 30 commits a year, probably you want a lot more. Whereas a project actively adding new features with 200 maintainers working on it full time will probably not need even 1 year...

I guess that we could make it default to "infinite" and make a small section in the docs explaining how to work with large repos and mention this configuration option

last-partizan added a commit to last-partizan/SeaGOAT that referenced this issue Oct 29, 2024
last-partizan added a commit to last-partizan/SeaGOAT that referenced this issue Oct 29, 2024
@last-partizan
Copy link
Contributor Author

Not making a PR, before the other gets merged (because this also requires self.config in the repository).

But, take a look at this: last-partizan@ab184c1

Instead of --since I'm using --max-count to limit number of commits, and it works the same for both small and large repos. So, i hope we can set a reasonable default value. I'm starting at 1000, and it's super fast on linux repo, and under 5 seconds on my large repo. We probably can set it to 5_000, and it still be pretty good value. I want to avoid setting it to null, because on linux repo this results in a few minutes reading git history on my laptop.

Also, need you advice how to name this variable in the config.

kantord pushed a commit that referenced this issue Nov 17, 2024
* fix: Fix processing files from .gitignore

Refs #711

* refactor: Move filtering ignored files to Repository class

* ruff format

* chore: Add a test

* chore: Cleanup

* chore: Smallest possible working sleep time

* refactor: Add separate add_file_delete_commit function

* refactor: Use rg instead of git ls-files

* chore: Fix test name
last-partizan added a commit to last-partizan/SeaGOAT that referenced this issue Nov 18, 2024
last-partizan added a commit to last-partizan/SeaGOAT that referenced this issue Nov 18, 2024
last-partizan added a commit to last-partizan/SeaGOAT that referenced this issue Nov 18, 2024
kantord pushed a commit that referenced this issue Nov 23, 2024
* feat: Add an option to limit maximum commits used for frecency

Refs #711

* chore: Add debug logging

* chore: Add tests/refactor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants