Working on a large git repo #711

last-partizan · 2024-10-10T08:01:27Z

I tried this on relatively large old repo, and i'm getting this error (after waiting for a few minutes):

2024-10-10 10:30:07,184 Checking rwarning: exhaustive rename detection was skipped due to too many files.
warning: you may want to set your diff.renameLimit variable to at least 77823 and retry the command.
Exception in thread Thread-1 (_worker_function):
Traceback (most recent call last):
  File "/home/serg/.local/pipx/venvs/seagoat/lib/python3.12/site-packages/seagoat/queue/base_queue.py", line 78, in _worker_function
    task = self._task_queue.get(timeout=0.1)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/queue.py", line 179, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:
...
  File "/home/serg/.local/pipx/venvs/seagoat/lib/python3.12/site-packages/seagoat/repository.py", line 119, in top_files
    (self.get_file(filename), score)
     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/serg/.local/pipx/venvs/seagoat/lib/python3.12/site-packages/seagoat/repository.py", line 134, in get_file
    self.get_file_object_id(filename),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/serg/.local/pipx/venvs/seagoat/lib/python3.12/site-packages/seagoat/repository.py", line 45, in get_file_object_id
    subprocess.check_output(

After raising renameLimit, it again fails on same line, but with IndexError. Upon investigation, it tries to access src/app/node_modules/abbrev/README.md, which should be ignored, because node_modules is in my .gitignore. But, not in root, so maybe it treats ignore patterns differenly ...

output is '' - empty string, and it fails to access second part after split().

I can make a patch for it, if you suggest proper way to do it. My first guess is to return None for such cases and filter them later...

The text was updated successfully, but these errors were encountered:

last-partizan · 2024-10-10T08:02:12Z

Adding **/node_modules/* to ignorePatterns didn't fix this issue.

kantord · 2024-10-16T13:12:02Z

Adding **/node_modules/* to ignorePatterns didn't fix this issue.

but did it at least stop it from attempting to process those files?

In any case it sounds like a bug, SeaGOAT should never attempt to process gitignore files regardless of the ignore configuration (it's debatable if it's a good decision or not, but the sorting mechanism partially depends on git history, so it's how it's done at the moment)

so if you are certain that those files are not managed by git, then it is some nasty bug in SeaGOAT.

I'm surprised to see that you get a warning based on rename detection. I ran SeaGOAT on fairly large repositories such as Linux and React, and I don't remember getting such an error. Perhaps it's also dependent on git version or config that the repo might already have. In any case, it seems like an exception is thrown because git gives a non-zero exit code, right? I think that having exact rename detection for large repositories is not essential for SeaGOAT to function properly, it will just make the frecency-based sorting less accurate. So I think we should probably ignore this warning, what do you think?

Refs kantord#711

last-partizan · 2024-10-16T19:46:03Z

I tried to dig deeper, and ... there's actually two problems.

git log ... takes a really long time on this repo (around 5 minutes). Do we need all history here? Can we limit it for a last year or maybe last 1000 entries?
Someone (me), commited entire node_modules somewhere in the 2016, and then removed it and added to .gitignore.

To fix second issue, we can use git ls-files, and process only those files returned. Take a look, please: #726

last-partizan · 2024-10-16T21:01:23Z

In any case, it seems like an exception is thrown because git gives a non-zero exit code, right?

Upon furhter investigation, turned out that renameLimit (coming from git log ...) wasn't crashing a process. It was just a warning printed to stderr by git, and it was already ignored.

Crash was occurring in this part of code:

        object_id = (
            subprocess.check_output(
                [
                    "git",
                    "-C",
                    str(self.path),
                    "ls-tree",
                    "HEAD",
                    str(file_path),
                ],
                text=True,
            )
            .split()[2]
            .strip()
        )

I tried to make exception log shorter and excluded line with IndexError, pointing at split()[2]. My bad!

Anyway, if we use git ls-files to filter only files currently existing in git, this error disappears.

kantord · 2024-10-26T16:04:19Z

I tried to dig deeper, and ... there's actually two problems.

git log ... takes a really long time on this repo (around 5 minutes). Do we need all history here? Can we limit it for a last year or maybe last 1000 entries?

Someone (me), commited entire node_modules somewhere in the 2016, and then removed it and added to .gitignore.

To fix second issue, we can use git ls-files, and process only those files returned. Take a look, please: #726

yes, I think that it should be possible to limit the amount of history we read, at least by time. One of the major reasons why git history is to calculate frecency scores, which is used for different things. One of them being prioritizing frequently and/or recently edited files for indexing so that you can get decent results even before the indexing has finished. A limitation such as reading only the last 1 year could be a decent proxy for this. This could still mean that in a huge repo with many contributors, you would still get a massive amount of commits, however in this you are likely dealing with a monorepo or some sort of monolith, so maybe being able to index a subset of a repo would be a more interesting feature?

last-partizan · 2024-10-26T19:26:14Z

This is single repo, just with lot of history, so limitation by --since 1y would be ideal.

I can create PR with changes, the only thing I want to ask first: should we make this configurable, or just default 1year is enough?

kantord · 2024-10-27T20:51:37Z

This is single repo, just with lot of history, so limitation by --since 1y would be ideal.

I can create PR with changes, the only thing I want to ask first: should we make this configurable, or just default 1year is enough?

this is a good question. I would make it configurable because I think that the minimum amount of history that you would prefer kinda varies 🤔 for instance, if you have a small long term project in maintenance with 2-3 maintainers that averages like 30 commits a year, probably you want a lot more. Whereas a project actively adding new features with 200 maintainers working on it full time will probably not need even 1 year...

I guess that we could make it default to "infinite" and make a small section in the docs explaining how to work with large repos and mention this configuration option

Refs kantord#711

last-partizan · 2024-10-29T22:18:52Z

Not making a PR, before the other gets merged (because this also requires self.config in the repository).

But, take a look at this: last-partizan@ab184c1

Instead of --since I'm using --max-count to limit number of commits, and it works the same for both small and large repos. So, i hope we can set a reasonable default value. I'm starting at 1000, and it's super fast on linux repo, and under 5 seconds on my large repo. We probably can set it to 5_000, and it still be pretty good value. I want to avoid setting it to null, because on linux repo this results in a few minutes reading git history on my laptop.

Also, need you advice how to name this variable in the config.

* fix: Fix processing files from .gitignore Refs #711 * refactor: Move filtering ignored files to Repository class * ruff format * chore: Add a test * chore: Cleanup * chore: Smallest possible working sleep time * refactor: Add separate add_file_delete_commit function * refactor: Use rg instead of git ls-files * chore: Fix test name

Refs kantord#711

* feat: Add an option to limit maximum commits used for frecency Refs #711 * chore: Add debug logging * chore: Add tests/refactor

last-partizan added a commit to last-partizan/SeaGOAT that referenced this issue Oct 16, 2024

fix: Fix processing files from .gitignore

1ececbf

Refs kantord#711

last-partizan added a commit to last-partizan/SeaGOAT that referenced this issue Oct 16, 2024

fix: Fix processing files from .gitignore

8d51d6f

Refs kantord#711

last-partizan mentioned this issue Oct 16, 2024

fix: Fix processing files from .gitignore #726

Merged

last-partizan added a commit to last-partizan/SeaGOAT that referenced this issue Oct 29, 2024

fix: Fix processing files from .gitignore

e1917eb

Refs kantord#711

last-partizan added a commit to last-partizan/SeaGOAT that referenced this issue Oct 29, 2024

draft: Add an option to limit maximum commits used for frecency

ab184c1

Refs kantord#711

last-partizan added a commit to last-partizan/SeaGOAT that referenced this issue Nov 18, 2024

draft: Add an option to limit maximum commits used for frecency

b614d29

Refs kantord#711

last-partizan mentioned this issue Nov 18, 2024

Add an option to limit maximum commits used for frecency #756

Merged

last-partizan added a commit to last-partizan/SeaGOAT that referenced this issue Nov 18, 2024

feat: Add an option to limit maximum commits used for frecency

8abbea0

Refs kantord#711

last-partizan added a commit to last-partizan/SeaGOAT that referenced this issue Nov 18, 2024

feat: Add an option to limit maximum commits used for frecency

36d5ab9

Refs kantord#711

kantord pushed a commit that referenced this issue Nov 23, 2024

feat: add an option to limit maximum commits used for frecency (#756)

44bd95e

* feat: Add an option to limit maximum commits used for frecency Refs #711 * chore: Add debug logging * chore: Add tests/refactor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working on a large git repo #711

Working on a large git repo #711

last-partizan commented Oct 10, 2024

last-partizan commented Oct 10, 2024

kantord commented Oct 16, 2024

last-partizan commented Oct 16, 2024

last-partizan commented Oct 16, 2024 •

edited

Loading

kantord commented Oct 26, 2024

last-partizan commented Oct 26, 2024

kantord commented Oct 27, 2024

last-partizan commented Oct 29, 2024

Working on a large git repo #711

Working on a large git repo #711

Comments

last-partizan commented Oct 10, 2024

last-partizan commented Oct 10, 2024

kantord commented Oct 16, 2024

last-partizan commented Oct 16, 2024

last-partizan commented Oct 16, 2024 • edited Loading

kantord commented Oct 26, 2024

last-partizan commented Oct 26, 2024

kantord commented Oct 27, 2024

last-partizan commented Oct 29, 2024

last-partizan commented Oct 16, 2024 •

edited

Loading