-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working on a large git repo #711
Comments
Adding |
but did it at least stop it from attempting to process those files? In any case it sounds like a bug, SeaGOAT should never attempt to process gitignore files regardless of the ignore configuration (it's debatable if it's a good decision or not, but the sorting mechanism partially depends on git history, so it's how it's done at the moment) so if you are certain that those files are not managed by git, then it is some nasty bug in SeaGOAT. I'm surprised to see that you get a warning based on rename detection. I ran SeaGOAT on fairly large repositories such as Linux and React, and I don't remember getting such an error. Perhaps it's also dependent on git version or config that the repo might already have. In any case, it seems like an exception is thrown because git gives a non-zero exit code, right? I think that having exact rename detection for large repositories is not essential for SeaGOAT to function properly, it will just make the frecency-based sorting less accurate. So I think we should probably ignore this warning, what do you think? |
I tried to dig deeper, and ... there's actually two problems.
To fix second issue, we can use |
Upon furhter investigation, turned out that Crash was occurring in this part of code: object_id = (
subprocess.check_output(
[
"git",
"-C",
str(self.path),
"ls-tree",
"HEAD",
str(file_path),
],
text=True,
)
.split()[2]
.strip()
) I tried to make exception log shorter and excluded line with Anyway, if we use |
yes, I think that it should be possible to limit the amount of history we read, at least by time. One of the major reasons why git history is to calculate frecency scores, which is used for different things. One of them being prioritizing frequently and/or recently edited files for indexing so that you can get decent results even before the indexing has finished. A limitation such as reading only the last 1 year could be a decent proxy for this. This could still mean that in a huge repo with many contributors, you would still get a massive amount of commits, however in this you are likely dealing with a monorepo or some sort of monolith, so maybe being able to index a subset of a repo would be a more interesting feature? |
This is single repo, just with lot of history, so limitation by I can create PR with changes, the only thing I want to ask first: should we make this configurable, or just default 1year is enough? |
this is a good question. I would make it configurable because I think that the minimum amount of history that you would prefer kinda varies 🤔 for instance, if you have a small long term project in maintenance with 2-3 maintainers that averages like 30 commits a year, probably you want a lot more. Whereas a project actively adding new features with 200 maintainers working on it full time will probably not need even 1 year... I guess that we could make it default to "infinite" and make a small section in the docs explaining how to work with large repos and mention this configuration option |
Not making a PR, before the other gets merged (because this also requires self.config in the repository). But, take a look at this: last-partizan@ab184c1 Instead of Also, need you advice how to name this variable in the config. |
* fix: Fix processing files from .gitignore Refs #711 * refactor: Move filtering ignored files to Repository class * ruff format * chore: Add a test * chore: Cleanup * chore: Smallest possible working sleep time * refactor: Add separate add_file_delete_commit function * refactor: Use rg instead of git ls-files * chore: Fix test name
* feat: Add an option to limit maximum commits used for frecency Refs #711 * chore: Add debug logging * chore: Add tests/refactor
I tried this on relatively large old repo, and i'm getting this error (after waiting for a few minutes):
After raising renameLimit, it again fails on same line, but with
IndexError
. Upon investigation, it tries to accesssrc/app/node_modules/abbrev/README.md
, which should be ignored, becausenode_modules
is in my .gitignore. But, not in root, so maybe it treats ignore patterns differenly ...output is
''
- empty string, and it fails to access second part aftersplit()
.I can make a patch for it, if you suggest proper way to do it. My first guess is to return
None
for such cases and filter them later...The text was updated successfully, but these errors were encountered: