-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
large repository makes index page very slow #309
Comments
How many refs in the repo? |
Not a single one as a regular file in the local testing repository, the one on the server has the master ref as a regular file, the rest are packed refs (207329 lines in that file). |
The sheer number of refs is the problem here; it means 200k random object accesses. |
What can we do to make it faster? |
When I initially built Klaus my use case to have an easy way to browse commits locally, essentially a better UI for |
The thing is, the site for the project itself loads reasonably well (or at least as well as I'd expect with that repository). It's really just that difference in load times that makes me think there is something happening in the code for the index that does more than needs to be done. |
I started scrolling through the code just now and it struck me that index is used in a different manner than I used it in this issue, the page that is very slow is the repo_list specifically. Line 79 in a6d322a
The loading times of the repo list can me mitigated entirely for my use case by not querying the Line 64 in a6d322a
|
I wonder if the is is related to determining the timestamp of latest change to the repo. Maybe related to #248 |
If that’s indeed the case the page should los much faster if you order by name instead of update |
I tested that, however it then hangs when rendering the template for the repo list because that one still displays the timestamp, hence it just moves the "lazily parse all refs of the entire repository" to the template. |
Do you want to try to hotfix the code so that it doesn’t look up any timestamps? |
In the process of debugging this I made From 10f646fb1e38eb1e4469398915a8e3010ddb07c6 Mon Sep 17 00:00:00 2001
From: benaryorg <[email protected]>
Date: Sun, 2 Apr 2023 10:33:45 +0000
Subject: [PATCH] retrieve only HEAD for last updated of repo
Signed-off-by: benaryorg <[email protected]>
---
klaus/repo.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/klaus/repo.py b/klaus/repo.py
index 5033607..590dc6d 100644
--- a/klaus/repo.py
+++ b/klaus/repo.py
@@ -61,7 +61,7 @@ class FancyRepo(dulwich.repo.Repo):
"""Get datetime of last commit to this repository."""
# Cache result to speed up repo_list.html template.
# If self.get_refs() has changed, we should invalidate the cache.
- all_refs = self.get_refs()
+ all_refs = [ b'HEAD', ]
return cached_call(
key=(id(self), "get_last_updated_at"),
validator=all_refs,
--
2.39.2
The patch is far from production quality and I'm not sure about the implications. |
Maybe we can have a rule that stops looking up the timestamps and just use HEAD for the timestamp if you have more than N refs in a repo. |
Or something like: we stop looking at all refs and instead look at a hardcoded list of typical master branch names and the top K tags determined by sorting by name. |
FWIW GitHub also seems to just give up beyond a certain number of refs; it just displays a handful for that repository. |
Ah, so one caveat I've discovered so far happens when the default branch isn't set up 100% correctly. So personally the combination of the two approaches would be great; check HEAD first and if that doesn't resolve to any commits fallback to a list of usual default branch-names. Edit: the |
One thing I realised is that |
Since I am specifically mirroring the repository, yes. Edit: I realised that's a little short. What I am doing is keeping the history of the repository in case something upstream changes, whether that is a compromised ecosystem or just GitHub having issues, so I can use everything, including the PRs to keep things. |
Issue
Similar to #230 I too have cloned the nixpkgs repository and am now experiencing long wait times.
The corresponding path is quite slow, but it doesn't run into the default gunicorn timeout of 30s (though it's almost there with its load time of ~20s).
However when I access the index page – the one listing all repositories – I get a 5xx error from nginx due to a gateway timeout.
I opened this one with klaus rather than dulwich because it only seems to get triggered by the index page, not the project specific page (and the only other project there has like 100 commits and has like a megabyte I guess).
Details
Commands to reproduce:
I can reproduce this on both Gentoo and NixOS (klaus 2.0.2, dulwich 0.21.3 and 0.20.50 respectively).
I can also reproduce it with the Docker image (
dd9eaa5d40c7
) using this:podman run --rm -p 7777:80 -v $PWD/:/repos/ -it jonashaag/klaus:latest klaus --host 0.0.0.0 --port 80 /repos/nixpkgs.git
Note that sometimes I also get a traceback like the following after a minute or so of dulwich reading the pack files, the error seems spurious, no idea what it's about, it usually goes away after one or two occurrences:
zlib.error: Error -3 while decompressing data: incorrect header check
The text was updated successfully, but these errors were encountered: