Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large repository makes index page very slow #309

Closed
benaryorg opened this issue Apr 1, 2023 · 18 comments · Fixed by #310
Closed

large repository makes index page very slow #309

benaryorg opened this issue Apr 1, 2023 · 18 comments · Fixed by #310

Comments

@benaryorg
Copy link

benaryorg commented Apr 1, 2023

Issue

Similar to #230 I too have cloned the nixpkgs repository and am now experiencing long wait times.
The corresponding path is quite slow, but it doesn't run into the default gunicorn timeout of 30s (though it's almost there with its load time of ~20s).
However when I access the index page – the one listing all repositories – I get a 5xx error from nginx due to a gateway timeout.

I opened this one with klaus rather than dulwich because it only seems to get triggered by the index page, not the project specific page (and the only other project there has like 100 commits and has like a megabyte I guess).

Details

Commands to reproduce:

# go somewhere with ~5GiB of diskspace
cd "$(mktemp -dp /tmp)"
# this takes ages
git clone --mirror https://github.com/NixOS/nixpkgs
# it did happen with other command lines but this is the last one in my history that I know worked
klaus * --port 8888 --debug

# this should run pretty fast (0.395s)
curl http://localhost:8888/nixpkgs/
# this will take very long (35.289s)
curl http://localhost:8888/

I can reproduce this on both Gentoo and NixOS (klaus 2.0.2, dulwich 0.21.3 and 0.20.50 respectively).
I can also reproduce it with the Docker image (dd9eaa5d40c7) using this:

podman run --rm -p 7777:80 -v $PWD/:/repos/ -it jonashaag/klaus:latest klaus --host 0.0.0.0 --port 80 /repos/nixpkgs.git

Note that sometimes I also get a traceback like the following after a minute or so of dulwich reading the pack files, the error seems spurious, no idea what it's about, it usually goes away after one or two occurrences:

zlib.error: Error -3 while decompressing data: incorrect header check
Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/flask/app.py", line 2551, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/lib/python3.10/site-packages/klaus/utils.py", line 62, in __call__
    return super(ProxyFix, self).__call__(environ, start_response)
  File "/usr/lib/python3.10/site-packages/werkzeug/middleware/proxy_fix.py", line 187, in __call__
    return self.app(environ, start_response)
  File "/usr/lib/python3.10/site-packages/flask/app.py", line 2531, in wsgi_app
    response = self.handle_exception(e)
  File "/usr/lib/python3.10/site-packages/flask/app.py", line 2528, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/lib/python3.10/site-packages/flask/app.py", line 1825, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/lib/python3.10/site-packages/flask/app.py", line 1823, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/lib/python3.10/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/usr/lib/python3.10/site-packages/klaus/views.py", line 79, in repo_list
    repos = sorted(repos, key=sort_key)
  File "/usr/lib/python3.10/site-packages/klaus/views.py", line 75, in <lambda>
    -(repo.fast_get_last_updated_at() or -1),
  File "/usr/lib/python3.10/site-packages/klaus/repo.py", line 363, in fast_get_last_updated_at
    self.__last_updated_at = self.__repo.get_last_updated_at()
  File "/usr/lib/python3.10/site-packages/klaus/repo.py", line 65, in get_last_updated_at
    return cached_call(
  File "/usr/lib/python3.10/site-packages/klaus/repo.py", line 36, in cached_call
    data = producer()
  File "/usr/lib/python3.10/site-packages/klaus/repo.py", line 68, in <lambda>
    producer=lambda: self._get_last_updated_at(all_refs),
  File "/usr/lib/python3.10/site-packages/klaus/repo.py", line 75, in _get_last_updated_at
    resolveable_refs.append(self[ref_hash])
  File "/usr/lib/python3.10/site-packages/dulwich/repo.py", line 771, in __getitem__
    return self.object_store[self.refs[name]]
  File "/usr/lib/python3.10/site-packages/dulwich/object_store.py", line 123, in __getitem__
    type_num, uncomp = self.get_raw(sha1)
  File "/usr/lib/python3.10/site-packages/dulwich/object_store.py", line 541, in get_raw
    return pack.get_raw(sha)
  File "/usr/lib/python3.10/site-packages/dulwich/pack.py", line 2377, in get_raw
    obj_type, obj = self.data.get_object_at(offset)
  File "/usr/lib/python3.10/site-packages/dulwich/pack.py", line 1328, in get_object_at
    unpacked = self.get_unpacked_object_at(offset, include_comp=False)
  File "/usr/lib/python3.10/site-packages/dulwich/pack.py", line 1313, in get_unpacked_object_at
    unpacked, _ = unpack_object(self._file.read, include_comp=include_comp)
  File "/usr/lib/python3.10/site-packages/dulwich/pack.py", line 871, in unpack_object
    unused = read_zlib_chunks(
  File "/usr/lib/python3.10/site-packages/dulwich/pack.py", line 297, in read_zlib_chunks
    decomp = decomp_obj.decompress(add)
zlib.error: Error -3 while decompressing data: incorrect header check
@jelmer
Copy link
Contributor

jelmer commented Apr 1, 2023

How many refs in the repo?

@benaryorg
Copy link
Author

How many refs in the repo?

Not a single one as a regular file in the local testing repository, the one on the server has the master ref as a regular file, the rest are packed refs (207329 lines in that file).

@jelmer
Copy link
Contributor

jelmer commented Apr 1, 2023

How many refs in the repo?

Not a single one as a regular file in the local testing repository, the one on the server has the master ref as a regular file, the rest are packed refs (207329 lines in that file).

The sheer number of refs is the problem here; it means 200k random object accesses.

@jonashaag
Copy link
Owner

What can we do to make it faster?

@jonashaag
Copy link
Owner

When I initially built Klaus my use case to have an easy way to browse commits locally, essentially a better UI for git log + git diff. I didn’t expect people to do crazy stuff like hosting thousands of repos (@jelmer :)) or repos with hundreds of thousands of commits.

@benaryorg
Copy link
Author

How many refs in the repo?

Not a single one as a regular file in the local testing repository, the one on the server has the master ref as a regular file, the rest are packed refs (207329 lines in that file).

The sheer number of refs is the problem here; it means 200k random object accesses.

The thing is, the site for the project itself loads reasonably well (or at least as well as I'd expect with that repository).
It's only the index page which has abysmal load times for me.
Without knowing the code I'd assume that the project page itself does a little more lifting compared to the index page which effectively retrieves the time of the last commit on the default branch(?) and the description.

It's really just that difference in load times that makes me think there is something happening in the code for the index that does more than needs to be done.

@benaryorg
Copy link
Author

I started scrolling through the code just now and it struck me that index is used in a different manner than I used it in this issue, the page that is very slow is the repo_list specifically.
To be precise, I've pulled the whole thing into a venv and are fiddling with it by now, this line (when ordering is set to last_updated):

repos = sorted(repos, key=sort_key)

The loading times of the repo list can me mitigated entirely for my use case by not querying the last_updated for all refs, just the default one (which IMHO is good enough for the repo list).
I.e. replacing the following code with [ b'HEAD' ], although the default_branch method that I've seen in some other issues should probably be used (though PR № 308 would make them equal anyways):

all_refs = self.get_refs()

@jonashaag
Copy link
Owner

I wonder if the is is related to determining the timestamp of latest change to the repo. Maybe related to #248

@jonashaag
Copy link
Owner

If that’s indeed the case the page should los much faster if you order by name instead of update

@benaryorg
Copy link
Author

If that’s indeed the case the page should los much faster if you order by name instead of update

I tested that, however it then hangs when rendering the template for the repo list because that one still displays the timestamp, hence it just moves the "lazily parse all refs of the entire repository" to the template.

@jonashaag
Copy link
Owner

Do you want to try to hotfix the code so that it doesn’t look up any timestamps?

@benaryorg
Copy link
Author

In the process of debugging this I made get_last_updated_at() (or what's it called) return 0, which worked nicely, currently I'm running this patch which only looks up the HEAD ref and it works, as noted earlier:

From 10f646fb1e38eb1e4469398915a8e3010ddb07c6 Mon Sep 17 00:00:00 2001
From: benaryorg <[email protected]>
Date: Sun, 2 Apr 2023 10:33:45 +0000
Subject: [PATCH] retrieve only HEAD for last updated of repo

Signed-off-by: benaryorg <[email protected]>
---
 klaus/repo.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/klaus/repo.py b/klaus/repo.py
index 5033607..590dc6d 100644
--- a/klaus/repo.py
+++ b/klaus/repo.py
@@ -61,7 +61,7 @@ class FancyRepo(dulwich.repo.Repo):
         """Get datetime of last commit to this repository."""
         # Cache result to speed up repo_list.html template.
         # If self.get_refs() has changed, we should invalidate the cache.
-        all_refs = self.get_refs()
+        all_refs = [ b'HEAD', ]
         return cached_call(
             key=(id(self), "get_last_updated_at"),
             validator=all_refs,
-- 
2.39.2

The patch is far from production quality and I'm not sure about the implications.

@jonashaag
Copy link
Owner

jonashaag commented Apr 2, 2023

Maybe we can have a rule that stops looking up the timestamps and just use HEAD for the timestamp if you have more than N refs in a repo.

@jonashaag
Copy link
Owner

Or something like: we stop looking at all refs and instead look at a hardcoded list of typical master branch names and the top K tags determined by sorting by name.

@jelmer
Copy link
Contributor

jelmer commented Apr 2, 2023

FWIW GitHub also seems to just give up beyond a certain number of refs; it just displays a handful for that repository.

@benaryorg
Copy link
Author

benaryorg commented Apr 3, 2023

Ah, so one caveat I've discovered so far happens when the default branch isn't set up 100% correctly.
I use gitolite under the hood and both when using the create-on-clone and the create-on-push methods the repository on the server-side gets initialised with the git defaults (or configured settings), making HEAD point to ref/heads/master in this case.
Since all of my default branches are main, that means that klaus when retrieving the HEAD will fail to resolve it and instead display the repository as no commits yet making the link unclickable (opening manually works just fine tho).
I probably should be using gitolite's symbolic-ref non-core thingy for that, but it might affect existing users, so I'd just like to throw that in here.
(This is something to be aware of in #308 at least.)

So personally the combination of the two approaches would be great; check HEAD first and if that doesn't resolve to any commits fallback to a list of usual default branch-names.
That's just my opinion though.

Edit: the symbolic-ref solution is ~fine~ except for the caching of the last_updated timestamp, presumably because it parses all refs.
If this issue is at some point satisfactorily fixed maybe the caching can be removed (if it only has to look at like ten files and parse part of a pack instead of a whole lot more).

@jonashaag
Copy link
Owner

One thing I realised is that --mirror seems to download all the refs/pull/ refs as well. Are they useful for your use case?

@benaryorg
Copy link
Author

benaryorg commented Apr 5, 2023

One thing I realised is that --mirror seems to download all the refs/pull/ refs as well. Are they useful for your use case?

Since I am specifically mirroring the repository, yes.

Edit: I realised that's a little short. What I am doing is keeping the history of the repository in case something upstream changes, whether that is a compromised ecosystem or just GitHub having issues, so I can use everything, including the PRs to keep things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants