Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch wheel metadata by async range requests on the remote wheel #301

Merged
merged 15 commits into from
Nov 6, 2023

Conversation

konstin
Copy link
Member

@konstin konstin commented Nov 2, 2023

Use range requests and async zip to extract the METADATA file from a remote wheel.

We currently only cache when the remote says the remote declares the resource as immutable, see 06chaynes/http-cache#57 and prefix-dev/async_http_range_reader#1 . The cache is stored as json with the description omitted, this improve cache deserialization performance.

@konstin konstin force-pushed the async-range-requests branch 2 times, most recently from 7c09721 to 788b535 Compare November 3, 2023 12:11
@konstin konstin changed the title Async range requests Fetch wheel metadata by async range requests on the remote wheel Nov 3, 2023
@charliermarsh
Copy link
Member

@konstin - Is this ready for review?

@konstin konstin force-pushed the async-range-requests branch from 40d2e4f to 9c8cfad Compare November 5, 2023 20:59
/// The metadata name may be uppercase, while the wheel and dist info names are lowercase, or
/// the metadata name and the dist info name are lowercase, while the wheel name is uppercase.
/// Either way, we just search the wheel for the name
pub fn find_dist_info_metadata<'a, T: Copy>(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect we'll have to reuse this, at least for async zip and unpacked to the fs

}
})?;
Metadata21::parse(text.as_bytes()).map_err(std::convert::Into::into)
debug!("Downloading whole wheel to extract metadata from {url}");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch exists but has not been given care since i much rather hope that registries/file hosts support range requests.

@konstin konstin marked this pull request as ready for review November 5, 2023 22:23
@konstin
Copy link
Member Author

konstin commented Nov 5, 2023

It had a branch and tests missing, now it's ready

};
// otherwise, try using HTTP range requests to read only the `.dist-info/METADATA` file
// from the zip, and if that also fails, download the whole wheel into the cache and read
// from there
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put this comment within the else?

let mut metadata = Metadata21::parse(text.as_bytes())?;
if metadata.description.is_some() {
// Perf (and cache size) improvement
metadata.description = Some("[omitted]".to_string());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this for now -- we should revisit this holistically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g., I think it could be confusing that the metadata is sometimes complete and sometimes incomplete depending on where it comes from. I'd rather store the whole thing, and then change all representations at once.

#[error(transparent)]
AsyncHttpRangeReader(#[from] AsyncHttpRangeReaderError),

/// Invalid dist-info dir
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you either remove this comment or fold it into the error message? As a reader I don't totally understand what it's saying or why it's relevant.

Cargo.toml Outdated
@@ -14,6 +14,8 @@ license = "MIT OR Apache-2.0"

[workspace.dependencies]
anyhow = { version = "1.0.75" }
async_http_range_reader = { version = "0.3.0", git = "https://github.com/baszalmstra/async_http_range_reader", ref = "4cafe5afda889d53060e0565c949d4ffd6ef3786" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is published, can we remove the Git dep now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not published, i've removed the wrong version

&self,
filename: &WheelFilename,
url: &Url,
) -> Result<Metadata21, Error> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of wish we could just pass down our cached HTTP client to the async request crate, then we wouldn't need a custom cache implementation (nor would we need to expose the raw client). But I guess the thinking is that we'll need to do this anyway at some point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't cache those range requests, just the metadata we need eventually (which is even less than the Metadata21 we have atm). I found deserialization being the major performance bottleneck in the other prototype (even though the deserialization was in rust and the resolver was in python), so i try to cache as minimal data as possible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but I'd rather have a consistent caching strategy, and then change over to a better consistent caching strategy, rather than use separate strategies for separate requests. But it's not worth blocking on.

Copy link
Member

@charliermarsh charliermarsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good! A few small comments, and I'd prefer to move the license attribution directly to the source.

@konstin konstin force-pushed the async-range-requests branch from c78d910 to df5d3ea Compare November 6, 2023 09:52
@konstin konstin force-pushed the async-range-requests branch from df5d3ea to c58f4ce Compare November 6, 2023 13:58
@konstin konstin merged commit b2439b2 into main Nov 6, 2023
@konstin konstin deleted the async-range-requests branch November 6, 2023 14:06
@astral-sh astral-sh deleted a comment from konstin Feb 14, 2024
@astral-sh astral-sh deleted a comment from konstin Feb 14, 2024
@astral-sh astral-sh deleted a comment from konstin Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants