-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fetch wheel metadata by async range requests on the remote wheel #301
Conversation
7c09721
to
788b535
Compare
@konstin - Is this ready for review? |
40d2e4f
to
9c8cfad
Compare
/// The metadata name may be uppercase, while the wheel and dist info names are lowercase, or | ||
/// the metadata name and the dist info name are lowercase, while the wheel name is uppercase. | ||
/// Either way, we just search the wheel for the name | ||
pub fn find_dist_info_metadata<'a, T: Copy>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect we'll have to reuse this, at least for async zip and unpacked to the fs
} | ||
})?; | ||
Metadata21::parse(text.as_bytes()).map_err(std::convert::Into::into) | ||
debug!("Downloading whole wheel to extract metadata from {url}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch exists but has not been given care since i much rather hope that registries/file hosts support range requests.
It had a branch and tests missing, now it's ready |
crates/puffin-client/src/client.rs
Outdated
}; | ||
// otherwise, try using HTTP range requests to read only the `.dist-info/METADATA` file | ||
// from the zip, and if that also fails, download the whole wheel into the cache and read | ||
// from there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put this comment within the else
?
crates/puffin-client/src/client.rs
Outdated
let mut metadata = Metadata21::parse(text.as_bytes())?; | ||
if metadata.description.is_some() { | ||
// Perf (and cache size) improvement | ||
metadata.description = Some("[omitted]".to_string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this for now -- we should revisit this holistically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g., I think it could be confusing that the metadata is sometimes complete and sometimes incomplete depending on where it comes from. I'd rather store the whole thing, and then change all representations at once.
crates/puffin-client/src/error.rs
Outdated
#[error(transparent)] | ||
AsyncHttpRangeReader(#[from] AsyncHttpRangeReaderError), | ||
|
||
/// Invalid dist-info dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you either remove this comment or fold it into the error message? As a reader I don't totally understand what it's saying or why it's relevant.
Cargo.toml
Outdated
@@ -14,6 +14,8 @@ license = "MIT OR Apache-2.0" | |||
|
|||
[workspace.dependencies] | |||
anyhow = { version = "1.0.75" } | |||
async_http_range_reader = { version = "0.3.0", git = "https://github.com/baszalmstra/async_http_range_reader", ref = "4cafe5afda889d53060e0565c949d4ffd6ef3786" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is published, can we remove the Git dep now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not published, i've removed the wrong version
&self, | ||
filename: &WheelFilename, | ||
url: &Url, | ||
) -> Result<Metadata21, Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of wish we could just pass down our cached HTTP client to the async request crate, then we wouldn't need a custom cache implementation (nor would we need to expose the raw client). But I guess the thinking is that we'll need to do this anyway at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't cache those range requests, just the metadata we need eventually (which is even less than the Metadata21
we have atm). I found deserialization being the major performance bottleneck in the other prototype (even though the deserialization was in rust and the resolver was in python), so i try to cache as minimal data as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree but I'd rather have a consistent caching strategy, and then change over to a better consistent caching strategy, rather than use separate strategies for separate requests. But it's not worth blocking on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good! A few small comments, and I'd prefer to move the license attribution directly to the source.
c78d910
to
df5d3ea
Compare
df5d3ea
to
c58f4ce
Compare
Use range requests and async zip to extract the METADATA file from a remote wheel.
We currently only cache when the remote says the remote declares the resource as immutable, see 06chaynes/http-cache#57 and prefix-dev/async_http_range_reader#1 . The cache is stored as json with the description omitted, this improve cache deserialization performance.