Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider HttpClient interface #874

Open
janbuchar opened this issue Jan 6, 2025 · 3 comments
Open

Reconsider HttpClient interface #874

janbuchar opened this issue Jan 6, 2025 · 3 comments
Assignees
Labels
debt Code quality improvement or decrease of technical debt. solutioning The issue is not being implemented but only analyzed and planned. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@janbuchar
Copy link
Collaborator

As of now, the python BaseHttpClient look like this:

class BaseHttpClient(ABC):

It has two methods, send_request and crawl. This is the first iteration of decoupled HTTP clients.

Later on, we refactored the JS version to use this one:

https://github.com/apify/crawlee/blob/f912b8b06da2bc4f3f3db508cc39c936a5c87f23/packages/core/src/http_clients/base-http-client.ts#L179

It also has two methods, sendRequest and stream. Unlike the python version, the signatures of those methods match quite well. It is worth noting that the two serious attempts at implementing this interface (so far) both couldn't manage to implement stream correctly. Although we could probably live without it in the most common case, streaming is paramount for downloading files (potentially large ones), which is a use case that we want to support.

We should simplify this interface and make it look the same in both versions. Any thoughts on how to achieve that? @vdusek @B4nan @barjin... and whoever else wants to chat 🙂

@janbuchar janbuchar added debt Code quality improvement or decrease of technical debt. solutioning The issue is not being implemented but only analyzed and planned. labels Jan 6, 2025
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 6, 2025
@B4nan B4nan self-assigned this Jan 13, 2025
@B4nan
Copy link
Member

B4nan commented Jan 15, 2025

I like the JS version, but given many HTTP clients won't support streaming or it will be challenging to implement, I think we could have a default implementation of the stream method as part of the base class. So in JS instead of having an interface, it would be an abstract class with the stream method implemented with a fallback like we do with the curl-impersonate client.

I am not sure why there is that crawl method in python, if we don't support streaming there at all, ideally we would drop it (and implement a proxy at the call site if we need a different signature). Feels weird that crawler.run and send_request would use something else. And we should surely think of how to support streaming there too, but that's another story.

@vdusek
Copy link
Collaborator

vdusek commented Jan 16, 2025

I am not sure why there is that crawl method in python, if we don't support streaming there at all, ideally we would drop it (and implement a proxy at the call site if we need a different signature). Feels weird that crawler.run and send_request would use something else. And we should surely think of how to support streaming there too, but that's another story.

Mostly historical reasons - the first attempt of implementation of HTTP clients. The crawl method is being called from the crawler.run, and send_request is being called from the send_request context helper. They have a slightly different APIs, but I believe they could be unified.

I like the JS version, but given many HTTP clients won't support streaming or it will be challenging to implement, I think we could have a default implementation of the stream method as part of the base class. So in JS instead of having an interface, it would be an abstract class with the stream method implemented with a fallback like we do with the curl-impersonate client.

I like the refactored version in JS as well and agree with the streaming. IMO it's acceptable if not all clients support streaming. A default implementation of the stream method in the base class, with a warning when it's not fully supported, would be okay fallback.

@janbuchar
Copy link
Collaborator Author

I like the JS version, but given many HTTP clients won't support streaming or it will be challenging to implement, I think we could have a default implementation of the stream method as part of the base class. So in JS instead of having an interface, it would be an abstract class with the stream method implemented with a fallback like we do with the curl-impersonate client.

I like the refactored version in JS as well and agree with the streaming. IMO it's acceptable if not all clients support streaming. A default implementation of the stream method in the base class, with a warning when it's not fully supported, would be okay fallback.

This is preferrable in the arguably common case when streaming cannot be implemented. I can't shake the feeling that stream-based implementations should be the default because it's easy to read a stream into memory when needed, but doing it the other way around nullifies the benefit of streams. Plus it's not nice to require people to implement both a streaming and non-streaming version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
debt Code quality improvement or decrease of technical debt. solutioning The issue is not being implemented but only analyzed and planned. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants