Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming parsing of JSON array in Spring WebClient #24951

Open
HaloFour opened this issue Apr 21, 2020 · 10 comments
Open

Streaming parsing of JSON array in Spring WebClient #24951

HaloFour opened this issue Apr 21, 2020 · 10 comments
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) status: blocked An issue that's blocked on an external project change type: enhancement A general enhancement

Comments

@HaloFour
Copy link
Contributor

HaloFour commented Apr 21, 2020

I found #21862 which is pretty close to my request but closed.

I am currently using Spring WebClient with Spring Boot 2.2.6 and Spring Framework 5.2.5 writing a service that sits in front of a number of other upstream services and transforms their response for public consumption. Some of these services respond with very large JSON payloads that are little more than an array of entities wrapped in a JSON document, usually with no other properties:

{
    "responseRoot": {
        "entities": [
            { "id": "1" },
            { "id": "2" },
            { "id": "n" },
        ]
    }
}

There could be many thousands of entities in this nested array and the entire payload can be tens of MBs. I want to be able to read in these entities through a Flux<T> so that I can transform them individually and write them out to the client without having to deserialize all of them into memory. This doesn't appear to be something that Spring WebFlux supports out of the box.

I'm currently exploring writing my own BodyExtractor which reuses some of the code in Jackson2Tokenizer to try to support this. My plan is to accept a JsonPointer to the location of the array and then parse asynchronously until I find that array, then to buffer the tokens for each array element to deserialize them.

var flux = client.get()
    .uri(uri)
    .exchange()
    .flatMapMany(r ->
        r.body(new StreamingBodyExtractor(JsonPointer.compile("/responseRoot/entities")))
    );

Before I go too far down this path I was curious if this was functionality that Spring would be interested in supporting out of the box.

Similarly, I was curious about the functionality of being able to stream out a response from a WebFlux controller via a Flux<T> where the streamed response would be wrapped in a JSON array and possibly in a root JSON document as well?

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Apr 21, 2020
@HaloFour
Copy link
Contributor Author

Here's a very quick&dirty implementation of the BodyExtractor implementation:

https://gist.github.com/HaloFour/ce3063d4e693b495e3c194cbb2f66686

The actual token parsing could certainly be cleaned up but it gets the job done at least to the extent that existing integration tests in the project are passing.

@HaloFour
Copy link
Contributor Author

Also, not to pile up additional requests in a single issue, but I didn't see a way to use a BodyExtractor with retrieve() which would force me to manually interpret the HTTP status error codes. Is there a reason WebClient.ResponseSpec doesn't include a method that accepts a BodyExtractor?

@rstoyanchev
Copy link
Contributor

@HaloFour thanks for the proposal.This looks feasible and probably worth doing but mainly I'm wondering about what a more general solution looks like and how much more general does it need to be.

For example the case of multiple arrays such as in #21862. We could accept multiple JSON pointers but it's less obvious how to represent the output which logically is Flux<T1>, Flux<T2>, etc but needs to be exposed sequentially, i.e. Flux<Flux<?>> which is not great for generics and it might as well be Flux<Object> where the application has to check the Object type and downcast accordingly. An even more challenging question is what if you want to extract the surrounding Object structure as in #25472?

@fransflippo
Copy link

Thanks for this, @HaloFour ! Looks like something I was looking for (hence #25472). I'll give your Gist a try.

@rstoyanchev (just reiterating from #25472 ) I think it makes sense to focus on the most common case of a single array of a single type of object in the JSON response. The semantics of anything else, like you explain, becomes very hairy very quickly and the applicability of it seems low for most real world scenarios (imho).

@HaloFour
Copy link
Contributor Author

HaloFour commented Oct 5, 2020

Thanks for taking a look! Here's a newer Gist based on the code that we're currently using in production.

@rstoyanchev
Copy link
Contributor

rstoyanchev commented Oct 6, 2020

Yes it make sense to do something that would solve many cases. That said other possible cases are not that far to see. Take for example #21862 or even for Elasticsearch isn't it necessary sometimes to access something else besides the hits, like "search_after"?

@rstoyanchev rstoyanchev added the in: web Issues in web modules (web, webmvc, webflux, websocket) label Nov 8, 2021
@joedevgee
Copy link

going back to the original question, with the new API, exactly how do we extract the entities under responseRoot ?

@nilsga
Copy link

nilsga commented Nov 14, 2023

toEntityFlux(streamingBodyExtractor.toFlux(MyClass.class, JsonPointer.compile("/pathToArray"))) worked for me. This seems very useful. Any chance this BodyExtractor can be added to Spring?

@simonbasle
Copy link
Contributor

simonbasle commented Dec 5, 2023

for the original use case of json-pointing to an array in order to stream-parse it, I think it would be better to delegate that responsibility to Jackson and probably just offer an lightweight BodyExtractor adapter in Framework.

Unfortunately, even though in Jackson-Core there is a FilteringParserDelegate which can accept a JsonPointerBasedFilter, this doesn't work for async parsers for now (see FasterXML/jackson-core#1144)...

@HaloFour maybe there's an opportunity to contribute something there?

@HaloFour
Copy link
Contributor Author

HaloFour commented Dec 5, 2023

Sure, I can take a look at that.

@simonbasle simonbasle added status: blocked An issue that's blocked on an external project change and removed status: waiting-for-triage An issue we've not yet triaged or decided on labels Dec 12, 2023
@sdeleuze sdeleuze added this to the 6.x Backlog milestone Dec 14, 2023
@jhoeller jhoeller added the type: enhancement A general enhancement label Jan 11, 2024
@jhoeller jhoeller modified the milestones: 6.x Backlog, General Backlog Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) status: blocked An issue that's blocked on an external project change type: enhancement A general enhancement
Projects
None yet
Development

No branches or pull requests

9 participants