Fetch all multi-query properties in parallel instead of sequentially #3823

porunov · 2023-06-14T20:18:12Z

porunov
Jun 14, 2023
Maintainer

As noted here currently, when JanusGraph fetches properties it uses a dedicated Slice query per property (or a Slice query to fetch all properties). That said, as KeyColumnValueStore doesn't have ability to execute several Slice queries in parallel - it means that MultiVertexCentricQueryBuilder executes properties fetching one by one for multi-query.
In other words, even so we execute multi-key queries when user uses multi-query (i.e. query.batch.enabled=true) (i.e. we execute the same slice query for multiple vertices at once) - we are still executing ALL slice queries one by one.
For example,
g.V(v1,v2,v3).has("foo", "bar").has("bar", "foo") - in this situation has steps are folded together and a single multi-query will be executed for v1, v2, and v3 vertices with keys foo and bar. As there are 2 keys - it means 2 slice queries.
Thus, multi-query will execute first slice query for v1, v2, v3 to fetch property foo, await for the result to be fetched (blocking operation) and then execute the second slice query for v1, v2, v3 again but to fetch property bar now.
As you can see - there is no any reason to wait for foo property to fetch bar property. We can (and I believe we must) fetch those properties in parallel at least (preferable in a single backend query as described in #3816).

What I'm thinking is that MultiVertexCentricQueryBuilder should not decide how those queries should be executed. Instead, it should ask a storage backend to execute all those queries however they like to execute it.
If a storage backend (like in-memory storage backend) wants to execute all those queries sequentially (like it's done in the current MultiVertexCentricQueryBuilder implementation) - then it's OK.
If a storage backend have ability to execute those queries in parallel / asynchronously / or grouped into a single query - they should be free to choose.
Thus, my proposal is to add a possibility to group all Slice queries with Keys which should be executed for those slice queries and then send and groups of slice queries with groups of keys (vertexIds) to the storage implementations for evaluation.
The interface I imagine should look something like below:

Map<SliceQuery, Map<StaticBuffer, EntryList>> getSlice(Collection<Pair<? extends Collection<SliceQuery>, ? extends Collection<StaticBuffer>>> multiRangeSliceQueriesForKeys, StoreTransaction txh) throws BackendException;

Collection<Pair<? extends Collection<SliceQuery>, ? extends Collection<StaticBuffer>>> multiRangeSliceQueriesForKeys - why such a strange data structure to pass multiple slice queries with their keys? Why now to use simple Map<SliceQuery, Collection<StaticBuffer>>?
This data structure is debatable, and I'm OK to use any other structure (including Map<SliceQuery, Collection<StaticBuffer>>). That said we need to take into consideration #3816 as well. The thing is that most of the times we will execute all SliceQuery queries for all the same keys. The exception is when some of the keys had already some data cached which will result in skipping a backend query for that key. In that case if storage backend implementations want to groups queries for the same sets of keys they will need to scan all keys of all queries and compare each keys collections with each other to find same collections (i.e. compare Collection<StaticBuffer> with any other Collection<StaticBuffer>).
If both collections are equal - their slice queries should be grouped together to make a single backend call for both slice queries.
Of course, storage backend implementations can do that job, but in that case they may just do unnecessary comparison jobs because grouping queries to the same sets of keys from Collection<InternalVertex> vertices, List<BackendQueryHolder<SliceQuery>> queries should be slightly more efficient then doing so from Map<SliceQuery, Collection<StaticBuffer>> (not sure here 100% but it feels like so).
Thus, it feels to me that we should group slice queries to the same sets of keys and pass grouped slice queries and a set of keys for each group of slice queries to the storage backend.
Again, no much preference which exact data structure we should pass to getSlice method, but it feels like the proposed data structure make sense.
The one downside of Collection<Pair<? extends Collection<SliceQuery>, ? extends Collection<StaticBuffer>>> is that it doesn't show that each SliceQuery is unique not only in it's own containing Collection but in all collections of Pairs. I.e. no two same SliceQuery will exist. It isn't obvious that each SliceQuery is unique from Collection<Pair<? extends Collection<SliceQuery>, ? extends Collection<StaticBuffer>>>, but I guess it's not that critical. In case anyone can propose a better data structure - it would be great.
Nevertheless, it seems that it shouldn't be a problem to migrate to any other data structure from Collection<Pair<? extends Collection<SliceQuery>, ? extends Collection<StaticBuffer>>>. Thus, I will try to use it as for now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch all multi-query properties in parallel instead of sequentially #3823

{{title}}

Replies: 0 comments

Select a reply

Fetch all multi-query properties in parallel instead of sequentially #3823

porunov Jun 14, 2023 Maintainer

Replies: 0 comments

porunov
Jun 14, 2023
Maintainer