You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As noted here currently, when JanusGraph fetches properties it uses a dedicated Slice query per property (or a Slice query to fetch all properties). That said, as KeyColumnValueStore doesn't have ability to execute several Slice queries in parallel - it means that MultiVertexCentricQueryBuilder executes properties fetching one by one for multi-query.
In other words, even so we execute multi-key queries when user uses multi-query (i.e. query.batch.enabled=true) (i.e. we execute the same slice query for multiple vertices at once) - we are still executing ALL slice queries one by one.
For example, g.V(v1,v2,v3).has("foo", "bar").has("bar", "foo") - in this situation has steps are folded together and a single multi-query will be executed for v1, v2, and v3 vertices with keys foo and bar. As there are 2 keys - it means 2 slice queries.
Thus, multi-query will execute first slice query for v1, v2, v3 to fetch property foo, await for the result to be fetched (blocking operation) and then execute the second slice query for v1, v2, v3 again but to fetch property bar now.
As you can see - there is no any reason to wait for foo property to fetch bar property. We can (and I believe we must) fetch those properties in parallel at least (preferable in a single backend query as described in #3816).
What I'm thinking is that MultiVertexCentricQueryBuilder should not decide how those queries should be executed. Instead, it should ask a storage backend to execute all those queries however they like to execute it.
If a storage backend (like in-memory storage backend) wants to execute all those queries sequentially (like it's done in the current MultiVertexCentricQueryBuilder implementation) - then it's OK.
If a storage backend have ability to execute those queries in parallel / asynchronously / or grouped into a single query - they should be free to choose.
Thus, my proposal is to add a possibility to group all Slice queries with Keys which should be executed for those slice queries and then send and groups of slice queries with groups of keys (vertexIds) to the storage implementations for evaluation.
The interface I imagine should look something like below:
Collection<Pair<? extends Collection<SliceQuery>, ? extends Collection<StaticBuffer>>> multiRangeSliceQueriesForKeys - why such a strange data structure to pass multiple slice queries with their keys? Why now to use simple Map<SliceQuery, Collection<StaticBuffer>>?
This data structure is debatable, and I'm OK to use any other structure (including Map<SliceQuery, Collection<StaticBuffer>>). That said we need to take into consideration #3816 as well. The thing is that most of the times we will execute all SliceQuery queries for all the same keys. The exception is when some of the keys had already some data cached which will result in skipping a backend query for that key. In that case if storage backend implementations want to groups queries for the same sets of keys they will need to scan all keys of all queries and compare each keys collections with each other to find same collections (i.e. compare Collection<StaticBuffer> with any other Collection<StaticBuffer>).
If both collections are equal - their slice queries should be grouped together to make a single backend call for both slice queries.
Of course, storage backend implementations can do that job, but in that case they may just do unnecessary comparison jobs because grouping queries to the same sets of keys from Collection<InternalVertex> vertices, List<BackendQueryHolder<SliceQuery>> queries should be slightly more efficient then doing so from Map<SliceQuery, Collection<StaticBuffer>> (not sure here 100% but it feels like so).
Thus, it feels to me that we should group slice queries to the same sets of keys and pass grouped slice queries and a set of keys for each group of slice queries to the storage backend.
Again, no much preference which exact data structure we should pass to getSlice method, but it feels like the proposed data structure make sense.
The one downside of Collection<Pair<? extends Collection<SliceQuery>, ? extends Collection<StaticBuffer>>> is that it doesn't show that each SliceQuery is unique not only in it's own containing Collection but in all collections of Pairs. I.e. no two same SliceQuery will exist. It isn't obvious that each SliceQuery is unique from Collection<Pair<? extends Collection<SliceQuery>, ? extends Collection<StaticBuffer>>>, but I guess it's not that critical. In case anyone can propose a better data structure - it would be great.
Nevertheless, it seems that it shouldn't be a problem to migrate to any other data structure from Collection<Pair<? extends Collection<SliceQuery>, ? extends Collection<StaticBuffer>>>. Thus, I will try to use it as for now.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
As noted here currently, when JanusGraph fetches properties it uses a dedicated Slice query per property (or a Slice query to fetch all properties). That said, as
KeyColumnValueStore
doesn't have ability to execute several Slice queries in parallel - it means thatMultiVertexCentricQueryBuilder
executes properties fetching one by one for multi-query.In other words, even so we execute multi-key queries when user uses multi-query (i.e.
query.batch.enabled=true
) (i.e. we execute the same slice query for multiple vertices at once) - we are still executing ALL slice queries one by one.For example,
g.V(v1,v2,v3).has("foo", "bar").has("bar", "foo")
- in this situationhas
steps are folded together and a single multi-query will be executed forv1
,v2
, andv3
vertices with keysfoo
andbar
. As there are 2 keys - it means 2 slice queries.Thus, multi-query will execute first slice query for
v1
,v2
,v3
to fetch propertyfoo
, await for the result to be fetched (blocking operation) and then execute the second slice query forv1
,v2
,v3
again but to fetch propertybar
now.As you can see - there is no any reason to wait for
foo
property to fetchbar
property. We can (and I believe we must) fetch those properties in parallel at least (preferable in a single backend query as described in #3816).What I'm thinking is that
MultiVertexCentricQueryBuilder
should not decide how those queries should be executed. Instead, it should ask a storage backend to execute all those queries however they like to execute it.If a storage backend (like
in-memory
storage backend) wants to execute all those queries sequentially (like it's done in the currentMultiVertexCentricQueryBuilder
implementation) - then it's OK.If a storage backend have ability to execute those queries in parallel / asynchronously / or grouped into a single query - they should be free to choose.
Thus, my proposal is to add a possibility to group all Slice queries with Keys which should be executed for those slice queries and then send and groups of slice queries with groups of keys (vertexIds) to the storage implementations for evaluation.
The interface I imagine should look something like below:
Collection<Pair<? extends Collection<SliceQuery>, ? extends Collection<StaticBuffer>>> multiRangeSliceQueriesForKeys
- why such a strange data structure to pass multiple slice queries with their keys? Why now to use simpleMap<SliceQuery, Collection<StaticBuffer>>
?This data structure is debatable, and I'm OK to use any other structure (including
Map<SliceQuery, Collection<StaticBuffer>>
). That said we need to take into consideration #3816 as well. The thing is that most of the times we will execute allSliceQuery
queries for all the same keys. The exception is when some of the keys had already some data cached which will result in skipping a backend query for that key. In that case if storage backend implementations want to groups queries for the same sets of keys they will need to scan all keys of all queries and compare eachkeys
collections with each other to find same collections (i.e. compareCollection<StaticBuffer>
with any otherCollection<StaticBuffer>
).If both collections are equal - their slice queries should be grouped together to make a single backend call for both slice queries.
Of course, storage backend implementations can do that job, but in that case they may just do unnecessary comparison jobs because grouping queries to the same sets of keys from
Collection<InternalVertex> vertices, List<BackendQueryHolder<SliceQuery>> queries
should be slightly more efficient then doing so fromMap<SliceQuery, Collection<StaticBuffer>>
(not sure here 100% but it feels like so).Thus, it feels to me that we should group slice queries to the same sets of keys and pass grouped slice queries and a set of keys for each group of slice queries to the storage backend.
Again, no much preference which exact data structure we should pass to
getSlice
method, but it feels like the proposed data structure make sense.The one downside of
Collection<Pair<? extends Collection<SliceQuery>, ? extends Collection<StaticBuffer>>>
is that it doesn't show that eachSliceQuery
is unique not only in it's own containingCollection
but in all collections of Pairs. I.e. no two sameSliceQuery
will exist. It isn't obvious that eachSliceQuery
is unique fromCollection<Pair<? extends Collection<SliceQuery>, ? extends Collection<StaticBuffer>>>
, but I guess it's not that critical. In case anyone can propose a better data structure - it would be great.Nevertheless, it seems that it shouldn't be a problem to migrate to any other data structure from
Collection<Pair<? extends Collection<SliceQuery>, ? extends Collection<StaticBuffer>>>
. Thus, I will try to use it as for now.Beta Was this translation helpful? Give feedback.
All reactions