Unable to scale Trino queries #18720
Replies: 5 comments 4 replies
-
cc: @lukasz-stec since IIRC you were looking into some skew related optimisations |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for your answer!!!!
|
Beta Was this translation helpful? Give feedback.
-
Thanks for the details.
The small amout of data and small number of splits also means that any skew in split processing time will have big impact on the entire query. |
Beta Was this translation helpful? Give feedback.
-
That depends on a connector, for
First, buckets or partitions should be set up so that it does not block scaling or create skew. E.g. if the number of buckets is smaller than the node count, partial aggregation won't scale too well. If the number of buckets is the number of workers + 1, one worker will have to process 2 buckets so it may slow down the entire query. For example, you can try without any partitioning or bucketing, but keep in mind that for some queries partition/bucketing is very helpful. You also have to have enough splits. This may be achieved by making splits smaller, but the smaller the split, the less efficient the processing of the split is. For example. with
That depends on the connector but usually no, splits are not necessarily different for every query. Splits are a way of dividing the table into smaller pieces, so if two queries read the same data, they will usually have the same splits. |
Beta Was this translation helpful? Give feedback.
-
hey I was sick for a week, sorry for not replying... Thanks for your answers. We've decided to postpone handling this for later, we'll use Dask in the mean while to scale |
Beta Was this translation helpful? Give feedback.
-
Hi everyone
we are trying to scale up Trino queries, and are currently failing.
We use Trino to query Iceberg data, into Dask, in a jupyterlab notebook, and we're running on GKE Kubernetes
We are using Dask to check Trino performance as using sql client apps returns the first X results and cancels the query. The dask configuration stayed the same throughout testing
We tried used two tables:
we are trying to query three columns: id, parentId and date. We are trying three types of queries:
(1) grouping by id with min of parent id (2) same but w/o min of parent id; and (3) grouping by parentId with min of parent id
Queries are attached below
We used 4, 8, 16 Trino workers, and different iceberg partitioning strategies (by id and by parent id columns (not together)); in 4, 8 buckets
We see improvement in the fake table query by using more trino workers, but not from partitioning.
We dont see improvement in the skewed (real) data.
We are trying to understand why
Any help would be appreciated
Beta Was this translation helpful? Give feedback.
All reactions