Unable to scale Trino queries #18720

michael-f-cognyte · 2023-08-17T16:37:54Z

michael-f-cognyte
Aug 17, 2023

Hi everyone
we are trying to scale up Trino queries, and are currently failing.
We use Trino to query Iceberg data, into Dask, in a jupyterlab notebook, and we're running on GKE Kubernetes
We are using Dask to check Trino performance as using sql client apps returns the first X results and cancels the query. The dask configuration stayed the same throughout testing

We tried used two tables:

a fake one has 1:1 ratio of parent ids to ids, and
a real one with 5000 to 3 ids per parent id (to see how much data skew effects query time)

we are trying to query three columns: id, parentId and date. We are trying three types of queries:
(1) grouping by id with min of parent id (2) same but w/o min of parent id; and (3) grouping by parentId with min of parent id
Queries are attached below

We used 4, 8, 16 Trino workers, and different iceberg partitioning strategies (by id and by parent id columns (not together)); in 4, 8 buckets

We see improvement in the fake table query by using more trino workers, but not from partitioning.
We dont see improvement in the skewed (real) data.

We are trying to understand why

partitioning does not help the query time,
why querying only the id column on the skewed data does not improve by using more Trino workers
is it correct that more trino workers will not help with highly skewed data that is not partitioned?

Any help would be appreciated

## Query 1
df_merch = dd.read_sql_query(select([
    func.min(cast(generated_table.c.ifc_core_updatetime, DateTime)).label('update_time'),
    generated_table.c.ifc_core_id_long
]).group_by(generated_table.c.ifc_core_id_long), "trino://[email protected]:8080/iceberg", index_col="ifc_core_id_long", npartitions=16) # this will create the sql queries, but not run them. 
df_merch_persisted = df_merch.persist()
wait(df_merch_persisted)


## Query 2
df_merch = dd.read_sql_query(select([
    func.min(cast(generated_table.c.ifc_core_updatetime, DateTime)).label('update_time'),
    func.min(generated_table.c.ifc_core_id_long).label('ifc_core_id_long'),
    generated_table.c.ifc_core_parentid_long
]).group_by(generated_table.c.ifc_core_parentid_long), "trino://[email protected]:8080/iceberg", index_col="ifc_core_parentid_long", npartitions=16) # this will create the sql queries, but not run them. 
df_merch_persisted = df_merch.persist()
wait(df_merch_persisted)

## Query 3
df_merch = dd.read_sql_query(select([
    func.min(cast(generated_table.c.ifc_core_updatetime, DateTime)).label('update_time'),
    func.min(generated_table.c.ifc_core_parentid_long).label('ifc_core_parentid_long'),
    generated_table.c.ifc_core_id_long
]).group_by(generated_table.c.ifc_core_id_long), "trino://[email protected]:8080/iceberg", index_col="ifc_core_id_long", npartitions=16) # this will create the sql queries, but not run them. 
df_merch_persisted = df_merch.persist()
wait(df_merch_persisted)

hashhar · 2023-08-18T07:38:01Z

hashhar
Aug 18, 2023
Collaborator

cc: @lukasz-stec since IIRC you were looking into some skew related optimisations

1 reply

lukasz-stec Aug 18, 2023
Collaborator

Query 1 does not use ifc_core_parentid_long at all so it is surprising that it behaves differently with different ifc_core_parentid_long values. Are you sure the tables you are comparing are only different by ifc_core_parentid_long values?

Generally for scaling group-by the most important thing is the cardinality of the grouping keys (so ifc_core_parentid_long and ifc_core_id_long in your case ) and how those values are partitioned by a hash function. The worst case is where most of the values land on the same worker and on the same driver/thread. This can happen if most of the rows in a table have one or a small number of values. Another possibility is that the hash function behaves badly and hashes different values to the same bucket but this has a very low probability.

We used 4, 8, 16 Trino workers, and different iceberg partitioning strategies (by id and by parent id columns (not together)); in 4, 8 buckets

Does it mean you used a bucketed table, as opposed to a partitioned one? The bucket table has the benefit that aggregation does not need to repartition the data over the network but this comes at the cost of limiting worker node parallelism to the number of buckets. So for example with 4 buckets, only 4 workers will be used to read and aggregate data.

michael-f-cognyte · 2023-08-20T14:02:45Z

michael-f-cognyte
Aug 20, 2023
Author

Thanks a lot for your answer!!!!

I agree query 1 results are surprising. We have many more columns, but we only involve ifc_core_id and ifc_core_updatetime. this is especially puzzling since adding Trino workers we did see improvement on the fake data
We tried partitioning also by ifc_core_updatetime, and no difference in results
We checked further the tables structure, seems the id to parent_id skew is bigger on the real data, but the fake data has many repetitive parent_ids (see below)

my_table	total_id	distinct_parentid_long	distinct_id	distinct_updatetime
real	30,531,892	1,267,100	29,165,320	12,517,187
generated `	12,157,186	999,995	999,995	12,157,180

is there a recommended way to query Trino directly? perhaps we should eliminate the Dask variable?
We use the following code to partition the data - I believe this is bucketing the data?:

 CREATE TABLE {catalog}.{schema}.{partitioned_table_name}
         STORED AS PARQUET
         PARTITIONED BY ({','.join(partition_columns)})
        AS
         SELECT * FROM {catalog}.{schema}.{table_name}
         limit 0"""

2 replies

lukasz-stec Aug 21, 2023
Collaborator

is there a recommended way to query Trino directly? perhaps we should eliminate the Dask variable?

trino has a client (https://trino.io/docs/current/client/cli.html) that you can use to query it directly

We use the following code to partition the data - I believe this is bucketing the data?
        STORED AS PARQUET
       PARTITIONED BY ({','.join(partition_columns)})
       AS
        SELECT * FROM {catalog}.{schema}.{table_name}
        limit 0"""

this looks like partitioning, but I'm not familiar with this syntax.
Could you share SHOW CREATE TABLE {catalog}.{schema}.{table_name} from trino cli?
Partitioning by id, parentid or updatetime is not useful here because you don't have many values per partition but also it does not help with aggregations.
In case it is unclear, by partitioning I mean splitting data into partitions based on some column value, which means two rows with different values will be in separate partitions. Bucketing on the other hand splits data into a limited number of buckets, also based on some column value, but here rows with different values can be put into one bucket e.g. if you have two buckets half of the values will be in one bucket and the other half in the other bucket.

Can you try to run queries on unpartitioned data?
If you can, please share also EXPLAIN ANALYZE VERBOSE {query} for the queries on 4 and 16 workers that do not scale. This would help analyze what is wrong.

michael-f-cognyte Aug 23, 2023
Author

Thank you for your answers!!
We are now using the trino client, and getting query execution time from the Trino UI

The code for bucketing was spark code. We now also tried a different bucketing code (added 'using iceberg').
But according to show create table - there is no difference between the two... (see below)

Now we only tested query #1 - group by id, min on update time

On the real data and fake: We saw no change in the query performance by changing partition, or by adding Trino workers. Queries were slow. The fake data queries are very fast, so it could be that this was the reason for no improvement
Using Dask to run the queries on the real data was faster than running on the Trino client
we used 0, 8, 16 buckets. On fake data we used 4, 16 Trino workers. On real data we used 16, 32 Trino workers
We also noted, that when using dask to query, and using dask partitions, we see in Trino UI a query per partition, each going over a specific range.
When we query using Trino client, we see one query.
If we run the same query, but with string id, then when running using Trino client, initially Trino workers all work, but at second stage only the coordinator works and gets stuck for very long time. I guess when using Dask the dask is merging the result of queries, when using Trino client the Trino coordinator does that?
Also, one last question:
What way do we have checking parallelism of the workers? We're using Grafana, but the Trino UI query details also have some details - what should we look at?

partitioning code

`Spark.sql(f"""CREATE TABLE iceberg.marquisefull3.p_merchandise_by_id_8_iceberg
USING ICEBERG
PARTITIONED BY (bucket(8, ifc_core_id_long))
AS
SELECT * FROM iceberg.marquisefull3.merchandise
limit 0""")

Spark.sql(f"""
CREATE TABLE iceberg.marquisefull3.p_merchandise_by_id_8_iceberg
STORED AS PARQUET
PARTITIONED BY (bucket(8, ifc_core_id_long))
AS
SELECT * FROM iceberg.marquisefull3.merchandise
limit 0""")`

show create table

`WITH (
format = 'PARQUET',
format_version = 1,
location = 's3a://iceberg/marquisefull3/merchandise'
)

.
.
.

WITH (
format = 'PARQUET',
format_version = 1,
location = 's3a://iceberg/marquisefull3/p_merchandise_by_id_8_iceberg',
partitioning = ARRAY['bucket(ifc_core_id_long, 8)']
)

.
.
.

WITH (
format = 'PARQUET',
format_version = 1,
location = 's3a://iceberg/marquisefull3/p_merchandise_by_ifc_core_id_8',
partitioning = ARRAY['bucket(ifc_core_id_long, 8)']
)`

groupby_id_min_updatetime_bucketed_id.txt
groupby_id_min_updatetime_fake.txt
groupby_id_min_updatetime_bucketed_id_updatetime_fake.txt
groupby_id_min_updatetime_bucketed_id_fake.txt
groupby_id_min_updatetime.txt

lukasz-stec · 2023-08-23T21:09:03Z

lukasz-stec
Aug 23, 2023
Collaborator

Thanks for the details.
It looks like the issue is with data volume. Regardless of bucketing, when I look at groupby_id_min_updatetime.txt I see that the merchandise table is prety small (271.16MB) and there are only 10 splits (you can see it in the Input rows distribution.count metric). For fake data, even though the table is smaller I see 96 splits. In trino, split is the basic unit of parallelization in the source stage so with 10 splits you can use at most 10 threads at a time to read the data and do partial aggregation.
96 splits is still not much, but explains why the fake query can scale better.

       └─ TableScan[table = iceberg:marquisefull3.merchandise$data@4878487135489896904]
              Layout: [ifc_core_id_long:bigint, ifc_core_updatetime:timestamp(6) with time zone]
              Estimates: {rows: 30531894 (640.58MB), cpu: 640.58M, memory: 0B, network: 0B}
              CPU: 4.22s (16.29%), Scheduled: 12.00s (28.92%), Blocked: 0.00ns (0.00%), Output: 30531894 rows (640.58MB)
              connector metrics:
                'ParquetReaderCompressionFormat_GZIP' = LongCount{total=275395739}
                'ParquetReaderCompressionFormat_ZSTD' = LongCount{total=7598863}
              metrics:
                'CPU time distribution (s)' = {count=10.00, p01=0.03, p05=0.03, p10=0.09, p25=0.11, p50=0.55, p75=0.61, p90=0.93, p95=0.93, p99=0.93, min=0.03, max=0.93}
                'Input rows distribution' = {count=10.00, p01=179682.00, p05=179682.00, p10=661243.00, p25=682665.00, p50=4492502.00, p75=4710521.00, p90=4859008.00, p95=4859008.00, p99=4859008.00, min=179682.00, max=4859008.00}
                'Scheduled time distribution (s)' = {count=10.00, p01=0.24, p05=0.24, p10=0.46, p25=0.51, p50=1.65, p75=1.72, p90=1.89, p95=1.89, p99=1.89, min=0.24, max=1.89}
              Input avg.: 3053189.40 rows, Input std.dev.: 64.47%
              ifc_core_id_long := 8:ifc_core_id_long:bigint
              ifc_core_updatetime := 26:ifc_core_updatetime:timestamp(6) with time zone
              Input: 30531894 rows (640.58MB), Physical input: 271.16MB, Physical input time: 7982.00ms

The small amout of data and small number of splits also means that any skew in split processing time will have big impact on the entire query.
For example in groupby_id_min_updatetime_bucketed_id.txt, looking at p_merchandise_by_id_16_iceberg table scan
'Scheduled time distribution (s)' = {count=17.00, p01=0.07, p05=0.07, p10=1.02, p25=1.04, p50=1.16, p75=1.31, p90=6.35, p95=7.98, p99=7.98, min=0.07, max=7.98}
You can see that one split took 7.98s to process, where entire query took 11.83s. This means, no matter what, this query will not take less than 8s.

1 reply

michael-f-cognyte Aug 27, 2023
Author

what determines the splits? is there any way to control it?
what can we do to enable scaling query time?
and the splits are different for every query, is that correct? so if I query id, parent-id, update time, or other columns the splits will be different every time

lukasz-stec · 2023-08-28T07:50:54Z

lukasz-stec
Aug 28, 2023
Collaborator

what determines the splits? is there any way to control it?

That depends on a connector, for iceberg there is iceberg table property read.split.target-size (see https://iceberg.apache.org/docs/0.13.1/configuration/). Splits will not span multiple files so file size also influences split size.

what can we do to enable scaling query time?

First, buckets or partitions should be set up so that it does not block scaling or create skew. E.g. if the number of buckets is smaller than the node count, partial aggregation won't scale too well. If the number of buckets is the number of workers + 1, one worker will have to process 2 buckets so it may slow down the entire query. For example, you can try without any partitioning or bucketing, but keep in mind that for some queries partition/bucketing is very helpful.

You also have to have enough splits. This may be achieved by making splits smaller, but the smaller the split, the less efficient the processing of the split is. For example. with 300MB table, if you want to saturate 16 workers with 16 cores each, you will end up with at most around 1MB per split which is pretty small, and even at that size, the per split overhead can make scaling not worth it.

and the splits are different for every query, is that correct? so if I query id, parent-id, update time, or other columns the splits will be different every time

That depends on the connector but usually no, splits are not necessarily different for every query. Splits are a way of dividing the table into smaller pieces, so if two queries read the same data, they will usually have the same splits.

0 replies

michael-f-cognyte · 2023-09-05T12:26:20Z

michael-f-cognyte
Sep 5, 2023
Author

hey

I was sick for a week, sorry for not replying...

Thanks for your answers. We've decided to postpone handling this for later, we'll use Dask in the mean while to scale
Seems the issue our data, we'll see what we can do
I will update here with results when we have some results

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to scale Trino queries #18720

{{title}}

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Unable to scale Trino queries #18720

michael-f-cognyte Aug 17, 2023

Replies: 5 comments · 4 replies

hashhar Aug 18, 2023 Collaborator

lukasz-stec Aug 18, 2023 Collaborator

michael-f-cognyte Aug 20, 2023 Author

lukasz-stec Aug 21, 2023 Collaborator

michael-f-cognyte Aug 23, 2023 Author

partitioning code

show create table

lukasz-stec Aug 23, 2023 Collaborator

michael-f-cognyte Aug 27, 2023 Author

lukasz-stec Aug 28, 2023 Collaborator

michael-f-cognyte Sep 5, 2023 Author

michael-f-cognyte
Aug 17, 2023

Replies: 5 comments 4 replies

hashhar
Aug 18, 2023
Collaborator

lukasz-stec Aug 18, 2023
Collaborator

michael-f-cognyte
Aug 20, 2023
Author

lukasz-stec Aug 21, 2023
Collaborator

michael-f-cognyte Aug 23, 2023
Author

lukasz-stec
Aug 23, 2023
Collaborator

michael-f-cognyte Aug 27, 2023
Author

lukasz-stec
Aug 28, 2023
Collaborator

michael-f-cognyte
Sep 5, 2023
Author