Performance degredation using SymlinkTextInputFormat vs MapredParquetInputFormat input #16581
Replies: 1 comment
-
Alright, I have investigated this issue on my laptop, and this seems to be coming from the very poor symlink file listing that is done on all symlinks of a Manifest file. There is an option to enable an optimize symlink listing, here. Moreover, we could improve this feature a bit better by allowing more than one parent location for the optimizer code. I might raise a pull request with these suggestions later on. In the meantime, we are unblocked by enabling the optimiser and making sure all our Parquet files are in the same directory. |
Beta Was this translation helpful? Give feedback.
-
Hello,
We have noticed recently that the performance of simple queries counting the number of records of two tables pointing to the same Parquet data was drastically different.
We are running Trino in AWS and accessing data on S3.
The tables differ only in their input format: one accesses the Parquet data directly on S3 (MapredParquetInputFormat), one uses a manifest file that points to the same set of files in the same location (SymlinkTextInputFormat).
The following query was used during these tests:
where
<table>
could either bedirect_parquet_table
(for the MapredParquetInputFormat input) orsymlink_parquet_table
(for the SymlinkTextInputFormat).The query against the
direct_parquet_table
takes around 2 seconds.The query against the
symlink_parquet_table
takes around 25 seconds.Note that exactly the same Parquet files are used for both queries.
For
direct_parquet_table
, the Parquet files are retrieved from the target partition using an S3 List operation.For
symlink_parquet_table
, the Parquet files are retrieved from the manifest file itself on S3.Why do we have such a drastic difference of performance?
I have also attached the plans of
EXPLAIN ANALYSE VERBOSE
for both queries, and we can see that, in the case of the Symlink table, theLocalExchange
/RemoteSource
of the last stage are blocked for most of the time.parquet-plan.txt
symlink-plan.txt
As a note, we have also seen exactly the same performance degradation using Athena v2 and v3.
Please also find below the table definitions for both tables.
Many thanks!
Beta Was this translation helpful? Give feedback.
All reactions