[BUG]: Can't Read Parquet Files Created with parquet-mr 1.14.0+ #583

mukunku · 2024-12-24T04:25:12Z

Library Version

5.0.2

OS

Windows

OS Architecture

64 bit

How to reproduce?

Summary

Parquet files created in Apache Spark using Parquet-Java (formerly parquet-mr) 1.14.0+ can't be opened.

Sample files: Samples.zip

parquet-mr-1.13.1.parquet -> Works
parquet-mr-1.14.3.parquet -> Fails with System.InvalidOperationException: don't know how to skip type 14
parquet-mr-1.15.0.parquet -> Fails with ystem.InvalidOperationException: don't know how to skip type Double
- I tried implementing the skip for Double but it still fails with another error.

All provided files are openable using https://www.parquet-viewer.com/ so they appear valid.

How do I generate these files myself?

Below are the steps I followed on my ARM Macbook to generate these files:

Download latest Apache Spark with Hadoop: spark-3.5.4-bin-hadoop3.tgz
- Version 3.5.4 comes with parquet-mr version 1.13.1
Make sure you have JRE installed
Unzip Spark and run ./bin/spark-shell
- This will start a one node local spark cluster and start a Scala shell

Copy paste the following scala code into the shell:

import java.time.LocalDate
import java.time.LocalDateTime

case class SampleData(
  id: Int,
  name: String,
  age: Int,
  height: Double,
  isStudent: Boolean,
  enrollmentDate: LocalDate,
  lastLogin: LocalDateTime
)

val df = Seq(
  SampleData(1, "John Doe", 30, 175.5, true, LocalDate.now(), LocalDateTime.now()),
  SampleData(2, "Jane Smith", 28, 165.7, false, LocalDate.of(2023, 1, 1), LocalDateTime.now()),
  SampleData(3, "Mike Johnson", 35, 180.9, true, LocalDate.now(), LocalDateTime.now()),
  SampleData(4, "Emily Brown", 32, 170.2, false, LocalDate.of(2023, 6, 15), LocalDateTime.now()),
  SampleData(5, "David Lee", 29, 185.1, true, LocalDate.now(), LocalDateTime.now())
).toDF()

// Write to Parquet file
df.coalesce(1).write.mode("overwrite").parquet("sample_parquet_file.parquet")

You'll find the generated file sample_parquet_file.parquet in the directory.
Since this version of Spark uses parquet-mr 1.13.1 this file should be readable with Parquet.NET
To switch to a different parquet-mr version, delete the following JAR's in the bin folder:
- parquet-column-1.13.1.jar
- parquet-common-1.13.1.jar
- parquet-encoding-1.13.1.jar
- parquet-format-structures-1.13.1.jar
- parquet-hadoop-1.13.1.jar
- parquet-jackson-1.13.1.jar
Then add versions 1.14.1 or 1.15.0 of these libraries into the bin folder for Spark to use
- You can download these versions of the libraries directly from Maven: https://mvnrepository.com/artifact/org.apache.parquet
- Although I have them downloaded here for simplicity:
  - parquet-mr_1.14.1_JARs.zip
  - parquet-mr_1.15.0_JARs.zip
Once you have the later version of the 6 JARs in the bin folder restart the spark-shell and re-create the file using the same code above.
This new file won't be openable with Parquet.NET

Failing test

//Following code will fail with provided test files
using Stream s = System.IO.File.OpenRead(@"C:\Users\Sal\source\repos\parquet-dotnet\src\Parquet.Test\data\parquet-mr-1.14.3.parquet");
using ParquetReader r = await ParquetReader.CreateAsync(s);

The text was updated successfully, but these errors were encountered:

This was referenced Dec 24, 2024

[BUG] Can't open Parquet files created in Spark anymore mukunku/ParquetViewer#121

Open

[BUG] "don't know how to skip type Set" error mukunku/ParquetViewer#118

Closed

mukunku closed this as completed Dec 24, 2024

mukunku reopened this Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Can't Read Parquet Files Created with parquet-mr 1.14.0+ #583

[BUG]: Can't Read Parquet Files Created with parquet-mr 1.14.0+ #583

mukunku commented Dec 24, 2024 •

edited

Loading

[BUG]: Can't Read Parquet Files Created with parquet-mr 1.14.0+ #583

[BUG]: Can't Read Parquet Files Created with parquet-mr 1.14.0+ #583

Comments

mukunku commented Dec 24, 2024 • edited Loading

Library Version

OS

OS Architecture

How to reproduce?

Summary

How do I generate these files myself?

Failing test

mukunku commented Dec 24, 2024 •

edited

Loading