Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Can't Read Parquet Files Created with parquet-mr 1.14.0+ #583

Open
mukunku opened this issue Dec 24, 2024 · 0 comments
Open

[BUG]: Can't Read Parquet Files Created with parquet-mr 1.14.0+ #583

mukunku opened this issue Dec 24, 2024 · 0 comments

Comments

@mukunku
Copy link
Contributor

mukunku commented Dec 24, 2024

Library Version

5.0.2

OS

Windows

OS Architecture

64 bit

How to reproduce?

Summary

Parquet files created in Apache Spark using Parquet-Java (formerly parquet-mr) 1.14.0+ can't be opened.

Sample files: Samples.zip

  • parquet-mr-1.13.1.parquet -> Works
  • parquet-mr-1.14.3.parquet -> Fails with System.InvalidOperationException: don't know how to skip type 14
  • parquet-mr-1.15.0.parquet -> Fails with ystem.InvalidOperationException: don't know how to skip type Double
    • I tried implementing the skip for Double but it still fails with another error.

All provided files are openable using https://www.parquet-viewer.com/ so they appear valid.


How do I generate these files myself?

Below are the steps I followed on my ARM Macbook to generate these files:

  • Download latest Apache Spark with Hadoop: spark-3.5.4-bin-hadoop3.tgz
    • Version 3.5.4 comes with parquet-mr version 1.13.1
  • Make sure you have JRE installed
  • Unzip Spark and run ./bin/spark-shell
    • This will start a one node local spark cluster and start a Scala shell
  • Copy paste the following scala code into the shell:
    import java.time.LocalDate
    import java.time.LocalDateTime
    
    case class SampleData(
      id: Int,
      name: String,
      age: Int,
      height: Double,
      isStudent: Boolean,
      enrollmentDate: LocalDate,
      lastLogin: LocalDateTime
    )
    
    val df = Seq(
      SampleData(1, "John Doe", 30, 175.5, true, LocalDate.now(), LocalDateTime.now()),
      SampleData(2, "Jane Smith", 28, 165.7, false, LocalDate.of(2023, 1, 1), LocalDateTime.now()),
      SampleData(3, "Mike Johnson", 35, 180.9, true, LocalDate.now(), LocalDateTime.now()),
      SampleData(4, "Emily Brown", 32, 170.2, false, LocalDate.of(2023, 6, 15), LocalDateTime.now()),
      SampleData(5, "David Lee", 29, 185.1, true, LocalDate.now(), LocalDateTime.now())
    ).toDF()
    
    // Write to Parquet file
    df.coalesce(1).write.mode("overwrite").parquet("sample_parquet_file.parquet")
    
  • You'll find the generated file sample_parquet_file.parquet in the directory.
  • Since this version of Spark uses parquet-mr 1.13.1 this file should be readable with Parquet.NET
  • To switch to a different parquet-mr version, delete the following JAR's in the bin folder:
    • parquet-column-1.13.1.jar
    • parquet-common-1.13.1.jar
    • parquet-encoding-1.13.1.jar
    • parquet-format-structures-1.13.1.jar
    • parquet-hadoop-1.13.1.jar
    • parquet-jackson-1.13.1.jar
  • Then add versions 1.14.1 or 1.15.0 of these libraries into the bin folder for Spark to use
  • Once you have the later version of the 6 JARs in the bin folder restart the spark-shell and re-create the file using the same code above.
  • This new file won't be openable with Parquet.NET

Failing test

//Following code will fail with provided test files
using Stream s = System.IO.File.OpenRead(@"C:\Users\Sal\source\repos\parquet-dotnet\src\Parquet.Test\data\parquet-mr-1.14.3.parquet");
using ParquetReader r = await ParquetReader.CreateAsync(s);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant