Fix IndexOutOfBounds exception in FileFormat#fromFileName #12301

rshkv · 2025-02-17T18:56:01Z

FileFormat#fromFileName should account for file names that are shorter than a file format extension.

api/src/test/java/org/apache/iceberg/FileFormatTest.java

RussellSpitzer · 2025-02-17T19:17:37Z

api/src/main/java/org/apache/iceberg/FileFormat.java

-      if (Comparators.charSequences()
-              .compare(format.ext, filename.subSequence(extStart, filename.length()))
-          == 0) {
+      if (extStart > 0


Do we want to fail if the file name is just an extension? I think this is a reasonable choice because the only weird case I can imagine is setting your file name to metadata.json. All the other extensions are pretty much useless without another part to the name.

Yeah, let's follow the common definition of an extension being "the bit after the last dot" (e.g., commons-io).

I think my test cases with parquet as filename were silly. Removed those

actually I would consider changing this part to

public static FileFormat fromFileName(CharSequence filename) { for (FileFormat format : VALUES) { if (null != filename && filename.toString().endsWith(format.ext)) { return format; } } return null; }

then we wouldn't need to deal with negative indexes and such and all places that call this method already pass an actual String, so calling toString() is cheap

I think if we did that we should just make a new version of the method, I don't think it's great to change the underlying perf semantics of the method in the API module. If it takes CharSequence we shouldn't change it to a string.

RussellSpitzer

Looks good to me, left a tiny nit on the test

api/src/test/java/org/apache/iceberg/FileFormatTest.java

nastra · 2025-02-18T14:22:25Z

api/src/test/java/org/apache/iceberg/TestFileFormat.java

+  private static final Object[][] FILE_NAMES =
+      new Object[][] {
+        // Files with format
+        {"file.puffin", FileFormat.PUFFIN},


#12300 mentions that there's an issue with file names where the prefix is shorter than the file format suffix, but those things aren't tested here. The params should be updated to include x.puffin / x.orc / x.parquet / x.avro to cover that

@nastra our longest "extension" is metadata.json so a lot of these are too short for that :)

I've checked locally but the test cases that expect a valid file format all pass without the fix (meaning that they don't reproduce the original issue that was reported).
The only test cases that fail is where we're expecting a null file format.

That's why I'm suggesting to also test x.orc and so on, since all of those use cases that expect a valid file format won't fail and reproduce the original issue because they never actually reach the metadata.json file format during iteration, since they are matching a different file format earlier and thus the original issue isn't reproduced

Can you double-check? I've undone the fix locally and get these test failures, all with out-of-bounds indexes:

E.g., for the file.csv case:

It goes through the list of file formats in enum order, the last is metadata.json with 13 characters

From length of file.csv (8 characters), subtract 13 characters for .metadata.json and 1 for the dot

Your start index is -6 and you get an IndexOutOfBoundsException

It does the checks in order, so if you get a hit on "orc" you won't fail at "metadata.json"

Can you double-check? I've undone the fix locally and get these test failures, all with out-of-bounds indexes:
E.g., for the `file.csv` case:

It goes through the list of file formats in enum order, the last is metadata.json with 13 characters

From length of file.csv (8 characters), subtract 13 characters for .metadata.json and 1 for the dot

Your start index is -6 and you get an IndexOutOfBoundsException

Yes I'm aware of that. My point is that we should just add some test cases where we're expecting a valid FileFormat to show that the same issue happens e.g. with x.orc too and not just unsupported file formats like csv.

@rshkv can you please add

{"x.parquet", FileFormat.PARQUET}, {"x.puffin", FileFormat.PUFFIN}, {"x.orc", FileFormat.ORC}, {"x.avro", FileFormat.AVRO},

in the right order to the argument list?

Ah, I get you now. We'd hit out of bounds if x.avro was compared against .parquet which happens because PARQUET is before the AVRO enum value.

Updated

nastra · 2025-02-18T14:26:48Z

api/src/test/java/org/apache/iceberg/TestFileFormat.java

+        {"dir/file.csv", null},
+        // No format
+        {"file", null},
+        {"dir", null},


we should also test where the filename is null or an empty string

pvary · 2025-02-18T16:20:51Z

I understand that this PR is just a fix for an existing method, but I have concerns about the original intention of the method. We are relying on the filename to deduce the actual file format. This seems brittle to me. For example many of our test are generating parquet files without extensions.

I have faced a similar issue here: #11216 (comment)

The Iceberg specification have a file_format field for data files specifying the actual file format. Shouldn't we rely on these fields instead of trying to find out the format from the location of the file? If we want to allow metadata files to use different file formats, we might want to add a file_format field to the metadata descriptors too.

rshkv · 2025-02-18T21:36:21Z

I understand that this PR is just a fix for an existing method, but I have concerns about the original intention of the method. We are relying on the filename to deduce the actual file format. This seems brittle to me. For example many of our test are generating parquet files without extensions.

Yeah, I guess there's an argument to not use this method when you can rely on a better source for the file type. For what it's worth, we use FileFormat#fromFileName in remote signing where we just have the URL.

pvary · 2025-02-19T08:37:59Z

Yeah, I guess there's an argument to not use this method when you can rely on a better source for the file type. For what it's worth, we use FileFormat#fromFileName in remote signing where we just have the URL.

Why do you need the file format to sign an URL?

rshkv · 2025-02-19T12:20:31Z

Why do you need the file format to sign an URL?

I think that doesn't quite matter for this PR. But it helps us make granular access decisions based on file type (e.g., treating metadata differently than data files). We could obviously get the extension without this. But we appreciate there's some utility that extracts "Iceberg types".

rshkv · 2025-02-19T12:22:01Z

@nastra let's merge?

nastra · 2025-02-19T13:48:23Z

api/src/test/java/org/apache/iceberg/TestFileFormat.java

+
+class TestFileFormat {
+
+  private static final Object[][] FILE_NAMES =


can you please add @SuppressWarnings("unused") right above this line

rshkv added 2 commits February 17, 2025 18:54

Add (failing) test

9f30d8e

Fix failing test

0a201d9

github-actions bot added the API label Feb 17, 2025

RussellSpitzer reviewed Feb 17, 2025

View reviewed changes

api/src/test/java/org/apache/iceberg/FileFormatTest.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Feb 17, 2025

View reviewed changes

api/src/test/java/org/apache/iceberg/FileFormatTest.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Feb 17, 2025

View reviewed changes

RussellSpitzer approved these changes Feb 17, 2025

View reviewed changes

Field source

574c40f

ebyhr reviewed Feb 18, 2025

View reviewed changes

api/src/test/java/org/apache/iceberg/FileFormatTest.java Outdated Show resolved Hide resolved

nastra reviewed Feb 18, 2025

View reviewed changes

api/src/test/java/org/apache/iceberg/FileFormatTest.java Outdated Show resolved Hide resolved

Remove #fromFileName and rename TestFileFormat

0b17e9d

nastra reviewed Feb 18, 2025

View reviewed changes

pvary mentioned this pull request Feb 18, 2025

Data: Add partition stats writer and reader #11216

Open

Add tests for null and empty strings

5962859

rshkv force-pushed the wr/file-format-substring branch from 9486452 to 5962859 Compare February 18, 2025 18:20

Try fix CI

c2a8d3f

nastra reviewed Feb 19, 2025

View reviewed changes

Add tests with short file names

c87f18e

nastra approved these changes Feb 19, 2025

View reviewed changes

Retrigger CI

6b0c8ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix IndexOutOfBounds exception in FileFormat#fromFileName #12301

Fix IndexOutOfBounds exception in FileFormat#fromFileName #12301

rshkv commented Feb 17, 2025

RussellSpitzer Feb 17, 2025

rshkv Feb 17, 2025

nastra Feb 18, 2025

RussellSpitzer Feb 18, 2025

RussellSpitzer left a comment

nastra Feb 18, 2025

RussellSpitzer Feb 18, 2025

nastra Feb 18, 2025

rshkv Feb 18, 2025

RussellSpitzer Feb 18, 2025

nastra Feb 19, 2025

rshkv Feb 19, 2025

nastra Feb 18, 2025

pvary commented Feb 18, 2025

rshkv commented Feb 18, 2025

pvary commented Feb 19, 2025

rshkv commented Feb 19, 2025

rshkv commented Feb 19, 2025

nastra Feb 19, 2025


		class TestFileFormat {

		private static final Object[][] FILE_NAMES =

Fix IndexOutOfBounds exception in FileFormat#fromFileName #12301

Are you sure you want to change the base?

Fix IndexOutOfBounds exception in FileFormat#fromFileName #12301

Conversation

rshkv commented Feb 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pvary commented Feb 18, 2025

rshkv commented Feb 18, 2025

pvary commented Feb 19, 2025

rshkv commented Feb 19, 2025

rshkv commented Feb 19, 2025

Choose a reason for hiding this comment