Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix IndexOutOfBounds exception in FileFormat#fromFileName #12301

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

rshkv
Copy link
Contributor

@rshkv rshkv commented Feb 17, 2025

Closes #12300.

FileFormat#fromFileName should account for file names that are shorter than a file format extension.

@github-actions github-actions bot added the API label Feb 17, 2025
if (Comparators.charSequences()
.compare(format.ext, filename.subSequence(extStart, filename.length()))
== 0) {
if (extStart > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to fail if the file name is just an extension? I think this is a reasonable choice because the only weird case I can imagine is setting your file name to metadata.json. All the other extensions are pretty much useless without another part to the name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's follow the common definition of an extension being "the bit after the last dot" (e.g., commons-io).

I think my test cases with parquet as filename were silly. Removed those

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I would consider changing this part to

public static FileFormat fromFileName(CharSequence filename) {
    for (FileFormat format : VALUES) {
      if (null != filename && filename.toString().endsWith(format.ext)) {
        return format;
      }
    }

    return null;
  }

then we wouldn't need to deal with negative indexes and such and all places that call this method already pass an actual String, so calling toString() is cheap

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we did that we should just make a new version of the method, I don't think it's great to change the underlying perf semantics of the method in the API module. If it takes CharSequence we shouldn't change it to a string.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, left a tiny nit on the test

private static final Object[][] FILE_NAMES =
new Object[][] {
// Files with format
{"file.puffin", FileFormat.PUFFIN},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#12300 mentions that there's an issue with file names where the prefix is shorter than the file format suffix, but those things aren't tested here. The params should be updated to include x.puffin / x.orc / x.parquet / x.avro to cover that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra our longest "extension" is metadata.json so a lot of these are too short for that :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked locally but the test cases that expect a valid file format all pass without the fix (meaning that they don't reproduce the original issue that was reported).
The only test cases that fail is where we're expecting a null file format.

That's why I'm suggesting to also test x.orc and so on, since all of those use cases that expect a valid file format won't fail and reproduce the original issue because they never actually reach the metadata.json file format during iteration, since they are matching a different file format earlier and thus the original issue isn't reproduced

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you double-check? I've undone the fix locally and get these test failures, all with out-of-bounds indexes:

Screenshot 2025-02-18 at 18 00 49

E.g., for the file.csv case:

  • It goes through the list of file formats in enum order, the last is metadata.json with 13 characters
  • From length of file.csv (8 characters), subtract 13 characters for .metadata.json and 1 for the dot
  • Your start index is -6 and you get an IndexOutOfBoundsException

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does the checks in order, so if you get a hit on "orc" you won't fail at "metadata.json"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you double-check? I've undone the fix locally and get these test failures, all with out-of-bounds indexes:

Screenshot 2025-02-18 at 18 00 49 E.g., for the `file.csv` case:
  • It goes through the list of file formats in enum order, the last is metadata.json with 13 characters
  • From length of file.csv (8 characters), subtract 13 characters for .metadata.json and 1 for the dot
  • Your start index is -6 and you get an IndexOutOfBoundsException

Yes I'm aware of that. My point is that we should just add some test cases where we're expecting a valid FileFormat to show that the same issue happens e.g. with x.orc too and not just unsupported file formats like csv.

@rshkv can you please add

        {"x.parquet", FileFormat.PARQUET},
        {"x.puffin", FileFormat.PUFFIN},
        {"x.orc", FileFormat.ORC},
        {"x.avro", FileFormat.AVRO},

in the right order to the argument list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I get you now. We'd hit out of bounds if x.avro was compared against .parquet which happens because PARQUET is before the AVRO enum value.

Updated

{"dir/file.csv", null},
// No format
{"file", null},
{"dir", null},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also test where the filename is null or an empty string

@pvary
Copy link
Contributor

pvary commented Feb 18, 2025

I understand that this PR is just a fix for an existing method, but I have concerns about the original intention of the method. We are relying on the filename to deduce the actual file format. This seems brittle to me. For example many of our test are generating parquet files without extensions.

I have faced a similar issue here: #11216 (comment)

The Iceberg specification have a file_format field for data files specifying the actual file format. Shouldn't we rely on these fields instead of trying to find out the format from the location of the file? If we want to allow metadata files to use different file formats, we might want to add a file_format field to the metadata descriptors too.

@rshkv rshkv force-pushed the wr/file-format-substring branch from 9486452 to 5962859 Compare February 18, 2025 18:20
@rshkv
Copy link
Contributor Author

rshkv commented Feb 18, 2025

I understand that this PR is just a fix for an existing method, but I have concerns about the original intention of the method. We are relying on the filename to deduce the actual file format. This seems brittle to me. For example many of our test are generating parquet files without extensions.

Yeah, I guess there's an argument to not use this method when you can rely on a better source for the file type. For what it's worth, we use FileFormat#fromFileName in remote signing where we just have the URL.

@pvary
Copy link
Contributor

pvary commented Feb 19, 2025

Yeah, I guess there's an argument to not use this method when you can rely on a better source for the file type. For what it's worth, we use FileFormat#fromFileName in remote signing where we just have the URL.

Why do you need the file format to sign an URL?

@rshkv
Copy link
Contributor Author

rshkv commented Feb 19, 2025

Why do you need the file format to sign an URL?

I think that doesn't quite matter for this PR. But it helps us make granular access decisions based on file type (e.g., treating metadata differently than data files). We could obviously get the extension without this. But we appreciate there's some utility that extracts "Iceberg types".

@rshkv
Copy link
Contributor Author

rshkv commented Feb 19, 2025

@nastra let's merge?


class TestFileFormat {

private static final Object[][] FILE_NAMES =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add @SuppressWarnings("unused") right above this line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IndexOutOfBounds in FileFormat#fromFileName
5 participants