Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Type and Array for Decimal32 and Decimal64 #7098

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

CurtHagenlocher
Copy link
Contributor

@CurtHagenlocher CurtHagenlocher commented Feb 8, 2025

Which issue does this PR close?

Part of addressing #6661 but does not close it.

What changes are included in this PR?

This change implements Decimal32Array, Decimal32Type, Decimal64Array and Decimal64Type in a relatively minimal fashion. It doesn't implement the full range of casts or support for Parquet or CSV reader. These will come in the next change.

Are there any user-facing changes?

Adds several public types. Adds new entries for the DataType enum, which is a breaking change.

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Feb 8, 2025
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes make sense to me (though is a bit harder to identify where things might not have been added like tests, etc.)

However, I think the addition of new values to DataType enum would make this a breaking change so would need to wait for next major release?


/// The default scale for [DataType::Decimal64] values
pub const DECIMAL64_DEFAULT_SCALE: i8 = 6;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a reference for these constants (both max precision/scale & defaults)?

Copy link
Contributor Author

@CurtHagenlocher CurtHagenlocher Feb 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I'd picked the default scales mostly arbitrarily. The DecimalType contract requires one, and there isn't really any obviously-correct choice. PostgreSQL has a money type which is roughly DECIMAL(~18, 2). MSSQL has a smallmoney that's ~9, 4 and a money that's ~18, 4. Four seems a little big for a 32-bit decimal and a little small for a 64-bit decimal. I'd love to get feedback about these, but also don't know how important they are.

As for precision, it's based on the maximum number of decimal digits that can be stored without running into the space limits, just like the existing decimals. The largest i32 is 2,147,483,647 and that can reliably store 999,999,999 but not 9,999,999,999 -- so the max precision is 9.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the default scales should get wider discussion? Where would be the right place for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense for precision. For default scale, do we have a reference from the C++ implementation possibly, to align with that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found defaults in either the C++ or the Go implementation, and I know there's none in the C# code.

Comment on lines +278 to +281
ParquetStatistics::Int64(s) => s
.$func()
.map(|x| $stat_value_type::try_from(*x).ok())
.flatten(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes significant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these changes are required for the code to compile, so I'm somewhat leaning on the compiler here. My naive take was that there shouldn't be a performance loss from the change (and that this wouldn't be a performance-critical path anyway) but I'm willing to believe that I might be wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine, was just curious as these changes specifically didn't seem related to the rest here, so wondering what else was changed that made the compiler require this change.

@@ -730,7 +783,7 @@ macro_rules! get_decimal_page_stats_iterator {
native_index
.indexes
.iter()
.map(|x| x.$func.and_then(|x| Some($stat_value_type::from(x))))
.map(|x| x.$func.and_then(|x| $stat_value_type::try_from(x).ok()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here

@CurtHagenlocher
Copy link
Contributor Author

Changes make sense to me (though is a bit harder to identify where things might not have been added like tests, etc.)

I'd gone through and searched the codebase for all instances of Decimal128 and Decimal256 and amended nearly all of them. For this subset of the changes, I think I got most of the non-benchmark tests. I was thinking there would be two more PRs, one to cover casts, changes to Parquet and CSV reader/writers, and benchmarks, and then one that does a final pass through the code to account for things that I originally missed due to the volume of changes and things that were added after I originally did this work in the end of December.

However, I think the addition of new values to DataType enum would make this a breaking change so would need to wait for next major release?

Yes, it's definitely a breaking change. I have to admit that I'm not familiar with the arrow-rs release schedule. I'm used to the main branch of the primary Arrow repo which is always "Vnext". How does that work here?

@alamb
Copy link
Contributor

alamb commented Feb 8, 2025

However, I think the addition of new values to DataType enum would make this a breaking change so would need to wait for next major release?

Yes I think that is the case

Per the schedule this means we could merge this PR once we have branched for 54.2 (early march) https://github.com/apache/arrow-rs?tab=readme-ov-file#arrow-and-parquet-crates

@CurtHagenlocher
Copy link
Contributor Author

Per the schedule this means we could merge this PR once we have branched for 54.2 (early march) https://github.com/apache/arrow-rs?tab=readme-ov-file#arrow-and-parquet-crates

Got it. In light of this, I'll likely withdraw the PR for now or move to Draft status. It would be good but not critical to figure out what we want as the default scale for the new types.

@Jefffrey Jefffrey added the next-major-release the PR has API changes and it waiting on the next major version label Feb 9, 2025
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave the PR up; I've tagged with the next-major-release tag for visibility

Comment on lines +278 to +281
ParquetStatistics::Int64(s) => s
.$func()
.map(|x| $stat_value_type::try_from(*x).ok())
.flatten(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine, was just curious as these changes specifically didn't seem related to the rest here, so wondering what else was changed that made the compiler require this change.


/// The default scale for [DataType::Decimal64] values
pub const DECIMAL64_DEFAULT_SCALE: i8 = 6;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense for precision. For default scale, do we have a reference from the C++ implementation possibly, to align with that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants