Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datetime): enhance datetime parsing and validation #2129

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

CoreyWinkelmannPP
Copy link
Contributor

@CoreyWinkelmannPP CoreyWinkelmannPP commented Nov 8, 2024

Add validation and parsing for DateColumnDef and DateTimeColumnDef to handle invalid date formats and zero dates as null. Introduce a DateValidator with regex matching for date formats. Extend tests to cover invalid dates and malformed datetime strings.

MySQL and MariaDB allow invalid dates by default unless strict SQL mode is on, enabling entries like 2020-00-01 to be stored without error, which can lead to data processing issues. In older systems, data may be in poor condition, making it difficult to switch to strict mode, so these cases often need handling when using CDC.

Add validation and parsing for DateColumnDef and DateTimeColumnDef to handle invalid date formats and zero dates as null. Introduce a DateValidator with regex matching for date formats. Extend tests to cover invalid dates and malformed datetime strings.
@osheroff
Copy link
Collaborator

what was the behavior of the bootstrapper wrt to "0000-00-00" style dates (if the as-null flag was off) before this change?

I know, it absolutely sucks to output invalid datetimes into the stream but all things equal the 0000-00-00 case is so common I'd prefer not to change that behavior without changing a flag. I'm good with your other changes that nullify stuff like zero months tho.

@CoreyWinkelmannPP
Copy link
Contributor Author

That's a good question. I will try and run through those scenarios when I get a chance today to get an idea of what it did and how the change will affect it. Might be necessary to call out the all zero path and allow it to run the normal path through without validating it from there.

@CoreyWinkelmannPP
Copy link
Contributor Author

I isolated the date validation logic outside of the all zero case and I believe kept the logic of the all zero case the same as what existed before. From my testing locally it seems to work in the same way as before.

else
return appendFractionalSeconds("0000-00-00 00:00:00", 0, getColumnLength());
} else {
if ( !DateValidator.isValidDateTime(dateString) )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the delay, I'm coming back to this PR since I'm prepping a release of your mariadb stuff and some other stuff too. I'm a little worried about running a regex in a very hot code-path...

But I guess this only runs on bootstrapping so it might be ok? can you confirm that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just making it back to looking at this to see where it was at. From my testing, the value instanceof String is only triggered through the bootstrapping and not during normal processing. It goes down the Long path at least from my debugging I was doing. If it is just bootstrapping are you good with it or should I look for a more efficient way to verify that Date/DateTime instead of using Regex?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm yeah I don't know. I had a performance benchmark setup at some point, let me see if I can maybe measure the impact of the regex on bootstrapping.

if (value == null) {
return null;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mildly, same concerns here -- is this just for bootstrapping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am only seeing bootstrapping taking the String path. Otherwise, during normal processing it takes the Long path.

@osheroff
Copy link
Collaborator

would be nice to slap some integration tests around this code (esp checking behavior in bootstrapping, which I believe is where this change targets).

@osheroff
Copy link
Collaborator

sorry to say it's a big perf hit.

Bootstrapping a table with 2 timestamps, 2 datetimes:

before your patch:

Producer received 1000076 rows in 8.22 seconds
121738 rows per second

after your patch:

Producer received 1000082 rows in 39.52 seconds
25307 rows per second

I tried to profile it in visualVM for more proof but it turns out that profiler is lousy. I pushed the bootstrap benchmarker into your branch, LMK if you need pointers in running it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants