Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot use ISO 8601 date format #1696

Open
SimonScholler opened this issue Oct 23, 2024 · 6 comments
Open

Cannot use ISO 8601 date format #1696

SimonScholler opened this issue Oct 23, 2024 · 6 comments
Labels
wontfix This will not be worked on

Comments

@SimonScholler
Copy link

Hi,
I have a schema that contains a field 'date' which also specifies an ISO 8601 date format:

{
  "fields": [
    {
      "name": "date",
      "type": "date",
      "description": "Week in ISO-8106 Format",
      "format": "%G-W%V"
    },
...

When doing validation on a data file, I get the following error.

{'type': 'type-error',
'title': 'Type Error',
'description': 'The value does not match the schema '
               'type and format for this field.',
'message': 'Type error in the cell "2024-W40" in row '
           '"250" and field "date" at position "1": '
           'type is "date/%G-W%V"',
'tags': ['#table', '#row', '#cell'],
'note': 'type is "date/%G-W%V"',
'cells': ['2024-W40',
...

It work when I use a non-ISO date format such as "%Y-W%W", but using this would be semantically wrong.

Am I missing something here or is this a known issue? Is there a workaround?

Thank you and kind regards
Simon

@pierrecamilleri
Copy link
Collaborator

Thanks for the report. Can reproduce, and looked under the hood : datetime.strptime(cell, self.format).date()

This command fails with your inputs :

from datetime import datetime
datetime.strptime("2024-W40", "%G-W%V").date()

ValueError: ISO year directive '%G' must be used with the ISO week directive '%V' and a weekday directive ('%A', '%a', '%w', or '%u')

This seems to be a limitation of datetime as mentioned in this SO question

Excerpts :

The parsing in datetime is limited. Look at module dateutil

It looks like dateutil is already a dependency from frictionless, so I don't see any drawback to using dateutil instead.

Just to be sure, for your use case, do you see any inconvenience to store "2024-W40" as the first day of this week internally ? I guess that if it is only for validation then it should not matter.

@SimonScholler
Copy link
Author

Hi,
thanks for looking into it!
Some more background: This is a dataset that is made available as Open Data, this is why I would like to stick with the existing representation. We also want to use Frictionless in order to describe it's structure and make the schema availabe publicly as well. In the description, we write that the date is in ISO format, but ideally I would encode that information in the format property as well so that it's semantically correct.
Wouldn't it be possible for frictionless to use dateutil.parser.isoparse() in cases when an ISO date format is used?
Thank you
Simon

@pierrecamilleri
Copy link
Collaborator

You mean, if the format is set to "default" ?

I'm going to need a little time to think before replying, while I parse the XML documentation linked in the table schema specification myself!

Actual code already uses dateutil in this case with additional asserts and a comment, but I do not know if something has changed between v1 and v2 :

                     if self.format == "default":
                        # Guard against shorter formats supported by dateutil
                        assert cell[16] == ":"
                        assert len(cell) >= 19
                        cell = platform.dateutil_parser.isoparse(cell)

@SimonScholler
Copy link
Author

Hi Pierre,
I mean not only if it's set to 'default' but if the pattern is an ISO-compliant date pattern ("%G-W%V" is ISO-compliant in my opinion as it represents this ISO format: https://en.wikipedia.org/wiki/ISO_week_date).

@pierrecamilleri
Copy link
Collaborator

I looked a little bit closer into this issue :

  • dateutil will not be useful here. What you suggest is not possible, as you would not want to validate a different ISO format that what you specify in your table schema. dateutil is not based on format strings from what I understand, but sophisticated heuristics instead, the the aim is to be " forgiving with regards to unlikely input formats" (source), and we seek quite the opposite.
  • SO question linked above suggests to use another third party lib, however replacing stdlib with an unknown new dependency is really not an option here.
  • So what would be left would be a little hack : detect this specific format string, add "-1" to the data, "-%u" to the format to validate it (but it would not work for any variation).

However, all things considered, I'll say that this isn't a bug but a feature : "%G-W%V" is actually not a valid date format, as it indicates an entire week. I have no access to the ISO8601 standard to check whether it makes such a distinction. However strptime does and the error message is explicit about this : the format should contain a weekday directive.

Additionnaly, the Table Schema specification is very explicit about this :

follow the syntax of standard Python / C strptime.

So If you need to validate this format, I would suggest you preprocess your data before validation (adding -1 after each week date), and validate your column with %G-W%V-%u. Would this work for you ?

@pierrecamilleri pierrecamilleri added the wontfix This will not be worked on label Nov 28, 2024
@dangotbanned
Copy link

dangotbanned commented Dec 17, 2024

@pierrecamilleri, @SimonScholler

For a ISO 8601 date/datetime, you can use (date|datetime).fromisoformat(...):

import datetime as dt

>>> dt.date.fromisoformat("2024-W40")
datetime.date(2024, 9, 30)

>>> dt.datetime.fromisoformat("2024-W40")
datetime.datetime(2024, 9, 30, 0, 0)

As mentioned in the docs, the inverse of this is date.isocalendar():

import datetime as dt

>>> dt.date.fromisoformat("2024-W40").isocalendar()
datetime.IsoCalendarDate(year=2024, week=40, weekday=1)

Finally, bringing it all back together in datetime.strptime(...):

import datetime as dt

year, week, weekday = dt.date.fromisoformat("2024-W40").isocalendar()
>>> dt.datetime.strptime(f"{year}-W{week}-{weekday}", "%G-W%V-%u").date()
datetime.date(2024, 9, 30)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants