Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dates before 1970-01-01 cause crash #10

Closed
DrAndiLowe opened this issue Jun 8, 2018 · 4 comments
Closed

Dates before 1970-01-01 cause crash #10

DrAndiLowe opened this issue Jun 8, 2018 · 4 comments

Comments

@DrAndiLowe
Copy link

In DateTimeAttribute.py, line 65:

timestamps = self.data_dropna.map(lambda x: parse(x).timestamp())

timestamp() results in a crash for dates earlier than 1970:

Traceback (most recent call last):
  File "C:\Users\ANDREW~1\AppData\Local\Temp\RtmpuMDjSt\chunk-code-2b143b894699.txt", line 131, in <module>
    describer.describe_dataset_in_correlated_attribute_mode(input_data, epsilon = epsilon, k = degree_of_bayesian_network, attribute_to_is_categorical = categorical_attributes, attribute_to_is_candidate_key = candidate_keys)
  File ".\DataSynthesizer\DataDescriber.py", line 123, in describe_dataset_in_correlated_attribute_mode
    seed)
  File ".\DataSynthesizer\DataDescriber.py", line 88, in describe_dataset_in_independent_attribute_mode
    self.infer_domains()
  File ".\DataSynthesizer\DataDescriber.py", line 242, in infer_domains
    column.infer_domain(self.input_dataset[column.name])
  File ".\DataSynthesizer\datatypes\DateTimeAttribute.py", line 56, in infer_domain
    timestamps = self.data_dropna.map(lambda x: parse(x).timestamp())
  File "C:\Users\Andrew_Lowe\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\series.py", line 2354, in map
    new_values = map_f(values, arg)
  File "pandas/_libs/src/inference.pyx", line 1521, in pandas._libs.lib.map_infer
  File ".\DataSynthesizer\datatypes\DateTimeAttribute.py", line 56, in <lambda>
    timestamps = self.data_dropna.map(lambda x: parse(x).timestamp())
OSError: [Errno 22] Invalid argument

This is apparently a known Python bug: see this
Stack Overflow post.

If the timestamp is out of the range of values supported by the platform C localtime() or gmtime() functions, datetime.fromtimestamp() may raise an exception like you're seeing. On Windows platform, this range can sometimes be restricted to years in 1970 through 2038. I have never seen this problem on a Linux system.

The same problem seems to occur with timestamp(); I tried this from a Python command prompt:

>>> from dateutil.parser import parse
>>> parse('19/04/1979').timestamp()
293320800.0
>>> parse('19/04/1969').timestamp()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument

If you're not seeing this behaviour, the SO post hints that Windows systems are affected, but not Linux.

Is there way to replace the translation from dates to timestamps, and vice versa, with code that works for dates earlier than 1970-01-01?

@DrAndiLowe
Copy link
Author

Also, see here for info on this bug: https://bugs.python.org/issue29097

@DrAndiLowe
Copy link
Author

Not too sure how datetime values are treated. I tried a workaround of converting all dates to Unix timestamps, negative values included, but the results were terrible; converting back to dates gets me a much much smaller range of dates than in the input data. So that's not a solution. Somehow I need dates before 1970 to be treated properly. Any ideas?

@haoyueping
Copy link
Collaborator

For a workaround solution, if the datetime values are dates, you can first convert them to integers. After generating synthetic dataset, convert integers back to dates.

>>> from dateutil.parser import parse
>>> date0 = parse('01/01/1970')
>>> date1 = parse('19/04/1979')
>>> date2 = parse('19/04/1969')
>>> date3 = parse('06/08/2018')
>>> (date1-date0).days
3395
>>> (date2-date0).days
-257
>>> (date3-date0).days
17690

@DrAndiLowe
Copy link
Author

With regards to my previous comment about the distributions of datetime attributes being synthesised poorly: implementing #11 resolves this issue after converting to integers as you suggested in your reply. That is, converting to integers wasn't the source of the behaviour I saw. Replacing timestamp() with a simple count of seconds from a user-defined epoch start will probably be sufficient to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants