Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to read CSV without header row #82

Merged
merged 9 commits into from
Dec 16, 2020
Merged

Add ability to read CSV without header row #82

merged 9 commits into from
Dec 16, 2020

Conversation

djalova
Copy link
Collaborator

@djalova djalova commented Dec 15, 2020


Checklist:

For the following questions, only check the boxes that are applicable.

  • Ensure that the pull request text above the horizontal line is descriptive.

  • Add/update relevant tests.

  • Add/update relevant documents.

  • Do you have triage permission of this repository?

    • No
      • You are good.
    • Yes
      • Does this pull request close an issue?
        • No, this pull request also acts as an issue by itself.
          • Assign yourself.
          • Label this pull request.
          • Put a milestone.
          • Put this pull request under "In Progress" or "Under Review" pipeline on ZenHub.
          • Put this pull request under the appropriate epic on ZenHub.
          • Put an estimate on ZenHub.
          • Add dependency relationship with other issues/pull requests on ZenHub.
        • Yes, this pull request addresses an issue and does not act as an issue.
          • Connect this pull request to the underlying issue on ZenHub.
          • DON'T:
            • DON'T assign yourself or anyone else.
            • DON'T label.
            • DON'T put a milestone.
            • DON'T put this pull request under any epic on ZenHub.
            • DON'T put an estimate on ZenHub.
      • When the pull request is ready, request review.

@djalova
Copy link
Collaborator Author

djalova commented Dec 15, 2020

Recreating PR due to #81

@@ -37,6 +37,8 @@ def load(self, path: Union[_typing.PathLike, Dict[str, str]], options: SchemaDic
- ``columns`` key specifies the data type of each column. Each data type corresponds to a Pandas'
supported dtype. If unspecified, then it is default.
- ``delimiter`` key specifies the delimiter of the input CSV file.
- ``header`` key specifies if the first row of the CSV file contains the headers. Defaults to True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- ``header`` key specifies if the first row of the CSV file contains the headers. Defaults to True
- ``header`` key specifies if the first row of the CSV file contains the headers. Defaults to True.

noaa_jfk_schema['subdatasets']['jfk_weather_cleaned']['format']['options']['header'] = False
with pytest.raises(ValueError) as exinfo: # Pandas should error from trying to read string as another dtype
Dataset(noaa_jfk_schema, tmp_path, mode=Dataset.InitializationMode.DOWNLOAD_AND_LOAD)
assert('could not convert string to float' in exinfo.value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception is raised in the previous line and this assertion should have never been executed:

Suggested change
assert('could not convert string to float' in exinfo.value)
assert 'could not convert string to float' in str(exinfo.value)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops thanks for catching

@@ -55,9 +57,18 @@ def load(self, path: Union[_typing.PathLike, Dict[str, str]], options: SchemaDic
else:
dtypes[column] = type_

names = None
header = None
if options.get('header', True):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the document, do you mean

Suggested change
if options.get('header', True):
if options.get('header', True) is not False:

Or you can actually make the function based on Python's evaluation of whether the value is true or false rather than deciding whether it is exactly False or not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be confusing if we make the function based on Python's evaluation. If we accidentally set header to ''. Maybe we could rename the key to no_header and then use Python's evaluation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

noaa_jfk_schema['subdatasets']['jfk_weather_cleaned']['format']['options']['header'] = True
self.test_csv_pandas_loader(tmp_path, noaa_jfk_schema)

noaa_jfk_schema['subdatasets']['jfk_weather_cleaned']['format']['options']['header'] = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps also add a test on the value being an empty string and None?

Comment on lines 255 to 262
noaa_jfk_schema['subdatasets']['jfk_weather_cleaned']['format']['options']['no_header'] = False
self.test_csv_pandas_loader(tmp_path, noaa_jfk_schema)

noaa_jfk_schema['subdatasets']['jfk_weather_cleaned']['format']['options']['no_header'] = ''
self.test_csv_pandas_loader(tmp_path, noaa_jfk_schema)

noaa_jfk_schema['subdatasets']['jfk_weather_cleaned']['format']['options']['no_header'] = None
self.test_csv_pandas_loader(tmp_path, noaa_jfk_schema)
Copy link
Collaborator

@xuhdev xuhdev Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a for loop for these. Everything else LGTM now

Copy link
Collaborator

@xuhdev xuhdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xuhdev xuhdev merged commit 9728b9f into master Dec 16, 2020
@xuhdev xuhdev deleted the header branch December 16, 2020 01:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSV loader should support those without headers
2 participants