Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_td() fails with null only column #16

Open
chezou opened this issue May 9, 2019 · 2 comments
Open

to_td() fails with null only column #16

chezou opened this issue May 9, 2019 · 2 comments

Comments

@chezou
Copy link
Member

chezou commented May 9, 2019

If there is a column which has null values only, to_td() fails to upload the dataframe. It should be avoided setting spark.sql.execution.arrow.fallback.enabled=true for PySpark config.

>>> import pytd.pandas_td as td
>>> engine = td.create_engine('presto:sample_datasets')
>>> df = td.read_td('select * from www_access limit 100', engine)
>>> df.isnull().sum()
user       100
host         0
path         0
referer      0
code         0
agent        0
size         0
method       0
dtype: int64
>>> con = td.connect()
>>> df.drop(columns='time', inplace=True)
>>> td.to_td(df, 'aki.test_pytd', con, if_exists='replace', index=False)
19/05/09 17:30:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pyspark/sql/session.py:714: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
  Unsupported type in conversion from Arrow: null
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.
  warnings.warn(msg)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pytd/pandas_td/__init__.py", line 349, in to_td
    writer.write_dataframe(frame, con.database, name, mode)
  File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pytd/writer.py", line 88, in write_dataframe
    sdf = self.td_spark.createDataFrame(df)
  File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pyspark/sql/session.py", line 748, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pyspark/sql/session.py", line 416, in _createFromLocal
    struct = self._inferSchemaFromList(data, names=schema)
  File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pyspark/sql/session.py", line 350, in _inferSchemaFromList
    raise ValueError("Some of types cannot be determined after inferring")
ValueError: Some of types cannot be determined after inferring
@takuti
Copy link
Contributor

takuti commented May 9, 2019

Looks like it doesn't matter of arrow. Everything happens in the Spark code in: https://github.com/apache/spark/blob/d36cce18e262dc9cbd687ef42f8b67a62f0a3e22/python/pyspark/sql/session.py#L619-L787

createDataFrame
-> _createFromLocal
-> _inferSchemaFromList
-> _has_nulltype returns true and raises the exception

@takuti
Copy link
Contributor

takuti commented May 9, 2019

pytd may need to validate column values before createDataFrame

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants