to_td() fails with null only column #16

chezou · 2019-05-09T08:38:50Z

If there is a column which has null values only, to_td() fails to upload the dataframe. It should be avoided setting spark.sql.execution.arrow.fallback.enabled=true for PySpark config.

>>> import pytd.pandas_td as td
>>> engine = td.create_engine('presto:sample_datasets')
>>> df = td.read_td('select * from www_access limit 100', engine)
>>> df.isnull().sum()
user       100
host         0
path         0
referer      0
code         0
agent        0
size         0
method       0
dtype: int64
>>> con = td.connect()
>>> df.drop(columns='time', inplace=True)
>>> td.to_td(df, 'aki.test_pytd', con, if_exists='replace', index=False)
19/05/09 17:30:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pyspark/sql/session.py:714: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
  Unsupported type in conversion from Arrow: null
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.
  warnings.warn(msg)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pytd/pandas_td/__init__.py", line 349, in to_td
    writer.write_dataframe(frame, con.database, name, mode)
  File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pytd/writer.py", line 88, in write_dataframe
    sdf = self.td_spark.createDataFrame(df)
  File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pyspark/sql/session.py", line 748, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pyspark/sql/session.py", line 416, in _createFromLocal
    struct = self._inferSchemaFromList(data, names=schema)
  File "/Users/ariga/src/pytd-test/.venv/lib/python3.6/site-packages/pyspark/sql/session.py", line 350, in _inferSchemaFromList
    raise ValueError("Some of types cannot be determined after inferring")
ValueError: Some of types cannot be determined after inferring

The text was updated successfully, but these errors were encountered:

takuti · 2019-05-09T09:26:26Z

Looks like it doesn't matter of arrow. Everything happens in the Spark code in: https://github.com/apache/spark/blob/d36cce18e262dc9cbd687ef42f8b67a62f0a3e22/python/pyspark/sql/session.py#L619-L787

createDataFrame
-> _createFromLocal
-> _inferSchemaFromList
-> _has_nulltype returns true and raises the exception

takuti · 2019-05-09T09:27:04Z

pytd may need to validate column values before createDataFrame

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_td() fails with null only column #16

to_td() fails with null only column #16

chezou commented May 9, 2019

takuti commented May 9, 2019

takuti commented May 9, 2019

to_td() fails with null only column #16

to_td() fails with null only column #16

Comments

chezou commented May 9, 2019

takuti commented May 9, 2019

takuti commented May 9, 2019