An AWS athena library for tensorflow.data.Dataset
. If you don't know tf.data
, take a look at documentation and this example.
Install is as simple as pip install
:
pip install tf-data-athena
Use is almost as simple as another tf.Dataset implementation. You just need to create a dataset using the funciton create_athena_dataset
no (it follows aws authentication chain in boto3).
# imports
from tf_data_athena import create_athena_dataset
# connector parameters
s3_output_location = "s3://my-bucket/my-folder/athena-outputs" # Athena output bucket folder
waiting_interval = 0.1 # Time (in seconds) to wait before asking for query state
# query
query = "select * from my_namespace.my_table"
# create dataset
dataset = create_athena_dataset(query, s3_output_location)
Now, dataset
is an instance of tf.data.Dataset
containing query results.
Then factory funcion create_athena_dataset
has the following parameters:
query
: The query to be ran in athenas3_output_location
: An s3 path with write access for the current account where the query results file will be savedwaiting_interval
: A float number representing the number of seconds between to wait before ask for query status on athenanum_parallel_calls
: Argument fortf.data.Dataset.map
(see documentation) while parsing result rows- other named arguments: Any other named argument will be used on
tf.data.TextLineDataset
constructor, please, see documentation.
This library uses boto3
behind the scenes, then, it follows the same authentication/authorization chain.
Authorized user or service needs permission to create and execute athena queries and create and read s3 objects in the folder defined by s3_output_location
.