-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SparkJDBCDataset not working when specify a query instead of a table #639
Comments
@DavidRetana-TomTom Did you mean that you expect there is a |
Yes exactly |
@DavidRetana-TomTom this is a great push - this dataset is quite old so this may be newer functionality. I think it's a good idea to add this to our implementation. There are two steps at this point:
ow do you feel about raising a PR to make this work? We can coach you through the process. |
I take what is described here and hopefully this can be a starting point or workaround, I only implemented the The diff: https://github.com/kedro-org/kedro-plugins/compare/noklam/sparkjdbcdataset-not-working-639?expand=1 Details
"""SparkJDBCDataset to load and save a PySpark DataFrame via JDBC.""" from copy import deepcopy from kedro.io.core import AbstractDataset, DatasetError from kedro_datasets.spark.spark_dataset import _get_spark class SparkJDBCDataset(AbstractDataset[DataFrame, DataFrame]):
|
That should be enough for my use case. I can't open a pull request because I am not a collaborator of this project. |
@DavidRetana-TomTom you can open one via the Forking workflow! We'd really appreciate it if you have a chance |
@noklam could you restore your branch https://github.com/kedro-org/kedro-plugins/tree/noklam/sparkjdbcdataset-not-working-639 ? Or was this already merged? |
Description
When using SparkJDBCDataset you need to specify table name as a mandatory parameter. However, using the spark JDBC connector directly, you can specify a query to retrieve data from the database instead of hardcoding a single table. Check out this link.
According to the official Spark documentation:
The specified query will be parenthesized and used as a subquery in the FROM clause.
Below are a couple of restrictions while using this option.
Example:
Context
This is specially important if you want to read data from multiple tables in the database or if you want to run complex or spatial queries in the database instead of retrieving all the data and perform the computations in the cluster.
Steps to Reproduce
Source code right now (https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/kedro_datasets/spark/spark_jdbc_dataset.py):
Expected Result
I would like to have something like the following:
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
pip show kedro
orkedro -V
): 0.19.3pip show kedro-airflow
):python -V
): 3.10The text was updated successfully, but these errors were encountered: