-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement option 'delete_rows' of argument 'if_exists' in 'DataFrame.to_sql' API. #60376
base: main
Are you sure you want to change the base?
feat: Implement option 'delete_rows' of argument 'if_exists' in 'DataFrame.to_sql' API. #60376
Conversation
@WillAyd I chose the name @erfannariman tagging you due to your help/interest during the lifecycle of this issue. |
e443d8d
to
33cd0d6
Compare
3c33249
to
1ef5a87
Compare
b71c0d9
to
1843040
Compare
15bda94
to
2eb19e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that the test failures are related. Restarted so let's see...
My remaining feedback is rather minor; overall I think the implementation looks good.
@mroeschke care to take a look?
@@ -974,11 +975,13 @@ def create(self) -> None: | |||
if self.exists(): | |||
if self.if_exists == "fail": | |||
raise ValueError(f"Table '{self.name}' already exists.") | |||
if self.if_exists == "replace": | |||
elif self.if_exists == "replace": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated
5f6ab41
to
d1b01d2
Compare
d1b01d2
to
3e8813f
Compare
pandas/io/sql.py
Outdated
@@ -750,6 +750,11 @@ def to_sql( | |||
""" | |||
Write records stored in a DataFrame to a SQL database. | |||
|
|||
.. warning:: | |||
|
|||
This method can run arbitrary code which can make you vulnerable to code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding. I know this is a verbatim copy from the other issue, but I think it's a little too alarmist. .query can definitely be unsafe, since it dispatches to exec in Python which is documented as unsafe in kind.
Here, pandas just isn't doing anything extra, but the underlying drivers may take care of this. I think a better warning would be:
.. warning::
The pandas library does not attempt to sanitize inputs provided via a to_sql call. Please refer to the documentation for the underlying database driver to see if it will properly prevent injection, or alternatively be advised of a security risk when executing arbitrary commands in a to_sql call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing. Makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
77dc01c
to
4c8fcda
Compare
NVM. CI is fine now. |
pandas/tests/io/test_sql.py
Outdated
with pandasSQL.run_transaction() as cur: | ||
cur.execute(table_stmt) | ||
|
||
if conn_name != "sqlite_buildin" and "adbc" not in conn_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to make this test more manageable we can create a generic pd.SQLError
and use that whenever these are raised instead. I don't think there's a lot of value to exposing the error messages from the underlying libraries to end users, and it would help simplify this test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I totally agree on that 💯 .
pandas should definitely provide a common interface for SQL-related errors. But I believe this is out of the current scope ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think we already have one in place - check pandas.errors.DatabaseError
. If you catch where these problems are in the implementation and just do raise DatabaseError from exc
then you can also just catch the DatabaseError
in the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sure.
So as of now the only place raising pd.errors.DatabaseError
is the method execute
.
The new implementation delete_rows
is using cursor.execute
instead - which is at the driver level.
So the options we have to comply with what you suggested is:
- raise
pd.errors.DatabaseError
fromdelete_rows
. The issue being other methods (append
,replace
) would not raise it. - We start using
self.execute
but I tried in the past and it only works for SQLite becauserun_transaction
closes the connection while ADBC and sqlalchemy don't.
Do you see another option or have a preference ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where within the class hierarchies does that execute get called? It seems like there is where you can catch the exception from the library and re-raise a more generic one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.execute
is not called at all as of now, we use cursor.execute
instead (see 2
☝️ ).
SQLiteDatabase
is the only implementation calling self.execute.
I tried to change cursor.execute
by self.execute
and transaction management didn't go well for ADBC and SQLAlchemy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK. We might need to fix that up as a precursor to this then - seems like something is off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok...so let's try to unravel it:
Here is the SQLAlchemy's implementation for execute.
There's no exception handling in this method as in SQLite and ADBC cases.
A problem with the current implementation is that the code that inserts is separated from the one deleting records.
So.... Even if delete_rows
is implemented using self.execute
it can happen that the insertion procedure doesn't - which is the case for SQLDatabase. I truly see the pain and wish we could implement the entire operation in a single method/function. But the codebase is not ready for that and would required some refactoring.
I really believe this is out of scope...please let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate wanting to get this in but I don't think the issue is out of scope - we would just be choosing to ignore it, which delays the problem for a later date.
Do you see a reasonable path to a pre-cursor that can fix some of the implementation issues with execute and get us back to a place where we consistently use the internal class methods, rather than cherry-picking calls to a third party execute
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate wanting to get this in but I don't think the issue is out of scope - we would just be choosing to ignore it, which delays the problem for a later date.
And just to clarify, the issue we're discussing is the lack of a common interface for database related errors ?
I have sent a patch. It touches parts of code that are out of my league but that are required for the new test test_delete_rows_is_atomic
to work - using self.execute
.
You will see that the implementation is still a work in progress and that it is not meant to be the final one (for example raise DataBaseError("foo")
is just used to represent the high level solution). It is not meant to be a bullet proof and solve previously existing problems (but it definitely changes existing APIs) but might be a middle-ground to what is to come and solves it for the newly introduced delete_rows
.
Please, let me know if that is feasible.
BTW, thanks a ton for the hard work reviewing it multiple times.
pandas/tests/io/test_sql.py
Outdated
|
||
@pytest.mark.parametrize("conn_name", all_connectable) | ||
def test_delete_rows_is_atomic(conn_name, request): | ||
adbc_driver_manager = pytest.importorskip("adbc_driver_manager") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I thought we fixed this before but we don't want to import both of these in the same test. It is certainly possible to have one without the other, and I think we have it set up in CI that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't follow. Why's that ?
I think you're referring to this comment maybe ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it so that this tests only runs when both ADBC and sqlalchemy are installed. I'd have to double check, but I think that means we aren't running this test in CI at all. A lot of our environments only have SQLAlchemy and some others only have ADBC, so these are getting skipped whenever that is the case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm...ok.
I'm going to conclude this discussion before taking any action.
The imports exist only because we need the exception class.
e967648
to
0fccb84
Compare
…Frame.to_sql' API.
Co-authored-by: Matthew Roeschke <[email protected]>
…das-dev#60532) * first * second * Update object_array.py * third * ascii * ascii2 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * style * style * style * style * docs * reset * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update doc/source/whatsnew/v3.0.0.rst --------- Co-authored-by: Abby VeCasey <[email protected]> Co-authored-by: Matthew Roeschke <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
0fccb84
to
49e0bcc
Compare
doc/source/whatsnew/v3.0.0.rst
file if fixing a bug or adding a new feature.