feat: Implement option 'delete_rows' of argument 'if_exists' in 'DataFrame.to_sql' API. #60376

gmcrocetti · 2024-11-20T13:31:48Z

closes ENH: DataFrame.to_sql with if_exists='replace' should do truncate table instead of drop table #37210
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v3.0.0.rst file if fixing a bug or adding a new feature.

gmcrocetti · 2024-11-20T13:37:13Z

@WillAyd I chose the name delete_rows instead of delete_replace because the behavior of replace right now is of recreate - as you mentioned - and delete_rows means exactly what is going on behind the scenes.

@erfannariman tagging you due to your help/interest during the lifecycle of this issue.

pandas/tests/io/test_sql.py

pandas/io/sql.py

pandas/tests/io/test_sql.py

WillAyd

I'm not sure that the test failures are related. Restarted so let's see...

My remaining feedback is rather minor; overall I think the implementation looks good.

@mroeschke care to take a look?

pandas/io/sql.py

pandas/tests/io/test_sql.py

pandas/io/sql.py

gmcrocetti · 2025-01-03T14:37:04Z

pandas/io/sql.py

@@ -974,11 +975,13 @@ def create(self) -> None:
        if self.exists():
            if self.if_exists == "fail":
                raise ValueError(f"Table '{self.name}' already exists.")
-            if self.if_exists == "replace":
+            elif self.if_exists == "replace":


WillAyd · 2025-01-03T16:39:44Z

pandas/io/sql.py

@@ -750,6 +750,11 @@ def to_sql(
    """
    Write records stored in a DataFrame to a SQL database.

+    .. warning::
+
+        This method can run arbitrary code which can make you vulnerable to code


Thanks for adding. I know this is a verbatim copy from the other issue, but I think it's a little too alarmist. .query can definitely be unsafe, since it dispatches to exec in Python which is documented as unsafe in kind.

Here, pandas just isn't doing anything extra, but the underlying drivers may take care of this. I think a better warning would be:

.. warning:: The pandas library does not attempt to sanitize inputs provided via a to_sql call. Please refer to the documentation for the underlying database driver to see if it will properly prevent injection, or alternatively be advised of a security risk when executing arbitrary commands in a to_sql call

Sure thing. Makes sense

I also added this warning at generic.py. LMK in case it's wrong.

gmcrocetti · 2025-01-03T20:38:52Z

~~@WillAyd it seems new tests have introduced some sort of "lock" on pytest. It is affecting 3.11, 3.12 and "Future infer strings" using a single CPU (it works using multiple). Do you have any hunch ?~~

NVM. CI is fine now.

WillAyd · 2025-01-06T15:04:23Z

pandas/tests/io/test_sql.py

+    with pandasSQL.run_transaction() as cur:
+        cur.execute(table_stmt)
+
+    if conn_name != "sqlite_buildin" and "adbc" not in conn_name:


I think to make this test more manageable we can create a generic pd.SQLError and use that whenever these are raised instead. I don't think there's a lot of value to exposing the error messages from the underlying libraries to end users, and it would help simplify this test

Yes. I totally agree on that 💯 .
pandas should definitely provide a common interface for SQL-related errors. But I believe this is out of the current scope ?

I actually think we already have one in place - check pandas.errors.DatabaseError. If you catch where these problems are in the implementation and just do raise DatabaseError from exc then you can also just catch the DatabaseError in the test

yes, sure.
So as of now the only place raising pd.errors.DatabaseError is the method execute.
The new implementation delete_rows is using cursor.execute instead - which is at the driver level.
So the options we have to comply with what you suggested is:

raise pd.errors.DatabaseError from delete_rows. The issue being other methods (append, replace) would not raise it.

We start using self.execute but I tried in the past and it only works for SQLite because run_transaction closes the connection while ADBC and sqlalchemy don't.

Do you see another option or have a preference ?

Where within the class hierarchies does that execute get called? It seems like there is where you can catch the exception from the library and re-raise a more generic one

self.execute is not called at all as of now, we use cursor.execute instead (see 2 ☝️ ).
SQLiteDatabase is the only implementation calling self.execute.
I tried to change cursor.execute by self.execute and transaction management didn't go well for ADBC and SQLAlchemy

Hmm OK. We might need to fix that up as a precursor to this then - seems like something is off

Ok...so let's try to unravel it:
Here is the SQLAlchemy's implementation for execute.

There's no exception handling in this method as in SQLite and ADBC cases.

A problem with the current implementation is that the code that inserts is separated from the one deleting records.
So.... Even if delete_rows is implemented using self.execute it can happen that the insertion procedure doesn't - which is the case for SQLDatabase. I truly see the pain and wish we could implement the entire operation in a single method/function. But the codebase is not ready for that and would required some refactoring.

I really believe this is out of scope...please let me know what you think.

I appreciate wanting to get this in but I don't think the issue is out of scope - we would just be choosing to ignore it, which delays the problem for a later date.

Do you see a reasonable path to a pre-cursor that can fix some of the implementation issues with execute and get us back to a place where we consistently use the internal class methods, rather than cherry-picking calls to a third party execute?

I appreciate wanting to get this in but I don't think the issue is out of scope - we would just be choosing to ignore it, which delays the problem for a later date.

And just to clarify, the issue we're discussing is the lack of a common interface for database related errors ?

I have sent a patch. It touches parts of code that are out of my league but that are required for the new test test_delete_rows_is_atomic to work - using self.execute.
You will see that the implementation is still a work in progress and that it is not meant to be the final one (for example raise DataBaseError("foo") is just used to represent the high level solution). It is not meant to be a bullet proof and solve previously existing problems (but it definitely changes existing APIs) but might be a middle-ground to what is to come and solves it for the newly introduced delete_rows.

Please, let me know if that is feasible.

BTW, thanks a ton for the hard work reviewing it multiple times.

WillAyd · 2025-01-06T15:04:35Z

pandas/tests/io/test_sql.py

+
+@pytest.mark.parametrize("conn_name", all_connectable)
+def test_delete_rows_is_atomic(conn_name, request):
+    adbc_driver_manager = pytest.importorskip("adbc_driver_manager")


Hmm I thought we fixed this before but we don't want to import both of these in the same test. It is certainly possible to have one without the other, and I think we have it set up in CI that way.

Sorry, I didn't follow. Why's that ?
I think you're referring to this comment maybe ?

This makes it so that this tests only runs when both ADBC and sqlalchemy are installed. I'd have to double check, but I think that means we aren't running this test in CI at all. A lot of our environments only have SQLAlchemy and some others only have ADBC, so these are getting skipped whenever that is the case

Humm...ok.
I'm going to conclude this discussion before taking any action.
The imports exist only because we need the exception class.

…Frame.to_sql' API.

Co-authored-by: Matthew Roeschke <[email protected]>

…das-dev#60532) * first * second * Update object_array.py * third * ascii * ascii2 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * ascii3 * style * style * style * style * docs * reset * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update doc/source/whatsnew/v3.0.0.rst --------- Co-authored-by: Abby VeCasey <[email protected]> Co-authored-by: Matthew Roeschke <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

gmcrocetti force-pushed the issue-37210-to-sql-truncate branch from e443d8d to 33cd0d6 Compare November 20, 2024 14:04

gmcrocetti marked this pull request as draft November 20, 2024 14:04

WillAyd requested changes Nov 20, 2024

View reviewed changes

pandas/tests/io/test_sql.py Outdated Show resolved Hide resolved

pandas/tests/io/test_sql.py Outdated Show resolved Hide resolved

pandas/tests/io/test_sql.py Show resolved Hide resolved

WillAyd added the IO SQL to_sql, read_sql, read_sql_query label Nov 20, 2024

gmcrocetti force-pushed the issue-37210-to-sql-truncate branch 3 times, most recently from 3c33249 to 1ef5a87 Compare November 22, 2024 01:10

gmcrocetti requested a review from WillAyd November 22, 2024 12:26

gmcrocetti marked this pull request as ready for review November 22, 2024 12:26

gmcrocetti force-pushed the issue-37210-to-sql-truncate branch 3 times, most recently from b71c0d9 to 1843040 Compare December 17, 2024 13:45

WillAyd requested changes Dec 26, 2024

View reviewed changes

pandas/io/sql.py Outdated Show resolved Hide resolved

WillAyd requested changes Dec 27, 2024

View reviewed changes

pandas/tests/io/test_sql.py Outdated Show resolved Hide resolved

gmcrocetti force-pushed the issue-37210-to-sql-truncate branch from 15bda94 to 2eb19e7 Compare December 27, 2024 18:03

gmcrocetti requested a review from WillAyd December 27, 2024 18:04

WillAyd requested changes Dec 30, 2024

View reviewed changes

pandas/io/sql.py Show resolved Hide resolved

pandas/tests/io/test_sql.py Outdated Show resolved Hide resolved

mroeschke reviewed Dec 30, 2024

View reviewed changes

pandas/io/sql.py Outdated Show resolved Hide resolved

mroeschke reviewed Dec 30, 2024

View reviewed changes

pandas/io/sql.py Outdated Show resolved Hide resolved

gmcrocetti commented Jan 3, 2025

View reviewed changes

gmcrocetti requested review from WillAyd and mroeschke January 3, 2025 14:38

gmcrocetti force-pushed the issue-37210-to-sql-truncate branch from 5f6ab41 to d1b01d2 Compare January 3, 2025 14:42

gmcrocetti requested review from MarcoGorelli, Dr-Irv and datapythonista as code owners January 3, 2025 14:42

gmcrocetti force-pushed the issue-37210-to-sql-truncate branch from d1b01d2 to 3e8813f Compare January 3, 2025 14:45

WillAyd reviewed Jan 3, 2025

View reviewed changes

gmcrocetti force-pushed the issue-37210-to-sql-truncate branch from 77dc01c to 4c8fcda Compare January 3, 2025 17:43

gmcrocetti requested a review from WillAyd January 3, 2025 17:45

WillAyd requested changes Jan 6, 2025

View reviewed changes

gmcrocetti requested a review from WillAyd January 8, 2025 19:03

gmcrocetti force-pushed the issue-37210-to-sql-truncate branch 3 times, most recently from e967648 to 0fccb84 Compare January 9, 2025 01:15

gmcrocetti and others added 8 commits January 9, 2025 13:43

feat: implement option 'delete_rows' of argument 'if_exists' in 'Data…

9ffc102

…Frame.to_sql' API.

Apply suggestions from code review

435a6e2

Co-authored-by: Matthew Roeschke <[email protected]>

reworked tests to include sqlite cases

404f339

merged readme

3810ce0

docs: add warning in 'to_sql'

fabde42

refactor: rewrite test

90bfb8c

wip - trying out new solution

49e0bcc

gmcrocetti force-pushed the issue-37210-to-sql-truncate branch from 0fccb84 to 49e0bcc Compare January 9, 2025 16:43

chore: add a new row to 'replacing_df' so we don't get biased

d2ef193

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement option 'delete_rows' of argument 'if_exists' in 'DataFrame.to_sql' API. #60376

feat: Implement option 'delete_rows' of argument 'if_exists' in 'DataFrame.to_sql' API. #60376

gmcrocetti commented Nov 20, 2024

gmcrocetti commented Nov 20, 2024

WillAyd left a comment

gmcrocetti Jan 3, 2025

WillAyd Jan 3, 2025

gmcrocetti Jan 3, 2025

gmcrocetti Jan 3, 2025 •

edited

Loading

gmcrocetti commented Jan 3, 2025 •

edited

Loading

WillAyd Jan 6, 2025

gmcrocetti Jan 6, 2025

WillAyd Jan 6, 2025

gmcrocetti Jan 6, 2025 •

edited

Loading

WillAyd Jan 6, 2025

gmcrocetti Jan 6, 2025 •

edited

Loading

WillAyd Jan 6, 2025

gmcrocetti Jan 6, 2025 •

edited

Loading

WillAyd Jan 8, 2025

gmcrocetti Jan 8, 2025 •

edited

Loading

WillAyd Jan 6, 2025

gmcrocetti Jan 6, 2025

WillAyd Jan 6, 2025

gmcrocetti Jan 6, 2025

feat: Implement option 'delete_rows' of argument 'if_exists' in 'DataFrame.to_sql' API. #60376

Are you sure you want to change the base?

feat: Implement option 'delete_rows' of argument 'if_exists' in 'DataFrame.to_sql' API. #60376

Conversation

gmcrocetti commented Nov 20, 2024

gmcrocetti commented Nov 20, 2024

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gmcrocetti Jan 3, 2025 • edited Loading

Choose a reason for hiding this comment

gmcrocetti commented Jan 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gmcrocetti Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gmcrocetti Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gmcrocetti Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gmcrocetti Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gmcrocetti Jan 3, 2025 •

edited

Loading

gmcrocetti commented Jan 3, 2025 •

edited

Loading

gmcrocetti Jan 6, 2025 •

edited

Loading

gmcrocetti Jan 6, 2025 •

edited

Loading

gmcrocetti Jan 6, 2025 •

edited

Loading

gmcrocetti Jan 8, 2025 •

edited

Loading