add SQL adapter #779

skarakuzu · 2024-08-22T15:59:58Z

preliminary start of sql adapter. to be continued ...

Checklist

Add a Changelog entry
Add the ticket number which this PR closes to the comment section

danielballan · 2024-08-22T16:48:18Z

pyproject.toml

@@ -44,6 +44,9 @@ tiled = "tiled.commandline.main:main"

 # This is the union of all optional dependencies.
 all = [
+    "adbc_driver_manager",


This section is used when tiled is installed like pip install "tiled[all]". These three should also be added to the section server, below, so that they are included when tiled is installed like pip install "tiled[server]" # server only.

danielballan · 2024-10-01T15:43:02Z

Lifecycle:

Client declares that it wants to create a new tabular dataset, via a request POST /api/v1/metadata/my_table.
In "catalog" SQL database, the server adds a row to the nodes table with any metadata about this table. This is how the new table is connected to any overall dataset, like Bluesky scan and its Scan ID.
Also in the "catalog" SQL database, the server adds a row each to the data_sources table and the assets table. Together, they describe how to locate where the new data will be saved. The Asset part is very locked down. It has room for the URI of the tabular SQL database: postgresql://... and some boilerplate. The DataSource has a freeform area called parameters, which can fit any JSON. We can use this to put in dataset-specific details, like the name of the SQL table (table_name)---derived from the Arrow schema in this case---and a means of selecting the rows of interest for this new dataset (dataset_id).
When data is written or read, a SQLAdapter object is instantiated inside the server. It is passed information extracted from this DataSource and Asset. So, it can know the table_name and the dataset_id.

danielballan · 2024-12-13T20:39:54Z

Test script:

import pandas
from tiled.client import from_uri
from tiled.structures.core import StructureFamily
from tiled.structures.data_source import Asset, DataSource, Management
from tiled.structures.table import TableStructure

client = from_uri("http://localhost:8000", api_key="secret")

df = pandas.DataFrame({"a": [1, 2, 3], "b": [1., 2., 3.]})
structure = TableStructure.from_pandas(df)

x = client.new(
    structure_family=StructureFamily.table,
    data_sources=[
        DataSource(
            management=Management.writable,
            mimetype="application/x-tiled-sql-table",
            structure_family=StructureFamily.table,
	        structure=structure,
            assets=[],
        ),
    ],
    metadata={},
    specs=[],
    key="x",
)
x.write(df)
x.append_partition(df, 0)

# This does not work yet
# x.read()  # calls /table/partition/x?partition=0 adapter.read_partition()

danielballan · 2025-01-16T17:10:05Z

For this PR

Add dataset_id column and filter by it.
~~Create table eagerly, if ADBC APIs allow it.~~ Seems not possible
In Adapter, remove write. Write would mean "overwrite" or "replace" and we are not sure we want to expose this. (We can add it later if we want it.)
In client, replace write_appendable_dataframe with create_appendable_dataframe. This will run the self.new(...) call, which runs init_storage on the server side, but it will not take any data. Data will be appended in later calls.
In Adapter, I removed append and used append_partition. (For now it's stuck at partition=0 but this constraint will be temporary.) Tests need to be updated.
Execute CREATE INDEX IF NOT EXISTS .... on dataset_id column.
Pandas indexes should round-trip. (Dan)
Protect against SQL injection. In init_storage, table_name should match some restrictive regex pattern. Maybe lowercase letters, numbers, and underscores?

Intended usage now looks like...

The following prompts the server to:

Generate a table_name from schema hash. (The table might or might not already exist, containing rows from other dataset_ids.)
Generate a new unique dataset_id for this dataset.
Store the table_name, dataset_id, and any metadata passed here in the catalog database.

# This uploads no data.
x = client.create_appendable_table(schema, key="x")

The following prompts the server to:

Create the table {table_name} if it not yet exist.
Ingest the rows into that table, with an additional dataset_id column.

# Now data can be added, potentially in parallel.
x.append_partition(df, 0)

In a separate process, this would also work. We can access an existing table and keep appending.

x = client["x"]
x.append_partition(df, 0)

In following up PRs...

Support PG database with credentials.
Connection pooling
Supporting more than one partition. SQL will scale find to a large table, but current Tiled does not let the client request less than a full partition. We either need to change that and let users request row ranges (seems complicated, especially with parquet...so I think might be something to wait to do...) or mark up the data in the SQL table as belonging to reasonably-sized partitions. Similar to how arrays are chunked by the client, table rows should be partitioned.

Maybe in the future partitions are added like this? Not sure whether PostgreSQL native "table partitioning" fits our use case.

# table_blahblahblah
dataset_id partition_id ...
12345        1
12345        1
12345        2
12345        3
12345        3
12345        3
24323

def read_partition(self, partition):
    query = f"SELECT * FROM {self.table_name} WHERE dataset_id={self.dataset_id} AND partition={partition}"
    ...

danielballan reviewed Aug 22, 2024

View reviewed changes

danielballan force-pushed the add_sql_adapter branch from e5858a1 to 8f676e8 Compare September 11, 2024 21:17

danielballan force-pushed the add_sql_adapter branch from add79c8 to 7679190 Compare October 2, 2024 13:18

danielballan mentioned this pull request Nov 4, 2024

Do not require SQL URIs to be prefixed with SQLAlchemy driver #810

Merged

2 tasks

danielballan force-pushed the add_sql_adapter branch from c0b88c6 to baefdbf Compare December 13, 2024 16:18

skarakuzu force-pushed the add_sql_adapter branch from 01a6f18 to 2befce1 Compare December 18, 2024 20:15

skarakuzu force-pushed the add_sql_adapter branch from 8d26ce5 to 7913020 Compare January 8, 2025 21:07

Seher Karakuzu and others added 21 commits January 15, 2025 12:55

preliminary start of sql adapter. to be continued ...

f3697c9

hashed table names. to be continued...

47808e7

modified hashing and added a test for sqlite database. to be continued

d587dfb

try TILED_TEST_POSTGRESQL_URI usage

1c0097c

fix postgreql uri

e014126

Automatically set SQL driver if unset.

69b7a4b

Do not require env var to be set.

1a5cc25

Consistently use database URI with schema.

216dca5

Refactor init_storage interface for SQL.

2290cd2

More adapters updated

c7b5f2a

More adapters updated

01a9053

Parse uri earlier.

cb61d6d

Use dataclass version of DataSource.

92a0af8

Begin to update SQLAdapter.

cc3b9ee

Fix import

fd1b6df

Typesafe accessor for Storage

5248925

few changes

224a6e8

Basic write and append works

5bf9e1b

Do not preserve index.

e88a68a

changes in test_sql.py

fd36342

latest changes

d8454fd

tried to fix the tests

75b2ddc

skarakuzu force-pushed the add_sql_adapter branch from 7913020 to 75b2ddc Compare January 15, 2025 17:56

skarakuzu and others added 8 commits January 15, 2025 13:00

removed prints

f649ccb

Remove vestigial comment.

2a22c78

Extract str path from sqlite URI

ac2ea46

Use unique temp dir and clean it up.

add42df

some more fixing and addition of partitions

fa53707

fixing docstrings

31504a4

CLI works with SQL writing

f63a25a

Tests pass again

246641c

danielballan added 7 commits January 16, 2025 12:50

Add convenience method write_appendable_dataframe.

dc305c8

Fix typo

1ffff43

Fix path handling for Windows

6292e0a

The dataset_id concept is mostly implemented

cde5412

Fix conditional

27b7fce

Support appendable tables with --temp catalog

9c18a40

Revert order swap (for now)

9b5b378

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add SQL adapter #779

add SQL adapter #779

skarakuzu commented Aug 22, 2024

danielballan Aug 22, 2024

danielballan commented Oct 1, 2024

danielballan commented Dec 13, 2024

danielballan commented Jan 16, 2025 •

edited

Loading

add SQL adapter #779

Are you sure you want to change the base?

add SQL adapter #779

Conversation

skarakuzu commented Aug 22, 2024

Checklist

danielballan Aug 22, 2024

Choose a reason for hiding this comment

danielballan commented Oct 1, 2024

danielballan commented Dec 13, 2024

danielballan commented Jan 16, 2025 • edited Loading

For this PR

In following up PRs...

danielballan commented Jan 16, 2025 •

edited

Loading