Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support for files table in Dexie and DuckDB #789

Merged
merged 4 commits into from
Jan 23, 2025

Conversation

humphd
Copy link
Collaborator

@humphd humphd commented Jan 20, 2025

This begins the journey to implement #788. I've done the minimum to get a workable thing. There is lots more to do, but it's already amazing!

  • Added a new files table, which tracks files by their hashed content id and includes metadata
  • Added a ChatCraftFile class to make it easier to work with files and convert to/from various things
  • Added fileIds Array to the chats table to track files associated with a chat
  • Connected ChatCraftFile and files to the use-file-import.ts hook so attaching, pasting, dragging files into a chat all add them to the files table and associate with the current chat's files.
  • Extended duckdb-chatcraft.ts to know about ChatCraftFile, bridging the filesystem from Dexie -> DuckDB (we still need to go the other way).
  • Similar to what we do with chatcraft.* tables, auto-injects files into DuckDB when they are referenced in a SQL query (by name or id). Ive you go to Options > Attach Files... and add a CSV file, doing something like select * from 'some-file-name.csv' just works.
  • Enhanced /duck command to be able to show all tables and all files, including virtual files in the chat we could inject.
  • Added /ls slash command to list files in a chat
  • Always store files in Dexie files table when imported. Currently I'm not changing how the imageUrls work at all.

For follow-ups--I don't want to bother with this stuff now, but so I don't forget:

  • I haven't got the runCode working with file injection yet, since it requires the chat to be passed all the way through to the inner component. We should probably make a React Context and Hook for this, so you can do const chat = useChat()
  • File expiry. Currently I only have an optional field.
  • Start moving imageUrls out of messages and into files
  • Figure out when to embed a file's content in the chat (e.g., based on type, size, etc).
  • We need some preference UI for signalling how to handle things with files (e.g., expiry, sizes when to inject)
  • Think about how to handle sharing. Currently I don't try to serialize files into shared chats.
  • Add support for using partial versions of file hashes (like short SHAs in GitHub) so you don't have paste the whole thing into a SQL statement
  • Think about file deduping (e.g., we store by hash id) and how that affects file names. I could add the same file more than once, and the original filename might differ from the new one. Do we care?

@humphd humphd requested a review from tarasglek January 20, 2025 01:30
@tarasglek
Copy link
Owner

Think about file deduping (e.g., we store by hash id) and how that affects file names. I could add the same file more than once, and the original filename might differ from the new one. Do we care?

Everything is solvable with more layers of indirection :)

This is solved by adding another 'file-content' table

src/lib/ChatCraftFile.ts Outdated Show resolved Hide resolved
): Promise<ChatCraftFile> {
let options: ChatCraftFileOptions;

if (input instanceof File) {
Copy link
Owner

@tarasglek tarasglek Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findOrCreate only makes sense to do via sha

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what it will do. Below this, I calculate the sha based on the file.

This lets us receive a File object in the app and use it to do a search, handling the hashing internally.

* @param options Listing options for sorting and filtering
* @returns Array of matching ChatCraftFile instances
*/
export async function ls(
Copy link
Owner

@tarasglek tarasglek Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use duckdb glob here? Not a fan of making a non-standard glob

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not part of duckdb, it's the files in Dexie. We also need it for duckdb, but not in this file.


// Include files if there are any, including files in the chat
const files = await chat.files();
if (files.length) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it

* from the chat's files.
* @param chat the ChatCraftChat, potentially with files
*/
export async function getFiles(chat: ChatCraftChat) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we do queryToMarkdown() here..and add a filter predicate to queryToMarkdown to allow filtering items...might be useful elsewhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's enhance this in a follow up. I wonder if we should be returning an object like this:

const result = query('select ...');

// result can now be used in various ways:
const markdown = result.toMarkdown();
const json = result.toJSON();
...

We could add some way to do further sub-queries on it too, by passing the Arrow Table result back into DuckDB and running another sql query.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your proposal of an object with multiple "projections"...ok with followup

@tarasglek
Copy link
Owner

Overall, nice stuff. What's left to land this? Just my review?

Copy link

cloudflare-workers-and-pages bot commented Jan 23, 2025

Deploying chatcraft-org with  Cloudflare Pages  Cloudflare Pages

Latest commit: cb4d691
Status: ✅  Deploy successful!
Preview URL: https://ce30ec4e.console-overthinker-dev.pages.dev
Branch Preview URL: https://humphd-files-table.console-overthinker-dev.pages.dev

View logs

@humphd
Copy link
Collaborator Author

humphd commented Jan 23, 2025

@tarasglek I think this is good to go. If you're ok with it, let's merge and keep improving in follow-ups.

@humphd
Copy link
Collaborator Author

humphd commented Jan 23, 2025

Made one more fix: creating files in duckdb now means they also show-up when you do /duck:

Screenshot 2025-01-22 at 8 37 20 PM

We can do file download, using as attachments in the chat, etc. in a follow-up.

@tarasglek tarasglek merged commit ba92118 into main Jan 23, 2025
4 checks passed
@tarasglek
Copy link
Owner

Fantastic work here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants