Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Files Table and DuckDB #788

Open
humphd opened this issue Jan 18, 2025 · 4 comments
Open

RFC: Files Table and DuckDB #788

humphd opened this issue Jan 18, 2025 · 4 comments
Labels
db Issues related to database enhancement New feature or request

Comments

@humphd
Copy link
Collaborator

humphd commented Jan 18, 2025

Add Files Table and DuckDB File Integration

We've been adding DuckDB support to ChatCraft and it has some powerful features we'd like to use with files. For example, being able to use SQL to query CSV, JSON, or Excel files, do full-text search of large text files, etc. DuckDB-wasm also has its own concept of a file system, which we'd like to connect to our current files feature.

Current Behaviour

Our current file feature mostly lives in src/hooks/use-file-import.tsx. Users can copy/paste, drag-and-drop, or click Options > Attach Files... to include file content in messages. In other words, file contents (text or extracted text, images as base64 encoded URLs) are included in messages to be sent to the LLM. This works well for small files.

Now that we have DuckDB and the concept of Tables, we want to add Files as well. Files would be separate from messages, and would be stored in binary form in Dexie.

Proposed Changes

1. New Dexie Table

Add a new Table to Dexie, files, with the following layout:

interface ChatCraftFile {
  id: string; // unique hash of the file contents, for deduping
  name: string; // original file name
  type: string; // mime-type of the file (e.g., "text/csv")
  size: number; // size of file in bytes
  content: Blob; // binary content of file
  text?: string; // extracted text of file, base64 encoded version, etc  
  created: Date; // when the file was created
  expires: Date; // when the file can be deleted
  metadata?: Record<string, unknown> // extra metadata
}

2. File Processing Logic

Modify src/hooks/use-file-import.tsx as follows:

  • always store imported files in the files table
  • add logic to decide when to include a file's contents in the message (e.g., based on file type, size). For example, a small PDF would get added as a message but a large one would not
  • modify how images work to add as Markdown to human messages when including vs. what we do now with image URLs
  • larger files can become available for RAG, DuckDB queries
  • Periodic file expiry (e.g., timeout that runs hourly or something and looks at expiry field to compare to current date/time?)

3. DuckDB Integration

We need to expose files in the current chat (and possibly all files?) to the DuckDB virtual filesystem. This would allow us to do SQL queries against the file data, for example:

  • full-text search (RAG) which gets included in chats
  • sql queries to obtain info from CSV, Excel, etc
  • sql queries to join across multiple files

It would be nice if DuckDB could also write to the Dexie files table, so you can safe the results of things back into the chat.

4. UI Changes

  • Add a files panel/section to show the attached files. This might live above the prompt area, or be available by clicking an icon (e.g., on mobile)
  • Allow deleting files
  • Allow downloading files (e.g., results of queries, artifacts created by LLM)
  • Maybe show all code blocks or other generated source as files in files that you can use or download
  • Add preferences for when to include files in messages (e.g., file types or file sizes), with good defaults
  • Add preferences/controls for expiring files so they don't take up too much space
  • Figure out how to indicate that we want RAG to happen on attached files
  • Maybe add some new /slash commands for simple ways to interact with the files in the current chat (e.g., /embed <filename>, /delete <filename>, /download <filename>)

Migration Strategy

We need to decide how to deal with existing imageUrls[] (i.e., base64 encoded image strings) in messages table when we do the next db migration. The table looks like this:

export type ChatCraftMessageTable = {
  id: string;
  date: Date;
  chatId: string;
  type: MessageType;
  model?: string;
  user?: User;
  func?: FunctionCallParams | FunctionCallResult;
  text: string;
  imageUrls?: string[]; // <-- these are big
  versions?: { id: string; date: Date; model: string; text: string; imageUrls?: string[] }[];
};

We have a few options:

  1. Delete Option

    • Pro: Simplest, cleanest approach
    • Pro: Reduces database size immediately
    • Con: Destroys user data without consent
    • Con: No way to recover if needed
  2. Archive Option

    • Pro: Preserves user data
    • Pro: Gives users control over their data
    • Con: More complex implementation
    • Con: Requires additional UI work
    • Con: Delays cleanup of large data
    • Con: Not clear if users would know how to recover images from this table
  3. Migrate to Files Option

    • Pro: Preserves recent images in new format
    • Con: Complex async migration
    • Con: we don't have filenames
    • Con: Risk of migration failure (e.g., user refreshes "hanging" tab)
    • Con: Performance impact during migration
    • Con: No clear way to handle partial failures

Proposed Implementation Phases

This is a complex change that needs to get done in pieces over many smaller issues/PRs.

  1. Phase 1: Basic Files Support

    • Add files table
    • Implement basic file storage
    • Add simple UI for file management
    • Choose and implement migration strategy
  2. Phase 2: DuckDB Integration

    • Connect files to DuckDB filesystem
    • Implement query support
    • Add basic file querying UI
  3. Phase 3: Enhanced Features

    • Add RAG support
    • Implement file expiry
    • Add advanced UI features
    • Add slash commands
  4. Phase 4: Optimization

    • Performance improvements
    • Storage optimization
    • UI/UX refinements
    • Additional file type support
@tarasglek
Copy link
Owner

re imageUrls?: string[]; // <-- these are big we don't have to migrate them in one go. We can keep em as urls in new schema and change them to dexie://files/id incrementally..eg take 5 biggest ones every startup until it's done. It doesn't have to be part of dexie migration.

@tarasglek
Copy link
Owner

Organizing files into projects like claude could be useful..eg you could assume that every chat with uploaded files is an anonymous project, once you name em, you can easily attach em to other chats. See https://www.anthropic.com/news/projects

@tarasglek
Copy link
Owner

Re expiry, we can trigger removal of old files when we add new ones

@tarasglek
Copy link
Owner

add logic to decide when to include a file's contents in the message (e.g., based on file type, size). For example, a small PDF would get added as a message but a large one would not

We could provide a little toggle next to file re full or RAG mode...Then the place where the file was inserted could have text content inlined or search results

Overall lgtm, lets do it!

@mulla028 mulla028 added db Issues related to database enhancement New feature or request labels Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
db Issues related to database enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants