Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Loader Options #196

Open
alexanderjophus opened this issue Aug 15, 2024 · 3 comments
Open

Document Loader Options #196

alexanderjophus opened this issue Aug 15, 2024 · 3 comments

Comments

@alexanderjophus
Copy link

Is your feature request related to a problem? Please describe.
My use case is; a cron job that trains on a git repo. I'm using git commit loader rather than file loader (not entirely sure if this is best for me).

The main thing for me is iteratively adding documents to a vector store, rather than once and done.

Describe the solution you'd like
I'd love to filter which commits are loaded, similar to how I can filter for only files of certain extensions. In my specific scenario I'd only like to load commits I've not seen before.

Describe alternatives you've considered
A generic filter/lambda type function that allows developers to plug in their own conditions on whether a commit should be loaded.

@alexanderjophus
Copy link
Author

I think providing our own filter, that we could plugin here would be ideal: https://github.com/Abraxas-365/langchain-rust/blob/main/src/document_loaders/git_commit_loader/git_commit_loader.rs#L45

@alexanderjophus
Copy link
Author

Also letting the user do their own mapping might be beneficial. I'm working on a PR that does both.

@alexanderjophus
Copy link
Author

I've updated the PR (it's a work in progress still as I struggle my way through rust's compiler).

The PR/ticket now is to;

  • Update document loader such that a user can pass in their own filter and map functions
  • This allows the user much more flexibility when it comes to how they want to upload documents to the vector store.

Current issue I'm facing is (and many others similar rooting from Repo fields are not Send nor Sync)

`RefCell<Vec<Vec<u8>>>` cannot be shared between threads safely within `gix::Repository`, the trait `Sync` is not implemented for `RefCell<Vec<Vec<u8>>>`, which is required by `gix::revision::walk::Info<'_>: std::marker::Send` 
if you want to do aliasing and mutation between multiple threads, use `std::sync::RwLock` instead required for `&gix::Repository` to implement `std::marker::Send`

@alexanderjophus alexanderjophus changed the title Git Commit Loader Options Document Loader Options Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant