Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Chunking Methods #45

Closed
wants to merge 0 commits into from
Closed

Conversation

PalmPalm7
Copy link
Contributor

@PalmPalm7 PalmPalm7 commented Jun 25, 2024

Demonstrated new chunking methods in replace of RecursiveCharacterTextSplitter()

Updated PR link: #65

Notable updates:

  1. Used a efficient library to detect file type.
  2. Applied document-specific test splitter from Langchain in replace of original naive version.
  3. Made heuristics changes to markdown file, especially using regex to trim markdown tables in attempt to fit in the whole table with limited context window.
  4. For updated chunk_document() function, see Chunking_Demo.ipynb on chunking with server_ctx_size=4096, chunk_word_count=1024). Granite 7b has 4k context windows.

@PalmPalm7
Copy link
Contributor Author

Demonstrated new chunking methods in replace of RecursiveCharacterTextSplitter()

Notable updates:

  1. Used a efficient library to detect file type.
  2. Applied document-specific test splitter from Langchain in replace of original naive version.
  3. Made heuristics changes to markdown file, especially using regex to trim markdown tables in attempt to fit in the whole table with limited context window.
  4. For updated chunk_document() function, see Chunking_Demo.ipynb on chunking with server_ctx_size=4096, chunk_word_count=1024). Granite 7b has 4k context windows.

@russellb @abhi1092 Hi Russel and Abhi, I have made final changes to the chunk_document() method and it passes all CI-Tests. Could you review my PR? Thanks

@mergify mergify bot added the needs-rebase label Jun 28, 2024
Copy link
Contributor

mergify bot commented Jun 28, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @PalmPalm7 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@russellb
Copy link
Member

russellb commented Jun 29, 2024

The code moved to src/instructlab/sdg/utils/chunking.py, sorry for the trouble

@abhi1092
Copy link
Member

abhi1092 commented Jul 1, 2024

A couple of things @PalmPalm7 :

  • Is it assumed that if the file is not any of code file it always markdown?
  • How do you handle chunks that are larger than the provided server_ctx_size ?

@PalmPalm7
Copy link
Contributor Author

A couple of things @PalmPalm7 :

  • Is it assumed that if the file is not any of code file it always markdown?
  • How do you handle chunks that are larger than the provided server_ctx_size ?

Thank you for the review and the comments @abhi1092 !

My Justification is there is I assume our most common use cases, as we discussed, is PDF into markdown formats, therefore default case should be markdown. Furthermore, by specifying the language param in RecursiveCharacterTextSplitter, it uses these following separators:

RecursiveCharacterTextSplitter.get_separators_for_language(Language.MARKDOWN)

Output:
['\n#{1,6} ',
 '```\n',
 '\n\\*\\*\\*+\n',
 '\n---+\n',
 '\n___+\n',
 '\n\n',
 '\n',
 ' ',
 '']

Instead of these separators.

["\n\n", "\n", " "]

This is valid concern. I have used the original chunk_document()'s logic, but it doesn't properly handle over context length chunk either.

@abhi1092
Copy link
Member

abhi1092 commented Jul 1, 2024

Gotcha. Let's create an issue for 2. so this can be addressed in the future. Everything looks good.

Let me test this and we can merge it.

@PalmPalm7 could you resolve the the conflicts?

But could you resolve the merge conflicts?

@mergify mergify bot added the ci-failure label Jul 2, 2024
@PalmPalm7 PalmPalm7 closed this Jul 2, 2024
@mergify mergify bot removed the ci-failure label Jul 2, 2024
jwm4 pushed a commit to jwm4/sdg that referenced this pull request Dec 13, 2024
Add policy document for using GitHub actions in workflows
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Affects CI/CD configuration needs-rebase testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants