A Python project that parses Markdown files into a tree structure, then processes them into semantically meaningful text chunks. This can be used to restructure or summarize large bodies of text while respecting a maximum token limit.
This repository provides a robust parser that:
- Converts a Markdown file into a hierarchical tree using custom classes (
TreeElement
andSemanticChunk
). - Splits oversized text sections into smaller parts based on sentence boundaries.
- Combines smaller chunks where possible, ensuring you stay under a predefined token limit.
- Produces a Text output (or any format you choose) showcasing these semantic chunks.
The project uses Poetry for dependency management and includes sample input and output files (input.md
and output.txt
) to demonstrate how the code works.
- Markdown to Tree: Uses
MarkdownNodeParser
fromllama_index.core.node_parser
(and custom logic) to convert Markdown into a hierarchical structure. - Token-Aware Splitting: Splits or combines chunks based on token length, using a customizable token limit.
- Post-Order Traversal: Ensures children are processed before the parent, giving a logical structure to the output.
- Configurable Headers: Preserves header hierarchy in
SemanticChunk
objects.
- Python 3.12.6+ (Recommended)
- Poetry (to manage dependencies)
-
Clone the repository:
git clone https://github.com/tsensei/Semantic-Markdown-Parser/ cd Semantic-Markdown-Parser
-
Install dependencies with Poetry:
poetry install
This will create a virtual environment (if needed) and install all the required libraries.
-
Prepare your input Markdown file (e.g.,
input.md
) with the content you want to parse. -
Run your parser code. You can modify or create a script that uses
SemanticMarkdownParser
to parseinput.md
and produce anoutput.txt
(or just print results).Example Python snippet (assuming you have a
main.py
or similar entry point):from markdown_parser import SemanticMarkdownParser from pathlib import Path import json from token_encoder.encode import get_token_length if __name__ == "__main__": parser = SemanticMarkdownParser() input_text = Path("input.md").read_text(encoding="utf-8") # Parse to tree root = parser.parse_markdown_to_tree(input_text) # Process tree into chunks chunks = parser.get_semantic_chunks(root, max_tokens=500) # Print resulting chunks with open("output.txt", "w") as file: for i, chunk in enumerate(chunks, 1): file.write(f"\nChunk {i}:\n") file.write("-" * 80 + "\n") file.write(chunk + "\n") file.write("-" * 80 + "\n") file.write(f"Token length: {get_token_length(chunk)}\n") print("Output saved to output.txt")
-
Run it with Poetry:
poetry run python main.py
Your parsed output should appear in
output.txt
.
Contributions are welcome! Feel free to open issues or submit pull requests if you have suggestions or bug reports.
- Fork the repo on GitHub
- Clone your fork locally
- Create a feature branch (
git checkout -b feature/my-new-feature
) - Commit and push your changes
- Open a Pull Request describing your changes
This project is open source and distributed under the MIT License. See the LICENSE file for details.
For any questions, feel free to open an issue or reach out to me directly via GitHub Issues.