Semantic Markdown Parser

A Python project that parses Markdown files into a tree structure, then processes them into semantically meaningful text chunks. This can be used to restructure or summarize large bodies of text while respecting a maximum token limit.

Description

This repository provides a robust parser that:

Converts a Markdown file into a hierarchical tree using custom classes (TreeElement and SemanticChunk).
Splits oversized text sections into smaller parts based on sentence boundaries.
Combines smaller chunks where possible, ensuring you stay under a predefined token limit.
Produces a Text output (or any format you choose) showcasing these semantic chunks.

The project uses Poetry for dependency management and includes sample input and output files (input.md and output.txt) to demonstrate how the code works.

Features

Markdown to Tree: Uses MarkdownNodeParser from llama_index.core.node_parser (and custom logic) to convert Markdown into a hierarchical structure.
Token-Aware Splitting: Splits or combines chunks based on token length, using a customizable token limit.
Post-Order Traversal: Ensures children are processed before the parent, giving a logical structure to the output.
Configurable Headers: Preserves header hierarchy in SemanticChunk objects.

Getting Started

Prerequisites

Python 3.12.6+ (Recommended)
Poetry (to manage dependencies)

Installation

Clone the repository:

git clone https://github.com/tsensei/Semantic-Markdown-Parser/
cd Semantic-Markdown-Parser

Install dependencies with Poetry:
```
poetry install
```

This will create a virtual environment (if needed) and install all the required libraries.

Usage

Prepare your input Markdown file (e.g., input.md) with the content you want to parse.

Run your parser code. You can modify or create a script that uses SemanticMarkdownParser to parse input.md and produce an output.txt (or just print results).

Example Python snippet (assuming you have a main.py or similar entry point):

 from markdown_parser import SemanticMarkdownParser
 from pathlib import Path
 import json
 from token_encoder.encode import get_token_length


 if __name__ == "__main__":
     parser = SemanticMarkdownParser()
     input_text = Path("input.md").read_text(encoding="utf-8")

     # Parse to tree
     root = parser.parse_markdown_to_tree(input_text)
     
     # Process tree into chunks
     chunks = parser.get_semantic_chunks(root, max_tokens=500)
     
     # Print resulting chunks
     with open("output.txt", "w") as file:
         for i, chunk in enumerate(chunks, 1):
             file.write(f"\nChunk {i}:\n")
             file.write("-" * 80 + "\n")
             file.write(chunk + "\n")
             file.write("-" * 80 + "\n")
             file.write(f"Token length: {get_token_length(chunk)}\n")

     print("Output saved to output.txt")

Run it with Poetry:
```
poetry run python main.py
```
Your parsed output should appear in output.txt.

Contributing

Contributions are welcome! Feel free to open issues or submit pull requests if you have suggestions or bug reports.

Fork the repo on GitHub
Clone your fork locally
Create a feature branch (git checkout -b feature/my-new-feature)
Commit and push your changes
Open a Pull Request describing your changes

License

This project is open source and distributed under the MIT License. See the LICENSE file for details.

Contact

For any questions, feel free to open an issue or reach out to me directly via GitHub Issues.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
token_encoder		token_encoder
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
input.md		input.md
main.py		main.py
markdown_parser.py		markdown_parser.py
output.txt		output.txt
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
text_splitter.py		text_splitter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Markdown Parser

Description

Features

Getting Started

Prerequisites

Installation

Usage

Contributing

License

Contact

About

Releases

Packages

Languages

License

tsensei/Semantic-Markdown-Parser

Folders and files

Latest commit

History

Repository files navigation

Semantic Markdown Parser

Description

Features

Getting Started

Prerequisites

Installation

Usage

Contributing

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages