Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✏️ Better HTML conversion options #98

Merged
merged 10 commits into from
Dec 6, 2024
Merged

✏️ Better HTML conversion options #98

merged 10 commits into from
Dec 6, 2024

Conversation

asim-shrestha
Copy link
Contributor

@asim-shrestha asim-shrestha commented Dec 5, 2024

Important

Adds HTML conversion feature with markdown and text options, including error handling and tests.

  • Behavior:
    • Adds html_converter_type parameter to capture_html() in core.py for specifying conversion type.
    • Raises UnknownHTMLConverter error for unsupported types in get_html_converter().
  • HTML Conversion:
    • Introduces HTMLToMarkdownConverter and HTMLToTextConverter for HTML to markdown and text conversion.
    • Implements get_html_converter() in html_converter/__init__.py to select converter based on type.
  • Errors:
    • Adds UnknownHTMLConverter exception in errors.py for handling unknown types.
  • Tests:
    • Adds test_capture_html_conversion_types() in test_e2e.py to verify conversion to markdown and text.
    • Includes mock HTML files heading.html and table.html for testing.
  • Misc:
    • Updates version to 0.53.0 in pyproject.toml and uv.lock.

This description was created by Ellipsis for ed3ae6b. It will automatically update as commits are pushed.

@asim-shrestha asim-shrestha changed the title ✨ Better parsing ✏️ Better HTML conversion options Dec 5, 2024
@@ -4,6 +4,13 @@ class HarambeException(Exception):
pass


class UnknownHTMLConverter(HarambeException):
def __init__(self, converter_type: any) -> None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The converter_type parameter should be typed as Any instead of any.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ellipsis is right, it should be Any from the typing module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call


def get_html_converter(
html_converter_type: HTMLConverterType | None,
) -> MarkdownConverter:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type of get_html_converter should be Union[HTMLToMarkdownConverter, HTMLToTextConverter] instead of MarkdownConverter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no

)

assert len(observer.data) == 1
assert observer.data[0]["text"].strip() == (
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected output in test_capture_html_table should not contain markdown syntax when using html_converter_type="text". Adjust the expected output accordingly.

@asim-shrestha asim-shrestha merged commit bd4e61b into main Dec 6, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants