-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✏️ Better HTML conversion options #98
Merged
Merged
Changes from 4 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
295bc7a
✨ Converters
asim-shrestha ebb26ee
✨ Converters
asim-shrestha 3f4652c
✨ Converters
asim-shrestha eb35ecb
✨ Converters
asim-shrestha 32e558c
✨ Converters
asim-shrestha 8ec7d92
✨ Converters
asim-shrestha 79c89af
✨ Converters
asim-shrestha a3662c9
✨ Converters
asim-shrestha 9e2ec2e
✨ Typing
asim-shrestha ed3ae6b
✨ Typing
asim-shrestha File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
from typing import Literal | ||
|
||
from harambe_core.errors import UnknownHTMLConverter | ||
from markdownify import MarkdownConverter | ||
|
||
from sdk.harambe.html_converter.html_to_markdown import HTMLToMarkdownConverter | ||
from sdk.harambe.html_converter.html_to_text import HTMLToTextConverter | ||
|
||
HTMLConverterType = Literal["markdown", "text"] | ||
|
||
|
||
def get_html_converter( | ||
html_converter_type: HTMLConverterType | None, | ||
) -> MarkdownConverter: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The return type of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no |
||
if html_converter_type == "markdown": | ||
return HTMLToMarkdownConverter() | ||
if html_converter_type == "text": | ||
return HTMLToTextConverter() | ||
else: | ||
raise UnknownHTMLConverter(html_converter_type) | ||
asim-shrestha marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
from markdownify import MarkdownConverter | ||
|
||
|
||
class HTMLToMarkdownConverter(MarkdownConverter): | ||
pass |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
from bs4.element import Tag | ||
from markdownify import MarkdownConverter | ||
|
||
|
||
class HTMLToTextConverter(MarkdownConverter): | ||
""" | ||
Custom converter to convert data from HTML to text | ||
|
||
Strip out standard markdown syntax like headings, em, strong, a, etc. | ||
Include footnotes in brackets | ||
""" | ||
|
||
def convert_sup(self, el: Tag, text: str, convert_as_inline: bool) -> str: | ||
return f"[{text}]" | ||
|
||
def convert_sub(self, el: Tag, text: str, convert_as_inline: bool) -> str: | ||
return f"[{text}]" | ||
|
||
def convert_span(self, el: Tag, text: str, convert_as_inline: bool) -> str: | ||
if el.get("class") and "sup" in el.get("class"): | ||
return f"[{text}]" | ||
if el.get("class") and "sub" in el.get("class"): | ||
return f"[{text}]" | ||
return text | ||
|
||
def convert_h1(self, el: Tag, text: str, convert_as_inline: bool) -> str: | ||
return self.convert_p(el, text, convert_as_inline) | ||
|
||
def convert_h2(self, el: Tag, text: str, convert_as_inline: bool) -> str: | ||
return self.convert_p(el, text, convert_as_inline) | ||
|
||
def convert_h3(self, el: Tag, text: str, convert_as_inline: bool) -> str: | ||
return self.convert_p(el, text, convert_as_inline) | ||
|
||
def convert_h4(self, el: Tag, text: str, convert_as_inline: bool) -> str: | ||
return self.convert_p(el, text, convert_as_inline) | ||
|
||
def convert_h5(self, el: Tag, text: str, convert_as_inline: bool) -> str: | ||
return self.convert_p(el, text, convert_as_inline) | ||
|
||
def convert_h6(self, el: Tag, text: str, convert_as_inline: bool) -> str: | ||
return self.convert_p(el, text, convert_as_inline) | ||
|
||
# Treat inline elements as spans | ||
def convert_strong(self, el: Tag, text: str, convert_as_inline: bool) -> str: | ||
return self.convert_span(el, text, convert_as_inline) | ||
|
||
def convert_em(self, el: Tag, text: str, convert_as_inline: bool) -> str: | ||
return self.convert_span(el, text, convert_as_inline) | ||
|
||
def convert_a(self, el: Tag, text: str, convert_as_inline: bool) -> str: | ||
return self.convert_span(el, text, convert_as_inline) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
<!DOCTYPE html> | ||
<html lang="en"> | ||
<head> | ||
<meta charset="UTF-8"> | ||
</head> | ||
<body> | ||
<h3>Heading</h3> | ||
</body> | ||
</html> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
converter_type
parameter should be typed asAny
instead ofany
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ellipsis is right, it should be
Any
from the typing moduleThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call