Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH adding write_html to TableReport #1190

Merged
merged 15 commits into from
Jan 6, 2025

Conversation

mrastgoo
Copy link
Contributor

@mrastgoo mrastgoo commented Dec 7, 2024

closes #1174

Adding write_html to TableReport.
The added test only checks for the raise error and if the file exist
The following code for checking the content failed and needs further investigation

    with open(f_name, 'r') as file:
        saved_content = file.read()  
    assert saved_content == report.html()

skrub/_reporting/tests/test_table_report.py Outdated Show resolved Hide resolved
skrub/_reporting/tests/test_table_report.py Outdated Show resolved Hide resolved
report = TableReport(df)

with TemporaryDirectory() as td:
f_name = Path(td) / Path("report.html")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call it if filename

skrub/_reporting/tests/test_table_report.py Outdated Show resolved Hide resolved
f_name = Path(td) / Path("report.html")

if filename_path == "str":
report.write_html(f_name.absolute())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to pass .absolute here?

Comment on lines 138 to 146
if filename_path == "str":
report.write_html(f_name.absolute())

if filename_path == "Path":
report.write_html(f_name)

if filename_path == "file_object":
file_object = open(f_name, "w", encoding="utf-8")
report.write_html(file_object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer something like:

tmp_file_path = ...
if filename_type == "str":
    filename = str(tmp_file_path)
elif filename_type == "file_object":
     filename = open(tmp_file_path, "w", encoding="utf-8")
else:
     filename = tmp_file_path

report.write_html(filename)
assert tmp_file_path.exists()

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mrastgoo , thanks a lot for working on this! It's a great addition 🎉 🚀 . Even in the skrub tests and documentation scripts there are a few places we could use it :)

also the fact that the reports are self-contained and can be seen after the python process finishes is one of their main advantages (which required some compromises) and by adding this method you'll make that feature more visible and help users take advantage of it :)

@@ -197,6 +198,30 @@ def _repr_mimebundle_(self, include=None, exclude=None):
def _repr_html_(self):
return self._repr_mimebundle_()["text/html"]

def write_html(self, filename):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! In the issue I forgot to mention a couple of things:

  • Sometimes generating the report can fail (unfortunately!). We want to make sure we close the file (if we opened it) in this case; also we may want to avoid truncating or creating it before we know we have something to write.
  • The report will always set the charset to utf-8 in the page's metadata. So if the user passes an open file we may optionally check the file's encoding if we want to be extra cautious. Also note unless the doctring we write forbids it the user could pass a binary stream so we may want to handle that case, too.
  • I'm not sure we need to enforce the filename suffix. I could imagine someone wanting to save to "report.htm" or "report.html.tmp" or something so I'd rather let the user name the file however they want.
  • If we allow both a path and a file object maybe the parameter could be just "file" instead of "filename"?

So taking those details into account I guess the function could be modified slightly to look similar to this:

    def write_html(self, file):
        html = self.html()
        if isinstance(file, (str, Path)):
            with open(file, "w", encoding="utf8") as stream:
                stream.write(html)
            return
        try:
            file.write(html.encode("utf-8"))
            return
        except TypeError:
            pass
        if (encoding := getattr(file, "encoding", None)) is not None:
            try:
                assert codecs.lookup(encoding).name == "utf-8"
            except (AssertionError, LookupError):
                raise ValueError(
                    f"If `file` is a text file it should use utf-8 encoding; got: {encoding!r}"
                )
        file.write(html)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you comments @jeromedockes , I have couple of questions.

  1. What are the reasons behind TypeError
  2. if the encoding is None it seems the locale.encoding will be used, so we want to check for that too and raise an error if it is not utf-8 ?

Copy link
Member

@jeromedockes jeromedockes Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mrastgoo ,

  1. What are the reasons behind TypeError

Trying to write bytes and catching the TypeError is an easy way to detect if we are dealing with a text stream or a binary stream.

A file object (a.k.a. stream) can be:

  • In binary mode, in which case it expects bytes that can be written to a file without transformation. If we want to write text we must encode it into bytes ourselves. Passing a str to write raises a TypeError.
  • In text mode, in which case it expects strings and it will transparently encode them into bytes (and optionally translate newline characters) before writing the resulting bytes to the underlying binary buffer. Passing a bytes to write raises a TypeError (the one we catch).
>>> 'é'
'é'
>>> 'é'.encode('utf-8')
b'\xc3\xa9'

Binary mode:

>>> binary_stream = open('thefile', 'wb') # 'wb' = 'write + binary'
>>> binary_stream.write('é'.encode('utf-8')) # we can write bytes
2
>>> binary_stream.write(b'\xc3\xa9')
2
>>> binary_stream.write('é') # trying to write a string directly is a TypeError
Traceback (most recent call last):
    ...
TypeError: a bytes-like object is required, not 'str'
>>> binary_stream.close()

Conversely, if we use text mode (the default):

>>> text_stream = open('thefile', 'wt', encoding='utf-8') # 'write + text', or 'w' as text is the default
>>> text_stream.write('é') # we can write strings
1
>>> text_stream.write(b'\xc3\xa9'.decode('utf-8'))
1

and this is the error we catch in the current version of TableReport.write_html:

>>> text_stream.write(b'\xc3\xa9') # trying to write bytes to a text stream is a TypeError
Traceback (most recent call last):
    ...
TypeError: write() argument must be str, not bytes
>>> text_stream.close()

So we need to know if the file object the user gave us is a text or binary stream. If we knew it was created with open we could inspect its mode:

>>> with open('thefile', 'w') as stream:
...    print(stream.mode)
w

>>> with open('thefile', 'wb') as stream:
...    print(stream.mode)
wb

But that may not always work with other kinds of file objects:

>>> import io
>>> stream = io.BytesIO()
>>> stream.write(b'\xc3\xa9')
2
>>> stream.mode
Traceback (most recent call last):
    ...
AttributeError: '_io.BytesIO' object has no attribute 'mode'

All well-behaved file objects however will respect the rule of raising a TypeError if we give them the wrong type of data. So what we can do is first suppose it is in binary mode and try to write bytes, and if we get a TypeError try again with a string (or we could do it the other way around, too).

  1. if the encoding is None it seems the locale.encoding will be used, so we want to check for that too and raise an error if it is not utf-8 ?

If we pass encoding=None to open it will default to using locale.encoding, but AFAIK the resulting file object will always have an explicit encoding (not None).

>>> with open('thefile', 'w', encoding=None) as f:
...     print(f.encoding)
UTF-8

So I think it should be ok to inspect the encoding of the file object in this case.

Some other kinds of text stream may not have an explicit encoding though. For example io.StringIO writes to an in-memory python string, so it accepts strings but does not need to encode them (it doesn't need to convert to bytes because the backing storage is a unicode string)

>>> stream = io.StringIO()
>>> stream.encoding is None
True
>>> stream.write('é') # ok
1

In the case of StringIO it's fine to copy the report string to the stream's buffer even if we couldn't figure out what encoding it uses (because it doesn't use any). In some other cases I guess looking at the encoding attribute may not give a result even if a (possibly wrong) encoding is being used. But I guess at this point we have to trust the user to know what they are doing and at least we have made a reasonable effort to prevent the most likely issue, which is doing this on windows:

>>> with open('thefile', 'w') as f:
...     report.write_html(f)

(Even that will go away in python 3.15 when utf-8 becomes the default).
So my suggestion is that if we easily detect a wrong encoding is being used by checking the encoding attribute, we raise an error because we are sure that writing will result in a file with an incorrect <meta charset='utf-8'> tag which is likely to cause it to display incorrectly in a browser. If we cannot figure out for sure that a wrong encoding was used we just go ahead and write the report and hope for the best.

sorry for the long answer, hope this makes sense!

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fail to see the use case for allowing the user to pass a file instead of a filename. According to some path, this feature should address the simple use case of writing an HTML file to disk. Can you motivate this a little bit? It looks like plain over-engineering to me TBH.

Copy link
Member

@jeromedockes jeromedockes Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it was a suggestion from @glemaitre .

but if you look at functions offering to save some object as text such as matplotlib.savefig, pandas or polars to_csv etc, xml.etree.write, plotly write_html and others it's hard to find an example that does not allow passing a file object.

one example use case would be using a temporary file, testing, situations in which you may need to write either to an actual file or an in-memory buffer, ...

from personal experience a while ago in a completely unrelated context NiftiImages from the nibabel library could only be saved to a filesystem path and I can't remember what were the exact situation where it was a problem for me but I remember it was annoying

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jeromedockes for the explanation. it helped a lot.

@@ -123,6 +127,42 @@ def test_duration(df_module):
assert re.search(r"2(\.0)?\s+days", TableReport(df).html())


@pytest.mark.parametrize("filename_type", ["str", "Path", "file_object"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to be super exhaustive maybe we could test files both in text and binary modes?

df = pd_module.make_dataframe({"a": [1, 2], "b": [3, 4]})
report = TableReport(df)

with TemporaryDirectory() as td:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could use the pytest tmp_path fixture?

filename = tmp_file_path

report.write_html(filename)
assert tmp_file_path.exists()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to check the full content but maybe we could see that the file contains "</html>" to see that the full report has been written

@jeromedockes
Copy link
Member

Users will be interested in this feature so we also need a changelog entry :) thanks!

@mrastgoo
Copy link
Contributor Author

mrastgoo commented Dec 9, 2024

I am block to make the test for platform encoding. The following test doesn't creates the error message for the moment.

def test_write_html_platform_encoding_not_utf8(tmp_path, pd_module, monkeypatch):
    df = pd_module.make_dataframe({"a": [1, 2], "b": [3, 4]})
    report = TableReport(df)
    monkeypatch.setattr(locale, "getencoding", lambda: "ascii")

    filename = open(tmp_path / Path("report.html"), "w", encoding=None)

    err_msg = f"Platform encoding is not utf-8; got {locale.getencoding()}"
    with pytest.raises(ValueError, match=err_msg):
        report.write_html(filename)

In addition the CI fails for the binary mode. Any suggestion @jeromedockes

@jeromedockes
Copy link
Member

I am block to make the test for platform encoding. The following test doesn't creates the error message for the moment.

I don't think we can set the default encoding just by patching locale.getencoding:

>>> import locale
>>> locale.getencoding = lambda: 'ascii'

>>> with open("thefile", "w", encoding=None) as f:
...     print(f.encoding)
UTF-8

I think it's fine not to test this, just testing when we open a file for which we explicitly set a wrong encoding is enough IMO.

In addition the CI fails for the binary mode. Any suggestion @jeromedockes

from the log it seems it fails when trying to read back the report to check the content, right? I think it is because we saved the report in utf-8 (as we should) but when we open it for reading we don't set an explicit encoding so on windows it ends up using the incorrect default encoding. I think if we replace

    with open(tmp_file_path, "r") as file:

with

    with open(tmp_file_path, "r", encoding="utf-8") as file:

it should fix the issue

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the tests are passing 🎉 a couple more details but we are almost there :)

skrub/_reporting/_table_report.py Outdated Show resolved Hide resolved
CHANGES.rst Outdated Show resolved Hide resolved
"filename_type",
["str", "Path", "file_object", "binary_mode"],
)
def test_write_html(tmp_path, pd_module, filename_type):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is looking great! the last thing we need to take care of is to make sure we close the file if we opened it (otherwise we "leak" a resource: the opened file handle. that could be a problem for example to clean up the temp directory on windows as it refuses to remove files that have an open file handle). Usually we ensure that with a simple context manager like:

with open(tmp_file_path, 'w', encoding='utf-8') as file:
    file.write('hello')

but here we have a tricky situation because in some cases we have a string or path (which require no closing) and sometimes we have a file object which does require closing.

The standard library module contextlib provides 2 ways to deal with that situation easily. The first is ExitStack: it creates a context and we can push as many context managers as we want to its stack; when it exits it unwinds the stack, calling each manager's __exit__ when it is popped. So we could use it like:

with contextlib.ExitStack() as stack:
    if file_type == 'str':
        file = str(tmp_file_path)
    elif file_type == 'text_file_object':
        file = stack.enter_context(open(tmp_file_path, 'w', encoding='utf-8'))
    # ...

    report.write_html(file)

# if we opened it the file is closed here when we exit the `with` block

This option using ExitStack is my favorite because the file is being managed by a context manager as soon as it is opened.

Another way is to use nullcontext in the cases where we do not open the file, so that later we can treat all options as if they were open files that implement the context manager protocol. nullcontext returns an object that implements the context manager protocol but whose __enter__ just returns the object we gave it and __exit__ does nothing:

if file_type == 'str':
    file = contextlib.nullcontext(str(tmp_file_path))
elif file_type == 'text_file_object':
    file = open(tmp_file_path, 'w', encoding='utf-8')

with file:
    report.write_html(file)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow contextlib.ExitStack() is very nice

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could add a comment on L139 to explain why it's a good idea to use contextlib here?

df = pd_module.make_dataframe({"a": [1, 2], "b": [3, 4]})
report = TableReport(df)

filename = open(tmp_path / Path("report.html"), "w", encoding="latin-1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here also we want to use a context manager to close the file

Comment on lines 168 to 169
report.write_html(filename)
assert not filename.exists()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
report.write_html(filename)
assert not filename.exists()
report.write_html(filename)
assert not filename.exists()

otherwise the assert is never executed because of the exception raised on the line above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually change this to check if the file doesn't contain html to make sure we don't modify it.

@jeromedockes
Copy link
Member

The following code for checking the content failed and needs further investigation

sorry I forgot to answer that. The reason is that every time we generate the report with report.html() or report.html_snippet(), the report generates a new unique id attribute for the html div it creates, so that we don't have problems with duplicate IDs in the same html page if we insert the same report several times in a page (for example by having it as the output of several notebook cells). so the html will not be exactly the same because an id attribute changes but just checking the </html> tag as you do now in the text is enough

actually I'm thinking now that we don't need to vary the id in the report.html() case, only for html_snippet() but in any case if we decide to change that it will be in a different pull request

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks so much for this very nice addition @mrastgoo 🚀

@glemaitre and @Vincent-Maladiere LMK if you want to have another look otherwise I'll merge it

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a few lines of comment to explain the rational behind the steps of the function, from what has been discussed here

f" {encoding!r}"
)

file.write(html)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment explaining what html is expected (or not expected) to be at line 230?
Additionally, light inline documentation/comments on the steps above would help readability :)

"filename_type",
["str", "Path", "file_object", "binary_mode"],
)
def test_write_html(tmp_path, pd_module, filename_type):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could add a comment on L139 to explain why it's a good idea to use contextlib here?

@mrastgoo
Copy link
Contributor Author

Let's add a few lines of comment to explain the rational behind the steps of the function, from what has been discussed here

I added some comments in the function, Let me know if this is ok

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mrastgoo!

@Vincent-Maladiere Vincent-Maladiere merged commit d539079 into skrub-data:main Jan 6, 2025
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a method to save TableReports to a file
4 participants