Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH adding write_html to TableReport #1190

Merged
merged 15 commits into from
Jan 6, 2025
21 changes: 21 additions & 0 deletions skrub/_reporting/_table_report.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import functools
import json
from pathlib import Path

from ._html import to_html
from ._serve import open_in_browser
Expand Down Expand Up @@ -197,6 +198,26 @@ def _repr_mimebundle_(self, include=None, exclude=None):
def _repr_html_(self):
return self._repr_mimebundle_()["text/html"]

def write_html(self, filename):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! In the issue I forgot to mention a couple of things:

  • Sometimes generating the report can fail (unfortunately!). We want to make sure we close the file (if we opened it) in this case; also we may want to avoid truncating or creating it before we know we have something to write.
  • The report will always set the charset to utf-8 in the page's metadata. So if the user passes an open file we may optionally check the file's encoding if we want to be extra cautious. Also note unless the doctring we write forbids it the user could pass a binary stream so we may want to handle that case, too.
  • I'm not sure we need to enforce the filename suffix. I could imagine someone wanting to save to "report.htm" or "report.html.tmp" or something so I'd rather let the user name the file however they want.
  • If we allow both a path and a file object maybe the parameter could be just "file" instead of "filename"?

So taking those details into account I guess the function could be modified slightly to look similar to this:

    def write_html(self, file):
        html = self.html()
        if isinstance(file, (str, Path)):
            with open(file, "w", encoding="utf8") as stream:
                stream.write(html)
            return
        try:
            file.write(html.encode("utf-8"))
            return
        except TypeError:
            pass
        if (encoding := getattr(file, "encoding", None)) is not None:
            try:
                assert codecs.lookup(encoding).name == "utf-8"
            except (AssertionError, LookupError):
                raise ValueError(
                    f"If `file` is a text file it should use utf-8 encoding; got: {encoding!r}"
                )
        file.write(html)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you comments @jeromedockes , I have couple of questions.

  1. What are the reasons behind TypeError
  2. if the encoding is None it seems the locale.encoding will be used, so we want to check for that too and raise an error if it is not utf-8 ?

Copy link
Member

@jeromedockes jeromedockes Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mrastgoo ,

  1. What are the reasons behind TypeError

Trying to write bytes and catching the TypeError is an easy way to detect if we are dealing with a text stream or a binary stream.

A file object (a.k.a. stream) can be:

  • In binary mode, in which case it expects bytes that can be written to a file without transformation. If we want to write text we must encode it into bytes ourselves. Passing a str to write raises a TypeError.
  • In text mode, in which case it expects strings and it will transparently encode them into bytes (and optionally translate newline characters) before writing the resulting bytes to the underlying binary buffer. Passing a bytes to write raises a TypeError (the one we catch).
>>> 'é'
'é'
>>> 'é'.encode('utf-8')
b'\xc3\xa9'

Binary mode:

>>> binary_stream = open('thefile', 'wb') # 'wb' = 'write + binary'
>>> binary_stream.write('é'.encode('utf-8')) # we can write bytes
2
>>> binary_stream.write(b'\xc3\xa9')
2
>>> binary_stream.write('é') # trying to write a string directly is a TypeError
Traceback (most recent call last):
    ...
TypeError: a bytes-like object is required, not 'str'
>>> binary_stream.close()

Conversely, if we use text mode (the default):

>>> text_stream = open('thefile', 'wt', encoding='utf-8') # 'write + text', or 'w' as text is the default
>>> text_stream.write('é') # we can write strings
1
>>> text_stream.write(b'\xc3\xa9'.decode('utf-8'))
1

and this is the error we catch in the current version of TableReport.write_html:

>>> text_stream.write(b'\xc3\xa9') # trying to write bytes to a text stream is a TypeError
Traceback (most recent call last):
    ...
TypeError: write() argument must be str, not bytes
>>> text_stream.close()

So we need to know if the file object the user gave us is a text or binary stream. If we knew it was created with open we could inspect its mode:

>>> with open('thefile', 'w') as stream:
...    print(stream.mode)
w

>>> with open('thefile', 'wb') as stream:
...    print(stream.mode)
wb

But that may not always work with other kinds of file objects:

>>> import io
>>> stream = io.BytesIO()
>>> stream.write(b'\xc3\xa9')
2
>>> stream.mode
Traceback (most recent call last):
    ...
AttributeError: '_io.BytesIO' object has no attribute 'mode'

All well-behaved file objects however will respect the rule of raising a TypeError if we give them the wrong type of data. So what we can do is first suppose it is in binary mode and try to write bytes, and if we get a TypeError try again with a string (or we could do it the other way around, too).

  1. if the encoding is None it seems the locale.encoding will be used, so we want to check for that too and raise an error if it is not utf-8 ?

If we pass encoding=None to open it will default to using locale.encoding, but AFAIK the resulting file object will always have an explicit encoding (not None).

>>> with open('thefile', 'w', encoding=None) as f:
...     print(f.encoding)
UTF-8

So I think it should be ok to inspect the encoding of the file object in this case.

Some other kinds of text stream may not have an explicit encoding though. For example io.StringIO writes to an in-memory python string, so it accepts strings but does not need to encode them (it doesn't need to convert to bytes because the backing storage is a unicode string)

>>> stream = io.StringIO()
>>> stream.encoding is None
True
>>> stream.write('é') # ok
1

In the case of StringIO it's fine to copy the report string to the stream's buffer even if we couldn't figure out what encoding it uses (because it doesn't use any). In some other cases I guess looking at the encoding attribute may not give a result even if a (possibly wrong) encoding is being used. But I guess at this point we have to trust the user to know what they are doing and at least we have made a reasonable effort to prevent the most likely issue, which is doing this on windows:

>>> with open('thefile', 'w') as f:
...     report.write_html(f)

(Even that will go away in python 3.15 when utf-8 becomes the default).
So my suggestion is that if we easily detect a wrong encoding is being used by checking the encoding attribute, we raise an error because we are sure that writing will result in a file with an incorrect <meta charset='utf-8'> tag which is likely to cause it to display incorrectly in a browser. If we cannot figure out for sure that a wrong encoding was used we just go ahead and write the report and hope for the best.

sorry for the long answer, hope this makes sense!

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fail to see the use case for allowing the user to pass a file instead of a filename. According to some path, this feature should address the simple use case of writing an HTML file to disk. Can you motivate this a little bit? It looks like plain over-engineering to me TBH.

Copy link
Member

@jeromedockes jeromedockes Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it was a suggestion from @glemaitre .

but if you look at functions offering to save some object as text such as matplotlib.savefig, pandas or polars to_csv etc, xml.etree.write, plotly write_html and others it's hard to find an example that does not allow passing a file object.

one example use case would be using a temporary file, testing, situations in which you may need to write either to an actual file or an in-memory buffer, ...

from personal experience a while ago in a completely unrelated context NiftiImages from the nibabel library could only be saved to a filesystem path and I can't remember what were the exact situation where it was a problem for me but I remember it was annoying

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jeromedockes for the explanation. it helped a lot.

"""saving an html report
mrastgoo marked this conversation as resolved.
Show resolved Hide resolved

Parameters
----------
filename : str, pathlib.Path or file object.
mrastgoo marked this conversation as resolved.
Show resolved Hide resolved
"""

if isinstance(filename, (str, Path)):
if isinstance(filename, str):
filename = Path(filename)
if filename.suffix != ".html":
raise ValueError("Not ending with .html")
mrastgoo marked this conversation as resolved.
Show resolved Hide resolved
file_object = open(filename, "w", encoding="utf-8")
else:
# already a file object
file_object = filename
file_object.write(self.html())
file_object.close()

def open(self):
"""Open the HTML report in a web browser."""
open_in_browser(self.html())
38 changes: 38 additions & 0 deletions skrub/_reporting/tests/test_table_report.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@
import json
import re
import warnings
from pathlib import Path
from tempfile import TemporaryDirectory

import pytest

from skrub import TableReport, ToDatetime
from skrub import _dataframe as sbd
Expand Down Expand Up @@ -123,6 +127,40 @@ def test_duration(df_module):
assert re.search(r"2(\.0)?\s+days", TableReport(df).html())


@pytest.mark.parametrize("filename_path", ["str", "Path", "file_object"])
mrastgoo marked this conversation as resolved.
Show resolved Hide resolved
def test_write_html(pd_module, filename_path):
df = pd_module.make_dataframe({"a": [1, 2], "b": [3, 4]})
report = TableReport(df)

with TemporaryDirectory() as td:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could use the pytest tmp_path fixture?

f_name = Path(td) / Path("report.html")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call it if filename


if filename_path == "str":
report.write_html(f_name.absolute())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to pass .absolute here?


if filename_path == "Path":
report.write_html(f_name)

if filename_path == "file_object":
file_object = open(f_name, "w", encoding="utf-8")
report.write_html(file_object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer something like:

tmp_file_path = ...
if filename_type == "str":
    filename = str(tmp_file_path)
elif filename_type == "file_object":
     filename = open(tmp_file_path, "w", encoding="utf-8")
else:
     filename = tmp_file_path

report.write_html(filename)
assert tmp_file_path.exists()


# Check if the file exists
mrastgoo marked this conversation as resolved.
Show resolved Hide resolved
assert f_name.exists()


def test_write_html_with_no_suffix(pd_module):
df = pd_module.make_dataframe({"a": [1, 2], "b": [3, 4]})
report = TableReport(df)
with TemporaryDirectory() as td:
f_name = Path(td) / Path("report")
with pytest.raises(ValueError, match="Not ending with .html"):
report.write_html(f_name)

# Check if the file exists
mrastgoo marked this conversation as resolved.
Show resolved Hide resolved
assert not f_name.exists()


def test_verbosity_parameter(df_module, capsys):
df = df_module.make_dataframe(
dict(
Expand Down
Loading