-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH adding write_html
to TableReport
#1190
Changes from 2 commits
e8dfaf6
58b8a65
048cec5
15c7b06
0debd0b
260d710
684a9c1
597bc0e
ef50a50
bbb50d9
f3c1e8b
760b7b3
0789ac5
911eb5a
036db59
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,6 +2,10 @@ | |
import json | ||
import re | ||
import warnings | ||
from pathlib import Path | ||
from tempfile import TemporaryDirectory | ||
|
||
import pytest | ||
|
||
from skrub import TableReport, ToDatetime | ||
from skrub import _dataframe as sbd | ||
|
@@ -123,6 +127,40 @@ def test_duration(df_module): | |
assert re.search(r"2(\.0)?\s+days", TableReport(df).html()) | ||
|
||
|
||
@pytest.mark.parametrize("filename_path", ["str", "Path", "file_object"]) | ||
mrastgoo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
def test_write_html(pd_module, filename_path): | ||
df = pd_module.make_dataframe({"a": [1, 2], "b": [3, 4]}) | ||
report = TableReport(df) | ||
|
||
with TemporaryDirectory() as td: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe we could use the pytest |
||
f_name = Path(td) / Path("report.html") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's call it if |
||
|
||
if filename_path == "str": | ||
report.write_html(f_name.absolute()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need to pass |
||
|
||
if filename_path == "Path": | ||
report.write_html(f_name) | ||
|
||
if filename_path == "file_object": | ||
file_object = open(f_name, "w", encoding="utf-8") | ||
report.write_html(file_object) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would prefer something like: tmp_file_path = ...
if filename_type == "str":
filename = str(tmp_file_path)
elif filename_type == "file_object":
filename = open(tmp_file_path, "w", encoding="utf-8")
else:
filename = tmp_file_path
report.write_html(filename)
assert tmp_file_path.exists() |
||
|
||
# Check if the file exists | ||
mrastgoo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
assert f_name.exists() | ||
|
||
|
||
def test_write_html_with_no_suffix(pd_module): | ||
df = pd_module.make_dataframe({"a": [1, 2], "b": [3, 4]}) | ||
report = TableReport(df) | ||
with TemporaryDirectory() as td: | ||
f_name = Path(td) / Path("report") | ||
with pytest.raises(ValueError, match="Not ending with .html"): | ||
report.write_html(f_name) | ||
|
||
# Check if the file exists | ||
mrastgoo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
assert not f_name.exists() | ||
|
||
|
||
def test_verbosity_parameter(df_module, capsys): | ||
df = df_module.make_dataframe( | ||
dict( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! In the issue I forgot to mention a couple of things:
So taking those details into account I guess the function could be modified slightly to look similar to this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you comments @jeromedockes , I have couple of questions.
TypeError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mrastgoo ,
Trying to write bytes and catching the
TypeError
is an easy way to detect if we are dealing with a text stream or a binary stream.A file object (a.k.a. stream) can be:
str
towrite
raises aTypeError
.bytes
towrite
raises aTypeError
(the one we catch).Binary mode:
Conversely, if we use text mode (the default):
and this is the error we catch in the current version of
TableReport.write_html
:So we need to know if the file object the user gave us is a text or binary stream. If we knew it was created with
open
we could inspect its mode:But that may not always work with other kinds of file objects:
All well-behaved file objects however will respect the rule of raising a TypeError if we give them the wrong type of data. So what we can do is first suppose it is in binary mode and try to write bytes, and if we get a TypeError try again with a string (or we could do it the other way around, too).
If we pass
encoding=None
toopen
it will default to usinglocale.encoding
, but AFAIK the resulting file object will always have an explicit encoding (notNone
).So I think it should be ok to inspect the encoding of the file object in this case.
Some other kinds of text stream may not have an explicit encoding though. For example
io.StringIO
writes to an in-memory python string, so it accepts strings but does not need to encode them (it doesn't need to convert to bytes because the backing storage is a unicode string)In the case of
StringIO
it's fine to copy the report string to the stream's buffer even if we couldn't figure out what encoding it uses (because it doesn't use any). In some other cases I guess looking at theencoding
attribute may not give a result even if a (possibly wrong) encoding is being used. But I guess at this point we have to trust the user to know what they are doing and at least we have made a reasonable effort to prevent the most likely issue, which is doing this on windows:(Even that will go away in python 3.15 when utf-8 becomes the default).
So my suggestion is that if we easily detect a wrong encoding is being used by checking the
encoding
attribute, we raise an error because we are sure that writing will result in a file with an incorrect<meta charset='utf-8'>
tag which is likely to cause it to display incorrectly in a browser. If we cannot figure out for sure that a wrong encoding was used we just go ahead and write the report and hope for the best.sorry for the long answer, hope this makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fail to see the use case for allowing the user to pass a file instead of a filename. According to some path, this feature should address the simple use case of writing an HTML file to disk. Can you motivate this a little bit? It looks like plain over-engineering to me TBH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it was a suggestion from @glemaitre .
but if you look at functions offering to save some object as text such as matplotlib.savefig, pandas or polars to_csv etc, xml.etree.write, plotly write_html and others it's hard to find an example that does not allow passing a file object.
one example use case would be using a temporary file, testing, situations in which you may need to write either to an actual file or an in-memory buffer, ...
from personal experience a while ago in a completely unrelated context NiftiImages from the nibabel library could only be saved to a filesystem path and I can't remember what were the exact situation where it was a problem for me but I remember it was annoying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jeromedockes for the explanation. it helped a lot.