Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(example): Adds Confidence Interval Ellipses #3747

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Jan 4, 2025

Will close #3715

Description

Adds an example inspired by (ggplot2|plotnine).stat_ellipse().

As can be seen in the first commit, this PR began by rebasing a closed PR from almost 7 years ago.

Relevant info from (#3715 (comment)):

I believe plotnine.stat_ellipse would be an example of implementing this with numpy, scipy.
Source code

I also found an old closed PR (#514 by @essicolo) that would have added an example for this.
The blocker at the time is no longer an issue as (#3202 by @joelostblom) added scipy as a docs dependency.

Example

image

Tasks

Future Work

I think a more generalized version of this would be a good fit for https://github.com/vega/altair_ally.
An issue might be the scipy dependency, which I really was hoping to be able to avoid here.
The dendrogram example shows some kind of inlining from scipy - but I have no idea if that is possible for:

@dangotbanned
Copy link
Member Author

Cannot express how relieved I am to see the CI finally green 😅
be087d2

Comment on lines +57 to +69
def pd_ellipse(
df: pd.DataFrame, col_x: str, col_y: str, col_group: str
) -> pd.DataFrame:
cols = col_x, col_y
groups = []
# TODO: Rewrite in a more readable way
categories = df[col_group].unique()
for category in categories:
sliced = df.loc[df[col_group] == category, cols]
ell_df = pd.DataFrame(np_ellipse(sliced.to_numpy()), columns=cols) # type: ignore
ell_df[col_group] = category
groups.append(ell_df)
return pd.concat(groups).reset_index()
Copy link
Member Author

@dangotbanned dangotbanned Jan 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

  • Figure out a more ergonomic way of applying the function to each group

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli not an urgent one.

Do you know of a more idiomatic way to write this pandas code?

Based on this from 7 years ago:

columns = ['petalLength', 'petalWidth']
petal_ellipse = []
for species in iris.species.unique():
ell_df = pd.DataFrame(ellipse(X=iris.loc[iris.species == species, columns].as_matrix()),
columns = columns)
ell_df['species'] = species
petal_ellipse.append(ell_df)
petal_ellipse = pd.concat(petal_ellipse, axis=0).reset_index()


Personally I'd rather use polars, but the pandas dependency is already there due to https://github.com/altair-viz/vega_datasets

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

polars version

Could probably be reduced a bit further.

  • Using pl.DataFrame.partition_by works
    • But needing to handle dict[tuple[str, ...], pl.DataFrame] for a single key seems like a code smell
def pl_ellipse(
    df: pl.DataFrame, col_x: str, col_y: str, col_group: str
) -> pl.DataFrame:
    parts = df.select(col_x, col_y, col_group).partition_by(
        col_group, as_dict=True, include_key=False
    )
    return pl.concat(
        pl.DataFrame(np_ellipse(group.to_numpy()), [col_x, col_y])
        .with_columns(pl.lit(k[0]).alias(col_group))
        .with_row_index()
        for k, group in parts.items()
    )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey - i haven't looked into ellipse, but the pattern of creating a list of dataframes and then concatenating is what pandas recommends (as opposed to continuously concatenating in the for loop)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey - i haven't looked into ellipse, but the pattern of creating a list of dataframes and then concatenating is what pandas recommends (as opposed to continuously concatenating in the for loop)

Thanks @MarcoGorelli
I mean - if nothing jumped out at you as a pandas anti-pattern - then that's a good sign at least 🙂

Re: (ellipse|np_ellipse) the only relevant parts would be the signature:

from typing import TypeAlias
import numpy as np

_2DArray: TypeAlias = np.ndarray[tuple[int, int], np.dtype[np.float64]]

def np_ellipse(arr: _2DArray, segments: int = 50) -> _2DArray: ...

The segments parameter controls the number of rows (elements) returned.
So there's a potential for changing shape before/after numpy

@dangotbanned dangotbanned changed the title docs(DRAFT): Add Confidence Interval Ellipse example docs: Add Confidence Interval Ellipse example Jan 5, 2025
@dangotbanned dangotbanned changed the title docs: Add Confidence Interval Ellipse example docs(example): Adds Confidence Interval Ellipses Jan 5, 2025
Comment on lines +63 to +65
groups = []
# TODO: Rewrite in a more readable way
categories = df[col_group].unique()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
groups = []
# TODO: Rewrite in a more readable way
categories = df[col_group].unique()
groups = []
categories = df[col_group].unique()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support/document Confidence Interval Ellipse
2 participants