Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor fetching #669

Closed
wants to merge 19 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
42b5d5b
Merge branch 'main' of https://github.com/skrub-data/skrub
LilianBoulard Jul 20, 2023
c306c09
Merge branch 'main' of https://github.com/skrub-data/skrub
LilianBoulard Jul 20, 2023
80aff5f
[WIP] Move public API and simplify/improve private API
LilianBoulard Jul 20, 2023
3e353d7
Merge branch 'main' of https://github.com/skrub-data/skrub into refac…
LilianBoulard Jul 20, 2023
4d54b3d
Merge branch 'main' of https://github.com/skrub-data/skrub into refac…
LilianBoulard Jul 20, 2023
a20c692
Merge branch 'main' of https://github.com/skrub-data/skrub into refac…
LilianBoulard Jul 28, 2023
0d889c9
Improve implementation
LilianBoulard Jul 28, 2023
b0c1f38
Adapt tests
LilianBoulard Jul 28, 2023
e30843c
Add changelog entry
LilianBoulard Jul 28, 2023
a4760fc
Merge branch 'main' of https://github.com/skrub-data/skrub into refac…
LilianBoulard Aug 2, 2023
cfc1b5e
Decrement columns as `RespondentID` is interpreted as index
LilianBoulard Aug 2, 2023
2db4c00
Rename files and clean tests
LilianBoulard Aug 3, 2023
b8e8e9e
Merge branch 'main' of https://github.com/skrub-data/skrub into refac…
LilianBoulard Aug 18, 2023
4e592cf
Merge with main
LilianBoulard Aug 18, 2023
0d8ae25
Clarify logic
LilianBoulard Aug 18, 2023
bb8f104
Use right function
LilianBoulard Aug 18, 2023
de24a11
Add `download_if_missing` support in figshare fetcher
LilianBoulard Aug 18, 2023
a6ff29f
Fix test error
LilianBoulard Aug 18, 2023
57ecd01
[WIP] Improve KEN embeddings fetching + various improvements
LilianBoulard Aug 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,12 @@ Major changes
compliance with the scikit-learn API.
:pr:`647` by :user:`Guillaume Lemaitre <glemaitre>`

* Fetching functions now have a unified and simpler API: a :class:`dataset.Dataset`
object is returned by all functions. Lazy loading (parameters `load_dataframe`)
has been removed. Parameter `download_if_missing` added to world bank and
figshare fetchers.
:pr:`669` by :user:`Lilian Boulard <LilianBoulard>`.

* Fixes a bug in :class:`TableVectorizer` with `remainder`: it is now cloned if it's
a transformer so that the same instance is not shared between different
transformers.
Expand Down
6 changes: 3 additions & 3 deletions benchmarks/utils/_various.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
fetch_road_safety,
fetch_traffic_violations,
)
from skrub.datasets import DatasetAll
from skrub.datasets import Dataset


def find_result(bench_name: str) -> Path:
Expand Down Expand Up @@ -66,7 +66,7 @@ def choose_file(results: list[Path]) -> Path:
return results[int(choice) - 1]


def get_classification_datasets() -> dict[str, DatasetAll]:
def get_classification_datasets() -> dict[str, Dataset]:
return {
"open_payments": fetch_open_payments(),
"drug_directory": fetch_drug_directory(),
Expand All @@ -76,7 +76,7 @@ def get_classification_datasets() -> dict[str, DatasetAll]:
}


def get_regression_datasets() -> dict[str, DatasetAll]:
def get_regression_datasets() -> dict[str, Dataset]:
return {
"medical_charge": fetch_medical_charge(),
"employee_salaries": fetch_employee_salaries(),
Expand Down
3 changes: 1 addition & 2 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -502,8 +502,7 @@ def notebook_modification_function(notebook_content, notebook_filename):
"DatetimeEncoder": "skrub.DatetimeEncoder",
"deduplicate": "skrub.deduplicate",
"TableVectorizer": "skrub.TableVectorizer",
"DatasetInfoOnly": "skrub.datasets._fetching.DatasetInfoOnly",
"DatasetAll": "skrub.datasets._fetching.DatasetAll",
"Dataset": "skrub.datasets.Dataset",
"_replace_false_missing": "skrub._table_vectorizer._replace_false_missing",
}

Expand Down
10 changes: 4 additions & 6 deletions skrub/_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,15 +84,13 @@ def import_optional_dependency(name: str, extra: str = ""):
maybe_module : Optional[ModuleType]
The imported module when found.
"""

msg = (
f"Missing optional dependency '{name}'. {extra} "
f"Use pip or conda to install {name}."
)
try:
module = importlib.import_module(name)
except ImportError as exc:
raise ImportError(msg) from exc
raise ImportError(
f"Missing optional dependency '{name}'. {extra} "
f"Use pip or conda to install {name}. "
) from exc

return module

Expand Down
10 changes: 4 additions & 6 deletions skrub/datasets/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
from ._fetching import (
DatasetAll,
DatasetInfoOnly,
from ._fetching_functions import (
Dataset,
fetch_drug_directory,
fetch_employee_salaries,
fetch_figshare,
Expand All @@ -10,18 +9,17 @@
fetch_road_safety,
fetch_traffic_violations,
fetch_world_bank_indicator,
get_data_dir,
)
from ._generating import make_deduplication_data
from ._ken_embeddings import (
fetch_ken_embeddings,
fetch_ken_table_aliases,
fetch_ken_types,
)
from ._utils import get_data_dir

__all__ = [
"DatasetAll",
"DatasetInfoOnly",
"Dataset",
"fetch_drug_directory",
"fetch_employee_salaries",
"fetch_medical_charge",
Expand Down
Loading