-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoML v2 (WIP) #81
Draft
gfournier
wants to merge
8
commits into
societe-generale:master
Choose a base branch
from
gfournier:wip_automl_v2
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
AutoML v2 (WIP) #81
Changes from 7 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
ff364fa
add automl registries
gfournier b137c12
reimported code to finish get_default_pipeline (excl. models, ...)
gfournier dde8d22
add random_model_generator tests
gfournier a26f758
add test for default model
gfournier e6c3d07
add JobConfig + Guider
gfournier 5d4a820
working automl module with sequential/inmemory backend
gfournier 62b7a74
add local dask backend, add result command
gfournier 3495d60
upgrade versions in github action build file
gfournier File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -121,3 +121,5 @@ pytest_report.html | |
pytest_report_not_long.html | ||
.DS_Store | ||
/docs/*.bat | ||
|
||
/*.xlsx |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
from . import _class_registration | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
from sklearn.pipeline import Pipeline | ||
from aikit.pipeline import GraphPipeline | ||
|
||
from aikit.transformers import ColumnsSelector | ||
from aikit.models import OutSamplerTransformer, StackerClassifier, StackerRegressor, KMeansWrapper, DBSCANWrapper, \ | ||
AgglomerativeClusteringWrapper | ||
|
||
from aikit.transformers import FeaturesSelectorClassifier, FeaturesSelectorRegressor, TruncatedSVDWrapper, PassThrough | ||
from aikit.transformers import PCAWrapper | ||
from aikit.transformers import BoxCoxTargetTransformer, NumImputer, KMeansTransformer, CdfScaler | ||
from aikit.transformers import Word2VecVectorizer, CountVectorizerWrapper, Char2VecVectorizer | ||
from aikit.transformers import TextNltkProcessing, TextDefaultProcessing, TextDigitAnonymizer | ||
from aikit.transformers import TargetEncoderClassifier, TargetEncoderEntropyClassifier, TargetEncoderRegressor | ||
from aikit.transformers import NumericalEncoder | ||
|
||
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, ExtraTreesClassifier, ExtraTreesRegressor | ||
from sklearn.linear_model import LogisticRegression, Ridge, Lasso | ||
|
||
from .util import CLASS_REGISTRY | ||
|
||
try: | ||
import nltk | ||
except ImportError: | ||
nltk = None | ||
print("NLTK not available, AutoML won't run with NLTK transformers") | ||
|
||
try: | ||
import gensim | ||
except ImportError: | ||
gensim = None | ||
print("Gensim not available, AutoML won't run with Gensim models") | ||
|
||
try: | ||
import lightgbm | ||
except ImportError: | ||
lightgbm = None | ||
print("LightGBM not available, AutoML won't run with LightGBM models") | ||
|
||
# Pipelines | ||
CLASS_REGISTRY.add_klass(PassThrough) | ||
CLASS_REGISTRY.add_klass(Pipeline) | ||
CLASS_REGISTRY.add_klass(GraphPipeline) | ||
CLASS_REGISTRY.add_klass(ColumnsSelector) | ||
|
||
# Stacking tools | ||
CLASS_REGISTRY.add_klass(OutSamplerTransformer) | ||
CLASS_REGISTRY.add_klass(StackerRegressor) | ||
CLASS_REGISTRY.add_klass(StackerClassifier) | ||
|
||
# Feature selection | ||
CLASS_REGISTRY.add_klass(FeaturesSelectorClassifier) | ||
CLASS_REGISTRY.add_klass(FeaturesSelectorRegressor) | ||
|
||
# Text vectorizers | ||
CLASS_REGISTRY.add_klass(CountVectorizerWrapper) | ||
if gensim is not None: | ||
CLASS_REGISTRY.add_klass(Word2VecVectorizer) | ||
CLASS_REGISTRY.add_klass(Char2VecVectorizer) | ||
|
||
# Text preprocessors | ||
if nltk is not None: | ||
CLASS_REGISTRY.add_klass(TextNltkProcessing) | ||
CLASS_REGISTRY.add_klass(TextDefaultProcessing) | ||
CLASS_REGISTRY.add_klass(TextDigitAnonymizer) | ||
|
||
# Transformers | ||
CLASS_REGISTRY.add_klass(TruncatedSVDWrapper) | ||
CLASS_REGISTRY.add_klass(PCAWrapper) | ||
CLASS_REGISTRY.add_klass(BoxCoxTargetTransformer) | ||
CLASS_REGISTRY.add_klass(NumImputer) | ||
CLASS_REGISTRY.add_klass(CdfScaler) | ||
CLASS_REGISTRY.add_klass(KMeansTransformer) | ||
|
||
# Category encoders | ||
CLASS_REGISTRY.add_klass(NumericalEncoder) | ||
CLASS_REGISTRY.add_klass(TargetEncoderClassifier) | ||
CLASS_REGISTRY.add_klass(TargetEncoderEntropyClassifier) | ||
CLASS_REGISTRY.add_klass(TargetEncoderRegressor) | ||
|
||
# Classifiers | ||
CLASS_REGISTRY.add_klass(RandomForestClassifier) | ||
CLASS_REGISTRY.add_klass(ExtraTreesClassifier) | ||
CLASS_REGISTRY.add_klass(LogisticRegression) | ||
CLASS_REGISTRY.add_klass(Lasso) | ||
if lightgbm is not None: | ||
CLASS_REGISTRY.add_klass(lightgbm.LGBMClassifier) | ||
|
||
# Regressors | ||
CLASS_REGISTRY.add_klass(RandomForestRegressor) | ||
CLASS_REGISTRY.add_klass(ExtraTreesRegressor) | ||
CLASS_REGISTRY.add_klass(Ridge) | ||
if lightgbm is not None: | ||
CLASS_REGISTRY.add_klass(lightgbm.LGBMRegressor) | ||
|
||
# Clustering | ||
CLASS_REGISTRY.add_klass(KMeansWrapper) | ||
CLASS_REGISTRY.add_klass(DBSCANWrapper) | ||
CLASS_REGISTRY.add_klass(AgglomerativeClusteringWrapper) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
from ._config import AutoMlConfig | ||
from ._job import JobConfig, load_job_config_from_json | ||
from ._automl import AutoMl, TimeBudget, AutoMlBudget | ||
from . import registry | ||
from ._registry import MODEL_REGISTRY | ||
|
||
__all__ = [ | ||
"AutoMlConfig", | ||
"AutoMl", | ||
"AutoMlBudget", | ||
"TimeBudget", | ||
"JobConfig", | ||
"load_job_config_from_json", | ||
"MODEL_REGISTRY" | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,165 @@ | ||
import logging | ||
import os | ||
import uuid | ||
# Remove some warning categories for debugging purpose | ||
from warnings import simplefilter | ||
|
||
import pandas as pd | ||
import typer | ||
from sklearn.exceptions import ConvergenceWarning | ||
|
||
# import scorers to add custom aikit scorers in scikit-learn SCORERS list | ||
import aikit.scorer # noqa | ||
from aikit.datasets import load_dataset, DatasetEnum | ||
from aikit.future.automl import AutoMl, TimeBudget, AutoMlConfig, load_job_config_from_json, JobConfig | ||
from aikit.future.automl._automl import ModelCountBudget | ||
from aikit.future.automl.backends import get_backend, filter_backend_kwargs | ||
from aikit.future.automl.guider import AutoMlModelGuider | ||
from aikit.future.automl.result import AutoMlResultReader | ||
from aikit.future.automl.serialization import Format | ||
|
||
simplefilter(action='ignore', category=FutureWarning) | ||
simplefilter(action='ignore', category=ConvergenceWarning) | ||
simplefilter(action='ignore', category=UserWarning) | ||
|
||
|
||
app = typer.Typer() | ||
|
||
# Configure logging | ||
logging.basicConfig(level=logging.INFO) | ||
# Configure some custom level for third-parties for debugging purpose | ||
logging.getLogger("gensim").setLevel(logging.WARNING) | ||
|
||
_logger = logging.getLogger(__name__) | ||
|
||
|
||
@app.command() | ||
def run(data: str, | ||
config_path: str = None, | ||
target: str = "target", | ||
backend: str = "sequential", | ||
session: str = None, | ||
cv: int = None, | ||
baseline: float = None, | ||
budget_model_count: int = None, | ||
budget_time: int = None, | ||
dask_storage_path: str = os.path.join(os.path.expanduser("~"), ".aikit", "working_dir"), | ||
dask_cluster: str = "local", | ||
dask_num_workers: int = 1): | ||
|
||
if session is None: | ||
session = str(uuid.uuid4()) | ||
_logger.info(f"Start AutoML, session: {session}") | ||
|
||
# Register in this dictionary all arguments that must be passed to the backend | ||
backend_kwargs = { | ||
"dask_storage_path": dask_storage_path, | ||
"dask_cluster": dask_cluster, | ||
"dask_num_workers": dask_num_workers, | ||
} | ||
backend_kwargs = filter_backend_kwargs(backend, **backend_kwargs) | ||
|
||
if data in DatasetEnum.alls: | ||
df_train, y_train, _, _, _ = load_dataset(data) | ||
else: | ||
# TODO: load data from filesystem | ||
raise NotImplementedError(f"Unknown dataset: {data}") | ||
automl_config = AutoMlConfig(X=df_train, y=y_train) | ||
automl_config.guess_everything() | ||
|
||
if config_path is not None: | ||
job_config = load_job_config_from_json(config_path) | ||
else: | ||
job_config = JobConfig() | ||
if cv is not None: | ||
job_config.cv = cv | ||
if baseline is not None: | ||
job_config.baseline_score = baseline | ||
if job_config.cv is None: | ||
job_config.guess_cv(automl_config) | ||
if job_config.scoring is None: | ||
job_config.guess_scoring(automl_config) | ||
|
||
if budget_time is not None: | ||
budget = TimeBudget(budget_time) | ||
elif budget_model_count is not None: | ||
budget = ModelCountBudget(budget_model_count) | ||
else: | ||
raise ValueError("'budget_time' or 'budget_model_count' must be set") | ||
|
||
# TODO: force seed of workers in the backend | ||
with get_backend(backend, session=session, **backend_kwargs) as backend: | ||
# TODO: add dedicated methods in backend to write common data | ||
backend.get_data_loader().write(key="X", path="data", data=df_train, serialization_format=Format.PICKLE) | ||
backend.get_data_loader().write(key="y", path="data", data=y_train, serialization_format=Format.PICKLE) | ||
backend.get_data_loader().write(key="groups", path="data", data=None, serialization_format=Format.PICKLE) | ||
backend.get_data_loader().write(key="automl_config", path="data", data=automl_config, | ||
serialization_format=Format.PICKLE) | ||
backend.get_data_loader().write(key="job_config", path="data", data=job_config, | ||
serialization_format=Format.PICKLE) | ||
|
||
result_reader = AutoMlResultReader(backend.get_data_loader()) | ||
|
||
automl_guider = AutoMlModelGuider(result_reader=result_reader, | ||
job_config=job_config) | ||
|
||
automl = AutoMl(automl_config=automl_config, | ||
job_config=job_config, | ||
backend=backend, | ||
automl_guider=automl_guider, | ||
budget=budget, | ||
random_state=123) | ||
|
||
automl.search_models() | ||
|
||
df_result = result_reader.load_all_results(aggregate=True) | ||
print(df_result) | ||
|
||
_logger.info(f"Finished searching models, session: {session}") | ||
|
||
|
||
@app.command() | ||
def result(session: str, | ||
output_path: str = ".", | ||
backend: str = "sequential", | ||
dask_storage_path: str = os.path.join(os.path.expanduser("~"), ".aikit", "working_dir")): | ||
|
||
# Register in this dictionary all arguments that must be passed to the backend | ||
backend_kwargs = { | ||
"dask_storage_path": dask_storage_path, | ||
} | ||
backend_kwargs = filter_backend_kwargs(backend, **backend_kwargs) | ||
|
||
with get_backend(backend, session=session, **backend_kwargs) as backend: | ||
result_reader = AutoMlResultReader(backend.get_data_loader()) | ||
|
||
df_results = result_reader.load_all_results() | ||
df_additional_results = result_reader.load_additional_results() | ||
df_params = result_reader.load_all_params() | ||
df_errors = result_reader.load_all_errors() | ||
df_params_other = result_reader.load_other_params() | ||
|
||
df_merged_result = pd.merge(df_params, df_results, how="inner", on="job_id") | ||
df_merged_result = pd.merge(df_merged_result, df_params_other, how="inner", on="job_id") | ||
if df_additional_results.shape[0] > 0: | ||
df_merged_result = pd.merge(df_merged_result, df_additional_results, how="inner", on="job_id") | ||
|
||
df_merged_error = pd.merge(df_params, df_errors, how="inner", on="job_id") | ||
|
||
result_filename = os.path.join(output_path, "result.xlsx") | ||
try: | ||
df_merged_result.to_excel(result_filename, index=False) | ||
_logger.info(f"Result file saved: {result_filename}") | ||
except: # noqa | ||
_logger.warning(f"Error saving result file ({result_filename})", exc_info=True) | ||
|
||
error_filename = os.path.join(output_path, "error.xlsx") | ||
try: | ||
df_merged_error.to_excel(error_filename, index=False) | ||
_logger.info(f"Error file saved: {error_filename}") | ||
except: # noqa | ||
_logger.warning(f"Error saving error file ({error_filename})", exc_info=True) | ||
|
||
|
||
if __name__ == '__main__': | ||
app() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incompatible variable type: config_path is declared to have type
str
but is used as typeNone
.❗❗ 7 similar findings have been found in this PR
🔎 Expand here to view all instances of this finding
Visit the Lift Web Console to find more details in your report.
ℹ️ Expand to see all @sonatype-lift commands
You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.
@sonatype-lift ignore
@sonatype-lift ignoreall
@sonatype-lift exclude <file|issue|path|tool>
file|issue|path|tool
from Lift findings by updating your config.toml fileNote: When talking to LiftBot, you need to refresh the page to see its response.
Click here to add LiftBot to another repo.
Help us improve LIFT! (Sonatype LiftBot external survey)
Was this a good recommendation for you? Answering this survey will not impact your Lift settings.
[ 🙁 Not relevant ] - [ 😕 Won't fix ] - [ 😑 Not critical, will fix ] - [ 🙂 Critical, will fix ] - [ 😊 Critical, fixing now ]