description |
---|
All other setup related parameters |
There are only two non-optional parameters in the setup function.
- data: pandas.DataFrame
****Shape (n_samples, n_features), where n_samples is the number of samples and n_features is the number of features. - target: str
****Name of the target column to be passed in as a string.
PyCaret can automatically log entire experiments including setup parameters, model hyperparameters, performance metrics, and pipeline artifacts. The default settings use MLflow as the logging backend. wandb is also available as an option for logging backend. A parameter in the setup can be enabled to automatically track all the metrics, hyperparameters, and other important information about your machine learning model.
- log_experiment: bool, default = bool or string 'mlflow' or 'wandb'
A (list of) PyCaretBaseLogger
or str (one ofmlflow
,wandb
) corresponding to a logger to determine which experiment loggers to use. Setting to True will use the MLFlow backend by default. - experiment_name: str, default = None
Name of the experiment for logging. When set toNone
, a default name is used. - experiment_custom_tags: dict, default = None
****Dictionary of tag_name: String -> value: (String, but will be string-ified if not) passed to the mlflow.set_tags to add new custom tags for the experiment. - log_plots: bool, default = False
When set toTrue
, applicable analysis plots are logged as an image file. - log_profile: bool, default = False
When set toTrue
, the data profile is logged as an HTML file. - log_data: bool, default = False
When set toTrue
, train and test dataset are logged as a CSV file.
# load dataset
from pycaret.datasets import get_data
data = get_data('diabetes')
# init setup
from pycaret.classification import *
clf1 = setup(data, target = 'Class variable', log_experiment = True, experiment_name = 'diabetes1')
# model training
best_model = compare_models()
To initialize MLflow
server you must run the following command from within the notebook or from the command line. Once the server is initialized, you can track your experiment on https://localhost:5000
.
# init server
!mlflow ui
When no backend is configured Data is stored locally at the provided file (or ./mlruns if empty). To configure the backend use mlflow.set_tracking_uri
before executing the setup function.
- An empty string, or a local file path, prefixed with file:/. Data is stored locally at the provided file (or ./mlruns if empty).
- An HTTP URI like https://my-tracking-server:5000.
- A Databricks workspace, provided as the string “databricks” or, to use a Databricks CLI profile, “databricks://<profileName>”.
# set tracking uri
import mlflow
mlflow.set_tracking_uri('file:/c:/users/mlflow-server')
# load dataset
from pycaret.datasets import get_data
data = get_data('diabetes')
# init setup
from pycaret.classification import *
clf1 = setup(data, target = 'Class variable', log_experiment = True, experiment_name = 'diabetes1')
When using PyCaret on Databricks experiment_name
parameter in the setup must include complete path to storage. See example below on how to log experiments when using Databricks:
# load dataset
from pycaret.datasets import get_data
data = get_data('diabetes')
# init setup
from pycaret.classification import *
clf1 = setup(data, target = 'Class variable', log_experiment = True, experiment_name = '/Users/[email protected]/experiment-name-here')
Following parameters in the setup can be used for setting parameters for model selection process. These are not related to data preprocessing but can influence your model selection process.
-
train_size: float, default = 0.7
****The proportion of the dataset to be used for training and validation. -
test_data: pandas.DataFrame, default = None
****If notNone
, thetest_data
is used as a hold-out set and thetrain_size
is ignored.test_data
must be labeled and the shape of thedata
andtest_data
must match. -
data_split_shuffle: bool, default = True
When set toFalse
, prevents shuffling of rows duringtrain_test_split
. -
data_split_stratify: bool or list, default = False
Controls stratification during thetrain_test_split
. When set toTrue
, it will stratify by target column. To stratify on any other columns, pass a list of column names. Ignored whendata_split_shuffle
isFalse
. -
fold_strategy: str or scikit-learn CV generator object, default = ‘stratifiedkfold’
Choice of cross-validation strategy. Possible values are:- ‘kfold’
- ‘stratifiedkfold’
- ‘groupkfold’
- ‘timeseries’
- a custom CV generator object compatible with
scikit-learn
.
-
fold: int, default = 10
****The number of folds to be used in cross-validation. Must be at least 2. This is a global setting that can be over-written at the function level by using thefold
parameter. Ignored whenfold_strategy
is a custom object. -
fold_shuffle: bool, default = False
****Controls the shuffle parameter of CV. Only applicable whenfold_strategy
iskfold
orstratifiedkfold
. Ignored whenfold_strategy
is a custom object. -
fold_groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when ‘GroupKFold’ is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When the string is passed, it is interpreted as the column name in the dataset containing group labels.
Following parameters in the setup can be used for controlling other experiment settings such as using GPU for training or setting verbosity of the experiment. They do not affect the data in any way.
- n_jobs: int, default = -1
****The number of jobs to run in parallel (for functions that support parallel processing) -1 means using all processors. To run all functions on single processor set n_jobs to None. - use_gpu: bool or str, default = False
****When set toTrue
, it will use GPU for training with algorithms that support it and fall back to CPU if they are unavailable. When set toforce
it will only use GPU-enabled algorithms and raise exceptions when they are unavailable. WhenFalse
all algorithms are trained using CPU only. - html: bool, default = True
When set toFalse
, prevents the runtime display of the monitor. This must be set toFalse
when the environment does not support IPython. For example, command line terminal, Databricks, PyCharm, Spyder, and other similar IDEs. - session_id: int, default = None
****Controls the randomness of the experiment. It is equivalent torandom_state
inscikit-learn
. WhenNone
, a pseudo-random number is generated. This can be used for later reproducibility of the entire experiment. - silent: bool, default = False
Controls the confirmation input of data types whensetup
is executed. When executing in completely automated mode or on a remote kernel, this must beTrue
. - verbose: bool, default = True
When set toFalse
, Information grid is not printed. - profile: bool, default = False
When set toTrue
, an interactive EDA report is displayed. - profile_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the ProfileReport method used to create the EDA report. Ignored ifprofile
is False. - custom_pipeline: (str, transformer) or list of (str, transformer), default = None
****When passed, will append the custom transformers in the preprocessing pipeline and are applied on each CV fold separately and on the final fit. All the custom transformations are applied aftertrain_test_split
and before PyCaret's internal transformations. - preprocess: bool, default = True
When set toFalse
, no transformations are applied except fortrain_test_split
and custom transformations passed incustom_pipeline
parameter. Data must be ready for modeling (no missing values, no dates, categorical data encoding) when preprocess is set toFalse
.