Skip to content

Commit

Permalink
Merge pull request #8 from autonomio/initial_release
Browse files Browse the repository at this point in the history
#2 Initial Release Loose-Ends Tracker
mikkokotila authored Apr 17, 2022
2 parents 1a51961 + fa2a606 commit c63c87d
Showing 15 changed files with 856 additions and 436 deletions.
4 changes: 4 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -35,6 +35,10 @@ jobs:
with:
name: "config.json"
json: ${{ secrets.CONFIG_JSON }}
- name: Add Key
run: |
echo "${{ secrets.AUTONOMIO_DEV_PEM }}" > autonomio-dev.pem
chmod 0400 autonomio-dev.pem
- name: Tests
run: |
pip install tensorflow>=2.0
316 changes: 316 additions & 0 deletions docs/DistributedScan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,316 @@
# DistributedScan

The experiment is configured and started through the `DistributedScan()` command. All of the options effecting the experiment, other than the hyperparameters themselves, are configured through the Scan arguments. The most common use-case is where ~10 arguments are invoked.

## Minimal Example

```python
jako.DistributedScan(x='x', y='y', params=p, model=input_model,config='config.json')

```

## DistributedScan Arguments

`x`, `y`, `params`, `model` and `config` are the only required arguments to start the experiment, all other are optional.</aside>

Argument | Input | Description
--------- | ------- | -----------
`x` | array or list of arrays | prediction features
`y` | array or list of arrays | prediction outcome variable
`params` | dict or ParamSpace object | the parameter dictionary or the ParamSpace object after splitting
`model` | function | the Keras model as a function
`experiment_name` | str | Used for creating the experiment logging folder
`x_val` | array or list of arrays | validation data for x
`y_val` | array or list of arrays | validation data for y
`val_split` | float | validation data split ratio
`random_method` | str | the random method to be used
`seed` | float | Seed for random states
`performance_target` | list | A result at which point to end experiment
`fraction_limit` | float | The fraction of permutations to be processed
`round_limit` | int | Maximum number of permutations in the experiment
`time_limit` | datetime | Time limit for experiment in format `%Y-%m-%d %H:%M`
`boolean_limit` | function | Limit permutations based on a lambda function
`reduction_method` | str | Type of reduction optimizer to be used used
`reduction_interval` | int | Number of permutations after which reduction is applied
`reduction_window` | int | the lookback window for reduction process
`reduction_threshold` | float | The threshold at which reduction is applied
`reduction_metric` | str | The metric to be used for reduction
`minimize_loss` | bool | `reduction_metric` is a loss
`disable_progress_bar` | bool | Disable live updating progress bar
`print_params` | bool | Print each permutation hyperparameters
`clear_session` | bool | Clear backend session between permutations
`save_weights` | bool | Keep model weights (increases memory pressure for large models)
`config` | str or dict | Configuration containing information about machines to distribute and database to upload the data.


## DistributedScan Object Properties

Once the `DistributedScan()` procedures are completed, an object with several useful properties is returned.The namespace is strictly kept clean, so all the properties consist of meaningful contents.

In the case conducted the following experiment, we can access the properties in `distributed_scan_object` which is a python class object.

```python
distributed_scan_object = jako.DistributedScan(x, y, model=iris_model, params=p, fraction_limit=0.1, config='config.json')
```
<hr>

**`best_model`** picks the best model based on a given metric and returns the index number for the model.

```python
distributed_scan_object.best_model(metric='f1score', asc=False)
```
NOTE: `metric` has to be one of the metrics used in the experiment, and `asc` has to be True for the case where the metric is something to be minimized.

<hr>

**`data`** returns a pandas DataFrame with the results for the experiment together with the hyperparameter permutation details.

```python
distributed_scan_object.data
```

<hr>

**`details`** returns a pandas Series with various meta-information about the experiment.

```python
distributed_scan_object.details
```

<hr>

**`evaluate_models`** creates a new column in `distributed_scan_object.data` with result from kfold cross-evaluation.

```python
distributed_scan_object.evaluate_models(x_val=x_val,
y_val=y_val,
n_models=10,
metric='f1score',
folds=5,
shuffle=True,
average='binary',
asc=False)
```

Argument | Description
-------- | -----------
`distributed_scan_object` | The class object returned by DistributedScan() upon completion of the experiment.
`x_val` | Input data (features) in the same format as used in DistributedScan(), but should not be the same data (or it will not be much of validation).
`y_val` | Input data (labels) in the same format as used in DistributedScan(), but should not be the same data (or it will not be much of validation).
`n_models` | The number of models to be evaluated. If set to 10, then 10 models with the highest metric value are evaluated. See below.
`metric` | The metric to be used for picking the models to be evaluated.
`folds` | The number of folds to be used in the evaluation.
`shuffle` | If the data is to be shuffled or not. Set always to False for timeseries but keep in mind that you might get periodical/seasonal bias.
`average` |One of the supported averaging methods: 'binary', 'micro', or 'macro'
`asc` |Set to True if the metric is to be minimized.
`saved` | bool | if a model saved on local machine should be used
`custom_objects` | dict | if the model has a custom object, pass it here

<hr>

**`learning_entropy`** returns a pandas DataFrame with entropy measure for each permutation in terms of how much there is variation between results of each epoch in the permutation.

```python
distributed_scan_object.learning_entropy
```

<hr>

**`params`** returns a dictionary with the original input parameter ranges for the experiment.

```python
distributed_scan_object.params
```

<hr>

**`round_times`** returns a pandas DataFrame with the time when each permutation started, ended, and how many seconds it took.

```python
distributed_scan_object.round_times
```

<hr>

<hr>

**`round_history`** returns epoch-by-epoch data for each model in a dictionary.

```python
distributed_scan_object.round_history
```

<hr>

**`saved_models`** returns the JSON (dictionary) for each model.

```python
distributed_scan_object.saved_models
```

<hr>

**`saved_weights`** returns the weights for each model.

```python
distributed_scan_object.saved_weights
```

<hr>

**`x`** returns the input data (features).

```python
distributed_scan_object.x
```

<hr>

**`y`** returns the input data (labels).

```python
distributed_scan_object.y
```

## Input Model

The input model is any Keras or tf.keras model. It's the model that Jako will use as the basis for the hyperparameter experiment.

#### A minimal example

```python
def input_model(x_train, y_train, x_val, y_val, params):

model.add(Dense(12, input_dim=8, activation=params['activation']))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', params['optimizer'])
out = model.fit(x=x_train,
y=y_train,
validation_data=[x_val, y_val])

return out, model
```
See specific details about defining the model [here](Examples_Typical?id=defining-the-model).

#### Models with multiple inputs or outputs (list of arrays)

For both cases, `DistributedScan(... x_val, y_val ...)` must be explicitly set i.e. you split the data yourself before passing it into Jako. Using the above minimal example as a reference.

For **multi-input** change `model.fit()` as highlighted below:

```python
out = model.fit(x=[x_train_a, x_train_b],
y=y_train,
validation_data=[[x_val_a, x_val_b], y_val])
```

For **multi-output** the same structure is expected but instead of changing the `x` argument values, now change `y`:

```python
out = model.fit(x=x_train,
y=[y_train_a, y_train_b],
validation_data=[x_val, [y_val_a, y_val_b]])
```
For the case where its both **multi-input** and **multi-output** now both `x` and `y` argument values follow the same structure:

```python
out = model.fit(x=[x_train_a, x_train_b],
y=[y_train_a, y_train_b],
validation_data=[[x_val_a, x_val_b], [y_val_a, y_val_b]])
```


## Parameter Dictionary

The first step in an experiment is to decide the hyperparameters you want to use in the optimization process.

#### A minimal example

```python
p = {
'first_neuron': [12, 24, 48],
'activation': ['relu', 'elu'],
'batch_size': [10, 20, 30]
}
```

#### Supported Input Formats

Parameters may be inputted either in a list or tuple.

As a set of discreet values in a list:

```python
p = {'first_neuron': [12, 24, 48]}
```
As a range of values `(min, max, steps)`:

```python
p = {'first_neuron': (12, 48, 2)}
```

For the case where a static value is preferred, but it's still useful to include it in in the parameters dictionary, use list:

```python
p = {'first_neuron': [48]}
```

## DistributedScan Config file

A config file has all the information regarding connection to remote machines. The config file will also use one of the remote machines as the central datastore. A sample config file will look like the following:

```
{
"run_central_node": true,
"machines": [
{
"machine_id": 1,
"JAKO_IP_ADDRESS": "machine_1_ip_address",
"JAKO_PORT": machine_1_port,
"JAKO_USER": "machine_1_username",
"JAKO_PASSWORD": "machine_1_password"
},
{
"machine_id": 2,
"JAKO_IP_ADDRESS": "machine_2_ip_address",
"JAKO_PORT": machine_2_port,
"JAKO_USER": "machine_2_username",
"JAKO_KEY_FILENAME": "machine_2_key_file_path"
}
],
"database": {
"DB_HOST_MACHINE_ID": 1, #the id for machine which is the datastore
"DB_USERNAME": "database_username",
"DB_PASSWORD": "database_password",
"DB_TYPE": "database_type",
"DATABASE_NAME": "database_name",
"DB_PORT": database_port,
"DB_ENCODING": "LATIN1",
"DB_UPDATE_INTERVAL": 5
}
}
```

### DistributeScan config arguments

Argument | Input | Description
--------- | ------- | -----------
`run_central_node` | bool | if set to true, the central machine where the script runs will also be included in distributed run.
`machines` | list of dict | list of machine configurations
`machine_id` | int | id for each machine in ascending order.
`JAKO_IP_ADDRESS` | str | ip address for the remote machine
`JAKO_PORT` | int | port number for the remote machine
`JAKO_USER` | str | username for the remote machine
`JAKO_PASSWORD` | str | password for the remote machine
`JAKO_KEY_FILENAME` | str | if password not available, the path to RSA private key of the machine could be supplied to this argument.
`database` | dict | configuration parameters for central datastore
`DB_HOST_MACHINE_ID` | int | The machine id to one of the remote machines that can be used as the host where the database resides.
`DB_USERNAME` | str | database username
`DB_PASSWORD` | str | database_password
`DB_TYPE` | str | database_type. Default is `postgres`.The available options are `postgres`, `mysql` and `sqlite`
`DATABASE_NAME` | str | database name
`DB_PORT` | int | database_port
`DB_ENCODING` | str | Defaults to `LATIN1`
`DB_UPDATE_INTERVAL` | int | The frequency with which database update happens. The value is in seconds.Defaults to `5`.


57 changes: 57 additions & 0 deletions docs/Install_Options.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Install Options

Before installing jako, it is recommended to first setup and start the following:
* A python or conda environment.
* A postgresql database setup in one of the machines. This will be used as the central datastore.


## Installing Jako

#### Creating a python virtual environment
```python
virtualenv -p python3 jako_env
source jako_env/bin/activate
```

#### Creating a conda virtual environment
```python
conda create --name jako_env
conda activate jako_env
```

#### Install latest from PyPi
```python
pip install jako
```

#### Install a specific version from PyPi
```python
pip install jako==0.1
```

#### Upgrade installation from PyPi
```python
pip install -U --no-deps jako
```

#### Install from monthly
```python
pip install --upgrade --no-deps --force-reinstall git+https://github.com/autonomio/jako
```

#### Install from weekly
```python
pip install --upgrade --no-deps --force-reinstall git+https://github.com/autonomio/jako@dev
```

#### Install from daily
```python
pip install --upgrade --no-deps --force-reinstall git+https://github.com/autonomio/jako@daily-dev
```

## Installing a postgres database

To enable postgres in your central datastore, follow the steps in this
* Postgres for ubuntu machine: [link](https://blog.logrocket.com/setting-up-a-remote-postgres-database-server-on-ubuntu-18-04/)
* Postgres for Mac Machines: [link](https://www.sqlshack.com/setting-up-a-postgresql-database-on-mac/)

Loading

0 comments on commit c63c87d

Please sign in to comment.