Skip to content

Commit

Permalink
Expand documentation with more examples
Browse files Browse the repository at this point in the history
  • Loading branch information
jlgarridol committed Nov 12, 2024
1 parent 19b4bef commit 5f8a9b0
Show file tree
Hide file tree
Showing 5 changed files with 363 additions and 37 deletions.
15 changes: 0 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,21 +27,6 @@ It can be installed using *Pypi*:

pip install sslearn

## Code example


```python
from sslearn.wrapper import TriTraining
from sslearn.model_selection import artificial_ssl_dataset
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
X, y, X_unlabel, true_label = artificial_ssl_dataset(X, y, label_rate=0.1)

model = TriTraining().fit(X, y)
model.score(X_unlabel, true_label)
```

## Citing

```bibtex
Expand Down
170 changes: 170 additions & 0 deletions docs/examples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
---

---

Examples
=============

In this page we provide some code examples to show how to use the library. Also, exists a [Jupyter Notebook](https://colab.research.google.com/drive/1wKSz-f7N4elqQwz_phrWXDrf3lRqaD6s#scrollTo=KS-6GNxCayxf) with the same examples to run in Google Colab.


Datasets manipulation
---------------------

SSLearn include tools for loading csv and dat (KEEL) datasets. Also, it provides a function to generate a semi-supervised dataset from a labeled dataset.

### Load a CSV or a DAT (KEEL) dataset

```python
from sslearn.datasets import read_csv, read_keel

# The CSV file must have the target column in the last position by default.
X_iris, y_iris = read_csv('iris.csv', format='numpy') # Format can be 'numpy' or 'pandas'

# The DAT file must have the target column in the first position by default. Also, the instances with class "unlabeled" will change to -1.
X_iris, y_iris = read_keel('iris.dat', format='numpy')
```

### Generate a unsupervised dataset from a labeled dataset

To test the semi-supervised algorithms, we need to generate a dataset with some instances without labels. The library provides a function to do this.

Exists two ways to generate the dataset:
1. Randomly select some instances to be unlabeled giving a percentage of labeled instances.
2. Generate folds where one fold is the labeled dataset and the others are unlabeled.

```python
from sslearn.model_selection import artificial_ssl_dataset

# Using the previous dataset "X_iris" and "y_iris"
X_train, y_train, X_u, y_y = artificial_ssl_dataset(X_iris, y_iris, label_rate=0.2, random_state=42) # 80% of the instances are unlabeled
# or if you want also get the indexes of the instances
X_train, y_train, X_u, y_y, idx_label, idx_u = artificial_ssl_dataset(X_iris, y_iris, label_rate=0.2, random_state=42, indexes=True)
```
`artificial_ssl_dataset` supports stratify and shuffle as it works in `train_test_split` from `sklearn.model_selection`. Also, a minimum number of instances per class can be set with the parameter `force_minimum`.

The other way to generate the dataset is using the `StratifiedKFoldSS` class.

```python
from sslearn.model_selection import StratifiedKFoldSS

skss = StratifiedKFoldSS(n_splits=10, shuffle=True, random_state=42) # 10 folds, that implies 10% of the instances are labeled

for X_train, y_train, label_indices, unlabel_indices in skss.split(X_iris, y_iris):
# It works like the StratifiedKFold from sklearn.model_selection
pass # Here you can use the labeled and unlabeled instances
```


`X_train` and `y_train` are the set ready to be used in a semi-supervised algorithm, with the "-1" in the target column for the unlabeled instances.

Wrappers
--------

The wrappers are the most widely used algorithms in the semi-supervised learning field. The library includes the most popular ones. The algorithms included are:
* `SelfTraining`: Self-training algorithm for one classifier.
* `Setred`: Self-training with data amendment.
* `CoTraining`: The co-training algorithm for two views, supports multi-learning and requires two views.
* `Rasco`: Random subspace co-training, supports multi-learning.
* `RelRasco`: Relevant random subspace co-training, supports multi-learning.
* `DemocraticCoLearning`: Democratic co-learning algorithm, only for multi-learning.
* `CoForest`: Co-training version of Random Forest.
* `CoTrainingByCommittee`: Co-training by committee algorithm, a.k.a. CoBagging or CoBoosting depending on the base classifier in the literature.
* `TriTraining`: Tri-training algorithm, supports multi-learning.
* `DeTriTraining`: Tri-training with data amendment, supports multi-learning.

The wrappers follow the same structure as the classifiers in scikit-learn. Use `fit` to train the model and `predict` to predict the target column. Also, the `score` method is available to get the accuracy of the model.

```python
# Using the previous dataset "X_train" and "y_train"

from sslearn.wrapper import CoForest
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

from sklearn

# Load dataset
X_iris_train, y_iris_train, X_iris_U, y_iris_U = artificial_ssl_dataset(X_iris, y_iris, label_rate=0.1, random_state=42)

# Create the wrapper
coforest = CoForest(base_estimator=DecisionTreeClassifier(random_state=10), random_state=10)
# Train
coforest.fir(X_iris_train, y_iris_train)
# Predict
y_pred = coforest.predict(X_iris_U)
# Score
accuracy = accuracy_score(y_iris_U, y_pred)
```
Each wrapper has its parameters, some of them need different views, have specific parameters, or need a kind of base classifiers. The documentation of each wrapper explains the parameters and the requirements and provides examples of how to use them.

The wrappers that supports multi-learning can recive differents algorithms as base classifiers. The base classifiers must be a list of classifiers.

```python
from sslearn.wrapper import DemocraticCoLearning
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# Load dataset
X_iris_train, y_iris_train, X_iris_U, y_iris_U = artificial_ssl_dataset(X_iris, y_iris, label_rate=0.1, random_state=42)

# Create the wrapper
democratic = DemocraticCoLearning(base_estimator=[DecisionTreeClassifier(random_state=10), KNeighborsClassifier(), GaussianNB()], random_state=10)
# Train
democratic.fit(X_iris_train, y_iris_train)
```


### The multiview adapters
`CoTraining` is the only wrapper that supports different views. But, the library provides an adapter to provide a different view to the other wrappers. The adapters are in the `sslearn.subview` module. Supports classification and regression problems.

That adapter create a subview for the dataset, and when the wrapper train the base classifiers, each will be trained with the view defined in the adapter. It supports three modes of create subviews:
- `regex`: The columns that match the regex will be in the subview.
- `columns`: The columns that are in the list will be in the subview.
- `include`: The columns whose names are in the list will be in the subview.

```python

from sslearn.subview import SubViewClassifier
from sklearn.tree import DecisionTreeClassifier

# Load dataset
X_breast, y_breast = read_csv('breast-cancer.csv', format='numpy')
X_breast_train, y_breast_train, X_breast_U, y_breast_U = artificial_ssl_dataset(X_breast, y_breast, label_rate=0.1, random_state=42)
# Create the the adapter
subview1 = SubViewClassifier(DecisionTreeClassifier(random_state=10), "breast", mode='include') # The columns `breast` and `breast-quad`
subview2 = SubViewClassifier(DecisionTreeClassifier(random_state=10), ["age", "menopause", "irradiant"], mode='columns') # The columns `age`, `menopause`, and `irradiant`
subview3 = SubViewClassifier(DecisionTreeClassifier(random_state=10), ".*", mode='regex') # All the columns
# Create a TriTraining wrapper with the subviews
tritraining = TriTraining(base_estimator=[subview1, subview2, subview3], random_state=10)
```

The wrappers that support multi-learning can be used with the subview adapters. The adapters are in the `sslearn.subview` module.


### Restricted set classification

The RSC algorithms support datasets with pairwise constraints. The library provides the `WhoIsWhoClassifier`, the `feature_fusion` and `probability_fusion` methods.
`WhoIsWhoClassifier` is a wrapper that uses the RSC algorithms and supports only cannot-link constraints. The `feature_fusion` and `probability_fusion` methods supports both constraints but only in prediction time. All algorithms are in the `sslearn.restriced` module.

The complete example is avaliable in the [Jupyter Notebook](https://colab.research.google.com/drive/1wKSz-f7N4elqQwz_phrWXDrf3lRqaD6s#scrollTo=mF32Wps3pIR5). Here is a snippet of the code with the dataset and the constraints already loaded.

```python
wiw = WhoIsWhoClassifier(base_estimator=DecisionTreeClassifier(random_state=10),
method="hungarian") # The method can ben `hungarian` and `greedy`
wiw.fit(X_train, y_train, cannot_link_train)
y_predicted = wiw.predict(X_test, cannot_link_test)
```
For `WhoIsWhoClassifier` the cannot-link constraints must be an array that the size is the number of instances in the dataset. Each index is mapped to the instance in the dataset. If two instances or more cannot-link the value must be the same in the array.

For the `feature_fusion` and `probability_fusion` methods, the constraints must be in dictionary format. The values are the instances that cannot-link or must-link. The keys represent the group of instances, but normaly is the first instance in the group (to ensure differents groups not have the same key). That methods requires a pre-trained model.

```python
estimator = DecisionTreeClassifier(random_state=10)
estimator.fit(X_train, y_train)
y_predicted_f = feature_fusion(estimator, X_test, must_link_test_dict, cannot_link_test_dict)
y_predicted_p = probability_fusion(estimator, X_test, must_link_test_dict, cannot_link_test_dict)
```

`feature_fusion` combine the features of the instances that must-link replacing the features for the average of the features in the group. With `probability_fusion` the post-probabilities of the instances that must-link are averaged. In both cases it uses the hungarian algorithm to solve the cannot-link constraints.
Loading

0 comments on commit 5f8a9b0

Please sign in to comment.