Skip to content

API Design

Lorenzo Gorini edited this page Sep 11, 2020 · 2 revisions

LowLevel Operations

These operations are performed directly on pandas DataFrame attribute, by methods of Dataset object.

HighLevel Operations / Transformations

FeatureOperation as an abstract base class, with attributes “original_column”, “derived_column”, .... Concrete Operation classes are all the operations that can be performed on a Dataset object. The init takes the name of the column where to apply the operation and optionally the name of the column where to store the result (otherwise the result is applied in place), plus all the possible arguments needed by the operation. The instances of these Operations are callable objects and they take as input a Dataset object and return a Dataset object.

There can be an Apply class which takes a callable object and the axis

Similarly to histolab’s filters, operations can be chained together by means of a special object Compose, which takes all the operations to perform in a list and then applies all of them one after another.

Operations not yet implemented (they require to access to the df attribute directly)

pd.to_timedelta() pd.to_datetime() pd.DataFrame().astype() pd.merge() pd.DataFrame().groupby pd.DataFrame().apply Copy to_csv() to_file() [shelve] drop

pd.to_numeric() ??? pd.DataFrame().set_index() ???

Dataset module

Module functions:

read_csv(path, sep, metadata_cols, feature_cols) [1a] nan_percentage_threshold attribute used in many_nan_columns → as parameter read_dataset(path, metadata_cols, feature_cols) [1c] Read path and reconstruct the Dataset with its Operations history Pay attention to metadata_cols

Dataset methods:

  • save_dataset(path) [1b] Save preprocessed CSV + its operations in a human readable format (and the parameters used).
  • nan_columns(tolerance) [2c] Tolerance is a (optional, default=1) float number (0 to 1) representing the ratio “nan samples”/”total samples” for the column to be considered a “nan column”.
  • add_operation(feat_op: FeatureOperation) [3a] Acts on private _operations_history to add feat_op to it
  • find_operations(feat_op: FeatureOperation) [3b] It can return zero, one or more operations in a list/OperationsList It requires a method (like ‘is_similar, not eq` ) to verify if two FeatureOperation are similar: each of the subclasses of FeatureOperation should implement this in their own method

Dataset Properties:

  • metadata_cols [2a]
  • feature_cols [2a]
  • numerical_cols [2b]
  • categorical_cols [2b]
  • boolean_cols [2b]
  • string_cols [2b]
  • mixed_cols [2b]
  • constant_columns [2d]
  • operations_history [3a]

Example

Def add_operation(self, feat_op: FeatureOperation): Self._operations_history += feat_op

feature_operation module

  • OrdinalEncoder [3c iii]
  • OneHotEncoder [3c ii]
  • BinSplitting [3c i]
  • FillNA
  • ReplaceSubStrings (single chars or substrings) [3d i]
  • ReplaceStrings (whole values) [3d i]
  • ReplaceOutOfScale (cases where “>80”) [3d ii]
  • Apply (to apply function like in pandas)
  • AnonymizeDataset [3e] Takes private columns list (that will be removed from Dataset and used to compute the unique ID), path to store private info file, “private_cols_to _keep” (columns to keep both in private and public df). Returns the anonymized Dataset and saves the private info dataset.
  • ...

Example

Class FeatureOperation(Protocol): Column : type Result_column : type

@abstractmethod
def is_similar(self, other):
    raise NotImplementedError

Class Fillna(FeatureOperation): Init(self, column, result_column=None, fill_value=’--’) self.column = column self.result_column = result_column self.fill_value = fill_value

Def call(self, dfinfo): filled_df = dfinfo.fillna(self.column) filled_df.add_operation(self) return filled_df

Class OperationsList(): init(): Self.operations = []

__contains__(self, feat_op: FeatureOperation):
    Something

__add__()
__iadd__(feat_op: FeatureOperation)
    self.operations.append(feat_op)

Scripts for usecases

transformations = Compose([Fillna(.....), Operation(...)]) dataset = read_csv(.....)

encoded_dataset = OneHotEncoder(...)(dataset)

Preprocessed_dataset = transformations(encoded_dataset)

Clone this wiki locally