-
Notifications
You must be signed in to change notification settings - Fork 1
API Design
These operations are performed directly on pandas DataFrame attribute, by methods of Dataset object.
FeatureOperation as an abstract base class, with attributes “original_column”, “derived_column”, .... Concrete Operation classes are all the operations that can be performed on a Dataset object. The init takes the name of the column where to apply the operation and optionally the name of the column where to store the result (otherwise the result is applied in place), plus all the possible arguments needed by the operation. The instances of these Operations are callable objects and they take as input a Dataset object and return a Dataset object.
There can be an Apply class which takes a callable object and the axis
Similarly to histolab’s filters, operations can be chained together by means of a special object Compose, which takes all the operations to perform in a list and then applies all of them one after another.
pd.to_timedelta() pd.to_datetime() pd.DataFrame().astype() pd.merge() pd.DataFrame().groupby pd.DataFrame().apply Copy to_csv() to_file() [shelve] drop
pd.to_numeric() ??? pd.DataFrame().set_index() ???
read_csv(path, sep, metadata_cols, feature_cols) [1a] nan_percentage_threshold attribute used in many_nan_columns → as parameter read_dataset(path, metadata_cols, feature_cols) [1c] Read path and reconstruct the Dataset with its Operations history Pay attention to metadata_cols
- save_dataset(path) [1b] Save preprocessed CSV + its operations in a human readable format (and the parameters used).
- nan_columns(tolerance) [2c] Tolerance is a (optional, default=1) float number (0 to 1) representing the ratio “nan samples”/”total samples” for the column to be considered a “nan column”.
- add_operation(feat_op: FeatureOperation) [3a]
Acts on private _operations_history to add
feat_op
to it - find_operations(feat_op: FeatureOperation) [3b]
It can return zero, one or more operations in a list/OperationsList
It requires a method (like ‘is_similar
, not
eq` ) to verify if two FeatureOperation are similar: each of the subclasses of FeatureOperation should implement this in their own method
- metadata_cols [2a]
- feature_cols [2a]
- numerical_cols [2b]
- categorical_cols [2b]
- boolean_cols [2b]
- string_cols [2b]
- mixed_cols [2b]
- constant_columns [2d]
- operations_history [3a]
Def add_operation(self, feat_op: FeatureOperation): Self._operations_history += feat_op
- OrdinalEncoder [3c iii]
- OneHotEncoder [3c ii]
- BinSplitting [3c i]
- FillNA
- ReplaceSubStrings (single chars or substrings) [3d i]
- ReplaceStrings (whole values) [3d i]
- ReplaceOutOfScale (cases where “>80”) [3d ii]
- Apply (to apply function like in pandas)
- AnonymizeDataset [3e] Takes private columns list (that will be removed from Dataset and used to compute the unique ID), path to store private info file, “private_cols_to _keep” (columns to keep both in private and public df). Returns the anonymized Dataset and saves the private info dataset.
- ...
Class FeatureOperation(Protocol): Column : type Result_column : type
@abstractmethod
def is_similar(self, other):
raise NotImplementedError
Class Fillna(FeatureOperation): Init(self, column, result_column=None, fill_value=’--’) self.column = column self.result_column = result_column self.fill_value = fill_value
Def call(self, dfinfo): filled_df = dfinfo.fillna(self.column) filled_df.add_operation(self) return filled_df
Class OperationsList(): init(): Self.operations = []
__contains__(self, feat_op: FeatureOperation):
Something
__add__()
__iadd__(feat_op: FeatureOperation)
self.operations.append(feat_op)
transformations = Compose([Fillna(.....), Operation(...)]) dataset = read_csv(.....)
encoded_dataset = OneHotEncoder(...)(dataset)
Preprocessed_dataset = transformations(encoded_dataset)