The present algorithm (called BELIEF) implements Feature Weighting (FW) on Spark for its application on Big Data problems. This repository contains an improved implementation of RELIEF-F algorithm [1], which has been extended with a cheap but effective feature redundancy elimination technique. BELIEF leverages distance computations computed in prior steps to estimate inter-feature redundancy relationships at virtually no cost. BELIEF is also highly scalable to different sample sizes, from hundreds of samples to thousands.
Spark package:
- Compliance with 2.2.0 Spark version, and ml API.
- Support for sparse data and high-dimensional datasets (millions of features).
- Include a new heuristic that removes redundant features from the final selection set.
- Scalable to large sample sets.
This software has been tested on several large-scale datasets, such as:
- Oversampled ECBDL14 dataset (64M instances, 631 features): a dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014 competition, which comes from the Protein Structure Prediction field (
- kddb dataset (20M instances, nearly 30M of features):
val selector = new ReliefFRSelector()
.setSeed(123456789L) // for sampling
.setNumNeighbors(5) // k-NN used in RELIEF
.setInputCol("features")// this must be a feature vector
val result =
RELIEF computations are required to be normalized to improve comparisons among feature ranks and nearest neighbor searches. Additionally, continuous data should have 0 mean, and 1 standard deviation for a better performing in REDUNDANCY estimations. We recommend to rely on MLLIB standard scaler to homogeneize data:
Likewise, one-hot encoder is recommended for nominal features (unordered discrete data)
- Sergio Ramírez-Gallego ([email protected]) (main contributor and maintainer).
[1] I. Kononenko, E. Simec, M. Robnik-Sikonja, Overcoming the myopia of inductive learning algorithms with RELIEFF, Applied Intelligence 7 (1) (1997) 39–55.