-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfeature.html
1138 lines (988 loc) · 51.4 KB
/
feature.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Smile - Feature Engineering</title>
<meta name="description" content="Statistical Machine Intelligence and Learning Engine">
<!-- prettify js and CSS -->
<script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js?lang=scala&lang=kotlin&lang=clj"></script>
<style>
.prettyprint ol.linenums > li { list-style-type: decimal; }
</style>
<!-- Bootstrap core CSS -->
<link href="css/cerulean.min.css" rel="stylesheet">
<link href="css/custom.css" rel="stylesheet">
<script src="https://code.jquery.com/jquery.min.js"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min.js"></script>
<!-- slider -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/owl-carousel/1.3.3/owl.carousel.min.js"></script>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/owl-carousel/1.3.3/owl.carousel.css" type="text/css" />
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/owl-carousel/1.3.3/owl.transitions.css" type="text/css" />
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/owl-carousel/1.3.3/owl.theme.min.css" type="text/css" />
<!-- table of contents auto generator -->
<script src="js/toc.js" type="text/javascript"></script>
<!-- styles for pager and table of contents -->
<link rel="stylesheet" href="css/pager.css" type="text/css" />
<link rel="stylesheet" href="css/toc.css" type="text/css" />
<!-- Vega-Lite Embed -->
<script src="https://cdn.jsdelivr.net/npm/vega@5"></script>
<script src="https://cdn.jsdelivr.net/npm/vega-lite@5"></script>
<script src="https://cdn.jsdelivr.net/npm/vega-embed@6"></script>
<!-- Google tag (gtag.js) -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-57GD08QCML"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-57GD08QCML');
</script>
<!-- Sidebar and testimonial-slider -->
<script type="text/javascript">
$(document).ready(function(){
// scroll/follow sidebar
// #sidebar is defined in the content snippet
// This script has to be executed after the snippet loaded.
// $.getScript("js/follow-sidebar.js");
$("#testimonial-slider").owlCarousel({
items: 1,
singleItem: true,
pagination: true,
navigation: false,
loop: true,
autoPlay: 10000,
stopOnHover: true,
transitionStyle: "backSlide",
touchDrag: true
});
});
</script>
</head>
<body>
<div class="container" style="max-width: 1200px;">
<header>
<div class="masthead">
<p class="lead">
<a href="index.html">
<img src="images/smile.jpg" style="height:100px; width:auto; vertical-align: bottom; margin-top: 20px; margin-right: 20px;">
<span class="tagline">Smile — Statistical Machine Intelligence and Learning Engine</span>
</a>
</p>
</div>
<nav class="navbar navbar-default" role="navigation">
<!-- Brand and toggle get grouped for better mobile display -->
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar-collapse">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
</div>
<!-- Collect the nav links, forms, and other content for toggling -->
<div class="collapse navbar-collapse" id="navbar-collapse">
<ul class="nav navbar-nav">
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Overview <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="quickstart.html">Quick Start</a></li>
<li><a href="overview.html">What's Machine Learning</a></li>
<li><a href="data.html">Data Processing</a></li>
<li><a href="visualization.html">Data Visualization</a></li>
<li><a href="vegalite.html">Declarative Visualization</a></li>
<li><a href="gallery.html">Gallery</a></li>
<li><a href="faq.html">FAQ</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Supervised Learning <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="classification.html">Classification</a></li>
<li><a href="regression.html">Regression</a></li>
<li><a href="deep-learning.html">Deep Learning</a></li>
<li><a href="feature.html">Feature Engineering</a></li>
<li><a href="validation.html">Model Validation</a></li>
<li><a href="missing-value-imputation.html">Missing Value Imputation</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Unsupervised Learning <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="clustering.html">Clustering</a></li>
<li><a href="vector-quantization.html">Vector Quantization</a></li>
<li><a href="association-rule.html">Association Rule Mining</a></li>
<li><a href="mds.html">Multi-Dimensional Scaling</a></li>
<li><a href="manifold.html">Manifold Learning</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">LLM & NLP <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="llm.html">Large Language Model (LLM)</a></li>
<li><a href="nlp.html">Natural Language Processing (NLP)</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Math <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="linear-algebra.html">Linear Algebra</a></li>
<li><a href="statistics.html">Statistics</a></li>
<li><a href="wavelet.html">Wavelet</a></li>
<li><a href="interpolation.html">Interpolation</a></li>
<li><a href="graph.html">Graph Data Structure</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">API <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="api/java/index.html" target="_blank">Java</a></li>
<li><a href="api/scala/index.html" target="_blank">Scala</a></li>
<li><a href="api/kotlin/index.html" target="_blank">Kotlin</a></li>
<li><a href="api/clojure/index.html" target="_blank">Clojure</a></li>
<li><a href="api/json/index.html" target="_blank">JSON</a></li>
</ul>
</li>
<li><a href="https://mybinder.org/v2/gh/haifengl/smile/notebook?urlpath=lab%2Ftree%2Fshell%2Fsrc%2Funiversal%2Fnotebooks%2Findex.ipynb" target="_blank">Try It Online</a></li>
</ul>
</div>
<!-- /.navbar-collapse -->
</nav>
</header>
<div id="content" class="row">
<div class="col-md-3 col-md-push-9 hidden-xs hidden-sm">
<div id="sidebar">
<div class="sidebar-toc" style="margin-bottom: 20px;">
<p class="toc-header">Contents</p>
<div id="toc"></div>
</div>
<div id="search">
<script>
(function() {
var cx = '010264411143030149390:ajvee_ckdzs';
var gcse = document.createElement('script');
gcse.type = 'text/javascript';
gcse.async = true;
gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
'//cse.google.com/cse.js?cx=' + cx;
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(gcse, s);
})();
</script>
<gcse:searchbox-only></gcse:searchbox-only>
</div>
</div>
</div>
<div class="col-md-9 col-md-pull-3">
<h1 id="feature-top" class="title">Feature Engineering</h1>
<p>We have gone through the classification and regression algorithms. In the examples,
we assume that the data is ready and features are carefully prepared, which many
machine learning practitioners take as granted. However, it is not general case
in practice. The practitioners should pay a lot of attentions on feature generation,
selection, and dimension reduction, which often have bigger impacts on the final
results than the choice of particular learning algorithm.</p>
<p>Understanding the problem (and the business) is the most important thing to prepare
the features. Besides the attributes supplied with the training instances, researchers
should modify or enhance the representation of data objects in search of new features
that describe the objects better.</p>
<p>The accuracy of the inferred function depends strongly on how the input
object is represented. Typically, the input object is transformed into
a feature vector, which contains a number of features that are descriptive
of the object. The number of features should not be too large because of
the curse of dimensionality; but should contain enough information to
accurately predict the output. Beyond feature vectors, a few algorithms such as
support vector machines can process complex data object such as sequences, trees,
and even graphs with properly engineered kernels.</p>
<p>If each of the features makes an independent contribution to the output,
then algorithms based on linear functions (e.g., linear regression,
logistic regression, linear support vector machines, naive Bayes) generally
perform well. However, if there are complex interactions among features,
then algorithms such as nonlinear support vector machines, decision trees
and neural networks work better. Linear methods can also be applied, but
engineers must manually specify the interactions when using them.</p>
<p>If the input features contain redundant information (e.g., highly correlated
features), some learning algorithms (e.g., linear regression, logistic
regression, and distance based methods) will perform poorly because of
numerical instabilities. These problems can often be solved by imposing
some form of regularization.</p>
<h2 id="preprocessing">Preprocessing</h2>
<p>Many machine learning methods such as Neural Networks and SVM with Gaussian
kernel also require the features properly scaled/standardized. For example,
each variable is scaled into interval [0, 1] or to have mean 0 and standard
deviation 1. Methods that employ a distance function are particularly sensitive
to this. In the package <code>smile.feature</code>, we provide several classes
to preprocess the features. These classes generally learn the transform from
a training data and can be applied to new feature vectors.</p>
<p>The class <code>Scaler</code> scales all numeric variables into the range [0, 1].
If the dataset has outliers, normalization will certainly scale
the "normal" data to a very small interval. In this case, the
Winsorization procedure should be applied: values greater than the
specified upper limit are replaced with the upper limit, and those
below the lower limit are replace with the lower limit. Often, the
specified range is indicate in terms of percentiles of the original
distribution (like the 5th and 95th percentile). The class <code>WinsorScaler</code>
implements this algorithm. In additional, the class <code>MaxAbsScaler</code>
scales each feature by its maximum absolute value so that the maximal absolute value
of each feature in the training set will be 1.0. It does not shift/center
the data, and thus does not destroy any sparsity.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_1" data-toggle="tab">Java</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane active" id="java_1">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
import smile.base.mlp.Layer;
var pendigits = Read.csv("data/classification/pendigits.txt", CSVFormat.DEFAULT.withDelimiter('\t'));
var df = pendigits.drop(16);
var y = pendigits.column(16).toIntArray();
var scaler = WinsorScaler.fit(df, 0.01, 0.99);
var x = scaler.apply(df).toArray();
CrossValidation.classification(10, x, y, (x, y) -> {
var model = new smile.classification.MLP(Layer.input(16),
Layer.sigmoid(50),
Layer.mle(10, OutputFunction.SIGMOID)
);
for (int epoch = 0; epoch < 10; epoch++) {
for (int i : MathEx.permutate(x.length)) {
model.update(x[i], y[i]);
}
}
return model;
});
</code></pre>
</div>
</div>
</div>
<p>The class <code>Standardizer</code> transforms numeric feature to 0 mean and unit variance.
Standardization makes an assumption that the data follows
a Gaussian distribution and are also not robust when outliers present.
A robust alternative is to subtract the median and divide by the IQR, which
is implemented <code>RobustStandardizer</code>.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_2" data-toggle="tab">Java</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane active" id="java_2">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
var zip = Read.csv("data/usps/zip.train", CSVFormat.DEFAULT.withDelimiter(' '));
var df = zip.drop(0);
var y = zip.column(0).toIntArray();
var scaler = Standardizer.fit(df);
var x = scaler.apply(df).toArray();
var model = new smile.classification.MLP(Layer.input(256),
Layer.sigmoid(768),
Layer.sigmoid(192),
Layer.sigmoid(30),
Layer.mle(10, OutputFunction.SIGMOID)
);
model.setLearningRate(TimeFunction.constant(0.1));
model.setMomentum(TimeFunction.constant(0.0));
for (int epoch = 0; epoch < 10; epoch++) {
for (int i : MathEx.permutate(x.length)) {
model.update(x[i], y[i]);
}
}
</code></pre>
</div>
</div>
</div>
<p>The class <code>Normalizer</code> transform samples to unit norm. This class
is stateless and thus doesn't need to learn transformation parameters from data.
Each sample (i.e. each row of the data matrix) with at least one non-zero
component is rescaled independently of other samples so that its norm
(L1 or L2) equals one. Scaling inputs to unit norms is a common operation for text
classification or clustering for instance.</p>
<p>The class <code>smile.math.MathEx</code> also provides several functions for
similar purpose, such as <code>standardize()</code>, <code>normalize()</code>,
and <code>scale()</code> that apply to a matrix.</p>
<p>Although some method such as decision trees can handle nominal variable directly,
other methods generally require nominal variables converted to multiple binary
dummy variables to indicate the presence or absence of a characteristic.
The class <code>OneHotEncoder</code> encode categorical integer features
using the one-of-K scheme.</p>
<p>There are other feature generation classes in the package. For example,
<code>DateFeature</code> generates the attributes of <code>Date</code> object.
<code>Bag</code> is a generic implementation of bag of words features, which
may be applied to generic objects other than <code>String</code>. </p>
<h2 id="Hughes-effect">Hughes Effect</h2>
<p>In practice, we often start with generating a lot of features with the hope that
more relevant information will improve the accuracy. However, it is not always
good to include a large number of features in the learning system.
It is well known that the optimal rate of convergence to fit the
<code>m</code>-th derivate of θ (θ is a <code>p</code>-times differentiable regression
function, especially nonparametric ones) to the data is at best proportional
to <code>n<sup>-(p-m)/(2p+d)</sup></code>, where <code>n</code> is the number
of data points, and <code>d</code> is the dimensionality of the data.
In a nutshell, the rate of convergence will be exponentially slower
when we increase the dimensionality of our inputs. In other words,
with a fixed number of training samples, the predictive power reduces
as the dimensionality increases, known as the Hughes effect.</p>
<p>Therefore, feature selection that identifies
the relevant features and discards the irrelevant ones and dimension reduction
that maps the input data into a lower
dimensional space are generally required prior to running the supervised learning algorithm.</p>
<h2 id="feature-selection">Feature Selection</h2>
<p>Feature selection is the technique of selecting a subset of relevant
features for building robust learning models. By removing most irrelevant
and redundant features from the data, feature selection helps improve the
performance of learning models by alleviating the effect of the curse of
dimensionality, enhancing generalization capability, speeding up learning
process, etc. More importantly, feature selection also helps researchers
to acquire better understanding about the data.</p>
<p>Feature selection algorithms typically fall into two categories: feature
ranking and subset selection. Feature ranking methods rank the features by a
metric and eliminate all features that do not achieve an adequate score.
Subset selection searches the set of possible features for the optimal subset.
Clearly, an exhaustive search of optimal subset is impractical if large
numbers of features are available. Commonly, heuristic methods such as
genetic algorithms are employed for subset selection.</p>
<h3 id="signal-noise-ratio">Signal Noise Ratio</h3>
<p>The signal-to-noise (S2N) metric ratio is a univariate feature ranking metric,
which can be used as a feature selection criterion for binary classification
problems. S2N is defined as</p>
<pre class="prettyprint lang-html"><code>
|μ<sub>1</sub> - μ<sub>2</sub>| / (σ<sub>1</sub> + σ<sub>2</sub>)
</code></pre>
<p>where μ<sub>1</sub> and μ<sub>2</sub> are the mean value of the variable
in classes 1 and 2, respectively, and σ<sub>1</sub> and σ<sub>2</sub>
are the standard deviations of the variable in classes 1 and 2, respectively.
Clearly, features with larger S2N ratios are better for classification.</p>
<p>The output is the S2N ratios for each variable.</p>
<h3 id="sum-squares-ratio">Sum Squares Ratio</h3>
<p>The ratio of between-groups to within-groups sum of squares is a univariate
feature ranking metric, which can be used as a feature selection criterion
for multi-class classification problems. For each variable j, this ratio is</p>
<pre class="prettyprint lang-html"><code>
BSS(j) / WSS(j) = ΣI(y<sub>i</sub> = k)(x<sub>kj</sub> - x<sub>·j</sub>)<sup>2</sup> / ΣI(y<sub>i</sub> = k)(x<sub>ij</sub> - x<sub>kj</sub>)<sup>2</sup>
</code></pre>
<p>where x<sub>·j</sub> denotes the average of variable j across all
samples, x<sub>kj</sub> denotes the average of variable j across samples
belonging to class k, and x<sub>ij</sub> is the value of variable j of sample i.
Clearly, features with larger sum squares ratios are better for classification.</p>
<p>Applying it to Iris data, we can find that the last two variables have much higher
sum squares ratios (about 16 and 13 in contrast to 1.6 and 0.6 of the first two variables).</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_3" data-toggle="tab">Java</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane active" id="java_3">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
var iris = Read.arff("data/weka/iris.arff");
SumSquaresRatio.fit(iris, "class");
var df = iris.select("petallength", "petalwidth", "class");
var canvas = ScatterPlot.of(df, "petallength", "petalwidth", "class", '*').canvas();
canvas.setAxisLabels(iris.names()[2], iris.names()[3]);
canvas.window();
</code></pre>
</div>
</div>
</div>
<p>The scatter plot of the last two variables shows clearly that they capture the difference
among classes.</p>
<div style="width: 100%; display: inline-block; text-align: center;">
<img src="images/iris-petal-width-length.png" class="enlarge" style="width: 480px;" />
<div class="caption" style="min-width: 480px;">Iris Petal Width vs Length</div>
</div>
<h3 id="ensemble-learning-feature-selection">Ensemble Learning Based Feature Selection</h3>
<p>Ensemble learning methods (e.g. random forest, gradient boosted trees, and AdaBoost) can give
the estimates of what variables are important in the classification.
Every time a split of a node is made on variable
the (GINI, information gain, etc.) impurity criterion for the two
descendent nodes is less than the parent node. Adding up the decreases
for each individual variable over all trees in the forest gives a fast
variable importance that is often very consistent with the permutation
importance measure.</p>
<p>For these algorithms, there is a method <code>importance</code> returning
the scores for feature selection (higher is better).</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_4" data-toggle="tab">Java</a></li>
<li><a href="#scala_4" data-toggle="tab">Scala</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_4">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
val iris = read.arff("data/weka/iris.arff")
val model = smile.classification.randomForest("class" ~ ".", iris)
println(iris.names.slice(0, 4).zip(model.importance).mkString("\n"))
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_4">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
var model = smile.classification.RandomForest.fit(Formula.lhs("class"), iris);
for (int i = 0; i < 4; i++) {
System.out.format("%s\t%.2f\n", iris.names()[i], model.importance()[i]);
}
</code></pre>
</div>
</div>
</div>
<p>For Iris data, random forest also gives much higher importance scores for
the last two variables.</p>
<p>For data including categorical variables with different number of
levels, random forests are biased in favor of those attributes with more
levels. Therefore, the variable importance scores from random forest are
not reliable for this type of data.</p>
<h3 id="tree-shap">TreeSHAP</h3>
<p>SHAP (SHapley Additive exPlanations) is a game theoretic approach to
explain the output of any machine learning model. It connects optimal
credit allocation with local explanations using the classic Shapley
values from game theory. In game theory, the Shapley value is the
average expected marginal contribution of one player after all
possible combinations have been considered.</p>
<p>SHAP leverages local methods designed to explain a prediction <code>f(x)</code>
based on a single input <code>x</code>. The local methods are defined
as any interpretable approximation of the original model. In particular,
SHAP employs additive feature attribution methods.
SHAP values attribute to each feature the change in the expected
model prediction when conditioning on that feature. They explain
how to get from the base value <code>E[f(z)]</code> that would be
predicted if we did not know any features to the current output
<code>f(x)</code>.</p>
<p>Tree SHAP is a fast and exact method to estimate SHAP values for
tree models and ensembles of trees, under several different possible
assumptions about feature dependence. </p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_14" data-toggle="tab">Java</a></li>
<li><a href="#scala_14" data-toggle="tab">Scala</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_14">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
smile> val housing = read.arff("data/weka/regression/housing.arff")
[main] INFO smile.io.Arff - Read ARFF relation housing
housing: DataFrame = [CRIM: float, ZN: float, INDUS: float, CHAS: byte nominal[0, 1], NOX: float, RM: float, AGE: float, DIS: float, RAD: float, TAX: float, PTRATIO: float, B: float, LSTAT: float, class: float]
+------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+-----+
| CRIM| ZN|INDUS|CHAS| NOX| RM| AGE| DIS|RAD|TAX|PTRATIO| B|LSTAT|class|
+------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+-----+
|0.0063| 18| 2.31| 0|0.538|6.575|65.2| 4.09| 1|296| 15.3| 396.9| 4.98| 24|
|0.0273| 0| 7.07| 0|0.469|6.421|78.9|4.9671| 2|242| 17.8| 396.9| 9.14| 21.6|
|0.0273| 0| 7.07| 0|0.469|7.185|61.1|4.9671| 2|242| 17.8|392.83| 4.03| 34.7|
|0.0324| 0| 2.18| 0|0.458|6.998|45.8|6.0622| 3|222| 18.7|394.63| 2.94| 33.4|
| 0.069| 0| 2.18| 0|0.458|7.147|54.2|6.0622| 3|222| 18.7| 396.9| 5.33| 36.2|
|0.0299| 0| 2.18| 0|0.458| 6.43|58.7|6.0622| 3|222| 18.7|394.12| 5.21| 28.7|
|0.0883|12.5| 7.87| 0|0.524|6.012|66.6|5.5605| 5|311| 15.2| 395.6|12.43| 22.9|
|0.1445|12.5| 7.87| 0|0.524|6.172|96.1|5.9505| 5|311| 15.2| 396.9|19.15| 27.1|
|0.2112|12.5| 7.87| 0|0.524|5.631| 100|6.0821| 5|311| 15.2|386.63|29.93| 16.5|
| 0.17|12.5| 7.87| 0|0.524|6.004|85.9|6.5921| 5|311| 15.2|386.71| 17.1| 18.9|
+------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+-----+
496 more rows...
smile> val model = smile.regression.randomForest("class" ~ ".", housing)
[main] INFO smile.util.package$ - Random Forest runtime: 0:00:00.495102
model: regression.RandomForest = smile.regression.RandomForest@52d63b7e
smile> println(housing.names.slice(0, 13).zip(model.importance).sortBy(_._2).map(t => "%-15s %12.4f".format(t._1, t._2)).mkString("\n"))
CHAS 87390.1358
ZN 139917.2980
RAD 142194.2660
B 290043.4817
AGE 442444.4313
TAX 466142.1004
DIS 1006704.7537
CRIM 1067521.2371
INDUS 1255337.3748
NOX 1265518.8700
PTRATIO 1402827.4712
LSTAT 6001941.3822
RM 6182463.1457
smile> println(housing.names.slice(0, 13).zip(model.shap(housing.stream.parallel)).sortBy(_._2).map(t => "%-15s %12.4f".format(t._1, t._2)).mkString("\n"))
CHAS 0.0483
ZN 0.0514
RAD 0.0601
B 0.1647
AGE 0.2303
TAX 0.2530
DIS 0.3403
CRIM 0.5217
INDUS 0.5734
NOX 0.6423
PTRATIO 0.7819
RM 2.3272
LSTAT 2.6838
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_14">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
smile> import smile.io.*
smile> import smile.data.formula.*
smile> import smile.regression.*
smile> var housing = Read.arff("data/weka/regression/housing.arff")
[main] INFO smile.io.Arff - Read ARFF relation housing
housing ==> [CRIM: float, ZN: float, INDUS: float, CHAS: byte ... -+-----+
496 more rows...
smile> var model = smile.regression.RandomForest.fit(Formula.lhs("class"), housing)
model ==> smile.regression.RandomForest@6b419da
smile> var importance = model.importance()
importance ==> double[13] { 1055236.1223483568, 140436.296249959 ... 12613, 5995848.431410795 }
smile> var shap = model.shap(housing.stream().parallel())
shap ==> double[13] { 0.512298718958897, 0.053544797051652 ... 24552, 2.677155911786813 }
smile> var fields = java.util.Arrays.copyOf(housing.names(), 13)
fields ==> String[13] { "CRIM", "ZN", "INDUS", "CHAS", "NOX" ... "PTRATIO", "B", "LSTAT" }
smile> smile.sort.QuickSort.sort(importance, fields);
smile> for (int i = 0; i < importance.length; i++) {
...> System.out.format("%-15s %12.4f%n", fields[i], importance[i]);
...> }
CHAS 86870.7977
ZN 140436.2962
RAD 150921.6370
B 288337.2989
AGE 451251.3294
TAX 463902.6619
DIS 1015965.9019
CRIM 1055236.1223
INDUS 1225992.4930
NOX 1267797.6737
PTRATIO 1406441.0265
LSTAT 5995848.4314
RM 6202612.7762
smile> var fields = java.util.Arrays.copyOf(housing.names(), 13)
fields ==> String[13] { "CRIM", "ZN", "INDUS", "CHAS", "NOX" ... "PTRATIO", "B", "LSTAT" }
smile> smile.sort.QuickSort.sort(shap, fields);
smile> for (int i = 0; i < shap.length; i++) {
...> System.out.format("%-15s %12.4f%n", fields[i], shap[i]);
...> }
CHAS 0.0486
ZN 0.0535
RAD 0.0622
B 0.1621
AGE 0.2362
TAX 0.2591
DIS 0.3433
CRIM 0.5123
INDUS 0.5597
NOX 0.6409
PTRATIO 0.7861
RM 2.3312
LSTAT 2.6772
</code></pre>
</div>
</div>
</div>
<h3 id="genetic-algorithm-feature-selection">Genetic Algorithm Based Feature Selection</h3>
<p>Genetic algorithm (GA) is a search heuristic that mimics the process of
natural evolution. This heuristic is routinely used to generate useful
solutions to optimization and search problems. Genetic algorithms belong
to the larger class of evolutionary algorithms (EA), which generate solutions
to optimization problems using techniques inspired by natural evolution,
such as inheritance, mutation, selection, and crossover.</p>
<p>In a genetic algorithm, a population of strings (called chromosomes), which
encode candidate solutions (called individuals) to an optimization problem,
evolves toward better solutions. Traditionally, solutions are represented in binary as strings
of 0s and 1s, but other encodings are also possible. The evolution usually
starts from a population of randomly generated individuals and happens in
generations. In each generation, the fitness of every individual in the
population is evaluated, multiple individuals are stochastically selected
from the current population (based on their fitness), and modified
(recombined and possibly randomly mutated) to form a new population. The
new population is then used in the next iteration of the algorithm. Commonly,
the algorithm terminates when either a maximum number of generations has been
produced, or a satisfactory fitness level has been reached for the population.
If the algorithm has terminated due to a maximum number of generations, a
satisfactory solution may or may not have been reached.</p>
<p>Genetic algorithm based feature selection finds many (random)
subsets of variables of expected classification power.
The "fitness" of each subset of variables is determined by its
ability to classify the samples according to a given classification
method.</p>
<p>In the below example, we use gradient boosted trees (100 tress)
as the classifier, the accuracy of 5-fold cross validation as the fitness measure.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_5" data-toggle="tab">Java</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane active" id="java_5">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
var train = Read.csv("data/usps/zip.train", CSVFormat.DEFAULT.withDelimiter(' '));
var test = Read.csv("data/usps/zip.test", CSVFormat.DEFAULT.withDelimiter(' '));
var x = train.drop("V1").toArray();
var y = train.column("V1").toIntArray();
var testx = test.drop("V1").toArray();
var testy = test.column("V1").toIntArray();
var selection = new GAFE();
var result = selection.apply(100, 20, 256,
GAFE.fitness(x, y, testx, testy, new Accuracy(),
(x, y) -> LDA.fit(x, y)));
for (BitString bits : result) {
System.out.format("%.2f%% %s%n", 100*bits.fitness(), bits);
}
</code></pre>
</div>
</div>
</div>
<!-- GAFE deadlock on scala
<ul class="nav nav-tabs">
<li class="active"><a href="#java_5" data-toggle="tab">Java</a></li>
<li><a href="#scala_5" data-toggle="tab">Scala</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_5">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
val train = read.csv("data/usps/zip.train", header=false, delimiter=' ')
val test = read.csv("data/usps/zip.test", header=false, delimiter=' ')
val x = train.drop("V1").toArray
val y = train("V1").toIntArray
val testx = test.drop("V1").toArray
val testy = test("V1").toIntArray
val selection = new GAFE
val result = selection.apply(100, 20, 256,
GAFE.fitness(x, y, testx, testy, new Accuracy,
(x: Array[Array[Double]], y: Array[Int]) => LDA.fit(x, y)))
result.foreach { bits =>
print(f"${100*bits.fitness}%.2f%% ")
println(bits.bits.mkString(" "))
}
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_5">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
var train = Read.csv("data/usps/zip.train", CSVFormat.DEFAULT.withDelimiter(' '));
var test = Read.csv("data/usps/zip.test", CSVFormat.DEFAULT.withDelimiter(' '));
var x = train.drop("V1").toArray();
var y = train.column("V1").toIntArray();
var testx = test.drop("V1").toArray();
var testy = test.column("V1").toIntArray();
var selection = new GAFE();
var result = selection.apply(100, 20, 256,
GAFE.fitness(x, y, testx, testy, new Accuracy(),
(x, y) -> LDA.fit(x, y)));
for (BitString bits : result) {
System.out.format("%.2f%% %s%n", 100*bits.fitness(), bits);
}
</code></pre>
</div>
</div>
</div>
-->
<p>When many such subsets of variables are obtained, the one with the best
performance may be used as selected features. Alternatively, the frequencies
with which variables are selected may be analyzed further. The most
frequently selected variables may be presumed to be the most relevant to
sample distinction and are finally used for prediction.</p>
<p>Although GA avoids brute-force search, it is still much slower than univariate feature selection.</p>
<h2 id="dimension-reduction">Dimension Reduction</h2>
<p>Dimension Reduction, also called Feature Extraction, transforms the data in the
high-dimensional space to a space of fewer dimensions. The data
transformation may be linear, as in principal component analysis (PCA),
but many nonlinear dimensionality reduction techniques also exist.</p>
<h3 id="pca">Principal Component Analysis</h3>
<p>The main linear technique for dimensionality reduction, principal component
analysis, performs a linear mapping of the data to a lower dimensional
space in such a way that the variance of the data in the low-dimensional
representation is maximized. In practice, the correlation matrix of the
data is constructed and the eigenvectors on this matrix are computed.
The eigenvectors that correspond to the largest eigenvalues (the principal
components) can now be used to reconstruct a large fraction of the variance
of the original data. Moreover, the first few eigenvectors can often be
interpreted in terms of the large-scale physical behavior of the system.
The original space has been reduced (with data loss, but hopefully
retaining the most important variance) to the space spanned by a few
eigenvectors.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_6" data-toggle="tab">Java</a></li>
<li><a href="#scala_6" data-toggle="tab">Scala</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_6">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
def pca(data: Array[Array[Double]], cor: Boolean = false): PCA
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_6">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
public class PCA {
public static PCA fit(double[][] data);
public static PCA cor(double[][] data);
}
</code></pre>
</div>
</div>
</div>
<p>If the parameter <code>cor</code> is true, PCA uses correlation matrix instead of covariance matrix.
The correlation matrix is simply the covariance matrix, standardized by setting all variances equal
to one. When scales of variables are similar, the covariance matrix is always preferred,
as the correlation matrix will lose information when standardizing the variance.
The correlation matrix is recommended when variables are measured in different scales.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_7" data-toggle="tab">Java</a></li>
<li><a href="#scala_7" data-toggle="tab">Scala</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_7">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
val pendigits = read.csv("data/classification/pendigits.txt", delimiter="\t", header = false)
val formula: Formula = "V17" ~ "."
val df = formula.x(pendigits)
val x = df.toArray()
val y = formula.y(pendigits).toIntArray()
val pc = pca(df)
show(screeplot(pc.varianceProportion()))
val x2 = pc.getProjection(3)(x)
show(plot(x2, y, '.'))
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_7">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
var pendigits = Read.csv("data/classification/pendigits.txt", CSVFormat.DEFAULT.withDelimiter('\t'));
var formula = Formula.lhs("V17");
var x = formula.x(pendigits).toArray();
var y = formula.y(pendigits).toIntArray();
var pca = PCA.fit(x);
var canvas = new ScreePlot(pca.varianceProportion()).canvas();
canvas.window();
var p = pca.getProjection(3);
var x2 = p.apply(x);
ScatterPlot.of(x2, y, '.').canvas().window();
</code></pre>
</div>
</div>
</div>
<p>In the above example, we apply PCA to the pen-digits data, which includes 16 variables.</p>
<div style="width: 100%; display: inline-block; text-align: center;">
<img src="images/scree.png" class="enlarge" style="width: 480px;" />
<div class="caption" style="min-width: 480px;">Scree Plot</div>
</div>
<p>The scree plot is a useful visual aid for determining an appropriate number of principal components.
The scree plot graphs the eigenvalue against the component number. To determine the appropriate
number of components, we look for an "elbow" in the scree plot. The component number is taken to
be the point at which the remaining eigenvalues are relatively small and all about the same size.</p>
<p>Finally, we plot the data projected into 3-dimensional space.</p>
<div style="width: 100%; display: inline-block; text-align: center;">
<img src="images/pca.png" class="enlarge" style="width: 480px;" />
<div class="caption" style="min-width: 480px;">PCA</div>
</div>
<h3 id="ppca">Probabilistic Principal Component Analysis</h3>
<p>Probabilistic principal component analysis is a simplified factor analysis
that employs a latent variable model with linear relationship:
<pre class="prettyprint lang-html"><code>
y ∼ W * x + μ + ε
</code></pre>
<p>where latent variables x ∼ N(0, I), error (or noise) ε ∼ N(0, Ψ),
and μ is the location term (mean). In probabilistic PCA, an isotropic noise model is used,
i.e., noise variances constrained to be equal (Ψ<sub>i</sub> = σ<sup>2</sup>).
A close form of estimation of above parameters can be obtained
by maximum likelihood method.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_8" data-toggle="tab">Java</a></li>
<li><a href="#scala_8" data-toggle="tab">Scala</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_8">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
def ppca(data: Array[Array[Double]], k: Int): PPCA
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_8">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
public class PPCA {
public static PPCA fit(double[][] data, int k);
}
</code></pre>
</div>
</div>
</div>
<p>Similar to PCA example, we apply probabilistic PCA to the pen-digits data.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_9" data-toggle="tab">Java</a></li>
<li><a href="#scala_9" data-toggle="tab">Scala</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_9">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
val pc = ppca(df, 3)
val x2 = pc(x)
show(plot(x2, y, '.'))
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_9">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
var pca = ProbabilisticPCA.fit(x, 3);
var x2 = pca.apply(x);
ScatterPlot.of(x2, y, '.').canvas().window();
</code></pre>
</div>
</div>
</div>
<div style="width: 100%; display: inline-block; text-align: center;">
<img src="images/ppca.png" class="enlarge" style="width: 480px;" />
<div class="caption" style="min-width: 480px;">Probabilistic PCA</div>
</div>
<h3 id="kpca">Kernel Principal Component Analysis</h3>
<p>Principal component analysis can be employed in a nonlinear way by means
of the kernel trick. The resulting technique is capable of constructing
nonlinear mappings that maximize the variance in the data. The resulting
technique is entitled Kernel PCA. Other prominent nonlinear techniques
include manifold learning techniques such as locally linear embedding
(LLE), Hessian LLE, Laplacian eigenmaps, and LTSA. These techniques
construct a low-dimensional data representation using a cost function
that retains local properties of the data, and can be viewed as defining
a graph-based kernel for Kernel PCA.</p>
<p>In practice, a large data set leads to a large Kernel/Gram matrix <code>K</code>, and
storing K may become a problem. One way to deal with this is to perform
clustering on your large dataset, and populate the kernel with the means
of those clusters. Since even this method may yield a relatively large K,
it is common to compute only the top P eigenvalues and eigenvectors of <code>K</code>.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_10" data-toggle="tab">Java</a></li>
<li><a href="#scala_10" data-toggle="tab">Scala</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_10">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
def kpca[T](data: Array[T], kernel: MercerKernel[T], k: Int, threshold: Double = 0.0001): KPCA[T]
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_10">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
public class KPCA {
public static KPCA<T> fit(T[] data, MercerKernel<T> kernel, int k, double threshold);
}
</code></pre>
</div>
</div>
</div>