Skip to content

Commit

Permalink
Adding changes to notebook for session1
Browse files Browse the repository at this point in the history
  • Loading branch information
jordancaraballo committed Apr 12, 2023
1 parent a95b21f commit ae24487
Showing 1 changed file with 106 additions and 13 deletions.
119 changes: 106 additions & 13 deletions session1/1-ML-Algorithms-Introduction-Session1.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
"outputId": "42f3e87c-85a0-4f38-eeef-b55dd2f3ca00"
},
"id": "fhjuYF52Q8ot",
"execution_count": 2,
"execution_count": null,
"outputs": [
{
"output_type": "stream",
Expand Down Expand Up @@ -73,7 +73,7 @@
"outputId": "e2666868-7292-4c23-bc30-7095cfea6cac"
},
"id": "a_HQqsZTRIIO",
"execution_count": 3,
"execution_count": null,
"outputs": [
{
"output_type": "stream",
Expand Down Expand Up @@ -176,7 +176,7 @@
},
{
"cell_type": "code",
"execution_count": 30,
"execution_count": null,
"id": "0126240c",
"metadata": {
"id": "0126240c"
Expand Down Expand Up @@ -240,7 +240,7 @@
"id": "cBVcWxlaYiiD"
},
"id": "cBVcWxlaYiiD",
"execution_count": 6,
"execution_count": null,
"outputs": []
},
{
Expand Down Expand Up @@ -445,7 +445,7 @@
"outputId": "c97eaff7-bbf7-4029-aa23-5f1500d91126"
},
"id": "zMm1avQyXkvy",
"execution_count": 7,
"execution_count": null,
"outputs": [
{
"output_type": "display_data",
Expand Down Expand Up @@ -577,7 +577,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": null,
"id": "fce50aa4",
"metadata": {
"id": "fce50aa4"
Expand Down Expand Up @@ -606,7 +606,7 @@
"outputId": "11e15c1e-bffc-4de5-980a-299a482a1dbe"
},
"id": "NNfgje2gX_1C",
"execution_count": 24,
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
Expand Down Expand Up @@ -1162,7 +1162,7 @@
"outputId": "553724f0-ab7a-4200-d994-98e1dd6eb5af"
},
"id": "S9SpCCA2aZnZ",
"execution_count": 29,
"execution_count": null,
"outputs": [
{
"output_type": "stream",
Expand Down Expand Up @@ -1246,7 +1246,7 @@
"outputId": "be06e59b-9338-47e7-b717-9c29d653db69"
},
"id": "e0UtkODubN_8",
"execution_count": 48,
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
Expand Down Expand Up @@ -1293,7 +1293,7 @@
"outputId": "0fff9047-70de-445f-f36e-4cf8f5628029"
},
"id": "sZpqweJSdH0g",
"execution_count": 49,
"execution_count": null,
"outputs": [
{
"output_type": "stream",
Expand Down Expand Up @@ -1333,7 +1333,7 @@
"outputId": "417a47bd-3ad2-44a7-f16a-fd200fbed4b0"
},
"id": "c-kG1Dtnh03e",
"execution_count": 50,
"execution_count": null,
"outputs": [
{
"output_type": "display_data",
Expand Down Expand Up @@ -1370,9 +1370,102 @@
"\n",
"Spatially aware models like Convolutional Neural Networks often require continuous pixels in order to perform additional spatial pattern discovery. Algorithms like Random Forest and XGBoost that only rely on the values per feature, can be trained simply from point observations, which are oftentimes cheaper to get.\n",
"\n",
"Let's introduce the basics of a Random Forest here. We basically have a set of independent trees looking at folds of the data that then take a decision by majority voting.\n",
"### Random Forest\n",
"\n",
"![image](https://1.bp.blogspot.com/-Ax59WK4DE8w/YK6o9bt_9jI/AAAAAAAAEQA/9KbBf9cdL6kOFkJnU39aUn4m8ydThPenwCLcBGAsYHQ/s0/Random%2BForest%2B03.gif)"
"Let's introduce the basics of a Random Forest here. We have a set N of observations where our target variable is a pixel whose possible outcomes are water or not-water. From here we create a set of independent trees looking at folds of the data that then take a decision by majority voting for classification, or by average when dealing with regression problems. This random generation of the dataset is called bootstrapping. \n",
"\n",
"![image](https://1.bp.blogspot.com/-Ax59WK4DE8w/YK6o9bt_9jI/AAAAAAAAEQA/9KbBf9cdL6kOFkJnU39aUn4m8ydThPenwCLcBGAsYHQ/s0/Random%2BForest%2B03.gif)\n",
"\n",
"The Random Forest can oftentimes work on small dataset, but it will be highly dependant on the representativeness of the training data, with high variance to unseen samples.\n",
"\n",
"```python\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"# create model instance\n",
"rf = RandomForestClassifier()\n",
"# fit model\n",
"rf.fit(X_train, y_train)\n",
"# make predictions\n",
"preds = rf.predict(X_test)\n",
"```\n",
"\n",
"### XGBoost\n",
"\n",
"To improve upon the high variance of the Random Forest, the Extreme Gradient Boosting (XGBoost) algorithm was developed. This algorithm takes the Random Forest structure even further by prioritizing the training of the model into improving on missclassified samples. \n",
"\n",
"![image](https://dz2cdn1.dzone.com/storage/temp/13069527-boosting-algo.png)\n",
"\n",
"The use of the Gradient Descent algorithm can be simply explained by the strategy of minimizing the cost function, or the error between predicted and actual y in this case, by means of iterations and optimization steps. XGBoost also uses regularization techniques that can help to avoid overfitting when compared to the Random Forest.\n",
"\n",
"![image](https://miro.medium.com/v2/resize:fit:720/format:webp/1*QJZ6W-Pck_W7RlIDwUIN9Q.jpeg)\n",
"\n",
"```python\n",
"from xgboost import XGBClassifier\n",
"# create model instance\n",
"bst = XGBClassifier(\n",
" n_estimators=2, max_depth=2, learning_rate=1, objective='binary:logistic')\n",
"# fit model\n",
"bst.fit(X_train, y_train)\n",
"# make predictions\n",
"preds = bst.predict(X_test)\n",
"```\n",
"\n",
"### Neural Networks\n",
"\n",
"Further up in the complexity realm we have Neural Networks. These are the base for deep learning algorithms and their main focus is on the mimicking of the human brain. The main idea behind Neural Networks is the use of connected neurons that can transmit information and be strictly activated based on a threshold that is normally defined by an activation function.\n",
"\n",
"As opposed to the tree ensembles we described above, neural networks use forward and back propagation to learn. Thus no trees are built and it makes them capable of extracting patterns out of the data without stricly performing feature engineering like the ones needed by trees.\n",
"\n",
"The downside is that given this feature extraction process, neural networks need substantially more training data when compared to the tree ensemble algorithms.\n",
"\n",
"![image](https://cdn-images-1.medium.com/max/1024/1*CniSdF4zewDrajSHwCekSQ.gif)\n",
"\n",
"```python\n",
"import tensorflow as tf\n",
"# create model instance\n",
"nn = keras.Sequential([\n",
" keras.layers.Reshape(target_shape=(28 * 28,), input_shape=(28, 28)),\n",
" keras.layers.Dense(units=256, activation='relu'),\n",
" keras.layers.Dense(units=192, activation='relu'),\n",
" keras.layers.Dense(units=128, activation='relu'),\n",
" keras.layers.Dense(units=10, activation='softmax')\n",
"])\n",
"\n",
"# compile model instance - extra step when compared to trees\n",
"model.compile(optimizer='adam', \n",
" loss=tf.losses.CategoricalCrossentropy(from_logits=True),\n",
" metrics=['accuracy'])\n",
"\n",
"# fit model\n",
"nn.fit(\n",
" (X_train, y_train), \n",
" epochs=10, \n",
" steps_per_epoch=500,\n",
" validation_data=(X_val, y_val), \n",
" validation_steps=2\n",
")\n",
"\n",
"# make predictions\n",
"preds = nn.predict(X_test)\n",
"```\n",
"\n",
"### Convolutional Neural Networks\n",
"\n",
"All of the algorithms previously discussed work really well with structured data (tabular format), and Neural Networks normally with both structured and unstructured data. However, none of them take into account spatial features when trying to learn from the data.\n",
"\n",
"Convolutional Neural Networks (CNNs) take the learning a step further by learning features from both the individual pixels, but their neighboring pixels as well. Using the same forward and backward propagation methods, CNNs can use sliding windows to extract information around local neighbornhoods around pixels.\n",
"\n",
"These have proven to be extremely powerful in segmentation tasks, and have proven to be capable of also outputing continuous values by performing additional modifications to their networks.\n",
"\n",
"![image](https://www.mobiquity.com/hs-fs/hubfs/CNN03.gif?width=640&name=CNN03.gif)\n",
"\n",
"The challenge behind CNNs and Neural Networks is their need for large extents of training data and performance when compared to other algorithms like tree ensembles.\n",
"\n",
"## Closing Thoughts\n",
"\n",
"- We have discussed ways of generating additional training data that might save some time in the overall process\n",
"- We have discussed the basics of several algorithms commonly used in Earth Science\n",
"- We have provided the base to understand which algorithm might be useful depending on the problem\n",
"- We have discussed techniques to better choose the initial algorithm to test, making it clear that you can always change the algorithm without major changes in your code"
]
}
],
Expand Down

0 comments on commit ae24487

Please sign in to comment.