Merge sharelatex-2018-11-13-1208 into master

hbertrand · Nov 13, 2018 · 355d6c8 · 355d6c8
2 parents fdbafb3 + 52c2c73
commit 355d6c8
Show file tree

Hide file tree

Showing 2 changed files with 6 additions and 4 deletions.
diff --git a/latex/chap_hyperopt.tex b/latex/chap_hyperopt.tex
@@ -757,7 +757,7 @@ \subsection{Hyperband}
 
 The principle is simple: pick randomly a group of configurations from a uniform distribution, train the corresponding networks partially, evaluate them, resume training of the most performing ones, and continue on until a handful of them have been trained to completion. Then pick a new group and repeat the cycle until exhaustion of the available resources.
 
-But a problem appears: at which point in the training can we start evaluating the models? Too soon and they will not have started to converge, making the evaluation meaningless, too late and we have wasted precious resources training under-performing models. Moreover, we do not know how to find that point and it changes from one task to the other. Hyperband's answer is to divide a cycle into brackets. Each bracket has the same quantity of resource at its disposal. The difference between brackets is the point at which they start evaluating the models. The first bracket will start evaluating and discarding models very early, allowing it to test a bigger number of configurations, while the last bracket will test only a small number of configurations but will train them until the end.
+But a problem appears: at which point in the training can we start evaluating the models? Too soon and they will not have started to converge, making their evaluation meaningless, too late and we have wasted precious resources training under-performing models. Moreover, this minimum time before evaluation is hard to establish and changes from one task to the other. Hyperband's answer is to divide a cycle into brackets. Each bracket has the same quantity of resource at its disposal. The difference between brackets is the point at which they start evaluating the models. The first bracket will start evaluating and discarding models very early, allowing it to test a bigger number of configurations, while the last bracket will test only a small number of configurations but will train them until the maximum number of resources allowed per model.
 
 The algorithm is controlled by two hyper-parameters: the maximal quantity of resources $R$ that can be allocated to a given model, and the proportion of configurations $\eta$ kept at each evaluation. $R$ is typically specified in number of epochs or in minutes. At each evaluation, $1 / \eta$ models are kept while the rest are discarded. 
 
@@ -998,7 +998,7 @@ \subsection{Hyper-parameter optimization}
 An alternative way to handle un-trainable models would have been to choose an arbitrarily high loss for those models. This violates the smoothness assumption of the Gaussian process and would have affected strongly the prediction on nearby models, discouraging the acquisition function from choosing them. This would have also disqualified nearby models that fit in memory, which was our reason for not doing that in the first place. 
 
 It is not clear which strategy is better. Choosing a high loss sacrifices some valid models, but saves time by testing less un-trainable models. On the other hand, an un-trainable model fails immediately at model creation, so the gain in time is minor. 
-But independently of the chosen strategy, the fact that some combinations lead to model too big to fit in memory highlights a flaw in the design of the hyper-parameters space and the resulting networks architectures. The un-trainable models are models with a low number of blocks but a high number of filters per layer. With one block and 64 filters per layer, the feature maps of the last convolutional layer are of size $64 \times 48 \times 48$, meaning the following fully-connected layer have $64 \times 48 \times 48 \times 4096 = 603,979,776$ weights. Since each block ends with a max-pooling layer, the number of weights quickly become manageable with additional blocks. 
+But independently of the chosen strategy, the fact that some combinations lead to model too big to fit in memory highlights a flaw in the design of the hyper-parameters space and the resulting networks architectures. The un-trainable models are models with a low number of blocks but a high number of filters per layer. With one block and 64 filters per layer, the feature maps of the last convolutional layer are of size $64 \times 48 \times 48$, meaning the following fully-connected layer have $64 \times 48 \times 48 \times 4096 = 603,979,776$ weights. Since each block ends with a max-pooling layer, the number of weights quickly becomes manageable with additional blocks. 
 
 \begin{table}
     \centering
@@ -1045,7 +1045,8 @@ \subsection{From probabilities to a decision}
 
 To answer this question, we processed a full body volume by classifying each of its slices through our best model. 
 %For each slice, the predicted class is the one with a probability higher than $0.7$, and if no class meets this criterion, then we do not choose any. 
-As we can see in Figure~\ref{fig:full_body}, the network is able to identify all body parts, despite the slices being processed independently. Nevertheless the network tends to misclassify the boundaries between regions, notably legs/pelvis and pelvis/abdomen. It also mistakenly identifies the empty slices above the head as being pelvis with a high confidence. This might be indicative of overfitting as those slices contain only noise. Interestingly, the straps used to hold the patient down, visible as the bands with high intensity on the extremities and low intensities on the body, trigger the network every time into the pelvis class. This is likely due to the rarity of those straps in the dataset.
+As we can see in Figure~\ref{fig:full_body}, the network is able to identify all body parts, despite the slices being processed independently. Nevertheless the network tends to misclassify the boundaries between regions, notably legs/pelvis and pelvis/abdomen. It also mistakenly identifies the empty slices above the head as being pelvis with a high confidence. %This might be indicative of overfitting as those slices contain only noise. 
+Interestingly, the straps used to hold the patient down, visible as the bands with high intensity on the extremities and low intensities on the body, trigger the network every time into the pelvis class. This is likely due to the rarity of those straps in the dataset.
 
 But we are not interested in the mere probabilities, we want to take an actual decision. Therefore we need a decision scheme. A very simple one would be a threshold, say $0.7$, over which we consider the slice to be part of the class. However the predictions from the network are too noisy (see Figure~\ref{fig:full_body}-b), and this approach gives regions broken in multiple parts with messy boundaries.
 
@@ -1055,6 +1056,7 @@ \subsection{From probabilities to a decision}
 %$\forall i \in [1; N], Z^m_i < Z^M_i$ (the ending slice is after the starting slice) and 
 $\forall i \in [1; N], Z^m_{i+1} \ge Z^M_i$ (a region cannot start before the end of the previous region). The network outputs the probability of a slice to belong to a region $P_i \left( Z \right)$.
 
+We translate these boundary 
 The optimal boundary slices are the solutions to the following linear program:
 \begin{equation}
     \begin{array}{ll@{}ll}

diff --git a/latex/main.tex b/latex/main.tex
@@ -143,7 +143,7 @@
 \addbibresource{references.bib}
 
 % \includeonly{chap_introduction, chap_segmentation, chap_transfer_learning, chap_hyperopt, chap_conclusion}
-\includeonly{chap_introduction}
+\includeonly{chap_hyperopt}
 
 \pagenumbering{roman}