-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrnnproblem.html
180 lines (136 loc) · 7.77 KB
/
rnnproblem.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>A question about RNNs</title>
<link href="https://cdn.jsdelivr.net/npm/tailwindcss/dist/tailwind.min.css" rel="stylesheet">
<style type="text/css">
/* body {
<link href="https://fonts.googleapis.com/css?family=Rokkitt" rel="stylesheet">
font-family: 'Rokkitt', serif;
font-size: 150%;
} */
.purple {
color: rgb(124, 25, 124);
}
.purple a:hover {
color: rgb(124, 25, 124);
background-color: rgb(203, 153, 203);
text-decoration: underline;
}
.purple a:visited {
color: rgb(70, 29, 70);
text-decoration: underline;
}
.purple a {
color: rgb(203, 153, 203);
text-decoration: underline;
}
.blue {
color: rgb(22, 56, 101);
}
.blue a:hover {
color: rgb(22, 56, 101);
background-color: rgb(108, 149, 203);
text-decoration: underline;
}
.blue a:visited {
color: rgb(30, 39, 52);
text-decoration: underline;
}
.blue a {
color: rgb(62, 96, 141);
text-decoration: underline;
}
.red {
color: rgb(100, 31, 34);
}
.red a:hover {
color: rgb(100, 31, 34);
background-color: rgb(151, 85, 87);
text-decoration: underline;
}
.red a:visited {
color: rgb(60, 22, 23);
text-decoration: underline;
}
.red a {
color: rgb(191, 71, 75);
text-decoration: underline;
}
.orange {
color: rgb(147, 94, 15);
}
.orange a:hover {
color: rgb(147, 94, 15);
background-color: rgb(213, 171, 108);
text-decoration: underline;
}
.orange a:visited {
color: rgb(94, 69, 30);
text-decoration: underline;
}
.orange a {
color: rgb(208, 147, 57);
text-decoration: underline;
}
</style>
<script type="text/javascript" id="MathJax-script" async
src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js">
</script>
</head>
<!--<body class="font-serif">-->
<!--<body class="bg-yellow-lighter">-->
<body class="bg-grey-lighter">
<div class="mx-auto px-2 min-h-screen min-h-full w-full md:w-3/4 lg:w-1/2 xl:w-1/2 bg-white shadow-md">
<div class="flex purple text-xl">
<div class="w-64 h-auto border-b-2 border-black pt-8 px-2 font-bold">
A Problem about RNNs
</div>
</div>
<div class="flex purple">
<div class="flex-no-shrink w-4 h-auto border-r-2 border-black"></div>
<div class="w-auto h-auto border-l-8 border-purple-lighter pl-3 py-1 ">
<p class="border-l-4 border-purple-light pl-2 my-2">
I want to write here about a problem I became interested in after meeting <a href="https://hanin.princeton.edu/">Boris Hanin</a> at the <a href="https://simons.berkeley.edu/programs/dl2019">Simons Institute Foundations of Deep Learning program</a> a few summers ago. I talked briefly with Boris about his work on the number of <a href="http://proceedings.mlr.press/v97/hanin19a/hanin19a.pdf">affine pieces of random neural networks</a>.
</p>
<p class="border-l-4 border-purple-light pl-2 my-2">
Boris's work showed that the number of affine pieces in a vanilla feedforward ReLU network was limited. I began to wonder if this was true for RNNs as well. It seems like RNNs have recursive computation that makes their outputs more complicated. They can even compute very weird functions like the devil's staircase if the weights are set carefully.
I ended up with the question: Is there some input data you can feed a ReLU RNN so that it has a chance of learning an infinite-affine-piece function?
</p>
<p class="border-l-4 border-purple-light pl-2 my-2">
Here is the conjecture I had, in formal language, and my thoughts:
</p>
<p class="border-l-4 border-purple-light pl-2 my-2">
Define a \(k\)-depth \(w\)-width 1-hidden-layer ReLU RNN, parametrized by the weight \(W \in \mathbb{R}^{w \times w}\), to be the function
$$ f_k(W, x) = (\text{ReLU} \circ W \circ \cdots \circ \text{ReLU} \circ W)(x)_1 $$
With \(k\) applications of \(W\). The subscript 1 denotes that the output of the network is the first component of the output layer vector.
</p>
<p class="border-l-4 border-purple-light pl-2 my-2">
<b>Conjecture</b>: There exists some finite data set \(\{(X_i, y_i)\}_{i=1}^n\) such that, if \(W\) is initialized with standard normal values, and the network is trained by gradient flow on the squared loss, then the expected number of affine pieces in the trained network is exponential in \(k\).
</p>
<p class="border-l-4 border-purple-light pl-2 my-2">
<b>Proof attempt (wrong)</b>. Consider <a href="https://en.wikipedia.org/wiki/Cantor_function">the devil's staircase function</a>. Note there is a choice of parameters \(W^*\) so that \(f_k\) computes the \(k\)th function in the devil's staircase iteration. Choose \(n >> w^2\) datapoints \((x, y)\) that exist in the limiting DS function, not on any of its discontinuities. Consider the loss landscape in the vicinity of \(W^*\). The landscape is smooth here - since we are not on a discontinuity the loss is a polynomial in the elements of \(W\) - and the loss has a minimum at \(W^*\). Furthermore, since there are more data points than parameters, presumably the losses associated with each of these data points mean all but a rank \(w^2-1\) subspace intersecting \(W^*\) has nonzero loss. We make sure that the data points are chosen such that the intersection of all of these subspaces is rank 0 on a sufficiently small neighborhood of \(W^*\).
</p>
<p class="border-l-4 border-purple-light pl-2 my-2">
There are obviously a few holes in this proof. The rank \(w^2-1\) zero loss subspaces for each datapoint are presumably curved, so these individual point losses are not convex, and I don't know if or how to show their sum is convex. Furthermore, some of the parameters may not affect the loss on any datapoint in the vicinity of \(W^*\), since they may only control the widths of the steps rather than their height or slope, so the zero loss subspaces for all datapoints may intersect on a line. Trying to move the datapoints to the corners of the staircase function opens up a host of new problems, such as pathological behavior in the loss gradient.
</p>
<p class="border-l-4 border-purple-light pl-2 my-2">
Does the devil's staircase work, or is a piecewise affine approximations of a quadratic function better? The function \(x \mapsto x(1-x)\) has the representation [see Yarotsky's 2016 paper].
$$ \frac14 t_1(x) + \frac{1}{16} t_2(x) + \frac{1}{64} t_4(x) + \dots $$
where \(t_i\) is the triangle wave with \(i\) cycles on \([0,1]\).
</p>
<p class="border-l-4 border-purple-light pl-2 my-2">
When I was studying this problem two years ago, I remember thinking a good candidate function would be.
$$ \frac14 t_1(x) + \frac{1}{32} t_2(x) + \frac{1}{128} t_4(x) + \dots $$
Because (I think) this would admit a \(W\) with functional norm \(< 1\), which was important somehow.
</p>
<p class="border-l-4 border-purple-light pl-2 my-2">
Anyway, I have stopped doing research on the theory of deep learning, so it is unlikely I will ever solve this problem. If you ever manage to prove this conjecture, or disprove it, <a href="mailto:[email protected]">I would be happy to hear about it</a>!
</p>
</div>
</div>
</div>
</body>
</html>