-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathrust_parser.html
402 lines (398 loc) · 36.1 KB
/
rust_parser.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<link href="https://adriann.github.io/feed.rss" rel="alternate" type="application/rss+xml" title="What's new on adriann.github.io" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta name="generator" content="pandoc" />
<meta name="author" content="Adrian Neumann ([email protected])" />
<title>Writing a Simple Parser in Rust</title>
<style type="text/css">
div.sourceCode { overflow-x: auto; }
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
margin: 0; padding: 0; vertical-align: baseline; border: none; }
table.sourceCode { width: 100%; line-height: 100%; }
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
td.sourceCode { padding-left: 5px; }
code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
code > span.dt { color: #902000; } /* DataType */
code > span.dv { color: #40a070; } /* DecVal */
code > span.bn { color: #40a070; } /* BaseN */
code > span.fl { color: #40a070; } /* Float */
code > span.ch { color: #4070a0; } /* Char */
code > span.st { color: #4070a0; } /* String */
code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
code > span.ot { color: #007020; } /* Other */
code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
code > span.fu { color: #06287e; } /* Function */
code > span.er { color: #ff0000; font-weight: bold; } /* Error */
code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
code > span.cn { color: #880000; } /* Constant */
code > span.sc { color: #4070a0; } /* SpecialChar */
code > span.vs { color: #4070a0; } /* VerbatimString */
code > span.ss { color: #bb6688; } /* SpecialString */
code > span.im { } /* Import */
code > span.va { color: #19177c; } /* Variable */
code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code > span.op { color: #666666; } /* Operator */
code > span.bu { } /* BuiltIn */
code > span.ex { } /* Extension */
code > span.pp { color: #bc7a00; } /* Preprocessor */
code > span.at { color: #7d9029; } /* Attribute */
code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
</style>
<style>
.caption{font-size:66%;text-align:right;}
.figure{float:right;padding-bottom:1em;padding-left:1em;}
.figure>img{display:block;margin:0 auto;}
.footnotes{font-size:80%;}
.block{border-left:1ex solid gray;padding-left:2em;}
li{padding:0.25em;}
a:hover{text-shadow: 0 0 5px;}
body{font-family:sans-serif;max-width:100ex;padding-left:3em;padding-right:2em;}
code{font-family:Consolas, Inconsolata, Monaco, monospace;}
p{text-align:justify;}
</style>
</head>
<body>
<div id="header">
<h1 class="title">Writing a Simple Parser in Rust</h1>
</div>
<p><em>Update May 2020</em> The code as written on the <a href="rust_parser_old.html">old version of this page</a> didn’t compile anymore as an astute reader noticed. I’ve updated this site so that it compiles with Rust 1.43.</p>
<p><em>Erratum</em> Boris Berger pointed out that I made a mistake in the grammar that allows parsing 3 * 4 + 5 as 3 * (4 + 5) instead of (3 * 4) + 5. This is now corrected.</p>
<p>In an effort to learn <a href="https://www.rust-lang.org">Rust</a> I wrote a parser for simple arithmetic expressions. I want to parse expressions of the form <code>1234 + 43* (34 +[2])</code> using a simple recursive descent parser. Maybe I’ll try one of the libraries for writing parsers next. <a href="https://github.com/Geal/nom">Nom</a> looks good.</p>
<p>First I define a grammar for my language. To refresh my memory about how grammars for arithmetic expressions should look like, I consult <a href="http://pages.cs.wisc.edu/~fischer/cs536.s08/course.hold/html/NOTES/3.CFG.html#exp">this site</a>. I want <code>*</code> to have higher precedence than <code>+</code> and of course expressions in parentheses should have higher precedence still.</p>
<p>The grammar I came up with is as follows:</p>
<pre><code> expr -> summand + expr | summand
summand -> term * summand | term
term -> NUMBER | ( expr )</code></pre>
<h2 id="types">Types</h2>
<p>Next I want define a type for items in this grammar. Normally I’d use inheritance, but Rust doesn’t have inheritance, so instead I use an <code>enum</code>. Enums in Rust are very useful because unlike in C I can add information to an enum value. For my grammar items, I add the value of the <code>NUMBER</code> terminal to the corresponding enum value.</p>
<div class="sourceCode"><pre class="sourceCode rust"><code class="sourceCode rust"><span class="at">#[</span>derive<span class="at">(</span><span class="bu">Debug</span><span class="at">,</span> <span class="bu">Clone</span><span class="at">)]</span>
<span class="kw">pub</span> <span class="kw">enum</span> GrammarItem {
Product,
Sum,
Number(<span class="dt">u64</span>),
Paren
}</code></pre></div>
<p>The nodes of my parse tree are structs that contain a <code>GrammarItem</code> and children in a vector like so</p>
<div class="sourceCode"><pre class="sourceCode rust"><code class="sourceCode rust"><span class="at">#[</span>derive<span class="at">(</span><span class="bu">Debug</span><span class="at">,</span> <span class="bu">Clone</span><span class="at">)]</span>
<span class="kw">pub</span> <span class="kw">struct</span> ParseNode {
<span class="kw">pub</span> children: <span class="dt">Vec</span><ParseNode>,
<span class="kw">pub</span> entry: GrammarItem,
}
<span class="kw">impl</span> ParseNode {
<span class="kw">pub</span> <span class="kw">fn</span> new() -> ParseNode {
ParseNode {
children: <span class="dt">Vec</span>::new(),
entry: GrammarItem::Paren,
}
}
}</code></pre></div>
<p>I know that each node can have at most two children, so a vector of children is probably overkill. But by using a vector I don’t have to worry about using <code>Box</code> to avoid <a href="https://stackoverflow.com/q/25296195">recursive types</a>.</p>
<p>I later on noticed that I could have saved a lot of <code>mut</code> and a couple of lines if I had made it posible to pass in the <code>entry</code> into the <code>new</code>. As it is right now I have to create the node and change it afterwards to set the entry to a value that I want. I also have to rely on the compiler to optimize the dead store away, or I waste some cycles. (I waste lots of cycles in this toy program anyway, so I don’t really care.)</p>
<h2 id="lexing">Lexing</h2>
<p>Usually one parses by first lexing the input and then constructing the parse tree. The <code>lex</code> function gets a <code>String</code> and turns it into a vector of tokens. So first I define another type for tokens. Again I use an enum.</p>
<div class="sourceCode"><pre class="sourceCode rust"><code class="sourceCode rust"><span class="at">#[</span>derive<span class="at">(</span><span class="bu">Debug</span><span class="at">,</span> <span class="bu">Clone</span><span class="at">)]</span>
<span class="kw">pub</span> <span class="kw">enum</span> LexItem {
Paren(<span class="dt">char</span>),
Op(<span class="dt">char</span>),
Num(<span class="dt">u64</span>),
}</code></pre></div>
<p>I could have used more enum values to distinguish between <code>+</code> and <code>*</code> and the different types of parentheses, but instead I just store the character. It probably would have been a good idea to add another integer to each <code>LexItem</code> that stores the location in the input at which the token starts. That would make error reporting more useful. Instead I will just use the position in the token stream for my errors. Since I’m the only user of this program, I will only be angry at myself, so it’s okay.</p>
<p>The language I want to parse is very simple to lex. Except numbers, all tokens are just a single character long. So instead of complicated things with regular expressions. Instead I iterate over the characters of my input <code>String</code> and use a <code>match</code> do create a <code>LexItem</code>. The <code>match</code> statement is really handy here, since I can specify multiple alternatives for the same case with <code>|</code> and ranges of characters are also supported.</p>
<div class="sourceCode"><pre class="sourceCode rust"><code class="sourceCode rust"><span class="kw">fn</span> lex(input: &<span class="dt">String</span>) -> <span class="dt">Result</span><<span class="dt">Vec</span><LexItem>, <span class="dt">String</span>> {
<span class="kw">let</span> <span class="kw">mut</span> result = <span class="dt">Vec</span>::new();
<span class="kw">let</span> <span class="kw">mut</span> it = input.chars().peekable();
<span class="kw">while</span> <span class="kw">let</span> <span class="cn">Some</span>(&c) = it.peek() {
<span class="kw">match</span> c {
<span class="ch">'0'</span>..=<span class="ch">'9'</span> => {
it.next();
<span class="kw">let</span> n = get_number(c, &<span class="kw">mut</span> it);
result.push(LexItem::Num(n));
}
<span class="ch">'+'</span> | <span class="ch">'*'</span> => {
result.push(LexItem::Op(c));
it.next();
}
<span class="ch">'('</span> | <span class="ch">')'</span> | <span class="ch">'['</span> | <span class="ch">']'</span> | <span class="ch">'{'</span> | <span class="ch">'}'</span> => {
result.push(LexItem::Paren(c));
it.next();
}
<span class="ch">' '</span> => {
it.next();
}
_ => {
<span class="kw">return</span> <span class="cn">Err</span>(<span class="pp">format!</span>(<span class="st">"unexpected character {}"</span>, c));
}
}
}
<span class="cn">Ok</span>(result)
}</code></pre></div>
<p>I have the feeling that this method could be improved so that the vector doesn’t have to be mutable and is instead constructed directly using something like <code>collect</code>. I Python I would have written a generator from the loop and collected all <code>yield</code>-ed items in a list. If it were sufficient to consume only single characters, I could use <code>map</code> and <code>collect</code> to build my vector. But since <code>get_number</code> eats a whole number, I’m not sure how to do it.</p>
<p>I’m not particularly happy about this function because I need a <code>peekable</code> iterator and I call <code>next()</code> so often. I would prefer to call <code>next</code> only in the <code>while</code>, but I couldn’t figure out a way that lets <code>get_number</code> consume a whole number without consuming the first character <em>after</em> the number as well. If I were to call <code>next</code> in the beginner of the loop, I wouldn’t see that character.</p>
<p>For the same reason I can’t use <code>take_while</code> to get only the part of the iterator that contains digits into <code>get_number</code>. That function would hide the character after the number from my lexer.</p>
<p>To extract a number from the input I use the following function.</p>
<div class="sourceCode"><pre class="sourceCode rust"><code class="sourceCode rust"><span class="kw">fn</span> get_number<T: <span class="bu">Iterator</span><Item = <span class="dt">char</span>>>(c: <span class="dt">char</span>, iter: &<span class="kw">mut</span> Peekable<T>) -> <span class="dt">u64</span> {
<span class="kw">let</span> <span class="kw">mut</span> number = c.to_string().parse::<<span class="dt">u64</span>>().expect(<span class="st">"The caller should have passed a digit."</span>);
<span class="kw">while</span> <span class="kw">let</span> <span class="cn">Some</span>(<span class="cn">Ok</span>(digit)) = iter.peek().map(|c| c.to_string().parse::<<span class="dt">u64</span>>()) {
number = number * <span class="dv">10</span> + digit;
iter.next();
}
number
}</code></pre></div>
<p>Here again I can’t use <code>next</code> in the <code>while</code> because I don’t want to consume the first non-digit, like <code>take_while</code> would. Instead I only <code>peek</code> at the next character. I use <code>map</code> to attempt a parsing of the digit into an int only in the <code>Some</code> case without having to do another <code>if let</code>. In case of <code>None</code> from the <code>peek</code>, <code>map</code> is a no-op. Maybe I should do some calculations involving the ASCII value of <code>c</code>, or alternatively extract the whole number as a slice and <code>parse</code> that to be more efficient.</p>
<h2 id="parsing">Parsing</h2>
<p>The next step is to actually start constructing the parse tree. My parse function looks like this:</p>
<div class="sourceCode"><pre class="sourceCode rust"><code class="sourceCode rust"><span class="kw">pub</span> <span class="kw">fn</span> parse(input: &<span class="dt">String</span>) -> <span class="dt">Result</span><ParseNode, <span class="dt">String</span>> {
<span class="kw">let</span> tokens = lex(input)?;
parse_expr(&tokens, <span class="dv">0</span>).and_then(|(n, i)| <span class="kw">if</span> i == tokens.len() {
<span class="cn">Ok</span>(n)
} <span class="kw">else</span> {
<span class="cn">Err</span>(<span class="pp">format!</span>(<span class="st">"Expected end of input, found {:?} at {}"</span>, tokens[i], i))
})
}</code></pre></div>
<p>Parsing can fail, so I return a <code>Result</code>. I first attempt to lex the input using the <code>lex</code> function, which I will show you in a moment. Lexing can also fail, so <code>lex</code> also returns a <code>Result</code>. At first I had a <code>match</code> on the return value of <code>lex</code>, so that I could call the parsing function only upon success, but then I learned about the <code>?</code> operator. That’s a neat helper function that takes a <code>Result</code> and unwraps it if it’s ok and otherwise bails from the function and returns the error.</p>
<p>If the lexing succeeds I stuff the tokens into the <code>parse_expr</code> function. The second parameter tells the function that it should start at the beginning. Since parsing can fail, <code>parse_expr</code> also returns a Result. In the success case, it returns a parse tree and an index one-past the last token it consumed. It can happen that the <code>parse_expr</code> function manages to construct a parse tree, but doesn’t consume all input. For example for the input string <code>(1+2)(3+4)</code> we manage to parse the <code>(1+2)</code> prefix, but then get stuck. In the success case, I want to check that the index returned indicates that we consumed all tokens. This doesn’t happen in the <code>parse_expr</code> function itself, because I want to use it recursively to parse the <code>( expr )</code> production in the grammar. In that case it is expected to stop parsing before consuming the closing <code>)</code>.</p>
<p>This time I use the <code>and_then</code> function of the <code>Result</code> type to continue the computation if parsing was successful (I think I could have used another <code>?</code> if I wanted to. I don’t know enough Rust to say which is more idiomatic). The closure that I put into <code>and_then</code> gets the <code>ParseNode</code> <code>n</code> and the index <code>i</code>. If the index is too small, I error out. The error message is constructed using the <code>format!</code> macro.</p>
<p>I use the <code>parse_expr</code> function for a simple recursive descent. It looks like this:</p>
<div class="sourceCode"><pre class="sourceCode rust"><code class="sourceCode rust"><span class="kw">fn</span> parse_expr(tokens: &<span class="dt">Vec</span><LexItem>, pos: <span class="dt">usize</span>) -> <span class="dt">Result</span><(ParseNode, <span class="dt">usize</span>), <span class="dt">String</span>> {
<span class="kw">let</span> (node_summand, next_pos) = parse_summand(tokens, pos)?;
<span class="kw">let</span> c = tokens.get(next_pos);
<span class="kw">match</span> c {
<span class="cn">Some</span>(&LexItem::Op(<span class="ch">'+'</span>)) => {
<span class="co">// recurse on the expr</span>
<span class="kw">let</span> <span class="kw">mut</span> sum = ParseNode::new();
sum.entry = GrammarItem::Sum;
sum.children.push(node_summand);
<span class="kw">let</span> (rhs, i) = parse_expr(tokens, next_pos + <span class="dv">1</span>)?;
sum.children.push(rhs);
<span class="cn">Ok</span>((sum, i))
}
_ => {
<span class="co">// we have just the summand production, nothing more.</span>
<span class="cn">Ok</span>((node_summand, next_pos))
}
}
}</code></pre></div>
<p>Instead of an iterator I use an index into the vector. I found this easier because <code>usize</code> can be copied around implicitly and I have no trouble with the borrow-checker. I first attempt to parse a summand with <code>parse_summand</code>. This returns a <code>Result</code>. I use <code>?</code> to continue the computation if it succeeds. If I successfully parsed a summand, I check whether the next token is a <code>+</code>. I that case I need to parse the RHS of the <code>+</code> recursively. The function to parse a summand looks very similar.</p>
<div class="sourceCode"><pre class="sourceCode rust"><code class="sourceCode rust"><span class="kw">fn</span> parse_summand(tokens: &<span class="dt">Vec</span><LexItem>, pos: <span class="dt">usize</span>) -> <span class="dt">Result</span><(ParseNode, <span class="dt">usize</span>), <span class="dt">String</span>> {
<span class="kw">let</span> (node_term, next_pos) = parse_term(tokens, pos)?;
<span class="kw">let</span> c = tokens.get(next_pos);
<span class="kw">match</span> c {
<span class="cn">Some</span>(&LexItem::Op(<span class="ch">'*'</span>)) => {
<span class="co">// recurse on the summand</span>
<span class="kw">let</span> <span class="kw">mut</span> product = ParseNode::new();
product.entry = GrammarItem::Product;
product.children.push(node_term);
<span class="kw">let</span> (rhs, i) = parse_summand(tokens, next_pos + <span class="dv">1</span>)?;
product.children.push(rhs);
<span class="cn">Ok</span>((product, i))
}
_ => {
<span class="co">// we have just the term production, nothing more.</span>
<span class="cn">Ok</span>((node_term, next_pos))
}
}
}</code></pre></div>
<p>I suppose I could have abstracted a bit, but I don’t know whether the reduced code duplication would have been worth the increased complexity.</p>
<p>The function to parse a term looks most complicated, because this is where I generate all my more or less helpful error messages.</p>
<div class="sourceCode"><pre class="sourceCode rust"><code class="sourceCode rust"><span class="kw">fn</span> parse_term(tokens: &<span class="dt">Vec</span><LexItem>, pos: <span class="dt">usize</span>) -> <span class="dt">Result</span><(ParseNode, <span class="dt">usize</span>), <span class="dt">String</span>> {
<span class="kw">let</span> c: &LexItem = tokens.get(pos)
.ok_or(<span class="dt">String</span>::from(<span class="st">"Unexpected end of input, expected paren or number"</span>))?;
<span class="kw">match</span> c {
&LexItem::Num(n) => {
<span class="kw">let</span> <span class="kw">mut</span> node = ParseNode::new();
node.entry = GrammarItem::Number(n);
<span class="cn">Ok</span>((node, pos + <span class="dv">1</span>))
}
&LexItem::Paren(c) => {
<span class="kw">match</span> c {
<span class="ch">'('</span> | <span class="ch">'['</span> | <span class="ch">'{'</span> => {
parse_expr(tokens, pos + <span class="dv">1</span>).and_then(|(node, next_pos)| {
<span class="kw">if</span> <span class="kw">let</span> <span class="cn">Some</span>(&LexItem::Paren(c2)) = tokens.get(next_pos) {
<span class="kw">if</span> c2 == matching(c) {
<span class="co">// okay!</span>
<span class="kw">let</span> <span class="kw">mut</span> paren = ParseNode::new();
paren.children.push(node);
<span class="cn">Ok</span>((paren, next_pos + <span class="dv">1</span>))
} <span class="kw">else</span> {
<span class="cn">Err</span>(<span class="pp">format!</span>(<span class="st">"Expected {} but found {} at {}"</span>,
matching(c),
c2,
next_pos))
}
} <span class="kw">else</span> {
<span class="cn">Err</span>(<span class="pp">format!</span>(<span class="st">"Expected closing paren at {} but found {:?}"</span>,
next_pos,
tokens.get(next_pos)))
}
})
}
_ => <span class="cn">Err</span>(<span class="pp">format!</span>(<span class="st">"Expected paren at {} but found {:?}"</span>, pos, c)),
}
}
_ => {
<span class="cn">Err</span>(<span class="pp">format!</span>(<span class="st">"Unexpected token {:?}, expected paren or number"</span>, {
c
}))
}
}
}</code></pre></div>
<p>The function is a lot simpler than the deep nesting makes it seem. We just check whether the next token is a number or a parenthesis. If it’s a number we simply return it. If it’s a parenthesis we parse the contained expression recursively and then check that the next token is a matching closing parenthesis. The function <code>matching</code> returns the matching parenthesis using a simple <code>match</code></p>
<div class="sourceCode"><pre class="sourceCode rust"><code class="sourceCode rust"><span class="kw">fn</span> matching(c: <span class="dt">char</span>) -> <span class="dt">char</span> {
<span class="kw">match</span> c {
<span class="ch">')'</span> => <span class="ch">'('</span>,
<span class="ch">']'</span> => <span class="ch">'['</span>,
<span class="ch">'}'</span> => <span class="ch">'{'</span>,
<span class="ch">'('</span> => <span class="ch">')'</span>,
<span class="ch">'['</span> => <span class="ch">']'</span>,
<span class="ch">'{'</span> => <span class="ch">'}'</span>,
_ => <span class="pp">panic!</span>(<span class="st">"should have been a parenthesis!"</span>),
}
}</code></pre></div>
<p>The last thing we need for this simple parser is a way to supply it with input. I use a the command line argument as input. The <code>main</code> function looks like this.</p>
<div class="sourceCode"><pre class="sourceCode rust"><code class="sourceCode rust"><span class="kw">fn</span> main() {
<span class="kw">let</span> args: <span class="dt">Vec</span><_> = env::args().collect();
<span class="kw">if</span> args.len() > <span class="dv">1</span> {
<span class="pp">println!</span>(<span class="st">"The first argument is {}"</span>, args[<span class="dv">1</span>]);
<span class="pp">println!</span>(<span class="st">"{:?}"</span>, parse(&args[<span class="dv">1</span>]));
}
}</code></pre></div>
<p>The output is not terribly pretty:</p>
<div class="sourceCode"><pre class="sourceCode rust"><code class="sourceCode rust">$ ./main <span class="st">"1234 + 43* (34 +[2])"</span>
The first argument is <span class="dv">1234</span> + <span class="dv">43</span>* (<span class="dv">34</span> +[<span class="dv">2</span>])
<span class="cn">Ok</span>(ParseNode { children: [ParseNode { children: [], entry: Number(<span class="dv">1234</span>) }, ParseNode {
children: [ParseNode { children: [], entry: Number(<span class="dv">43</span>) }, ParseNode { children: [ParseNode {
children: [ParseNode { children: [], entry: Number(<span class="dv">34</span>) }, ParseNode { children: [ParseNode {
children: [], entry: Number(<span class="dv">2</span>) }], entry: Paren }], entry: Sum }], entry: Paren }], entry:
Product }], entry: Sum })</code></pre></div>
<p>If you properly indent it, it looks like this:</p>
<div class="sourceCode"><pre class="sourceCode rust"><code class="sourceCode rust"><span class="cn">Ok</span>(
ParseNode {
children: [
ParseNode { children: [], entry: Number(<span class="dv">1234</span>) },
ParseNode { children: [
ParseNode { children: [], entry: Number(<span class="dv">43</span>) },
ParseNode { children: [
ParseNode { children: [
ParseNode { children: [], entry: Number(<span class="dv">34</span>) },
ParseNode { children: [
ParseNode { children: [], entry: Number(<span class="dv">2</span>) }
],
entry: Paren
}],
entry: Sum
}],
entry: Paren
}],
entry: Product
}],
entry: Sum })</code></pre></div>
<h2 id="pretty-printing">Pretty Printing</h2>
<p>It is very simple to write a pretty-printer for <code>ParseNode</code>. It’s just a combination of <code>match</code> and <code>format!</code>. I should have done this first, it would have made debugging easier.</p>
<div class="sourceCode"><pre class="sourceCode rust"><code class="sourceCode rust"><span class="kw">pub</span> <span class="kw">fn</span> print(tree: &ParseNode) -> <span class="dt">String</span> {
<span class="kw">match</span> tree.entry {
GrammarItem::Paren => {
<span class="pp">format!</span>(<span class="st">"({})"</span>,
print(tree.children.get(<span class="dv">0</span>).expect(<span class="st">"parens need one child"</span>)))
}
GrammarItem::Sum => {
<span class="kw">let</span> lhs = print(tree.children.get(<span class="dv">0</span>).expect(<span class="st">"sums need two children"</span>));
<span class="kw">let</span> rhs = print(tree.children.get(<span class="dv">1</span>).expect(<span class="st">"sums need two children"</span>));
<span class="pp">format!</span>(<span class="st">"{} + {}"</span>, lhs, rhs)
}
GrammarItem::Product => {
<span class="kw">let</span> lhs = print(tree.children.get(<span class="dv">0</span>).expect(<span class="st">"products need two children"</span>));
<span class="kw">let</span> rhs = print(tree.children.get(<span class="dv">1</span>).expect(<span class="st">"products need two children"</span>));
<span class="pp">format!</span>(<span class="st">"{} * {}"</span>, lhs, rhs)
}
GrammarItem::Number(n) => <span class="pp">format!</span>(<span class="st">"{}"</span>, n),
}
}</code></pre></div>
<p>After integrating that function into <code>main</code>, the output is readable again:</p>
<div class="sourceCode"><pre class="sourceCode rust"><code class="sourceCode rust">$ ./main <span class="st">"1234 + 43* (34 +[2])"</span>
The first argument is <span class="dv">1234</span> + <span class="dv">43</span>* (<span class="dv">34</span> +[<span class="dv">2</span>])
<span class="cn">Ok</span>(ParseNode { children: [ParseNode { children: [], entry: Number(<span class="dv">1234</span>) }, ParseNode {
children: [ParseNode { children: [], entry: Number(<span class="dv">43</span>) }, ParseNode { children: [ParseNode {
children: [ParseNode { children: [], entry: Number(<span class="dv">34</span>) }, ParseNode { children: [ParseNode {
children: [], entry: Number(<span class="dv">2</span>) }], entry: Paren }], entry: Sum }], entry: Paren }], entry:
Product }], entry: Sum })
<span class="dv">1234</span> + <span class="dv">43</span> * (<span class="dv">34</span> + (<span class="dv">2</span>))</code></pre></div>
<h2 id="testing">Testing</h2>
<p>The next step is testing the parser properly using QuickCheck. This is another thing that I should have done earlier. I actually found a small “bug” using QuickCheck. In the types I used to use <code>i64</code>, but I can only handle positive numbers. QuickCheck produced inputs with negative numbers which I failed to parse.</p>
<p>But let’s not get ahead of ourselves. To test with QuickCheck, I have to find a suitable invariant for my program that can be tested by throwing random inputs at it. The simplest thing for a deterministic parser like this is that pretty printing a parse tree and parsing the result again yields back the original tree.</p>
<p>In QuickCheck you have to implement a generator for test inputs. The trait is called <code>Arbitrary</code> and has two methods: <code>arbitrary</code> and <code>shrink</code>. The point of <code>shrink</code> is to provide a way for QuickCheck to reduce counterexamples to your invariant to something manageable. I only bothered to implement the <code>arbitrary</code> function and use the <code>empty_shrinker</code> to satisfy the interface.</p>
<p>I generate <code>ParseNode</code> instances recursively. The base case is a simple number. Otherwise I generate one of <code>Paren</code>, <code>Sum</code>, and <code>Product</code> and fill the children recursively. I use the random generator <code>Gen</code> as provided by QuickCheck. To generate a <code>GrammarItem</code> using the generator, I implemented the <code>Distribution<GrammarItem> for Standard</code> trait for the enum, which is how the <code>rand</code> crate likes to things. I found it a little strange that I had to do that manually, I would expect enums to magically work. Anyway, the implementation is not difficult. I just generate a random int and match over it.</p>
<div class="sourceCode"><pre class="sourceCode rust"><code class="sourceCode rust"><span class="kw">impl</span> Distribution<GrammarItem> <span class="kw">for</span> Standard {
<span class="kw">fn</span> sample<R: Rng + ?<span class="bu">Sized</span>>(&<span class="kw">self</span>, rng: &<span class="kw">mut</span> R) -> GrammarItem {
<span class="kw">let</span> n = rng.gen_range(<span class="dv">0</span>, <span class="dv">4</span>);
<span class="kw">match</span> n {
<span class="dv">0</span> => GrammarItem::Product,
<span class="dv">1</span> => GrammarItem::Sum,
<span class="dv">2</span> => GrammarItem::Paren,
<span class="dv">3</span> => GrammarItem::Number(rng.gen()),
_ => <span class="pp">panic!</span>(<span class="st">"unexpected number"</span>),
}
}
}
<span class="kw">fn</span> rec_node<G: Gen>(i: <span class="dt">u32</span>, g: &<span class="kw">mut</span> G) -> ParseNode {
<span class="kw">let</span> <span class="kw">mut</span> node = ParseNode::new();
<span class="kw">if</span> i >= <span class="dv">1</span> {
<span class="kw">loop</span> {
node.entry = g.gen();
<span class="kw">match</span> node.entry {
GrammarItem::Paren => {
node.children.push(rec_node(i - <span class="dv">1</span>, g));
<span class="kw">break</span>;
}
GrammarItem::Sum | GrammarItem::Product => {
node.children.push(rec_node(i - <span class="dv">1</span>, g));
node.children.push(rec_node(i - <span class="dv">1</span>, g));
<span class="kw">break</span>;
}
GrammarItem::Number(_) => {}
}
}
} <span class="kw">else</span> {
node.entry = GrammarItem::Number(g.gen());
}
node
}</code></pre></div>
<p>Then to implement <code>Arbitrary</code> I just call that function with a small value for <code>i</code>.</p>
<div class="sourceCode"><pre class="sourceCode rust"><code class="sourceCode rust"><span class="kw">impl</span> Arbitrary <span class="kw">for</span> ParseNode {
<span class="kw">fn</span> arbitrary<G: Gen>(g: &<span class="kw">mut</span> G) -> ParseNode {
rec_node(<span class="dv">5</span>, g)
}
<span class="kw">fn</span> shrink(&<span class="kw">self</span>) -> <span class="dt">Box</span><dyn <span class="bu">Iterator</span><Item = <span class="kw">Self</span>>> {
empty_shrinker()
}
}</code></pre></div>
<p>The property I want to test is very simple to state. I use the <code>quickcheck!</code> macro to turn a function returning a <code>bool</code> into a QuickCheck test.</p>
<div class="sourceCode"><pre class="sourceCode rust"><code class="sourceCode rust"><span class="pp">quickcheck!</span> {
<span class="kw">fn</span> prop(xs: ParseNode) -> <span class="dt">bool</span> {
<span class="kw">let</span> pp = print(&xs);
<span class="pp">println!</span>(<span class="st">"instance {}"</span>, pp);
<span class="kw">let</span> parsed = parse(&pp).unwrap();
<span class="pp">assert!</span>(pp == print(&parsed));
<span class="cn">true</span>
}
}</code></pre></div>
<p>The output is only shown by QuickCheck in case the property is not satisfied, so it’s okay to just spam the pretty printed tree there, it won’t clutter up my test runs unless I need the information.</p>
<p>As I said above, I actually found inputs where my property doesn’t hold. Namely, I can pretty print parse trees containing negative numbers, but I don’t parse them back. I fixed this by replacing <code>i64</code> by <code>u64</code> wherever appropriate. After this change the test runs green.</p>
<p>I could have tested more thoroughly. Pretty printing always produces nice whitespace and only uses ( and ) for parentheses. But I don’t think that it’s worth the effort to change this. I also don’t test error cases. Writing a generator that produces expression strings that trigger a particular error seems difficult, I think more traditional testing is better for this. But since writing normal tests is really easy in Rust (just write <code>[#test]</code> in front of a function), I won’t bother for this toy project.</p>
<p>So that about wraps it up. I learned a bit. The borrow checker is pretty helpful. Most of the time when it nagged me it was right and I was wrong. Enums and match are very useful for this kind of project. Testing is a breeze.</p>
<h2 id="get-the-code">Get the code</h2>
<p>The whole code for this article, included the ommitted lines that define modules and include crates and stuff is available here: https://github.com/adrianN/simple_rust_parser</p>
<hr/>
<div style="display:inline-flex;flex-wrap:wrap;justify-content:space-between;font-size:80%">
<p style="margin-right:2ex">CC-BY-SA <a href="mailto:[email protected]">Adrian Neumann</a> (PGP Key <a href="https://adriann.github.io/ressources/pub.asc">A0A8BC98</a>)</p>
<p style="margin-left:1ex;margin-right:1ex"><a href="http://adriann.github.io">adriann.github.io</a></p>
<p style="margin-left:2ex"><a href="https://adriann.github.io/feed.rss">RSS</a></p>
</div>
</body>
</html>