-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path02-proto.Rmd
441 lines (331 loc) · 13.6 KB
/
02-proto.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
---
title: "Proto Challenge"
output:
html_notebook: default
html_document:
df_print: paged
pdf_document: default
---
This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook to address the coding challenge presented to Peter Edstrom on January 5th, 2018.
## Challenge Goals
The full challenge text can be found in [README.md](byte-reader/README.md). We aim to parse [data.dat](byte-reader/data.dat) and answer the five questions:
* What is the total amount in dollars of debits?
* What is the total amount in dollars of credits?
* How many autopays were started?
* How many autopays were ended?
* What is balance of user ID 2456938384156277127?
## Log Specification
MPS7 transaction log specification:
Header:
| 4 byte magic string "MPS7" | 1 byte version | 4 byte (uint32) # of records |
Record:
| 1 byte record type enum | 4 byte (uint32) Unix timestamp | 8 byte (uint64) user ID |
Record type enum:
* 0x00: Debit
* 0x01: Credit
* 0x02: StartAutopay
* 0x03: EndAutopay
## Setup and Helper Functions
I started experimenting with the `readBin()` function. However, I found found that using some of the built-in modes such as the obvious `character` would return too much of the file. For example:
```{r}
to.read = file("byte-reader/data.dat", "rb")
readBin(to.read, character(), n=1, size=4)
close(to.read)
```
Notice the trailing `\001`. My understanding is that `character` is dependent on a zero-terminator character string, which we clearly can not count on. The `size=4` appears to be ignored in `character` modes.
Using the `raw` mode, and converting to a character string afterwords seems like a decent fall-back. However, I ran into a number of issues:
* `size` is always of size 1 in `raw` mode
* `n`, used for retrieving multiple records is also unavailable in `raw` mode
* and `raw` seems to have no option other than to retrieve 1 byte at a time.
```{r}
to.read = file("byte-reader/data.dat", "rb")
raw_data <- readBin(to.read, raw())
raw_data
rawToChar(raw_data)
close(to.read)
```
As you can see in these results, `4d` converts to our very first character, `M`.
Having not found a sufficient way to read a specific number of bytes in one go, I decided to write a short function. If someone can find a better way to do this upon code review, I absolutely welcome a refactor.
```{r}
retrieveNbytes <- function(file_name, number_of_bytes) {
count <- number_of_bytes
raw_bytes <- c()
while (count > 0) {
count <- count - 1
raw_bytes <- c(raw_bytes, readBin(file_name, raw()))
}
return(raw_bytes)
}
```
## Header Parsing
Putting the new `retrieveNbytes` function to use in extracting the first 4 characters:
```{r}
to.read = file("byte-reader/data.dat", "rb")
rawToChar(retrieveNbytes(to.read,4))
close(to.read)
```
We have found the expected magic string!
Reading the rest of the header:
```{r}
to.read = file("byte-reader/data.dat", "rb")
magic_string <- rawToChar(retrieveNbytes(to.read,4))
magic_string
version <- readBin(to.read, integer(), size=1)
version
records <- as.integer(retrieveNbytes(to.read,4))[4]
records
close(to.read)
```
Note that this is not quite the right for the record count. I'm reading 4 bytes, but each byte is stored in an array and for simplicity I'm just jumping to the 4th item and using it. There are 71 records reported, but for any record count greater than 255, this will fail.
Let's fix this with a quick function:
```{r}
bytesToInteger <- function(file_name, number_of_bytes) {
raw_bytes <- retrieveNbytes(file_name,number_of_bytes)
count <- number_of_bytes
total <- 0
while (count > 0) {
t <- as.integer(raw_bytes)[count] * 2^((number_of_bytes-count)*8)
total <- total + t
count <- count - 1
}
return(total)
}
```
Notice the similarities already emerging between `retrieveNbytes` and `bytesToInteger`. Perhaps a strategy that would create a function similar to Ruby's `unpack` function would make sense. Curiously, I spent some time researching alternatives to `unpack` and found very few. Even in Python, it appears that there is not a function that works as well as Ruby's implementation.
Regardless, we have removed the 255 record limit.
Let's extract the header processing into a function:
```{r}
processHeader <- function() {
magic_string <- rawToChar(retrieveNbytes(to.read,4))
version <- readBin(to.read, integer(), size=1)
records <- bytesToInteger(to.read,4)
}
```
```{r}
to.read = file("byte-reader/data.dat", "rb")
processHeader()
magic_string
version
records
close(to.read)
```
## Record Parsing
```{r}
to.read = file("byte-reader/data.dat", "rb")
processHeader()
type_enum <- readBin(to.read, integer(), size=1)
type_enum
timestamp <- bytesToInteger(to.read,4)
timestamp
user_id <- bytesToInteger(to.read,8)
user_id
close(to.read)
```
For the type_enum, `0` is a Debit.
For the timestamp, I validated it with https://www.unixtimestamp.com, which converts the UNIX timestamp into a more readable representation, `02/22/2014 @ 10:42pm (UTC)`. This date-time is neither wildly in the past nor in the far future.
For the user_id, `4.136354e+18` seems like an unusually long user ID, however when compared to the supplied user ID in the challenge `2456938384156277127` we find that they are both 19 digits long.
So I'd say these results seem reasonable! Let's carry on.
As a debit, this record comes with an additional field, an 8 byte float for the amount.
```{r}
to.read = file("byte-reader/data.dat", "rb")
processHeader()
type_enum <- readBin(to.read, integer(), size=1)
timestamp <- bytesToInteger(to.read,4)
user_id <- bytesToInteger(to.read,8)
amount <- readBin(to.read, double(), n=1, size=8, endian="big")
amount
close(to.read)
```
`604.2743` *feels* like a reasonable amount, however I should note that at this point in the exercise that there is very little in the data that will provide feedback on the validity of the results. The data is numerical and there is little guidance we can assume on valid minimums or maximums.
I see three assumptions we might be able to make:
* Timestamps will be in the future of the UNIX epoch.
* A transaction log would not include records from the future.
* Record types are always and *only* the 4 types listed.
Let's put a function together that will read a single record:
```{r}
processRecord <- function() {
type_enum <- readBin(to.read, integer(), size=1)
if (length(type_enum) == 0) { return(FALSE) }
timestamp <- bytesToInteger(to.read,4)
user_id <- bytesToInteger(to.read,8)
if (type_enum == 0 | type_enum == 1) {
amount <- readBin(to.read, double(), n=1, size=8, endian="big")
} else {
amount <- NA
}
print(paste("Type=", type_enum, " Timestamp=", timestamp, " User ID=", user_id, " Amount=", amount, sep=""))
return(TRUE)
}
to.read = file("byte-reader/data.dat", "rb")
processHeader()
p <- processRecord()
p <- processRecord()
p <- processRecord()
p <- processRecord()
p <- processRecord()
close(to.read)
```
Reading the first 5 records seems to work!
*As an aside: I feel like I'm getting sloppy with the global variables - these functions are not quite as atomic as I would prefer. Making a note of this to-do for future consideration.*
Next up: Loop through all of the records and stop successfully at the end.
```{r}
to.read = file("byte-reader/data.dat", "rb")
processHeader()
while (processRecord()) {
#do a thing
}
close(to.read)
```
## Refactor the Data into a Data Frame
Let's refactor our functions and store the values into a data frame.
In the following revised `processRecord` function we now a vector of the record values:
```{r}
processRecord <- function() {
type_enum <- readBin(to.read, integer(), size=1)
if (length(type_enum) == 0) { return(c()) }
timestamp <- bytesToInteger(to.read,4)
user_id <- bytesToInteger(to.read,8)
if (type_enum == 0 | type_enum == 1) {
amount <- readBin(to.read, double(), n=1, size=8, endian="big")
} else {
amount <- NA
}
return(c(type_enum, timestamp, user_id, amount))
}
```
In our main function we capture the vector returned from `processRecord` and add the results to a data.frame, `df`.
```{r}
to.read = file("byte-reader/data.dat", "rb")
processHeader()
df <- data.frame(matrix(ncol = 4, nrow = 0))
while (length(r <- processRecord()) > 1) {
df <- rbind(df, r)
}
colnames(df) <- c("type_enum", "timestamp", "user_id", "amount")
close(to.read)
df
```
A summary of the data frame below reveals a fair amount of consistency. _Amounts_ range between $98.83 and $998.12. _Timestamps_ are closely clustered, and most _user_ids_ are 19 digits long. Though there is at least one user_id that is only 17 digits long.
```{r}
summary(df)
```
## Answering the Questions
### What is the total amount in dollars of debits?
```{r}
debit_records <- df[df$type_enum==0,]
sum_of_debits <- sum(debit_records$amount)
sum_of_debits
```
The sum of the debit amounts is **$18,203.70**
### What is the total amount in dollars of credits?
```{r}
credit_records <- df[df$type_enum==1,]
sum_of_credits <- sum(credit_records$amount)
sum_of_credits
```
The sum of the credit amounts is **$10,073.36**
### How many autopays were started?
```{r}
start_autopay_records <- df[df$type_enum==2,]
count_start_autopay <- nrow(start_autopay_records)
count_start_autopay
```
There are **10** StartAutopay records.
### How many autopays were ended?
```{r}
end_autopay_records <- df[df$type_enum==3,]
count_end_autopay <- nrow(end_autopay_records)
count_end_autopay
```
There are **8** StartAutopay records.
### What is balance of user ID 2456938384156277127?
```{r}
user_records <- df[df$user_id==2456938384156277127,]
user_records
```
As you can see, there are two records for the user in question. It is pretty easy to tell that the two records will cancel each other out (one is a debit, one is a credit, and both have the same amount). However if this was a larger set of data I'd calculate it something like the following:
```{r}
sum_of_user_credits <- sum(user_records[user_records$type_enum==1,]$amount)
sum_of_user_debits <- sum(user_records[user_records$type_enum==0,]$amount)
user_balance <- sum_of_user_credits - sum_of_user_debits
user_balance
```
And there it is. The balance for user ID 2456938384156277127 is **$0.00**.
## Final Code
```{r}
# Functions
processHeader <- function() {
magic_string <- rawToChar(retrieveNbytes(to.read,4))
version <- readBin(to.read, integer(), size=1)
records <- bytesToInteger(to.read,4)
}
retrieveNbytes <- function(file_name, number_of_bytes) {
count <- number_of_bytes
raw_bytes <- c()
while (count > 0) {
count <- count - 1
raw_bytes <- c(raw_bytes, readBin(file_name, raw()))
}
return(raw_bytes)
}
bytesToInteger <- function(file_name, number_of_bytes) {
raw_bytes <- retrieveNbytes(file_name,number_of_bytes)
count <- number_of_bytes
total <- 0
while (count > 0) {
t <- as.integer(raw_bytes)[count] * 2^((number_of_bytes-count)*8)
total <- total + t
count <- count - 1
}
return(total)
}
processRecord <- function() {
type_enum <- readBin(to.read, integer(), size=1)
if (length(type_enum) == 0) { return(c()) }
timestamp <- bytesToInteger(to.read,4)
user_id <- bytesToInteger(to.read,8)
if (type_enum == 0 | type_enum == 1) {
amount <- readBin(to.read, double(), n=1, size=8, endian="big")
} else {
amount <- NA
}
return(c(type_enum, timestamp, user_id, amount))
}
# Read the file and create a data frame with the information
to.read = file("byte-reader/data.dat", "rb")
processHeader()
df <- data.frame(matrix(ncol = 4, nrow = 0))
while (length(r <- processRecord()) > 1) {
df <- rbind(df, r)
}
colnames(df) <- c("type_enum", "timestamp", "user_id", "amount")
close(to.read)
# Answer the Questions
debit_records <- df[df$type_enum==0,]
sum_of_debits <- sum(debit_records$amount)
credit_records <- df[df$type_enum==1,]
sum_of_credits <- sum(credit_records$amount)
start_autopay_records <- df[df$type_enum==2,]
count_start_autopay <- nrow(start_autopay_records)
end_autopay_records <- df[df$type_enum==3,]
count_end_autopay <- nrow(end_autopay_records)
user_records <- df[df$user_id==2456938384156277127,]
sum_of_user_credits <- sum(user_records[user_records$type_enum==1,]$amount)
sum_of_user_debits <- sum(user_records[user_records$type_enum==0,]$amount)
user_balance <- sum_of_user_credits - sum_of_user_debits
```
## Areas for further consideration and refinement
There are many areas that could be addressed next and prioritizing any one of these would be an exercise in quantifying trade-offs. I advise estimating the work with the engineering team and collaborating with the business/product team to determine which items represent the greatest value to the business.
#### Logic considerations
* Is there a need to address the discrepancy in the record count in the header 71 and the actual record count found? I choose to read all 72 records but perhaps I should have ignored the last one. This may be gap in documentation of the log specification.
* Consider where fractional pennies make sense, if any. Round at the appropriate time.
#### Code quality
* Potentially consolidate the similarities in the `retrieveNbytes` and `bytesToInteger` functions, similar to Ruby's `unpack` function.
* Create constants for the 4 record types so that the code is easier to read.
* Reduce the use of global variables. Make the functions more atomic.
* Build in error handling for the case where the data file or records are malformed.
* Build tests that demonstrate and validate the functions are working properly.
* There might be value in utilizing Docker or Vagrant to wrap this code in a consistent environment.
#### Future-proofing
* Process the UNIX timestamp data into a more usable R time object.
* Depending on the size of the production data set, refactor this code for performance. For example: this code will run out of memory for sufficiently large data sets.