02-proto.Rmd

---
title: "Proto Challenge"
output:
  html_notebook: default
  html_document:
    df_print: paged
  pdf_document: default
---

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook to address the coding challenge presented to Peter Edstrom on January 5th, 2018.

## Challenge Goals

The full challenge text can be found in [README.md](byte-reader/README.md). We aim to parse [data.dat](byte-reader/data.dat) and answer the five questions:

* What is the total amount in dollars of debits?
* What is the total amount in dollars of credits?
* How many autopays were started?
* How many autopays were ended?
* What is balance of user ID 2456938384156277127?

## Log Specification

MPS7 transaction log specification:

Header:
| 4 byte magic string "MPS7" | 1 byte version | 4 byte (uint32) # of records |

Record:
| 1 byte record type enum | 4 byte (uint32) Unix timestamp | 8 byte (uint64) user ID |

Record type enum:

* 0x00: Debit
* 0x01: Credit
* 0x02: StartAutopay
* 0x03: EndAutopay

## Setup and Helper Functions

I started experimenting with the `readBin()` function. However, I found found that using some of the built-in modes such as the obvious `character` would return too much of the file. For example:

```{r}
to.read = file("byte-reader/data.dat", "rb")
readBin(to.read, character(), n=1, size=4)
close(to.read)
```

Notice the trailing `\001`. My understanding is that `character` is dependent on a zero-terminator character string, which we clearly can not count on. The `size=4` appears to be ignored in `character` modes.

Using the `raw` mode, and converting to a character string afterwords seems like a decent fall-back. However, I ran into a number of issues:

* `size` is always of size 1 in `raw` mode
* `n`, used for retrieving multiple records is also unavailable in `raw` mode
* and `raw` seems to have no option other than to retrieve 1 byte at a time.

```{r}
to.read = file("byte-reader/data.dat", "rb")
raw_data <- readBin(to.read, raw())
raw_data
rawToChar(raw_data)
close(to.read)
```

As you can see in these results, `4d` converts to our very first character, `M`.

Having not found a sufficient way to read a specific number of bytes in one go, I decided to write a short function. If someone can find a better way to do this upon code review, I absolutely welcome a refactor.

```{r}
retrieveNbytes <- function(file_name, number_of_bytes) {
  count <- number_of_bytes
  raw_bytes <- c()
  while (count > 0) {
    count <- count - 1
    raw_bytes <- c(raw_bytes, readBin(file_name, raw()))
  }
  return(raw_bytes)
}
```

## Header Parsing

Putting the new `retrieveNbytes` function to use in extracting the first 4 characters:

```{r}
to.read = file("byte-reader/data.dat", "rb")
rawToChar(retrieveNbytes(to.read,4))
close(to.read)
```

We have found the expected magic string!

Reading the rest of the header:

```{r}
to.read = file("byte-reader/data.dat", "rb")
magic_string <- rawToChar(retrieveNbytes(to.read,4))
magic_string
version <- readBin(to.read, integer(), size=1)
version
records <- as.integer(retrieveNbytes(to.read,4))[4]
records
close(to.read)
```

Note that this is not quite the right for the record count. I'm reading 4 bytes, but each byte is stored in an array and for simplicity I'm just jumping to the 4th item and using it. There are 71 records reported, but for any record count greater than 255, this will fail.

Let's fix this with a quick function:

```{r}
bytesToInteger <- function(file_name, number_of_bytes) {
  raw_bytes <- retrieveNbytes(file_name,number_of_bytes)
  count <- number_of_bytes
  total <- 0
  while (count > 0) {
    t <- as.integer(raw_bytes)[count] * 2^((number_of_bytes-count)*8)
    total <- total + t
    count <- count - 1
  }
  return(total)
}
```

Notice the similarities already emerging between `retrieveNbytes` and `bytesToInteger`. Perhaps a strategy that would create a function similar to Ruby's `unpack` function would make sense. Curiously, I spent some time researching alternatives to `unpack` and found very few. Even in Python, it appears that there is not a function that works as well as Ruby's implementation.

Regardless, we have removed the 255 record limit.

Let's extract the header processing into a function:

```{r}
processHeader <- function() {
  magic_string <- rawToChar(retrieveNbytes(to.read,4))
  version <- readBin(to.read, integer(), size=1)
  records <- bytesToInteger(to.read,4)
}
```


```{r}
to.read = file("byte-reader/data.dat", "rb")
processHeader()
magic_string
version
records
close(to.read)
```

## Record Parsing

```{r}
to.read = file("byte-reader/data.dat", "rb")
processHeader()
type_enum <- readBin(to.read, integer(), size=1)
type_enum
timestamp <- bytesToInteger(to.read,4)
timestamp
user_id <- bytesToInteger(to.read,8)
user_id
close(to.read)
```

For the type_enum, `0` is a Debit.

For the timestamp, I validated it with https://www.unixtimestamp.com, which converts the UNIX timestamp into a more readable representation, `02/22/2014 @ 10:42pm (UTC)`. This date-time is neither wildly in the past nor in the far future.

For the user_id, `4.136354e+18` seems like an unusually long user ID, however when compared to the supplied user ID in the challenge `2456938384156277127` we find that they are both 19 digits long.

So I'd say these results seem reasonable! Let's carry on.

As a debit, this record comes with an additional field, an 8 byte float for the amount.

```{r}
to.read = file("byte-reader/data.dat", "rb")
processHeader()
type_enum <- readBin(to.read, integer(), size=1)
timestamp <- bytesToInteger(to.read,4)
user_id <- bytesToInteger(to.read,8)
amount <- readBin(to.read, double(), n=1, size=8, endian="big")
amount
close(to.read)
```

`604.2743` *feels* like a reasonable amount, however I should note that at this point in the exercise that there is very little in the data that will provide feedback on the validity of the results. The data is numerical and there is little guidance we can assume on valid minimums or maximums.

I see three assumptions we might be able to make:

* Timestamps will be in the future of the UNIX epoch.
* A transaction log would not include records from the future.
* Record types are always and *only* the 4 types listed.

Let's put a function together that will read a single record:

```{r}
processRecord <- function() {
  type_enum <- readBin(to.read, integer(), size=1)
  if (length(type_enum) == 0) { return(FALSE) }
  timestamp <- bytesToInteger(to.read,4)
  user_id <- bytesToInteger(to.read,8)
  if (type_enum == 0 | type_enum == 1) {
    amount <- readBin(to.read, double(), n=1, size=8, endian="big")
  } else {
    amount <- NA
  }
  print(paste("Type=", type_enum, " Timestamp=", timestamp, " User ID=", user_id, " Amount=", amount, sep=""))
  return(TRUE)
}

to.read = file("byte-reader/data.dat", "rb")
processHeader()
p <- processRecord()
p <- processRecord()
p <- processRecord()
p <- processRecord()
p <- processRecord()
close(to.read)

```

Reading the first 5 records seems to work!

*As an aside: I feel like I'm getting sloppy with the global variables - these functions are not quite as atomic as I would prefer. Making a note of this to-do for future consideration.*

Next up: Loop through all of the records and stop successfully at the end.

```{r}
to.read = file("byte-reader/data.dat", "rb")
processHeader()
while (processRecord()) {
  #do a thing
}
close(to.read)
```

## Refactor the Data into a Data Frame

Let's refactor our functions and store the values into a data frame.

In the following revised `processRecord` function we now a vector of the record values:

```{r}
processRecord <- function() {
  type_enum <- readBin(to.read, integer(), size=1)
  if (length(type_enum) == 0) { return(c()) }
  timestamp <- bytesToInteger(to.read,4)
  user_id <- bytesToInteger(to.read,8)
  if (type_enum == 0 | type_enum == 1) {
    amount <- readBin(to.read, double(), n=1, size=8, endian="big")
  } else {
    amount <- NA
  }
  return(c(type_enum, timestamp, user_id, amount))
}
```

In our main function we capture the vector returned from `processRecord` and add the results to a data.frame, `df`.

```{r}
to.read = file("byte-reader/data.dat", "rb")
processHeader()
df <- data.frame(matrix(ncol = 4, nrow = 0))
while (length(r <- processRecord()) > 1) {
  df <- rbind(df, r)
}
colnames(df) <- c("type_enum", "timestamp", "user_id", "amount")
close(to.read)
df
```

A summary of the data frame below reveals a fair amount of consistency. _Amounts_ range between $98.83 and $998.12. _Timestamps_ are closely clustered, and most _user_ids_ are 19 digits long. Though there is at least one user_id that is only 17 digits long.

```{r}
summary(df)
```


## Answering the Questions

### What is the total amount in dollars of debits?

```{r}
debit_records <- df[df$type_enum==0,]
sum_of_debits <- sum(debit_records$amount)
sum_of_debits
```

The sum of the debit amounts is **$18,203.70**


### What is the total amount in dollars of credits?

```{r}
credit_records <- df[df$type_enum==1,]
sum_of_credits <- sum(credit_records$amount)
sum_of_credits
```

The sum of the credit amounts is **$10,073.36**


### How many autopays were started?

```{r}
start_autopay_records <- df[df$type_enum==2,]
count_start_autopay <- nrow(start_autopay_records)
count_start_autopay
```

There are **10** StartAutopay records.


### How many autopays were ended?

```{r}
end_autopay_records <- df[df$type_enum==3,]
count_end_autopay <- nrow(end_autopay_records)
count_end_autopay
```

There are **8** StartAutopay records.


### What is balance of user ID 2456938384156277127?

```{r}
user_records <- df[df$user_id==2456938384156277127,]
user_records
```

As you can see, there are two records for the user in question. It is pretty easy to tell that the two records will cancel each other out (one is a debit, one is a credit, and both have the same amount). However if this was a larger set of data I'd calculate it something like the following:

```{r}
sum_of_user_credits <- sum(user_records[user_records$type_enum==1,]$amount)
sum_of_user_debits <- sum(user_records[user_records$type_enum==0,]$amount)
user_balance <- sum_of_user_credits - sum_of_user_debits
user_balance
```

And there it is. The balance for user ID 2456938384156277127 is **$0.00**.

## Final Code

```{r}

# Functions

processHeader <- function() {
  magic_string <- rawToChar(retrieveNbytes(to.read,4))
  version <- readBin(to.read, integer(), size=1)
  records <- bytesToInteger(to.read,4)
}

retrieveNbytes <- function(file_name, number_of_bytes) {
  count <- number_of_bytes
  raw_bytes <- c()
  while (count > 0) {
    count <- count - 1
    raw_bytes <- c(raw_bytes, readBin(file_name, raw()))
  }
  return(raw_bytes)
}

bytesToInteger <- function(file_name, number_of_bytes) {
  raw_bytes <- retrieveNbytes(file_name,number_of_bytes)
  count <- number_of_bytes
  total <- 0
  while (count > 0) {
    t <- as.integer(raw_bytes)[count] * 2^((number_of_bytes-count)*8)
    total <- total + t
    count <- count - 1
  }
  return(total)
}

processRecord <- function() {
  type_enum <- readBin(to.read, integer(), size=1)
  if (length(type_enum) == 0) { return(c()) }
  timestamp <- bytesToInteger(to.read,4)
  user_id <- bytesToInteger(to.read,8)
  if (type_enum == 0 | type_enum == 1) {
    amount <- readBin(to.read, double(), n=1, size=8, endian="big")
  } else {
    amount <- NA
  }
  return(c(type_enum, timestamp, user_id, amount))
}

# Read the file and create a data frame with the information

to.read = file("byte-reader/data.dat", "rb")
processHeader()
df <- data.frame(matrix(ncol = 4, nrow = 0))
while (length(r <- processRecord()) > 1) {
  df <- rbind(df, r)
}
colnames(df) <- c("type_enum", "timestamp", "user_id", "amount")
close(to.read)

# Answer the Questions

debit_records <- df[df$type_enum==0,]
sum_of_debits <- sum(debit_records$amount)

credit_records <- df[df$type_enum==1,]
sum_of_credits <- sum(credit_records$amount)

start_autopay_records <- df[df$type_enum==2,]
count_start_autopay <- nrow(start_autopay_records)

end_autopay_records <- df[df$type_enum==3,]
count_end_autopay <- nrow(end_autopay_records)

user_records <- df[df$user_id==2456938384156277127,]
sum_of_user_credits <- sum(user_records[user_records$type_enum==1,]$amount)
sum_of_user_debits <- sum(user_records[user_records$type_enum==0,]$amount)
user_balance <- sum_of_user_credits - sum_of_user_debits

```

## Areas for further consideration and refinement

There are many areas that could be addressed next and prioritizing any one of these would be an exercise in quantifying trade-offs. I advise estimating the work with the engineering team and collaborating with the business/product team to determine which items represent the greatest value to the business.

#### Logic considerations

* Is there a need to address the discrepancy in the record count in the header 71 and the actual record count found? I choose to read all 72 records but perhaps I should have ignored the last one. This may be gap in documentation of the log specification.
* Consider where fractional pennies make sense, if any. Round at the appropriate time.

#### Code quality

* Potentially consolidate the similarities in the `retrieveNbytes` and `bytesToInteger` functions, similar to Ruby's `unpack` function.
* Create constants for the 4 record types so that the code is easier to read.
* Reduce the use of global variables. Make the functions more atomic.
* Build in error handling for the case where the data file or records are malformed.
* Build tests that demonstrate and validate the functions are working properly.
* There might be value in utilizing Docker or Vagrant to wrap this code in a consistent environment.

#### Future-proofing

* Process the UNIX timestamp data into a more usable R time object.
* Depending on the size of the production data set, refactor this code for performance. For example: this code will run out of memory for sufficiently large data sets.