Day0.1_IntroToR.Rmd

---
title: "Introducing R, Preparing for humdrumR"
subtitle: "Georgia Tech, humdrumR Workshop"
author: "Nat Condit-Schultz"
date: "May 10, 2023"
output:
  html_document:
    df_print: paged
    toc: true
    toc_float: true
    theme: flatly
editor_options: 
  chunk_output_type: inline
---


In this notebook, we'll introduce some of the fundamental concepts of the R programming language, preparing us for the usage of humdrumR itself.
You might also be interested in our "[R primer](https://computational-cognitive-musicology-lab.github.io/humdrumR/articles/RPrimer.html)", on the humdrumR website.

Open this notebook in Rstudio.
Throughout the notebook there are "code blocks", marked off with backticks and `{r}`.
You can run these code blocks by clicking the green arrow at the right side of the window.
You can also manipulate or add to the blocks, then click the arrow again!

# Fundamentals

The basics of R syntax are similar to many other programming languages.

We can do arithmetic and logic


```{r, results='hold'}
4 + 5

100 * -3

12 %% 7 # twelve modulo 7

2^8 # exponents

# These comments are ignored

TRUE | FALSE

TRUE & FALSE

!TRUE


```

and "call" functions with zero or more "arguments" inside of parentheses:


```{r, results = 'hold'}
Sys.time()

sqrt(2)

log(8, base = 2)

```

In R, we can also use functions in "pipes," which often saves typing.
The R pipe looks like `|>`, and you still need those parentheses!

```{r, results='hold'}

16.7 |> round() |> log(base = 2) |> sqrt()

# is the same as:
sqrt(log(round(16.7), base = 2))

```

Most importantly, we can look up how to use functions with `?`:

```{r}

?log

```

## Assignment

Like other programming languages, we can give things names by "assigning" them to a variable.
In R, we have leftward `<-` and rightward `->` assignment; rightward assignment is useful when combined with pipes!

```{r}

x <- 4

x -> 4

2^x |> log(base = 2) -> y

2^(x + y)

```


# Data Vectors

 
There are three types of basic, "atomic" data we work with in R:

```{r, results = 'hold'}

# Numbers:

145
3.2
2.2e6 # scientific notaion, e6 -> * 10^6

# Character strings:

"C Major"
"X"
"" # empty string

# Logical values:

TRUE
FALSE
NA # invalid data
NULL # empty, no data
```

In R is that these basic data types (except `NULL`) are **always** vectors: ordered collections.
Thus, a single value like `3.2` or `"Major"` is actually a vector of length one, etc.

--------------------------------------------------------------------------------------------------------

We can make vectors directly using the the `c()` function:

```{r, results = 'hold'}

c(3, 2, 1.6)

c('I', 'IV', 'V', 'I')

c(TRUE, FALSE, TRUE)

```


We often create vectors using other functions; for example, making sequences of numbers using `:` or `seq()`, or using `rep()`.

```{r}
1:100

seq(0, 1, by = .01)

rep(c('A', 'B', 'C', 'D'), 5)

```

---------------------------------------------------------------------------------------------------------------

## Vectorization


Many R operations/functions are "vectorized," meaning that they take in vectors and output vectors that are the same length.
This means that we, as programmers, don't need to worry about each element of the vector;
We can treat a vector like a single object, and R will oblige us.
For example, we can do math like:

```{r}

2^(0:10) - 1

(1:10) - (10:1)

sqrt(c(5, 10, 16))

```

Or work with strings like:

```{r}
paste(1:26, letters, sep = ': ')

paste('Chord', 1:10)

# Regular expressions:
grepl('[aeiou]', letters)


```

Or get logical values:

```{r}

2^(0:100) > 50

1:10 %% 2 == 0

1:20 %in% 2^(0:4)


```


---------------------------------------------------------------------------------------------------------------

Of course, other R functions take in vectors and return totally new vectors (or just scalars).
Examples:

```{r, results='hold'}

length(seq(50, 90, by = .2))

length(letters) # letters is a built-in vector which is always there!

sum(c(1, 5, 9))
mean(c(1, 5, 9))
max(c(1, 5, 9))

range(c(1, 100, 2, -4))

which(c(TRUE, FALSE, TRUE, TRUE))


```

Vectorization works very well when you are working with vectors that are either 1) all the same length or 2) length 1 (scalar).
If vectors are different lengths, the shorter one will be "recycled" (repeated) to match the longer one.

```{r}

c(0, 5) * 1:10

```


#### Practice

> Make a vector representing the line-of-fifths from Fb to B#

```{r}
# hint: rep() and paste()

LN <- rep(c('F', 'C', 'G', 'D', 'A', 'E', 'B'), 3)
ACC <- rep(c('-', '', '#'), each = 7)

paste(LN, ACC, sep = '')

paste(rep(c('F', 'C', 'G', 'D', 'A', 'E', 'B'), 3), 
      rep(c('-', '', '#'), each = 7), 
      sep = '')

# outer(c('F', 'C', 'G', 'D', 'A', 'E', 'B'), c('-', '', '#'), paste)

```

> What is the sum of the first 1000 integers (1 to 1000)?

```{r}
#
sum(1:1000)


```


> How many numbers between 1 and 1000 are divisible by 7?

```{r}
#
length(which((1:1000) %% 7 == 0))

sum((1:1000) %% 7 == 0)

```


> Calculate the average squared deviation between vector x and the mean of x.

```{r}
#
x <- rnorm(10, 0, 3)

sum((x - mean(x)) ^ 2)

```

## Indexing 

"Indexing" means to select subsets of a vector.
We use the `[]` operators, which can be provided numbers or logical values.

```{r, results = 'hold'}

letters[1:10]

x <- c(1, 0, 2, 0, 3, 4, 0)

x > 0
x[x > 0]

```

Negative numbers can be used to *remove* those indices:

```{r}
letters[-1]

letters[-1:-10]

letters[length(letters)]
letters[26]

tail(letters, 1)
head(letters, 1)
```

Indexing doesn't just reorder or subset a vector---you can use the same index any number of times!

```{r}

letters[c(3, 3, 7, 7, 1, 1, 7)]

```

#### Practice

> Get every other letter in the alphabet.

```{r}
# %% 

```

> Get all the numbers between 1:1000 that are divisible by 9.

```{r}
# %%

```


# Data frames

To work with real data, we'll want to put our vectors in to "data frames": collections of vectors that are all the same length.

```{r}


scale <- data.frame(Letter =  c("C", "D", "E", "F", "G", "A", "B", "C"),
                    PitchClass =c(0, 2, 4, 5, 7, 9, 11, 12),
                    Interval = c("P1", "M2", "M3", "P4", "P5", "M6", "M7", "P8"))
scale
```

Notice that we gave each vector a name *inside* the data.frame.
We can access these named columns using `$`:

```{r}

scale$Letter
```

We can add to the data.frame by assigning to that same `$`:

```{r}

scale$Tonic <- scale$Letter == 'C'
scale

```


We can index data.frames using `[i, j]`, where `i` indexes rows and `j` indexes columns:

```{r}

scale[1:5, ]

scale[ , 1:2]


```

#### Practice

> Extract the rows scale with perfect intervals.


```{r}
#  %in%
# grepl()

scale[scale$Interval == 'P1' | scale$Interval == 'P4' | scale$Interval == 'P5', ]
scale[scale$Interval %in% c('P1', 'P4', 'P5'), ]
scale[grepl('P', scale$Interval), ]


```

## Within the Data Frame


The real magic to R comes from a lot of "syntactic sugar" R has to make working with data.frames easier.
We use functions that "look inside" the namespace of the data.frame, allowing us to use the vectors (columns) of the data.frame freely.
The `with()` function is the basic approach.

```{r, results = 'hold'}
with(scale, diff(PitchClass))

with(scale, Letter[PitchClass %% 2 == 0])

# multi-line expressions need to use { }
with(scale, 
     {
       n <- 12
       n - PitchClass
       }
     )
```

`PitchClass` is not a variable we have defined! if we just type `PitchClass` we will get an error!
But if we go "within" the `scale` data.frame, we can then "see" `PitchClass` like any other variable.


-------------------------------------------------------------------------------------------------------------------


We can also use `within()` if we want to add columns to the data.frame:

```{r}
 x <- 3

within(scale, {
  StepSize <- c(diff(PitchClass), NA)
  
  }) -> scale

```


The `subset()` command is another command that can see "inside" the data.frame:

```{r}

subset(scale, {
  (PitchClass %% 2) == 0
  },
  )

```

#### Practice

> Find the subset of the scale with whole steps above the scale degree.

```{r}


```


# Real Data

Let's get to some real data!

When working in R, the first thing you'll always want to do is decide what "working directory" (on your computer) you want to work in.
(You can avoid this by typing out the complete file-path of every file you work with, but that can get tedious!)
Use `getwd()` to see the current working directory, and `setwd()` to change it.
For most of our workshop, we'll want our working directory set to the basic directory of our `humdrumRworkshop` repo, which you cloned (downloaded).
This *notebook* file will automatically run chunks within this directory (which is where the the notebook is located!), but for other work, you'll want to set your working directory to where ever the repo is on your machine:

```{r, eval=FALSE}
setwd("~/Bridge/Professional/Grants/humdrumR/Workshop_2023/humdrumRworkshop/")
```

-----------------------------------------------------------------------------------------------------


In the repo, you'll find some data I've prepared in a file called `LennonAndMcCartney.Rd`.
Load it with `load()`, which will load a data.frame called `LandM`

```{r}
load('LennonAndMcCartney.Rd')

LandM

```


This is a data.frame, consisting of 3,605 rows, with eight columns (vectors) of data.
We can use `with`, `within`, and `subset()` to drop inside the data and do some analysis.

```{r}

with(LandM, hist(MIDI))

with(LandM, table(Chord))

with(LandM, table(Chord, Tonic))

with(subset(LandM, Tonic == 2 & Mode == 'Major'), 
     table(Chord)
     ) |> barplot()

```

--------

The way this data is formatted, sometimes called "tall" format (as opposed to "wide"), is ideal for analysis.
Each "data observation" (in this case, notes) is represented in rows, with each row containing information about all variables (columns) associated with the data point.
This may seem unintuitive, as we see a lot data duplication (`John`, `John`, `John`, etc.), but it is actually ideal, because we end up with a set of vectors that are all the same length!

## Group by

The final fundamental concept in R data is called *split-apply-combine*, or "group by."
The idea is to split the data into subsets, apply the same analyses/processing to each subset, then (usually) recombine the output.


R (and various R packages) have defined a bewildering variety of different functions which allow us to do split-apply-combine routines.
The common idea in all these functions, is that the unique values in one vector is used to represent "groups" in another vector (or data.frame).
For example, in our `LandM` data, the `Composer` column is a `character` vector with just two unique values: `"John"` and `"Paul"`.

```{r}
with(LandM, table(Composer))
```

These two values could be used to *divide* the data into two subsets.
We can do this the manual way, like

```{r, echo = FALSE}
with(subset(LandM, Composer == 'Paul'), table(Chord))
subset(LandM, Composer == 'John')
```

but when there are more than two groups, this quickly gets tedious!


We usually pass grouping vectors to an argument called `by`, or `groupby`, but names like `INDEX` and `INDICES` and even just `f` are used as well.
For example, 

```{r}

split(LandM, f = LandM$Composer)

```

We can then, apply various functions to each sub-group of the data, and get the result.

```{r}
by(LandM, INDICES =  LandM$Song, nrow)

with(LandM, aggregate(MIDI, by = list(Song), mean, na.rm = TRUE))

within(LandM, tapply(MIDI, Song, \(x) x - mean(x, na.rm = TRUE)) |> unlist() -> SongOffset)

```

#### Practice


> Who sings higher, on average, John or Paul?

```{r}
#

with(LandM, data.frame(aggregate(MIDI, list(Composer), mean, na.rm = TRUE),
                       aggregate(MIDI, list(Composer), sd, na.rm = TRUE)))

```


> Whose songs have wider ranges, John or Paul?

```{r}
#
with(LandM, aggregate(MIDI, list(Composer), \(x) diff(range(x, na.rm = TRUE))))

```


> Are longer durations more likely on high notes, relative to low notes?

```{r}
#

with(subset(LandM, !is.na(MIDI)), cor(MIDI, Duration))


```

> Which scale degrees are sung on the longest durations?

```{r}
#

with(subset(LandM, Duration >= .5), table((MIDI - Tonic) %% 12)) |> barplot()


```


> Are non-chord tones less likely to resolve by step than chord-tones?

```{r}
#


```


> Do John and Paul differ in their use of chord qualities?

```{r}
#


```


# More R 


## Defining Functions

In R, you can define new functions using the syntax `\(arguments) expression`.
For example,

```{r}
square <- \(x) x^2

square(12)

```

More complex functions, with multiple lines of code, will need the code surrounded by `{}`.

```{r}
sumOfSquares <- \(x) {
  residuals <- x - mean(x)
  residuals^2 |> sum()
}
sumOfSquares(1:100)


```

#### Practice

> Write a function that converts MIDI (pitch) numbers to 12-tone pitch classes.

```{r}


```


> Write a function that changes compound steps (like 9 or 11) to their simple version (2 and 4)

```{r}


```


> Write a function which checks whether a number is a "whole number" (integer). I.e., 3.2 = FALSE, 15.0 = TRUE.

```{r}


```


> Write a function which "standardizes" a numeric input vector; i.e., centers it at zero and makes the standard deviation 1.
> You can get the standard deviation with the sd() function.


## Lists

The last major data type in R is the `list`.
Lists are flexible objects: lists can hold *any* number of *any* type of data, including other lists!

```{r, results = 'hold'}

list(1, 2, 3)

list('A', 1:100, list(TRUE, TRUE, TRUE))

chords <- list(CM = c(1, 4, 7), GMm = c(2, 7, 11), FMM = c(0, 4, 5, 9))

```

Lists can be index just like vectors---and you can also index using the names of the list (this works for vectors too, actually).

```{r, results = 'hold'}
chords[2]

chords[c('CM', 'FMM', 'GMm','CM')]
```

However, you'll see that what you get from doing this is just another list!
If you actually want to access the value *inside* the list you'll need to use `$` (just like a data.frame)
or use double `[[]]`.

```{r, results = 'hold'}

chords$DM <- chords$CM + 2 |> sort()

chords[['CMM']] <- ((chords[['FMM']] - 5) %% 12) |> sort()

chords$DM

```


(Data frames are actually special lists, where all list elements are the same length.)

### Lapply

We will often end up with lists of values---lists of vectors or lists of data.frames, for example.
If we want to work with lists of data, we can't rely on "plain" vectorization.
However, we can use `lapply` to apply functions to each element of a list (`Map` can work across multiple lists).

Let's say we have a list of useful values, like the chords in each of our Lennon/McCartney songs:

```{r}

chordspersong <- split(LandM$Chord, LandM$Song)

```

Maybe we want to know how many 'mm' chords appear in each song.
We can use `lapply` to apply a function to each element in the list---we'll make up a function on the fly!


```{r}

lapply(chordspersong, \(chords) sum(grepl('mm', chords)))

```

#### Practice

> Which is the most common chord in each song?

```{r}
#

```

## Missing values

R incorporates "missing values" into its core (atomic) vectors.
Any index of a vector (or all of them) can be `NA`.
For example,

```{r}
c(1, 2, NA, 3, 4)

c('Guitar', 'Bass', NA)


```

How `NA` values are treated depends on the function.
Vectorized functions will generally "pass through" `NA` values:

```{r, results = 'hold'}
c(1, 2, NA, 3, 4) + 2

sqrt(c(1, 2, NA, 3))

nchar(c('a', 'b', NA, 'c'))

```

For other functions, the presence of a single `NA` will (by default) result in `NA` output:


```{r, results = 'hold'}

k<- 1:100

k[45] <- NA

mean(k)

```

This is often annoying, but is actually a good thing, because it forces us to be aware of missing data.
Typically, there are options (arguments) to allow us to ignore `NA` values, if we want to---the most common
is `na.rm`:

```{r}

mean(k, na.rm = TRUE)

```

In other cases, we might need to manually remove the `NA`'s using indexing.
The `is.na` function will come in handy!

```{r}

mean(k[!is.na(k)])

```

------------------------------------------------------------------------------

Another missing value you might encounter is `NULL`.
`NULL` is not a vector; `NULL` is a single "empty" value, with `length(NULL) == 0`.
You can check if something is null using `is.null()`.

#### Practice

> How many rests occur in each Paul McCartney song?

```{r}
#


```


## If and Else

In R we can create pretty standard conditional blocks using `if` and `else`:

```{r}
x <- 714

if (x %% 3 == 0) {
        print("714 is divisible by three")
} else {
        print("714 is not divisibly by three")
}

```

However, we can also do *vectorized* conditions using `ifelse()`.
For example, recall that in our Lennon & McCartney data, rests in the vocal part show up as `NA` in the MIDI column.
What if we wanted these `NA` values to be something else?

```{r}

within(LandM, MIDI <- ifelse(is.na(MIDI), "Rest", MIDI))

```

Wherever MIDI is `NA`, `ifelse` places a "Rest"; wherever MIDI is not `NA`, the original value is returned.


#### Practice

> Write a function that translates pich class intervals (0-11) to their "classes" (0-6).
> I.e., 7 -> 5, 11 -> 1, etc.

```{r}
#


```