forked from jmledford3115/datascibiol
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlab1_1.Rmd
118 lines (79 loc) · 9.51 KB
/
lab1_1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
title: "Getting Started"
author: "Joel Ledford"
date: "Winter 2019"
output:
html_document:
theme: spacelab
toc: yes
toc_float: yes
---
## Background
In nearly every field of science, our ability to generate data has exceeded our capacity for analysis. For me, this means that there is the potential for loss to science; many important discoveries may go unnoticed because we are unable to efficiently analyze data.
There is also a growing problem of curation and reproducibility. As the analyses and data become more complex, research labs have a problem of managing data and analysis such that data can be used in future analyses or shared with colleagues. As students, postdocs, and visiting scientists come and go in labs, or even as an individual moves between projects, there is often a mess left behind where data are inefficiently stored, analysis was not documented, or in the worse case scenario, sloppy copy/paste type errors cause wasted days to months. Effectively stored data along with a documented, reproducible, work flow will allow for high-quality analysis, error prevention and time savings, and smooth and efficient collaboration with others or even with yourself months to years in the future.
Lastly, with the widespread availability of data online there is the potential for new scientific discovery or insight through exploratory data analysis. In many cases a clean, organized workflow will yield new perspectives and promising results.
## Objectives
The goal for this course is to help get you started learning to manage, transform, and visualize data using the R programming language. You will learn to clearly and neatly organize messy data, transform it in ways that address your questions, and communicate results in a variety of formats. The course is designed for people with **no prior programming experience**. There is a substantial learning curve but, working together, we should be able to make learning R easier, interesting, and fun.
Why spend the time? If you are pursuing any research path in science, this class will give you a competitive advantage and save you countless hours as most people struggle to [learn R on their own](https://www.wired.com/2017/03/biologists-teaching-code-survive/). Even if you are not pursuing a research career, companies routinely hire data scientists (or people with skills in data science) to analyze a variety of data.
## Resources
Given that R is open source, many resources are available online. We will use a combination of resources in the class, but key items are listed below.
- [R for data science](https://r4ds.had.co.nz/)
- [R cheatsheets](https://www.rstudio.com/resources/cheatsheets/)
- [RStudio keyboard shortcuts](https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts)
## Data Science at UC Davis
Our campus has a wealth of expertise in data science, from working groups to full initiatives. Should your interests progress, here are some links. They offer regular workshops and maintain archives.
- [Data Science Initiative, UC Davis](http://dsi.ucdavis.edu/)
- [Davis R Users Group](http://d-rug.github.io/)
## Lab 1
The goal of lab 1 is to get everyone started using R, Rstudio, and GitHub. All of our work will be done in RStudio and uploaded to the class GitHub repository.
At the start of each lab session, I will randomly pull someone's code from the class GitHub repository and use it as an example. There are many, many ways to solve the same problems in R and seeing other people's code is a great way to learn. This is not about shaming people who make mistakes!
This first lab is tedious, so please be patient. It is important that everyone is set-up correctly before they leave today. In the spirit of the R universe, our class is a community. If you see someone struggling, please give them some help.
## Setup your own computer
You are welcome to use your own computer, but in order to do so you need to install R, RStudio, and Git. If you would like to set-up your own computer, please follow the directions [here](https://jmledford3115.github.io/datascibiol/setup.html). If you are unsure about troubleshooting your own computer, **please** use the lab computers.
## GitHub and Git
[**GitHub**](www.github.com) is a file storage and management site used mostly by programmers. Programmers upload code to repositories (folders) and make it publicly available.
**Git** is software that is used for version control. It tracks changes and is especially helpful when multiple people are collaborating on a project; if a mistake is made it can be easily tracked back and retrieved. Both can be integrated with RStudio.
Most programmers use both **GitHub** and **Git** to manage their projects. However, **Git** is glitchy and hard to work with- especially in a classroom setting. If you want to try and use **Git**, then follow the directions [here](https://jmledford3115.github.io/datascibiol/setup.html).
## GitHub Desktop
The easiest way to manage your GitHub repositories is to use [GitHub desktop](https://desktop.github.com/). This is a helpful tool that will allow you to manage all of your files. Highly recommended!
## Create a GitHub account
Since we will use GitHub as a repository for all assignments, we need to make an account for each student and build a new project through RStudio. Navigate to www.github.com and create a free account. [Email](mailto:[email protected]) your username to me. You will receive an email to join the repository for our course.
I have made a repository for our class titled: [FRS417-DataScienceBiologists](https://github.com/FRS417-DataScienceBiologists). Once you have accepted the invitation to join the repository, you should be able to login to GitHub and see this organization or click on the link above. You should see a directory associated with your login name. This where you will upload all of your work during the quarter.
## Download `class_files`
In our class GitHub, you will see a directory called `class_files`. Click on the **Clone or download** button and download the files as a `.zip`. You should now have a folder called `class_files` inside of which are all of the files that we will use for each class. Prior to the start of each lab, you will need to download the `class files` folder.
## RStudio
Inside the `class_files` folder you will see a file titled `lab1_1.Rmd`. Open this file by double clicking and you should now be in RStudio.
RStudio is an interactive working environment for R. When you first open it, you should see your screen divided into four quadrants. I will demonstrate each of these in class, but for now it is enough to know that they display different information and are helpful as you work on code.
You need to make a folder on your computer where you will keep all of your work. You will need frequent access to the folder, so I recommend putting it in an easy to find space. I put mine on the desktop.
Please run the following code to check your current working directory.
```{r}
getwd()
```
If you find that you are not in the folder that you created, please navigate to it: Session>Set Working Directory>Choose Directory.
Re-run the code below to confirm.
```{r}
getwd()
```
## RMarkdown
RMarkdown is one of many types of documents that can be created in RStudio. It is an extremely helpful way to build code because it will display the code, its output, and any text that you wish to add.
In RStudio, open a new markdown document: **File>New File>R Markdown**
There is some generic information at the start of the file that we don't need. Delete lines 12 through 31 to clean things up.
In this file insert a new R code chunk by clicking on the green `insert` tab and selecting R.
- The mac shortcut is option + command + i
- The PC shortcut is Ctrl + Alt + i.
### Install the tidyverse
One strength of R is that there are thousands of add-on packages that perform specialized functions. The packages are referred to as libraries.
In this class, we will routinely use the library called the **tidyverse**. Libraries need to be installed in order to work, and whenever you update R you also need to update all of the packages that you use.
```{r eval=FALSE, include=TRUE}
install.packages("tidyverse")
```
## Push files to GitHub
You can use your repository to store all of your notes and work. All you need to do is go back to your GitHub directory in your web browser and click `Upload files`. Once there, you can drag and drop any files directly into your repository. But, in order for the upload to complete you need to add a short message in the box beneath `Commit changes`. Once the message is added, click `Commit changes` and your upload will be complete.
## Practice
Since you will use your GitHub repository for your homework, we need to practice.
1. Navigate to the `class_files` folder that you downloaded and find the file `dummy_push`. Open this file and save a copy to your working directory using: File>Save As. Change the name of the file to `dummy_push_*` where the `*` is your last name. You can also remove my name and add yours in the `author:` field at the top of the document.
2. Once this is finished, `push` this file to your GitHub directory using the instructions above. After a few moments, you should be able to see this file in GitHub.
## GitHub Issues
We may encounter issues over the next few weeks with GitHub. These will be caused mostly by user error, but sometimes weird things happen. If you are stuck, get my attention and we will sort things out.
## That's it, let's take a break!
--> On to [part 2](https://jmledford3115.github.io/datascibiol/lab1_2.html)