The efforts of this quarter and the work done is dedicated to the memory of:
- Fernando Regino (1993-2013)
- Bernardino De Jesus (1993-2016)
- Ivan Garcia Vergara (1991-2018)
- Erik Alonso (1991-2009)
- Jorge Zarate (1990-2008)
"When the lights shut off
And it's my turn to settle down
My main concern
Promise that you will sing about me" - Kendrick Lamar
Thank you to everyone who participated this quarter
This repository serves as an itinerary for the Project Groups for Winter Quarter for the Data Science at UCSB organization. Providing a weekly overview as well as resources used within the weekly meetings.
Contributors:
- Raul Eulogio -> rauleulogio3 [at] gmail.com
- GitHub: https://github.com/raviolli77/
- David Campos - dcampos.liz [at] gmail.com
- GitHub: https://github.com/dcamposliz
- Personal Site: http://davidacampos.com/
- Jason Freeberg -> freeberg [at] umail.ucsb.edu
- GitHub: https://github.com/JasonFreeberg
- Personal Site: JasonFreeberg.github.io
- Nathan Fritter -> nathan.fritter [at] gmail.com
- GitHub: https://github.com/Njfritter
- Who are you?
- Name
- Major
- Year
- Where are you from?
- Why are you here?
- What are you trying to accomplish in life?
- what are you trying to accomplish here?
- What are you trying to learn?
- What project(s) are you working on today?
- What recent failure have you had?
- Strengths & weaknesses as it relates to data science or in general? Storm Goal of this group is to ultimately get projects finished and published
- WHY
- We found that it is by working on projects that you actually get to learn and being to understand how to do data science
- Brainstorm on data science ideas
- Write them on a piece of paper
- Go to the front of the group and present it
- Have people walk up to you/you walk up to people, persuade people to be in your group
Collide:
- Form teams
- Mix up grade levels/experience
- Discuss weaknesses, technologies, expertise, talent
- Pick R or Python
- Establish Communication channels
- GroupMe
- Slack
- GitHub
- Phone
- Gmail/Email
Homework:
- Find an interesting project online/from inertia7.com
- Read through contents
- Catch up on your R/Python skills with DataCamp
- Get to know each other
- Become Familiar with GitHub/create account (for more beginner level/those who weren't here, we'll go into more detail in a later meeting)
Links to Resources to resources discussed in meeting:
- R/RStudio: https://www.rstudio.com/
- Python: https://www.python.org/
- Inertia7: http://www.inertia7.com/
- GroupMe: https://groupme.com/en+US/
- GitHub: https://github.com/
- Slack: https://slack.com/
- DataCamp: https://www.datacamp.com/
Some preliminaries
-
Does everyone in your team have:
- Slack account/channel within the dsprojectgroup Slack?
- GitHub account?
- R, Python, SQL set up on their machine? (Whatever y'all plan on using)
- Speak about versions for language and packages/modules. Especially in Python:
- Speak to me after if you need more clarification
+If you can answer this questions then you're fine: Do you know what a virtual environment is? And do you know its use?
- If you don't know have your team speak to me after.
- Speak to me after if you need more clarification
+If you can answer this questions then you're fine: Do you know what a virtual environment is? And do you know its use?
- Which interface will your team be using i.e. Rstudio or Jupyter Notebook for R
- Speak about versions for language and packages/modules. Especially in Python:
-
Introduce the concepts of Stand Ups
- Structure of an effective Stand Up:
- What did I accomplish last meeting?
- What will I do today?
- What obstacles are impeding my progress? (Blockers)
- Structure of an effective Stand Up:
-
Document everything in your Slack channel
- If you used a site to review R, Python, html, etc. post it within your group's channel
- Read a cool article relating to your project; document it on Slack
- This will become important when citing sources, creating documentation for project, and just a good habit to develop since people deserve credit for helping you!
-
Trello
- Nathan will introduce the interface and how to integrate it into your workforce
- We might create a markdown file explaining in more detail if people do not understand how to use it right away (but is pretty easy to use).
- Resources:
-
How to do a Data Science Project?
- Steps of a Data Science project:
- Getting Data
- UCI Machine Learning Repository
- Kaggle datasets
- Cleaning data/sanity checks
- Exploratory Analysis
- Trends in reponse and predictor variales
- Modeling (Choosing Supervised Vs. Unsupervised Learning)
- Model Validation
- Sharing Results
- Inertia7.com
- GitHub repo with nice READNE.md
- Jupyter/RMarkdown Notebook
- Getting Data
- Steps of a Data Science project:
If you don't think you can do a project on your own right of the bat. Try doing a project from Inertia7!
- Scrape a Webpage - Python
- Iris Flower Classification
- Modeling Home Prices
- Forecasting the Stock Market
- Sentiment Analysis on Twitter
Here are some of my own repos where I have projects that aren't published on Inertia7:
- https://github.com/raviolli77/pythonTutorialsVinceLa
- https://github.com/raviolli77/machineLearning_Flags_Python
- https://github.com/raviolli77/classification_iris
- https://github.com/raviolli77/machineLearning_breastCancer_Python
- https://github.com/raviolli77/ggplot2_Tutorial_R
Discuss what their project can look like given the structure of what they just hacked
- Fill in the Steps of a Data Science Project
Homework: For this section, we can be lenient as to when this gets done. For more advanced groups we expect for you to be able to do this on your own. Now for the newer groups you can wait until the next meeting to have me or other members help with the process.
- Build a proposal for your own project
- Get comfortable using Markdown notation
- Create a repo in the Data Science Project Groups GitHub Account including these steps:
- Abstracts
- Finish filling the Steps of a Data Science Project
- Data Sources? Examples include, but are not limited to:
- Kaggle
- UCI
- Data sets found in R
- Quandl
- API calls:
- Wikipedia
- Google Maps
- Saint Louis Federal Reserve
- Google Analytics
- If not, then select a project from the suggested list or talk to me for project ideas Links to Resources to resources discussed in meeting:
- R/RStudio: https://www.rstudio.com/
- Python: https://www.python.org/
- Inertia7: http://www.inertia7.com/
- GitHub: https://github.com/raviolli77
- Trello: https://trello.com/
- UCI ML Database: https://archive.ics.uci.edu/ml/datasets.html
- Kaggle Datasets: https://www.kaggle.com/datasets
- R Data sets: http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html
- Quandl: https://www.quandl.com/
- Wikipedia API: https://www.mediawiki.org/wiki/API:Main_page
- Twitter API: https://dev.twitter.com/docs
- Saint Louis Federal Reserve: https://fred.stlouisfed.org/
- Google Analytics: https://www.google.com/analytics/#?modal_active=none
- Jupyter Notebook: http://jupyter.org/
- R Markdown: http://rmarkdown.rstudio.com/
Some Preliminaries:
-
Are people interested in a Python Hackathon?
- If so when and where works best
-
Has your team created a GitHub Repo for your project within the organizational GitHub (Source: https://github.com/UCSB-dataScience-ProjectGroup)?
- Does it have a ReadMe explaining the Steps of a Data Science Project?
- Did you all agree which versions/interface for the language you will be using?
- Did you reach a conclusion of what models/approach you will take?
- If not give us an overview what you plan to do, by the end of this meeting the project should be decided more or less
Team Resources
- Has your team...
- Been in contact through Slack?
- Been doing Stand Ups?
- Been addressing issues in going about your project or any preliminary practice for your project
- Asked for help?
Here we're giving a quick overview of how GitHub works. Purpose is to be used as a rudimentary guide for those of you who are new to GitHub. We can spend an entire day going over the workflow of GitHub, but for now we're concerned with just getting your feet wet, and soon creating a repo for your project if you haven't already.
NOTE: One can spend an entire day learning git, so we'll leave that out for this iteration. We will provide resources for git below!
-
Step 1:
- Create a GitHub account (Should go without saying, but you'd be surprised.)
-
Step 2:
- You should create a myProject folder where you keep all your projects. This will help with organization for later on when you'll be doing a shit load of projects and prior when publishing projects!
- Create a folder for your project where you will include things like, but not limited to:
- README file - This file will be other people's introduction to your project so make it pretty and easy to follow! (in .md format). I use Sublime Text to create and edit README files (there's a plethora of text editors like Notepad++, atom, etc. really its all personal preference)
- Script files - These files will be in the format of the language you are doing your project on so either an R file or Python file (in .R or .py or .sql )
- Data file(Not sure what the proper name for this is will edit later) - This file is where your data is stored if you are using a static data source typically it can be:
- .csv file
- .txt file
- .JSON file
- .db file
- Image folder - For organizational purposes we usually create an image folder which is where we store all images produced in the project if we plan on hosting them or making them viewable without having to run/save the code. Inside this folder you will find static image files like:
- .png files (favorited in producing statistical images)
- .jpeg
- .gif
- Once you get more acquainted with GitHub there will be more files that you will add, but for this example these will do
-
Step 3:
- Once you have the folder for your project and all the respective files you wish to include in the repo on the main page of GitHub, click the green button that says New repository
- Add the Repo name: we usually name our repos as such
- statisticalModel_DataSetDescription
Ex.
- classification_IrisFlowersR
- regression_bostonHousingR
- statisticalModel_DataSetDescription
Ex.
- Add a description: give a brief overview of what your project will be about to help give people context.
Ex.
- A collection of alternate R markdown templates
- Repo for a quick ggplot2 tutorial for Exploratory Analysis using Jupyter Notebook and R script
- Leave it as public: Make it accessible to everyone
- Initialize with a README - ALWAYS initialize with a README: this acts as an instructional overview for your project
- You typically include steps that were required that you can't express in your code (i.e. Creating a plotly account, steps needed if there are multiple scripts in your project)
- A brief overview of your data set and statistical models used in the project
- This will help later on if you plan to publish on inertia7!
- Updates made to your project since its last iteration
- Look at the inertia7 README's for some concrete examples
-
Step 4: Since you will be working in a team you have to be familiar with branches. Branches are different versions for the project, so a good way for your group to work on the project without fucking up the master branch
-
(Master Branch: This is the version the world will see and use, so make sure that this branch is the best iteration/is deployable)
- Create a branch and call it like ravi_branch
- You and each person in your team should have a branch that shows your iteration of the project if you happen to go ahead or test something out you haven't spoken with your teammates yet.
-
Step 5: Say you and your group are in agreement that your branch is the version you want on the master branch, the next step is creating a Pull Request.
-
(Pull Request: Allows people to review any changes made in a project, make modifications before the master branch changes, and overall help a team work efficiently)
- Go into the branch you want to merge so ravi_branch
- Click New Pull Request
- Here you will see the two branches being compared:the base will typically be the master branch and the compared file will be ravi_branch in our example.
- Add a description of some of the changes you made!
- GitHub will give you an overview of the changes made in files
- Once you have reviewed everything click Create pull request
- This is where other teammates will be notified of you wanting to merge your branch and the master branch
- If everyone is in agreement you click Merge pull request
- Then, click Confirm merge and the master branch will now have the same contents as ravi_branch
That's a quick and rough tutorial to working in GitHub. Doesn't go over everything but should give context as to how to work as a team using GitHub and branches. I have provided sources that go in more detail and definitely explain better so I would suggest reading up on them!
Homework:
- Will depend on conversations we have on Wednesday to see where your team is at
- Have a repo within the organizational repo by the end of today!
- Create branches for each teammate
- Set up a meeting time outside of Wednesday
Links to Resources to resources discussed in meeting(NOTE(2/14): Moved GitHub related resources to Recommended Resources for entire quarter):
Some Preliminaries:
-
Python Hackathon (Workshop)
- Steps needed to be taken before we can start/set up the hackathon:
- Install Python3.X
- Use a Virtual Environment for your project if it will be in Python
- Fill out the google survey sent yesterday night:
- We need to gauge date, time, and funds to make sure it will run smoothly
- Steps needed to be taken before we can start/set up the hackathon:
-
Rewards!!!
- HG Data Hackathon
- Date proposition: April 21st from 2pm to 10pm
- Most likely broken into 5-6 teams and pair an HG Data Engineer with the respect teams
- Date proposition: April 21st from 2pm to 10pm
- Spoke with Jason
- Informal presentation of projects with congratulatory refreshments
- Reward for Best Data Visualization
- Reward for Best insight/best modeling
- Reward for Best presentation
- Informal presentation of projects with congratulatory refreshments
- Jun Seo can speak of presentation of projects for library staff!
- HG Data Hackathon
-
Major issues to address for today:
- Does every team have a requirement.txt for their project?
- Some README's need more detail (I will go about doing informal interviews today to each group)
- By today your team should have what algorithms, methods and Python versioning.
- Branches for team members Depending on attendance we want today really show us the early iteration of your project so
-
Have a script with modules you will be using
-
Data set attached to your repo
-
Algorithms you will use
Some Preliminaries:
- Python Hackathon (Workshop)
- Confirmed Date: 2/25/2017 at 10 a.m.
- Buy shirts to rep!
- Contact me after to get them from other officer. I can take Venmo!
- Rewards (Reiterate because a lot of people were MIA)!!!
- HG Data Hackathon
- Date proposition: April 21st from 2pm to 10pm
- Most likely broken into 5-6 teams and pair an HG Data Engineer with the respect teams
- Date proposition: April 21st from 2pm to 10pm
- Informal presentation of projects with congratulatory refreshments near end of this quarter
- Reward for Best Data Visualization
- Reward for Best insight/best modeling
- Reward for Best presentation
- The informal presentation can be a prep for the presentation to the Library faculty
- Most likely scheduled at the start of next quarter (Ask Jun-Seo if you have any questions)
- Project will be posted in the newest iteration of int7x (inertia7)!
- HG Data Hackathon
- Team Management
- Word from me regarding team
- We need teams to start applying Stand Ups now (Mandatory)
- Must be done before starting your sessions and immediately when your team finishes the meet-up.
- Will demonstrate again with more feedback given to teams Today will play as an important catch up day for many teams since midterm season was(is) around
- I will go around to teams and ask about project relating to
- repository
- code
- README Today will be focused mostly on iterating projects.
Carry on. Nothing to see here.
For this week I decided we are going to do a surprise project presentation.
Announcements: Thank you for everyone who participated in the Python Workshop
I will need every team to do the following:
- Update all scripts on their GitHub repo in the ProjectGroupWinter2017.
- README.md
- scipt.py
- All appropriate data files (i.e. csv files, txt files, etc.)
- Images (inside images folder) that were produced for this project
- Be prepared to pitch your idea to me.
- Sell that shit.
- Why is your project relevant to Data Science and the data community as a whole.
- (Not 100%) I would like to see some scripts/notebooks being ran during presentation but due to time constraints, we might just only use what's on GitHub.
Each group presentation should be no longer than 15 minutes
- Thank You's
- Dedications
- Food for thought for next quarter
Some Preliminaries:
Only 1$ a piece! Go show some support to our friends at the Female Actuarial Association. Find event link Here
- Location: SRB
- Date: March 14, 2017
- Time: 11AM - 3PM
The Org. wants a packed house for the Farmer's Insurance Data Talk so let's all make it out! Facebook event link Here
- Location: UCen SB Harbor Room
- Date: March 9, 2017 (So tomorrow)
- Time: 6PM - 8PM
- Will NOT BE FOCUSED on actuary based stuff (Will focus on Natural Language Processing so highly relevant to our group)
- Location: HG Data Offices
- Time: April 21st
- Will most likely work on a tutorial with Calvin during Spring Break to help prep
- Location: Chapman University
- Time: April 21st as well
- Team of 5 to attend
- NOTE: Json wants the people to attend the Chapman Data Fest to be of different class levels (i.e. freshman, sophomore, Junior, Senior and Super Senior)
- Let me know if you're interested in this event! Link for Event Here
We have confirmed date!
- Location: Same location so here
- Time: April 26th at 7pm
- Need y'all to use today to prep and keep track of progress!
- Make Github repos pretty
- Code readable
- Write nice docs
- Make plots pretty with titles, axis labels, and legends
Let's really flex for this. Everyone worked hard!
We would like your team to use inertia7 to present your projects so this is a good segue for the next section
We know dead week and finals are fast approaching but we were wondering if anyone would be interested in User-testing the new iteration of inertia7 to give constructive criticism.
- Doesn't have to be publishing a project. Can just play with the app
- If interested to talk to me or David
- Follow Link to apply for credentials
Things needed by the end of this meeting:
- Updated Scripts
- Updated README's
- Add any appropriate images
- Create plotly account to publish plotly graphs (if applicable)
- To-do list detailing what is still needed for your project
- Keep in contact with partners over break.
- If you're bored during break work on the project!
IMPORTANT TO NOTE: Since finals is approaching your group needs set this up in their repo since there will be a gap period of 3 weeks. I need to know where your team is at and context of this. You CAN'T leave until your team shows me the repo and the outline of what is done and what isn't done.
Three weeks is a long time and if there's no structure as to where your at you will forget/will be hard to pick back up.
For those of you who feel you are ready to iterate on the presentation part of your project talk to me by the end of today's meeting.
Again thank you for a wonderful quarter and hope to see you all again next quarter!
-
README Resources:
-
GitHub Resources:
-
Git Resources:
- Set Up Git Article
- Create a Repo Article
- Fork A Repo Not discussed in this meeting but important part of GitHub workflow
- Be social (Great place to discover cool shit on GitHub)
- David's Git Repo
-
Text Editors Resources:
-
Python Resources:
- Python for Data Analysis (Brush up on NumPy and learn Pandas from the man who created it!)
- Vincent La's Personal Website (Raul's Note: Great place to review/learn Python if you're really rusty)
- Python Documentation (For more advanced users, the documentation for the programming language are clutch resources)
- Learn Python the Hardway (Haven't gone through it will soon, but dank resource for learning Python)
- Yhat (Great resource for machine learning application with Python)
- David's Repo: learnPython
- Hitchhiker's Guide to Python
- Sklearn Docs
- Plotly examples in Python
-
R Resources:
- R-bloggers (Great place to see people contributing projects and tutorials by real R users)
- ggplot2 docs
- ggplot2 Cheat Sheet (For visualizations)
- Quick-R
- Plotly examples in R
- R for Data Science (Learn from some of the R greats including Hadley Wickham, creator of many famous R packages)
- An Introduction to Statistical Learning with R (Great book used in many UCSB PSTAT Classes)
-
Misc.
- Kaggle (Great resource for all things data science)
- DataCamp
- Analytics Vidhya (Lot of great tutorials relating to machine learning)
- Stack Overflow (Stack overflow is love, Stack Overflow is life)
- w3schools tutorials (Great place to learn other important tools like, but not limited too: html, SQL (I used this one a lot), website development)