Important links:
- Data Mining vs Machine Learning vs Artificial Intelligence vs Statistics
- What do data scientists get paid?
Name | Mike Izbicki (call me Mike) |
[email protected] | |
Office | Adams 216 |
Office Hours | See Issue #69 |
Zoom | See Issue #70 |
Webpage | https://izbicki.me |
Research | Machine Learning (see izbicki.me/research.html for some past projects) |
Fun facts:
- grew up in San Clemente (~1 hr south of Claremont)
- 7 years in the navy
- nuclear submarine officer, personally converted >10g of uranium into pure energy
- worked at National Security Agency (NSA)
- left Navy as a conscientious objector
- phd/postdoc at UC Riverside
- taught in DPRK (i.e. North Korea)
General Information:
- This is the theory course for CMC's Data Science major
- Prepare you for industry or graduate school
- Especially for machine learning technical interviews
- No SQL in this course => that's CSCI143 Big Data
Learning Objectives:
-
See the Jupyter notebook
-
Exposure to research-level data mining
-
Understand the latest algorithms... but algorithms get outdated fast.
-
The real goal is to teach you how to read research-level papers and math so that you can understand future techniques by yourself
-
-
Major concepts
- Techniques
- Eigen-methods for data mining
- Logistic regression
- Kernel methods
- Neural networks
- word2vec
- Small amount of deep learning (transformers, CNNs, etc.)
- Math
- Bias/variance trade-off
- VC Dimension theorem (fundamental theorem of statistical learning)
- Regularization (L1, L2, elastic net, weight decay, early stopping, etc.)
- Optimization algorithms (gradient descent, stochastic gradient descent, ADAM, etc.)
- Programming:
- Writing code that is easy to deploy
- Focus on text/web/social media examples
- Techniques
-
Ethical implications of data mining
Pet peeve: You can't fully understand the ethics if you don't understand the technical details
-
Apply data mining libraries (PyTorch, scikit-learn, GenSim, spaCy, etc.)
- Teaching you how to use these libraries is NOT the primary goal of the course
- In-person class time will focus on the math, and I'm expecting you can figure out how to use the libraries on your own
Prerequisite knowledge:
- linear algebra
- eigenvectors
- computation
- big-o analysis
- git
- download/use python libraries
- statistics
- super basic probability
- exposure to linear/logistic regression helpful but not required
Textbook:
I will provide all the reference material for this class. You don't have to buy anything.
-
Learning from Data by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin
I am providing you all a free copy. It is yours to keep forever if you'd like (or you can return it to me at the end of the semester and I'll pass it on to future students). Feel free to highlight/take notes/etc in it as if it were your own book, because it is.
-
Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz and Shai Ben-David
Freely available from Shalev-Shwartz's website
-
Lots of research papers / lecture notes
Grades:
Category | Percent | Approximate Date |
---|---|---|
Projects | 30 | Every 2-3 weeks |
Quizzes | 0 | |
Midterm 1 (Pagerank) | 15 | Week 03 |
Midterm 2 (Learning from Data) | 15 | Week 08 |
Midterm 3 (Text mining) | 15 | Week 13 |
Final | 25 |
Projects:
-
4-7 projects
-
All of them must be completed on the lambda server (i.e. using ssh+bash+vim)
Lambda server has 80 CPUs + 8 GPUs
-
I'm expecting almost everyone will get full credit, and these will act as a "grade boost"
Quizzes:
- There will be 1 quiz per midterm testing definition memorization.
- I will give you the quiz before you take it.
- They are not worth any points, but you must get 100% on the quiz or you will fail the class.
- Unlimited retakes, but each retake results in a -1% off your final grade.
Midterms:
- No programming, only math
- Take home, unlimited time, open note
- Very hard exams. (Historically, average in the 70s. No curve.)
Final:
- Oral exam
- The purpose is to help prepare you for interviews.
- The last week of class will be dedicated to prep.
- The final grade can replace your lowest midterm grade, if that would improve your overall grade in the class.
This is a hard class.
-
The material is intrinsically hard
- Very few people find linear algebra, statistics and programming to ALL be easy subjects, and this class combines them all
- There's a reason people who understand this material get paid big salaries at FAANG
-
You will have to read the required references.
Not all the material will be covered in lectures, and that's intentional to force you to get practice reading research-level data mining text.
-
Comments from previous students:
-
Holy fucking shit this was a hard class. I had no idea there was so much god damned fucking math involved in a CS class. You should warn students about that.
-
I spent 20+ hours per week on this class, and still only got a B. The class is too hard and you should make it easier.
Unfortunately, I can't remove the math from this class, and I can't make the class easier. Otherwise, you wouldn't be learning the material needed to pass a technical interview / get a good job / go to grad school.
-
NOTE: In all of my other courses, I include required reading/watching tasks to learn about CS/DS culture. This course doesn't have these tasks because there is already a LOT of textbook reading that you will have to complete.
Late Work Policy:
You lose 20% on projects for each day late. It is still typically better to submit a correct assignment late than an incorrect one on time.
If you collaborate with other students, you get an automatic 2 day extension on any project.
Collaboration Policy:
You are encouraged to discuss all labs and projects with other students, subject to the following constraints:
- you must be the person typing in all code for your assignments, and
- you must not copy another student's code.
You may use any online resources you like as references.
Basically, I'm trusting you all to be adults. You are ultimately responsible for ensuring you learn the material! So do what will help you learn best.
WARNING: All material in this class is cumulative. If you work "too closely" with another student on an assignment, you won't understand how to complete subsequent assignments, and you will quickly fall behind. You should view collaboration as a way to improve your understanding, not as a way to do less work.
I've tried to design the course to be as accessible as possible for all students. If you need any further accommodations---even if you don't have an officially recognized disability---please ask.
I want you to succeed and I'll make every effort to ensure that you can.