Data Mining
CS 6140/5140/4140, Spring 2023
MoWe / 3-4:20PM, L101 WEB; streamed at UofUDataScience YT
Instructor: Ana Marasović
Lead TA: Zhichao Xu
Through this website we will share only the schedule, details of project components, and lecture notes and videos. Everything else—syllabus, announcements, discussion, turning in assignments/project components, communicating grades, policies—will be done through Canvas.
Please use the Canvas discussion forum as the preferred medium for interacting with the instructor and the teaching assistants rather than emailing/messaging directly. Take advantage of the instructor and TA office hours. We will work hard to be accessible to students. Don’t be shy if you don’t understand something: come to office hours, post in the forum, send emails, or speak up in class!
Description
Data mining is the study of efficiently finding structures and patterns in large data sets. We will focus on several aspects of this: (1) converting from a messy and noisy raw data set to a structured and abstract one, (2) applying scalable and probabilistic algorithms to these well-structured abstract data sets, and (3) formally modeling and understanding the error and other consequences of parts (1) and (2), including choice of data representation and trade-offs between accuracy and scalability. These steps are essential for training as a data scientist.
Algorithms, programming, probability, and linear algebra are required tools for understanding these approaches.
Topics will include: similarity search, clustering, regression/dimensionality reduction, graph analysis, PageRank, and small space summaries. We will also cover several recent developments, and the application of these topics to modern applications, often relating to large internet-based companies.
Upon completion, students should be able to read, understand, and implement ideas from many data mining research papers.
Materials
The book for this course will mostly be a nearly-complete book on the Mathematical Foundation for Data Analysis (M4D), version v0.6. However, the lectures will follow more closely Prof. Jeff M. Phillips’ related Data Mining course notes, and in several cases, these have not made it into the above book (yet?).
We will also often link to two other online resources that cover similar material, either with a more applied or theoretical focus:
- MMDS(v1.3): Mining Massive Data Sets by Anand Rajaraman, Jure Leskovec, and Jeff Ullman. The digital version of the book is free, but you may wish to purchase a hard copy.
- FoDS: Foundations of Data Science by Avrim Blum, John Hopcroft and Ravindran Kannan. This provide some proofs and formalisms not explicitly covered in lecture.
Calendar
The calendar and readings are tentative and subject to change.
Date | Theme | Topic | Lecture Materials | Readings | Work due |
---|---|---|---|---|---|
1/9 | Class Overview | [recording] [slides] | |||
1/11 | Statistics Principles | [recording] [slides] [notes] | M4D 2.2-2.3; MMDS 1.2; FoDS 12.4 | ||
1/16 | Martin Luther King Jr. Day holiday | ||||
1/18 | Similarity | Jaccard + k-Grams | [recording] [slides] [notes] | M4D 4.3-4.4; MMDS 3.1-3.2; FoDS 7.3 | HW Statistical Principles due |
1/23 | cont. | Min Hashing | [recording] [slides] [notes] | M4D 4.6.6; MMDS 3.3 | |
1/25 | cont. | LSH | [recording] [slides] [notes] | M4D 4.6; MMDS 3.4 | Project proposal due |
1/30 | cont. | Distances | [recording] [slides] [notes] | M4D 4-4.3; MMDS 3.5 + 7.1; FoDS 8.1 | |
2/1 | cont. | Embeddings | [recording] [slides] [notes] | [Speech & Language Processing Sec 6]; [Bolukbasi et al., 2016]; see Piazza for more | HW Document Hash due |
2/6 | Clustering | Hierarchical | [recording] [slides] [notes] | M4D 8.5 + 8.2; MMDS 7.2; FoDS 7.7 | |
2/8 | cont. | K-Means | [recording] [slides] [notes] | M4D 8-8.3; MMDS 7.3; FoDS 7.2-7.3 | HW LSH due |
2/13 | cont. | Spectral | [recording] [slides] [notes] | [von Luxburg, 2007]; M4D 10.3; MMDS 10.4; FoDS 7.5 | |
2/15 | Streaming | Model and Misra-Greis | [recording] [slides] [notes] | [Cormode and Hadjieleftheriou, 2010]; M4D 11.1-11.2.2; FoDS 6.2.3 | Project data collection report due |
2/20 | Presidents Day holiday | ||||
2/22 | cont. | Count-Min Sketch | [recording] [slides] [notes] | M4D 11.2.3-4; Stanford CS166 L10 | HW Clustering due |
2/27 | cont. | Count Sketch | [recording] [slides] [notes] | [Cormode and Hadjieleftheriou, 2010]; [Goyal et al., 2012]; M4D 11.2.3-4; Stanford CS166 L11; see Piazza for more | |
3/1 | MIDTERM EXAM | HW Frequent due | |||
3/6 | Spring break | ||||
3/8 | Spring break | ||||
3/13 | Regression | Linear Regression | [recording] [slides] [notes] | M4D 5-5.2; ESL 3.2 + 3.4 | |
3/15 | cont. | Nonlinear Regression; Regularizarion | [recording] [slides] [notes] | M4D 5.3-5.5; Extra FoDS 10.2; [Tropp and Gilbert, 2007] | |
3/20 | Dim Reduce | SVD + PCA | [recording] [slides] [notes] | M4D 7-7.3 + 7.5; FoDS 3-3.6 + 3.8-3.9.2 | Project intermediate report due |
3/22 | cont. | Projects peer feedback; Random Projection | [recording] [slides] [notes] | M4D 7.10; FoDS 2.7 | Read 2 peer intermediate reports due |
3/27 | cont. | Frequent Directions | [recording] [slides] [notes] | M4D 11.3; [Ghashami et al., 2015] | |
3/29 | cont. | MDS, LDA, Metric Learning | [recording] [slides] [notes] | M4D 7.6-7.8; LDA | HW Regression due |
4/3 | Noise | Privacy | [notes] | [prev year lecture]; [blog] | |
4/5 | cont. | Metric learning (cont); Noise in Data | [recording] [slides] [notes] | M4D 8.6; [tutorial] | HW Dim Reduce due |
4/10 | Graphs | Matrix completion; PageRank | [recording] [slides] [notes] | M4D 10.2; MMDS 5.1 + 5.4 | |
4/12 | cont. | PageRank | [recording] [slides] [notes] | MMDS 5 | Project final report due |
4/17 | cont. | MapReduce | [recording] [slides] [notes] | MMDS 2 | Project poster outline due |
4/19 | Wrap-up | Looking back & Semester course feedback | [recording] [slides] | Project poster final version due | |
4/24 | Poster presentations | ||||
4/26 | No class (reading day) | HW Graphs due | |||
5/1 | No class (exam week) | ||||
5/3 | FINAL EXAM |
Acknowledgements
The content of this webpage is almost entirely copied from a previous instance of this course that was designed and taught by Prof. Jeff M. Phillips.