Lectures: theory classes and hands-on sessions. Use of a computer will be required during some lectures. Students can use their own laptops. Instructions on the required software will be provided during the first class.
Topics and tentative calendar:
- Class 1: Test to check the prerequisites for the course. Introduction to the course. Introduction to data mining and knowledge discovering process. Examples of application domains. Data types and formats.
- Class 2: Types of learning (e.g., supervised, unsupervised, semi-supervised, reinforcement learning). Data mining tasks (e.g., classification, regression, probability estimation, clustering). Exploratory data analysis and data understanding. Explanation vs. prediction.
- Class 3: Basic machine learning models: K-Nearest Neighbors. Decision Trees. Naïve Bayes classifiers.
- Class 4: Generalization, overfitting and underfitting. Cross-validation. Model evaluation and comparison (e.g., metrics for classification, metrics for regression, confusion matrix, precision-recall curves, ROC curves).
- Class 5: Hands-on session: application of concepts on data and real-world situations.
- Class 6: Alternative machine learning models: Support Vector Machines, Linear Discriminant Analysis, Ensemble methods.
- Class 7: Preprocessing and feature engineering (e.g., imputation, scaling, dealing with categorical variables). Features selection. Dimensionality reduction. Learning from imbalanced data.
- Class 8: Clustering. Taxonomy of clustering concepts: distance-based (separation, centroids, contiguity), density-based, partitional vs. hierarchical. Methods for centroid-based clustering (k-means), hierarchical clustering (single, complete and average linkage), density-based clustering (DBSCAN).
- Class 9: Introduction to frequent itemset mining. Applications for finding association rules. Level-wise algorithms, apriori. Introduction to recommender systems.
- Class 10: Final project presentation.
In short: don't do it! You may work with friends to help guide problem solving or consult stack overflow (or similar) to work out a solution, but copying—from friends, previous students, or the Internet—is strictly prohibited. NEVER copy blindly blocks of code – we can tell immediately.
If caught cheating, you will fail this course. Ask questions in recitation and at office hours. If you're really stuck and can't get help, write as much code as you can and write comments within your code explaining where you're stuck.
Textbooks and reading
- Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining, AddisonWesley, 2006.
- Mining of Massive Datasets. Leskovec, Rajaraman, Ullman, Cambridge University Press
- A list of papers and online resources will be provided during classes
Further information, such as the course website, assessment deadlines, office hours, contact details etc. will be given during the course.
The instructor reserves the right to modify this syllabus as deemed necessary any time during the term. Any modifications to the syllabus will be discussed with students during a class period. Students are responsible for information given in class.