Introduction to Data Mining and Big Data Analytics
Office: N11 609
Office hours: TBA or by appointment
Course DescriptionData mining and big data analytics is the process of examining data to uncover hidden patterns, unknown correlations and other useful information that can be used to make better decisions. This course is an introduction to concepts of data mining, machine learning and big data analytics. We will cover the key data mining methods of clustering, classification and pattern mining are illustrated, together with practical tools for their execution. We will show applications of these tools to a number of datasets, showing how theory and digital traces of human activities at societal scale can help us understand and forecast many complex socio-economic phenomena. The course will have a practical approach, with homeworks, hands-on classes and with the development of a project. Students are free to work in any computer language/network software they feel most comfortable. However, during the class all examples and sample code will be provided in Python and Jupyter notebooks, and use of Python is strongly encouraged.
Lectures: theory classes and hands-on sessions. Use of a computer will be required during some lectures. Students can use their own laptops. Instructions on the required software will be provided during the first class.
- Class 1: Test to check the prerequisites for the course. Introduction of the course
- Class 2: Motivations for data mining. Examples of application domains. Data types and formats. Exploratory data analysis and data understanding
- Class 3: Clustering. Taxonomy of clustering concepts: distance-based (separation, centroids, contiguity), density-based, partitional vs. hierarchical. Methods for centroid-based clustering (k-means), hierarchical clustering (single, complete and average linkage), density-based clustering (DBSCAN).
- Class 4: Model learning and model validation. Explanation vs. prediction. Rule-based classifiers and decision trees. Naïve Bayes classifiers.
- Class 5: hands on session (application of concept on data and real-world situations)
- Class 6: More on decision trees
- Class 7: Basic machine learning models (K-nearest neighbors, linear discriminant analysis, support vector machines, ensemble methods)
- Class 8: Features selection
- Class 9: Recommendation algorithms
- Class 10: Introduction to frequent itemset mining. Applications for finding association rules. Levelwise algorithms, apriori.
- Class 11: hands on session
- Class 12: Final project presentation
Textbooks and reading
- Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining, AddisonWesley, 2006.
- Mining of Massive Datasets. Leskovec, Rajaraman, Ullman, Cambridge University Press
- A list of papers and online resources will be provided during classes
Further information, such as the course website, assessment deadlines, office hours, contact details etc. will be given during the course.
The instructor reserves the right to modify this syllabus as deemed necessary any time during the term. Any modifications to the syllabus will be discussed with students during a class period. Students are responsible for information given in class.
The aim of the course is to provide a basic but comprehensive introduction to data mining. By the end of the course students will be able to
- Choose the right algorithms for data science problems
- Demonstrate knowledge of statistical data analysis techniques used in decision making
- Apply principles of Data Science to the analysis of large-scale problems
- Implement and use data mining software to solve real-world problems.
What you will NOT learn in this course: Coding and data visualization. This course is about the methods and algorithms to find information in the data. However, we will not discuss details of implementation or data handling. For learning to code, consider attending “Scientific Python MATH5016”. For learning to visualize data, consider attending “Data and Network Visualization CNSC6012”. Both courses are offered during the fall term.
- Attendance of the classes and hands-on sessions: 30% of the final grade
- Assignments: 30% of the final grade
- Final project: 40% of the final grade
Basic programming skills in a language like Python, R, Matlab, Mathematica or similar. Basic skills in statistics and linear algebra.