Data Mining and Big Data Analytics

Course Description: 
Course code: CNSC 6006
Course Instructor: Prof. Roberta Sinatra, sinatrar@ceu.edu

Office: N11 609

Office hours: TBA or by appointment

IMPORTANT: During the first class, we will hand out a test to check the prerequisites among the students. Those that do not reach the minimum threshold will not be able to take the course, even if regularly registered. Both students currently registered and in the waiting list need to take the test during the first class. The test does not count for the final grade of the course.

Course Description

Data mining and big data analytics is the process of examining data to uncover hidden patterns, unknown correlations and other useful information that can be used to make better decisions. This course is an introduction to concepts of data mining, machine learning and big data analytics. We will cover the key data mining methods of clustering, classification and pattern mining are illustrated, together with practical tools for their execution. We will show applications of these tools to a number of datasets, showing how theory and digital traces of human activities at societal scale can help us understand and forecast many complex socio-economic phenomena. The course will have a practical approach, with homeworks, hands-on classes and with the development of a project. Students are free to work in any computer language/network software they feel most comfortable. However, during the class all examples and sample code will be provided in Python and Jupyter notebooks, and use of Python is strongly encouraged.

Course Organization

Lectures: theory classes and hands-on sessions. Use of a computer will be required during some lectures. Students can use their own laptops. Instructions on the required software will be provided during the first class.

 Topics:

- Class 1: Test to check the prerequisites for the course. Introduction of the course

-  Class 2: Motivations for data mining. Examples of application domains. Data types and formats. Exploratory data analysis and data understanding

-  Class 3: Clustering. Taxonomy of clustering concepts: distance-based (separation, centroids, contiguity), density-based, partitional vs. hierarchical. Methods for centroid-based clustering (k-means), hierarchical clustering (single, complete and average linkage), density-based clustering (DBSCAN).

-  Class 4: Model learning and model validation. Explanation vs. prediction. Rule-based classifiers and decision trees. Naïve Bayes classifiers.

-  Class 5: hands on session (application of concept on data and real-world situations)

-  Class 6: More on decision trees

-  Class 7: Basic machine learning models (K-nearest neighbors, linear discriminant analysis, support vector machines, ensemble methods)

-  Class 8: Features selection

-  Class 9: Recommendation algorithms

-  Class 10: Introduction to frequent itemset mining. Applications for finding association rules. Levelwise algorithms, apriori.

-  Class 11: hands on session

-  Class 12: Final project presentation

Textbooks and reading

-  Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining, AddisonWesley, 2006.

-  Mining of Massive Datasets. Leskovec, Rajaraman, Ullman, Cambridge University Press

-  A list of papers and online resources will be provided during classes

Further information, such as the course website, assessment deadlines, office hours, contact details etc. will be given during the course.

The instructor reserves the right to modify this syllabus as deemed necessary any time during the term. Any modifications to the syllabus will be discussed with students during a class period. Students are responsible for information given in class.

Learning Outcomes: 

The aim of the course is to provide a basic but comprehensive introduction to data mining. By the end of the course students will be able to

- Choose the right algorithms for data science problems

- Demonstrate knowledge of statistical data analysis techniques used in decision making

- Apply principles of Data Science to the analysis of large-scale problems

- Implement and use data mining software to solve real-world problems.

What you will NOT learn in this course: Coding and data visualization. This course is about the methods and algorithms to find information in the data. However, we will not discuss details of implementation or data handling. For learning to code, consider attending “Scientific Python MATH5016”. For learning to visualize data, consider attending “Data and Network Visualization CNSC6012”. Both courses are offered during the fall term.

Assessment: 

Grading:

- Attendance of the classes and hands-on sessions: 30% of the final grade

- Assignments: 30% of the final grade

- Final project: 40% of the final grade

Prerequisites: 

Basic programming skills in a language like Python, R, Matlab, Mathematica or similar. Basic skills in statistics and linear algebra.