Skip to main content
Data Mining and Big Data Analytics
Graduate Program (& Advanced Certificate) Status
Course Description

Data mining and big data analytics is a core subject in data science with the aim to develop methods to examine sizable and multivariate datasets. Their common purpose is to uncover hidden patterns, unknown correlations and other useful information useful to make better decisions. In this course we will introduce methods of data aqusition and concepts of data mining, machine learning and big data analytics. We will cover the key data mining methods of clustering, classification and pattern mining are illustrated, together with practical tools for their execution. We will also demonstrate the applications of these tools on real datasets, to show how they can help us to analyse the digital traces of human activities at societal scale, to understand and forecast many complex socio-economic phenomena. The course will have a hands-on approach, with homeworks, practical classes and with the development of a project. Students are free to work in any computer language/network software they feel most comfortable. However, during the class all examples and sample code will be provided in Python and Jupyter notebooks, thus the use of Python is strongly encouraged.

 

Course schedule:

DATA MANAGEMENT AND ENGINEERING
1Introduction to data mining tasks and data types 
2Preprocessing and feature engineering: Data curation and filtering, imputation, scaling, dealing with categorical variables. Features selection.
DATA MINING METHODS
3Basic classification methods: decision trees, k-nearest neighbors, Support Vector Machine, Naïve Bayes Classifier
4Model evaluation: Generalization, overfitting and underfitting. Cross-validation. Model evaluation and comparison (e.g., metrics for classification, metrics for regression, confusion matrix, precision-recall curves, ROC curves).
5Basic clustering methods: distance-based (separation, centroids, contiguity), density-based, partitional vs. hierarchical. Methods for centroid-based clustering (k-means), hierarchical clustering (single, complete and average linkage), density-based clustering (DBSCAN).
6Hands-on session: Application of concepts on data and real-world situations.
7Dimensionality reduction: Simple Value Decomposition, Principal Component Analysis, Embedding
8Outlier analysis: Extreme value analysis, Probabilistic methods, distance and density-based methods for outlier detection
SPECIFIC DATA MINING AREAS
9Spatial data mining: location inference, spatial demography inference, spatial trajectory reconstruction, learning from remotely sensed data
10Text data / web data mining: text mining, text embedding, large language models
11Graph data mining: network embedding, community detection methods
12Final project presentation
Learning Outcomes

The aim of the course is to provide a basic but comprehensive introduction to data mining. By the end of the course students will be able to:
• Design basic data collection strategies and obtain data from a number of open data sources;
• Choose the right algorithms for data science problems;
• Demonstrate knowledge of statistical data analysis techniques used in decision making;
• Apply principles of Data Science to the analysis of large-scale problems;
• Implement and use data mining software to solve real-world problems.

Assessment

Students are expected to attend lectures and hands-on sessions, to hand in 1 to 3 assignments during the course and to develop a project during the entire term.

Grading:
• Attendance of the classes and hands-on sessions: 30% of the final grade
• Assignments: 30% of the final grade
• Final project: 40% of the final grade

Prerequisites

DNDS 6288 Scientific Python.

Course Level
Doctoral
Academic Year
2023-2024
Term
Winter
US Credits
2
ECTS Credits
4
Course Code
DNDS 6005