Data mining and big data analytics is a core subject in data science with the aim to develop methods to examine sizable and multivariate datasets. Their common purpose is to uncover hidden patterns, unknown correlations and other useful information useful to make better decisions. In this course we will introduce methods of data aqusition and concepts of data mining, machine learning and big data analytics. We will cover the key data mining methods of clustering, classification and pattern mining are illustrated, together with practical tools for their execution. We will also demonstrate the applications of these tools on real datasets, to show how they can help us to analyse the digital traces of human activities at societal scale, to understand and forecast many complex socio-economic phenomena. The course will have a hands-on approach, with homeworks, practical classes and with the development of a project. Students are free to work in any computer language/network software they feel most comfortable. However, during the class all examples and sample code will be provided in Python and Jupyter notebooks, thus the use of Python is strongly encouraged.
|DATA MANAGEMENT AND ENGINEERING
|Introduction to data mining tasks and data types
|Preprocessing and feature engineering: Data curation and filtering, imputation, scaling, dealing with categorical variables. Features selection.
|DATA MINING METHODS
|Basic classification methods: decision trees, k-nearest neighbors, Support Vector Machine, Naïve Bayes Classifier
|Model evaluation: Generalization, overfitting and underfitting. Cross-validation. Model evaluation and comparison (e.g., metrics for classification, metrics for regression, confusion matrix, precision-recall curves, ROC curves).
|Basic clustering methods: distance-based (separation, centroids, contiguity), density-based, partitional vs. hierarchical. Methods for centroid-based clustering (k-means), hierarchical clustering (single, complete and average linkage), density-based clustering (DBSCAN).
|Hands-on session: Application of concepts on data and real-world situations.
|Dimensionality reduction: Simple Value Decomposition, Principal Component Analysis, Embedding
|Outlier analysis: Extreme value analysis, Probabilistic methods, distance and density-based methods for outlier detection
|SPECIFIC DATA MINING AREAS
|Spatial data mining: location inference, spatial demography inference, spatial trajectory reconstruction, learning from remotely sensed data
|Text data / web data mining: text mining, text embedding, large language models
|Graph data mining: network embedding, community detection methods
|Final project presentation
The aim of the course is to provide a basic but comprehensive introduction to data mining. By the end of the course students will be able to:
• Design basic data collection strategies and obtain data from a number of open data sources;
• Choose the right algorithms for data science problems;
• Demonstrate knowledge of statistical data analysis techniques used in decision making;
• Apply principles of Data Science to the analysis of large-scale problems;
• Implement and use data mining software to solve real-world problems.
Students are expected to attend lectures and hands-on sessions, to hand in 1 to 3 assignments during the course and to develop a project during the entire term.
• Attendance of the classes and hands-on sessions: 30% of the final grade
• Assignments: 30% of the final grade
• Final project: 40% of the final grade
DNDS 6288 Scientific Python.