Statistical methods in network science and data analysis
Course Status: Elective
The increasing volume and nature of big datasets in business, economics, social and political sciences call for more complex and sophisticated data mining tools. The complex systems monitored by big databases are successfully described in terms of networks. In the present course we will present and discuss data mining tools used to characterize large empirical or model networks. Large data sets will be computationally investigated and the limits of the used algorithms will be discussed. The assessment of the statistical validity of the observed results will be analyzed and, when possible, quantitatively evaluated.
Lecture Session title.
Overview of the course. Basic concepts that will be treated in the course. Large data sets. Data mining approaches. Network science approach to large set of data. Computational tools.
2 Background on networks
Networks. Basic indicators of a network. Degree, betweennes, correlation, diameter. Random networks. Scale free networks. Small world. Core periphery networks.
3 Background on probability and statistics
Basic concepts of probability. Basic concepts of statistical inference. Methods of statistical inference.
4 Data mining basic concepts
Types of data. Different types of attributes. Asymmetric attributes. Dimensionality, sparsity and resolution. Data pre-processing. Measures of similarity and dissimilarity.
5 Network analysis with numerical tools
Basic introduction to R and RStudio. Assignment. Variables (objects). Vectors, matrices, arrays. List and data frames. Subsetting, indexing. Branching conditions and loops. Functions. Graphics. Packages. RStudio environment.
6 Classification in data mining
Classification and regression for predictive analysis. Classification model. Target function. Training set and test set. Decision trees. Sensitivity and specificity. Model overfitting.
7 Principal Component Analysis
Properties of the PCA. Selection of the most relevant principal components. PCA with covariance and correlation matrices. Selection of the principal components.
8 Random matrix theory and data mining
The concept of random matrix. Density function of eigenvalues. The Semi-circle law. Statistical uncertainty of a correlation matrix. The Marcenko-Pastur probability density function.
9 Cluster analysis
Similarity measures. Agglomerative and divisive clustering techniques. k-means clustering algorithm. Single linkage, average linkage and Complete linkage. Comparison of outputs of different methods.
10 Proximity based networks
Similarity based networks. Correlation networks. Partial correlation networks. Planar maximally graph networks.
11 Sampling and estimating in networks
Basic elements of statistical sampling theory. Induced and incident subgraph sampling. Star and snowball sampling. Estimation of totals network graphs
12 Estimating the degree distribution in sampled graphs
Mapping to the problems of species detection or population detection. The case of random and scale free graphs. The case of Internet. Estimating in subgraphs sampled by induced and incident subgraph sampling. Star and snowball sampling. Estimation of original network graphs.
13 Midterm test
14 Information networks
The World Wide Web. Information networks and associative memory. The Bow-Tie structure of the web.
15 Link analysis and web search
Searching the web. The ranking of web pages. Link analysis using hubs and authorities. PageRank. Link analysis in modern web search. Applications beyond the web.
16 Link prediction
Problem description and evaluation metrics. Local similarity indices. Global similarity indices. Applications of the problem of link prediction. Classification of partially labeled nodes.
17 Estimation of the weights of a network
Networks with information about marginal distributions of link weights. Maximum entropy estimation. Minimum density estimation.
18 Multiple hypothesis test correction and information filtering procedures
Total information awareness. Multiple hypothesis test corrections. The Bonferroni correction. The False Discovery Rate corrections. False positive, False Negative, True positive and True negative outcomes.
19 Association networks and statistically validated networks
Networks based on statistical tests. Statistically validated networks. Precision and accuracy of a statistically validated network. Statistically validated network in bipartite networks. Statistically validated networks in weighted networks.
20 Statistical aspects of community detection algorithms
Overview of community detection algorithms. Stochastic and deterministic algorithms. Indicators for the comparison of different partitions. Landscape of the fitness of a partition.
21 Motifs definition and detection. Comparison with a null hypothesis
Motifs in a directed graph. Basic motifs. Algorithms to find motifs. Optimality of the triangle-finding algorithm. Shuffling ,methods in directed networks.
22 Time dependent network models
Network evolution models. Nodal attribute models. Exponential random graph models. Stylized models of social network formation. Hysteresis in exponential random graphs and in other time dependent networks.
23 Network flow analysis: an example
Internet traffic analysis: a case study on two large backbone networks. Diagnosing volume anomalies. Traffic matrix estimation. Flow of funds data in economics. Input-output analysis and estimation. Transfer of fluctuations.
24 Projects' presentation
Eric D. Kolaczyk, Statistical Analysis of Network Data. Springer, 2009.
Anad Rajaraman, Jure Leskovec and Jeffrey D. Ullman, Mining of Massive Datasets, Book available online.
David Easley and Jon Kleinberg, Networks crowds and markets, Cambridge University Press 2010.
Jiawhei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques. Third edition (2012).
Course E-Learning site
To be assigned
Office hours: Upon agreement
By successfully completing the course the students will be able to:
- Learn how to perform data analysis and use statistical methods in the investigation of networks.
- Evaluate the statistical reliability of empirical estimations against an appropriate null hypothesis.
- Learn how to make use of large sets of data for investigating networks observed in the fields of social sciences.
- Perform empirical analyses and statistical validation of large datasets obtained from the Internet or from other business and scientific sources.
(1) Assessment type 1 (30 % of the final grade). Students will get home work consisting of statistical analyses, simple problems or data processing, which they will have to complete individually and submit electronically.
(2) Assessment type 2 (30% of the final grade). The midterm test will be after week 10, consisting of questions related to the course.
(3) Assessment type 3 (40% of the final grade). In the final project work students will have to perform a research project involving the investigation of a large network individually. The investigation and characterization of the network might include one or more of the following aspects: (i) developing of a network model, (ii) running network simulations, (iii) comparing network metrics with null hypotheses, and (iii) critically analyzing results of a large scale empirical investigations. Students will have to prepare a project presentation and a written report.
Knowledge of a basic programming language.