The increasing volume and nature of big data sets in business, economics, social and political sciences call for more complex and sophisticated mathematical and statistical tools. The complex systems monitored by big data bases are successfully described in terms of networks. In this course we will present the fundamentals of probability theory and statistics, and apply such mathematical and statistical tools to characterize large empirical or model networks. The assessment of the statistical validity of the observed results will be analyzed and, when possible, quantitatively evaluated. Special attention will be dedicated to statistically validated network-measures widely used in network analysis, such as backbones and structures at the mesoscale. Besides the mathematical theory, the course will have a practical approach with homeworks and hands-on classes. During the class all examples and sample codes will be provided in Python and Jupyter notebooks.
Lectures: 12 classes of 100 min. Around 80% of the classes will be theory only. The other 20% will include programming exercises or evaluation of data sets. Therefore, use of a computer will be required during some lectures. Students can form groups and use their own laptops. Instructions on the required software will be provided during the first class.
Overview of the course, prerequisites, related classes and grading. Basic concepts that will be treated in the course. Large data sets, network science approach to large sets of data, need of new statistical methods for big networked data, black swans and statistical motivation for focusing on distribution tails.
2 Fundamentals of probability theory and statistics I
Sample space, random variables, conditional probability and independence, Bernoulli random variable, binomial distribution, Poisson distribution, probability density function, cumulative distribution function, joint and marginal probability, expected value and moments.
3 Fundamentals of probability theory and statistics II
normal distribution, uniform distribution, exponential distribution, thin vs heavy and fat tailed distributions, moments of heavy-tailed distribution, power-law and scale-invariance, Pareto and Zipf distribution, Fat tails and risk estimate distortion, distributions and networks, law of large numbers and central limit theorem.
4 Introduction to statistical inference
Patterns and intuition, Occam’s razor, correlation vs causation, ceteris paribus, process of scientific understanding, statistical inference, parametric and non-parametric models, hypothesis testing and null models, p-value and p-hacking, z-score, regression, maximum likelihood estimation, especially in networks.
5 Statistical limits and data
Total information awareness. Multiple hypothesis test corrections. The Bonferroni correction. The False Discovery Rate corrections. False positive, False Negative, True positive and True negative outcomes.
6 Causal inference and matched sample
Randomized experiment vs observational data, biases and confounding factors, strongly ignorable treatment assignment. Covariate closeness, Mahalanobis distance, propensity score. Matching methods, diagnosis of matches, estimation of causal effects. Case studies.
7 Distribution fitting and generation, degree-degree correlations
Fitting power laws and other distributions. Kolmogorov Smirnov test. Assortative networks, joint degree distributions.
8 Statistical analysis with python
Using scientific Python for numerical network analysis: network models, null models, layouts, Using scientific Python for statistics and plotting: scikit-learn, histograms, distributions, central limit theorem.
9 Network filtering and backbones
Similarity based networks. Correlation networks. Partial correlation networks. Planar maximally graph networks, Disparity filter, Polya-urn based filter
10 Cluster analysis
Similarity measures. Agglomerative and divisive clustering techniques. k-means clustering algorithm. Single linkage, average linkage and complete linkage. Comparison of outputs of different methods.
11 Statistical aspects of rich club and core-periphery organization
Rich-club ordering, Normalisation for rich-club, weighted rich-club, Borgatti-Everett core-periphery organization, core-periphery and degree distribution
12 Final test
Textbooks and reading
- Latora, Nicosia, Russo. Complex Networks. Cambridge University Press, 2017
- Kolaczyk. Statistical Analysis of Network Data. Springer, 2010
- Easley, Kleinberg. Networks, crowds, and markets, Cambridge University Press, 2010
- A list of papers and online resources will be provided during classes
In short: don't do it. You may work with others to help guide problem solving or consult stack overflow (or similar) to work out a solution, but copying—from friends, previous students, or the Internet—is strictly prohibited. NEVER copy blindly blocks of code – we can tell immediately.
If caught cheating, you will fail this course. Ask questions in recitation and at office hours. If you are stuck with a programming task and cannot get help, write code as far as you can and explain in the code comments where and why you are stuck.