### Mandatory in:

### Mandatory-Elective in:

### Elective in:

The increasing volume and nature of big data sets in business, economics, social and political sciences call for more complex and sophisticated mathematical and statistical tools. The complex systems monitored by big data bases are successfully described in terms of networks. In this course we will present the fundamentals of probability theory and statistics, and apply such mathematical and statistical tools to characterize large empirical or model networks. The assessment of the statistical validity of the observed results will be analyzed and, when possible, quantitatively evaluated. Special attention will be dedicated to statistically validated network-measures widely used in network analysis, such as backbones and structures at the mesoscale. Besides the mathematical theory, the course will have a practical approach with homeworks and hands-on classes. During the class all examples and sample codes will be provided in Python and Jupyter notebooks.

**Course Organization**

Lectures: 12 classes of 100 min. Around 80% of the classes will be theory only. The other 20% will include programming exercises or evaluation of data sets. Therefore, use of a computer will be required during some lectures. Students can form groups and use their own laptops. Instructions on the required software will be provided during the first class.

**Tentative classes**

1 Introduction

Overview of the course, prerequisites, related classes and grading. Basic concepts that will be treated in the course. Large data sets, network science approach to large sets of data, need of new statistical methods for big networked data, black swans and statistical motivation for focusing on distribution tails.

2 Fundamentals of probability theory and statistics I

Sample space, random variables, conditional probability and independence, Bernoulli random variable, binomial distribution, Poisson distribution, probability density function, cumulative distribution function, joint and marginal probability, expected value and moments.

3 Fundamentals of probability theory and statistics II

normal distribution, uniform distribution, exponential distribution, thin vs heavy and fat tailed distributions, moments of heavy-tailed distribution, power-law and scale-invariance, Pareto and Zipf distribution, Fat tails and risk estimate distortion, distributions and networks, law of large numbers and central limit theorem.

4 Introduction to statistical inference

Patterns and intuition, Occam’s razor, correlation vs causation, ceteris paribus, process of scientific understanding, statistical inference, parametric and non-parametric models, hypothesis testing and null models, p-value and p-hacking, z-score, regression, maximum likelihood estimation, especially in networks.

5 Statistical limits and data

Total information awareness. Multiple hypothesis test corrections. The Bonferroni correction. The False Discovery Rate corrections. False positive, False Negative, True positive and True negative outcomes.

6 Causal inference and matched sample

Randomized experiment vs observational data, biases and confounding factors, strongly ignorable treatment assignment. Covariate closeness, Mahalanobis distance, propensity score. Matching methods, diagnosis of matches, estimation of causal effects. Case studies.

7 Distribution fitting and generation, degree-degree correlations

Fitting power laws and other distributions. Kolmogorov Smirnov test. Assortative networks, joint degree distributions.

8 Statistical analysis with python

Using scientific Python for numerical network analysis: network models, null models, layouts, Using scientific Python for statistics and plotting: scikit-learn, histograms, distributions, central limit theorem.

9 Network filtering and backbones

Similarity based networks. Correlation networks. Partial correlation networks. Planar maximally graph networks, Disparity filter, Polya-urn based filter

10 Cluster analysis

Similarity measures. Agglomerative and divisive clustering techniques. k-means clustering algorithm. Single linkage, average linkage and complete linkage. Comparison of outputs of different methods.

11 Statistical aspects of rich club and core-periphery organization

Rich-club ordering, Normalisation for rich-club, weighted rich-club, Borgatti-Everett core-periphery organization, core-periphery and degree distribution

12 Final test

**Textbooks and reading**

- Latora, Nicosia, Russo. Complex Networks. Cambridge University Press, 2017

- Kolaczyk. Statistical Analysis of Network Data. Springer, 2010

- Easley, Kleinberg. Networks, crowds, and markets, Cambridge University Press, 2010

- A list of papers and online resources will be provided during classes

**Cheating**

In short: don't do it. You may work with others to help guide problem solving or consult stack overflow (or similar) to work out a solution, but copying—from friends, previous students, or the Internet—is strictly prohibited. NEVER copy blindly blocks of code – we can tell immediately.

If caught cheating, you will fail this course. Ask questions in recitation and at office hours. If you are stuck with a programming task and cannot get help, write code as far as you can and explain in the code comments where and why you are stuck.

By successfully completing the course the students will be able to:

- Learn the fundamentals of how to perform data analysis and use statistical methods in the investigation of networks.
- Evaluate the statistical reliability of empirical estimations against an appropriate null hypothesis.
- Learn how to make use of large sets of data for investigating networks observed in the fields of social sciences.
- Perform empirical analyses and statistical validation of large datasets obtained from the Internet or from other business and scientific sources.

**What you will NOT learn in this course**

Fundamentals of coding and data visualization. This course has the prerequisite that you already have a basic proficiency with Python and will be able to develop and apply your skills towards data analysis and statistical visualization. For learning the basics to code (for-loops, lists, functions, reading and writing data from/to files, etc.), consider attending DNDS 6013 Scientific Python.

Similarly, basic concepts in network science are not covered in this course. For learning the basics of network science (degree-distributions, clustering, centrality measures, small-world networks, scale-free networks, etc), consider attending DNDS 6000 Fundamental Ideas in Network Science.

(1) Assessment type 1 (50% of the final grade). Attendance in at least 80% of classes, active cooperation, homework: Students will get home assignments consisting of statistical analyses, simple problems or data processing, which they will have to complete individually and submit electronically.

(2) Assessment type 2 (50% of the final grade). The final test will be lecture 12, consisting of questions related to the course that can be answered or solved by hand. The use of materials, including calculators or computers, is not permitted during the final test.

**Requirements for audit**

Attendance in at least 80% of classes, active cooperation, and completing the home assignments.

- Proven proficiency with Python
- Knowledge of fundamental network concepts
- Basic skills in statistics and linear algebra

**To satisfy the prerequisites**

Part of this course focuses on applying scientific programming with Python for research. We make no use of programs with a Graphical User Interface, like those available with spreadsheets. Since we need to pick one programming language for the course, we require students to prove proficiency with Python before the course starts, in one of the following ways:

a) Having taken for grade or audit the course DNDS 6013 Scientific Python.

b) Having taken a MOOC course on programming with Python and show the certificate.

c) Show and discuss a project you developed in Python with the instructor. Projects from someone else (web, friend, previous students) are not considered.

Moreover, familiarity with the fundamentals of network science is required. We require students to prove knowledge of the basic concept of network science by having taken in the past, or be taking this semester, for grade or audit the course DNDS 6000 Fundamental Ideas in Network Science.

The instructor holds no responsibility in case you do not satisfy the prerequisites and need to drop the course.