- Proven proficiency with Python – read the “To satisfy the prerequisites” section at the bottom
- Basic skills in statistics and linear algebra
The increasing volume and nature of big data sets in business, economics, social and political sciences call for more complex and sophisticated mathematical and data mining tools. The complex systems monitored by big data bases are successfully described in terms of networks. In this course we will present and discuss mathematical and data mining tools used to characterize large empirical or model networks. Large data sets will be computationally investigated and the limits of the used algorithms will be discussed. The assessment of the statistical validity of the observed results will be analyzed and, when possible, quantitatively evaluated. Besides the mathematical theory, the course will have a practical approach with homeworks, hands-on classes and with the development of a project. During the class all examples and sample codes will be provided in Python and Jupyter notebooks.
Lectures: 24 classes of 100 min. Around half of the classes will be theory only. The other half will include programming exercises or evaluation of data sets. Therefore, use of a computer will be required during some lectures. Students can form groups and use their own laptops. Instructions on the required software will be provided during the first class.
Overview of the course. Basic concepts that will be treated in the course. Large data sets. Data mining approaches. Network science approach to large sets of data. Computational tools.
2 Networks and the need for new statistics
Fundamental concepts and indicators of a network: degree, diameter, walk. Network organizing principles: sparsity, scale-free networks, small world networks, clustering. Statistical motivation for focusing on distribution tails.
3 Computational complexity, probability and statistics
Basic concepts of algorithm run-time analysis. Basic concepts of probability. Heavy-tailed distributions and generative processes. Central limit theorem.
4 Statistical inference
Hypothesis testing and null models, regression, maximum likelihood estimation, especially in networks.
5 Distribution fitting and generation, degree-degree correlations
Fitting power laws and other distributions. Kolmogorov Smirnov test. Assortative networks, joint degree distributions.
6 Python: Statistics and plotting
Using scientific Python for statistics and plotting: histograms, distributions, central limit theorem.
7 Python: NetworkX
Using scientific Python for numerical network analysis: network models, null models, layouts.
8 Fundamental concepts in data mining
Types of data. Different types of attributes. Asymmetric attributes. Dimensionality, sparsity and resolution. Data pre-processing. Measures of similarity and dissimilarity.
9 Classification in data mining
Classification and regression for predictive analysis. Classification model. Target function. Training set and test set. Decision trees. Sensitivity and specificity. Model overfitting.
10 Principal Component Analysis
Properties of the PCA. Selection of the most relevant principal components. PCA with covariance and correlation matrices. Selection of the principal components.
11 Cluster analysis
Similarity measures. Agglomerative and divisive clustering techniques. k-means clustering algorithm. Single linkage, average linkage and complete linkage. Comparison of outputs of different methods.
12 Proximity based networks
Similarity based networks. Correlation networks. Partial correlation networks. Planar maximally graph networks.
13 Midterm test
14 Sampling in networks
Basic elements of statistical sampling theory. Induced and incident subgraph sampling. Star and snowball sampling. Estimation of totals network graphs.
15 Estimating in networks
Mapping to the problems of species detection or population detection. The case of random and scale free graphs. The case of Internet. Estimating in subgraphs sampled by induced and incident subgraph sampling. Estimation of original network graphs.
16 Statistical paradoxes on networks
Friendship paradox, Simpon’s paradox, Braess’s paradox.
17 Motif detection and triad census
Motifs in a directed graph. Basic motifs. Efficient algorithms to find motifs. Optimality of the triangle-finding algorithm. Shuffling, methods in directed networks.
18 Signed networks and social balance
Signed networks, structural balance and triadic closure, shuffling methods, null models.
19 Statistical aspects of community detection algorithms
Overview of community detection algorithms. Stochastic and deterministic algorithms. Indicators for the comparison of different partitions. Landscape of the fitness of a partition.
20 Link prediction
Problem description and evaluation metrics. Local similarity indices. Global similarity indices. Applications of the problem of link prediction. Classification of partially labeled nodes.
21 Statistical limits in data mining
Total information awareness. Multiple hypothesis test corrections. The Bonferroni correction. The False Discovery Rate corrections. False positive, False Negative, True positive and True negative outcomes.
22 Network flows
Internet traffic analysis. Diagnosing volume anomalies. Traffic matrix estimation. Gravity and radiation model.
23 Applications of maximum matching
Shareability networks, network control theory, kidney exchange matching.
24 Project presentations
Textbooks and reading
- Latora, Nicosia, Russo. Complex Networks. Cambridge University Press, 2017
- Kolaczyk. Statistical Analysis of Network Data. Springer, 2010
- Easley, Kleinberg. Networks, crowds, and markets, Cambridge University Press, 2010
- Han, Kamber, Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2012
- Leskovec, Rajaraman, Ullman. Mining of Massive Datasets. Cambridge University Press
- A list of papers and online resources will be provided during classes
In short: don't do it. You may work with others to help guide problem solving or consult stack overflow (or similar) to work out a solution, but copying—from friends, previous students, or the Internet—is strictly prohibited. NEVER copy blindly blocks of code – we can tell immediately.
If caught cheating, you will fail this course. Ask questions in recitation and at office hours. If you are stuck with a programming task and cannot get help, write code as far as you can and explain in the code comments where and why you are stuck.
Further information, such as the course website, assessment deadlines, office hours, contact details etc. will be given during the course.
The instructor reserves the right to modify this syllabus as deemed necessary any time during the term. Any modifications to the syllabus will be discussed with students during a class period. Students are responsible for information given in class.