Skip to main content
Digital Data Collection Methods: Opportunities and Risks
Graduate Program (& Advanced Certificate) Status
Course Description

This is a shared course organized by the Department of Network and Data Science - CEU Vienna and the Médialab - Sciences Po Paris as part of the CIVICA European University program. The course will be taught online by two instructors for students at both universities.

Short Bio of the Instructors

Márton Karsai, PhD, Habil., Associate professor in the Department of Network and Data Science at the Central European University, director of the PhD Program in Network Science. He is a researcher of the Rényi Institute of Mathematics, fellow of the ISI Foundation and scientific advisor of the UNICEF Office of Innovation. He is a physicist trained network scientist with research interest in human dynamics, computational social science, and data science, especially focusing on systems with heterogeneous dynamics, spatial and temporal networks, socioeconomic systems and social and biological contagion phenomena.

Jean-Philippe Cointet is working at the Sciences Po médialab, where he designs innovative computational sociology methods. He is specialized in text analysis, and is working on various kinds of corpora and sources questioning their socio-political dynamics. His research areas are diverse, ranging from social media analysis (Facebook public posts and comments) to science of science (data turn in oncology) or political processes mapping (political discourses, international negotiations) and frame analysis (press coverage of migration processes). He also participates in developing  the CorText platform. He holds a PhD in Complex Systems and was trained as an engineer at Ecole Polytechnique. He is also affiliated with the research center INCITE, from Columbia University.


Course description: 

The course invites students to collect, model and visualize data from social media platforms. Data built from individual behaviors of users on Twitter, Facebook or Youtube are playing an increasing role in marketing, political targeting or even epidemic spreading forecasting. In this joint course between CEU and Sciences Po, we teach students the basics of data science applied to social media platforms and call for imagining alternative use of traditional AI powered data-analysis algorithms.

In this class, participants will get their hands on data and code and put to test existing state-of-the-art data science methods onto their own data to investigate a research question related to social and political dynamics at large: linguistic trends, social mobilizations, systematic discriminations, etc.

The pedagogical format is strongly oriented toward a workshop-style class. Typically, a short theoretical introduction will first be given. A discussion of the readings will follow before the class turns into lecturing mode, that will include more practical parts. One hands-on session (week 5) wherein students will practice data and algorithms coding is planned. The scheduling of the course slightly deviates from the usual format so that it can respect the academic calendars of both universities.

The classes will follow the below schedule: 

Session 1 [Jan 11 - M. Karsai] Introduction to Digital Data Collection Methods

In this introductory session we will see the pros and cons of the digital data revolution, and will walk through the many facets of digital data: spatial, temporal, social, behavioral, health, bank, mobility, urban, transportation, satellite, etc.


Session 2 [Jan 18 - M. Karsai] Tracking data collection

We will discuss ways of personal data collection, tracking data collection from the individual to the society level including reality mining and social experiments. We will talk about how to collect precise information about individuals' behavior using LifeLog, RFID, digital health tracking, contact tracing, self tracking or other methods.

Reading: Gonçalves B, Perra N, Vespignani A (2011) Modeling Users' Activity on Twitter Networks: Validation of Dunbar's Number. PLoS ONE 6(8): e22656.


Session 3 [Jan 25 - M. Karsai] Transactional data collection

This class will cover data collection from transactional logs of call detailed records, emails, bank transactions, shopping, and other automatically collected individual or relational data

Reading: Onnela JP, Saramäki J, Hyvönen J, Szabó G, Lazer D, Kaski K, Kertész J, Barabási AL. Structure and tie strengths in mobile communication networks. Proceedings of the national academy of sciences. 2007 May 1;104(18):7332-6.


Session 4 [Feb 1 - JP. Cointet] Data Science for social good, from promises to reality

In this introductory session, the outline and the philosophy of the class will be presented. We will discuss the opportunities and limitations of working with digital data and AI powered algorithms. 

Reading: Boyd D, Crawford K. Six provocations for big data. In a decade in internet time: Symposium on the dynamics of the internet and society 2011 Sep 21.


Session 5 [Feb 8 - M. Karsai] Tracking data collection

In this session dedicated to extraction of text from online sources. Beyond technicalities, we will discuss the opportunities and limitations of working with online platforms and contrast them with scraping strategies. 

Reading: McCormick TH, Lee H, Cesare N, Shojaie A, Spiro ES. Using Twitter for Demographic and Social Science Research: Tools for Data Collection and Processing. Sociological Methods & Research. 2017;46(3):390-421. doi:10.1177/0049124115605339


Session 6 [Feb 15 - JP. Cointet and M. Karsai] API data collection

We will explore how social media platform APIs can be leveraged to capture relational and demographic data from its users. In addition we will perform a hands-on session to help students to acquire practical skills in data collection methods.

Reading: Gonçalves B, Perra N, Vespignani A (2011) Modeling Users' Activity on Twitter Networks: Validation of Dunbar's Number. PLoS ONE 6(8): e22656.


Session 7 [Feb 22 - M. Karsai] Monitoring data collection

This class will cover data collection methods that are relying on remote sensing and transit monitoring and location data collection and also how to combine different data sources based on location and other information.

Reading: Blumenstock, J. E. (2016) Fighting poverty with data. Science 353, 753–754.


Session 8 [Feb 29]. Workshop on collective projects

Reading: no reading required


Session 9. [Mar 7 - JP. Cointet ] Text analysis 1 - Words as data

In this session, traditional NLP task are  introduced in order to perform lexicometry analysis

Reading: Michel JB, Shen YK, Aiden AP, Veres A, Gray MK, Google Books Team, Pickett JP, Hoiberg D, Clancy D, Norvig P, Orwant J. Quantitative analysis of culture using millions of digitized books. science. 2011 Jan 14;331(6014):176-82.


Session 10. [Mar 14 - JP. Cointet] Text analysis 2 - context as data

In this session, word embedding techniques will be introduced and illustrated on a medium scale dataset

Reading: Garg, Nikhil, et al. "Word embeddings quantify 100 years of gender and ethnic stereotypes." Proceedings of the National Academy of Sciences 115.16 (2018): E3635-E3644.


Session 11. [Mar 21] Invited lecture


Session 12. [Mar 28] Collective projects presentation


------------ END OF CEU TERM -------------


Session 13. [Apr 4 - JP. Cointet]  Text analysis 3 - Document as data

In this session, we will learn about how documents can be embedded and clustered to build topics. 

Reading: no reading required


Session 14. [Apr 11 - JP. Cointet] Social Network Analysis (XXth century)

We will explore the classical concepts, metrics and tools of social network analysis, from a social science perspective

Reading: Jeffrey Travers et Stanley Milgram, « An Experimental Study of the Small World Problem », Sociometry, vol. 32, no 4,‎ 1969, p. 425–443 (ISSN 0038-0431, DOI 10.2307/2786545, lire en ligne [archive], consulté le 7 août 2022)


Session 15. [Apr 18 - JP. Cointet] Social Network Analysis (XXIth century)

We will extend concepts introduced in the previous class to the current digital era where dataset are massive, and heterogeneous. Among others, we will explore how social network topology can be exploited to embed users in an ideological latent space

Reading: Barberá P. Birds of the same feather tweet together: Bayesian ideal point estimation using Twitter data. Political analysis. 2015;23(1):76-91.


Session 16. [Apr 25 - JP. Cointet] Using LLMs in computational social sciences

This last course introduces innovative research methods using ChatGPT and other foundational models capacities to advance social science research. 


This is a joint class between CEU and Sciences Po. As such, classes will take place on Zoom and collective work will also need to be organized remotely.

Learning Outcomes

The objectives of the class are threefold:

(i) learn and practice large data collection, manipulation and critical text and network analysis tools,

(ii) create your own research plan to investigate some original dataset or question in a collective project,

(iii) develop reflexivity and gain critical distance with the data-science & AI literature (and promises) through a series of readings.


There will be two main assessments during the semester. The most important one (60% of the grade) is a collective project that will be presented during the last session of the semester. In this original project, students will be required to identify a research question, collect an online dataset and design an experimental plan to analyze it accordingly. The final delivery will take the form of a notebook published on github summarizing the data workflow and discussing interpretations. An individual take-home paper will also be graded (30%). Participation during the class will also be evaluated (10% of the final grade).


  • In-class presence: 2 hours a week (except for two special sessions that will last 3 hours) / 20 hours a semester
  • Online learning activities: 20 minutes a week / 3 hours a semester
  • Reading and preparation for class: 45 minutes a week / 7 hours a semester
  • Research and preparation for group work: 2 hours a week / 18 hours a semester
  • Research and writing for individual assessment: 20 minutes a week / 3 hours a semester

Basic Python skills are required (building custom functions, importing and using libraries, manipulating lists and dictionaries, a first experience with pandas is appreciated). We are also expecting students to be curious about computer science, data & algorithms and their social and political roles.

Course Level
Academic Year
US Credits
ECTS Credits
Course Code
DNDS 6007