DSSthumb image
Topic: Research | Centre of Competence for Data Science and Simulation

The DSS Centre advances ML, big data, microsimulation & geoexperimentation to drive research & advise policymakers & stakeholders.

About
Centre of Competence for Data Science and Simulation

The Centre of Competence for Data Science and Simulation (DSS) is a cross-departmental initiative. Our ambition is to develop infrastructure, tools and competences in in machine learning, big data, microsimulation and geoexperimentation to conduct excellent research and advise policy-makers as well as stakeholders.

DSSthumb image
ABOUT DSS

The main goal of this Centre is to shape research activities (topics, tools and team) to anticipate the broad diffusion of AI in the economy and society through:

◉ expand the use and knowledge about AI tools in the LISER community
◉ include AI tools in project proposals
◉ develop methods to help increase efficiency (e.g. AI assistance for qualitative interviews)
◉ make use of Large Language Models (LLM) for economic research including aiding literature research, gaps in research or policy work
◉ make use of OpenAI for data queries (e.g. team working on online job vacancies)
◉ networking and building bridges to other local institutions
◉ being in contact to EU responsible agency for AI
◉ offer expertise for UPSKILL Project

EVENTS
A key component of rigorous and novel applied and academic oriented research is to rely on a center of excellence to advance the discipline.

For LISER, this means developing best practices and fostering a data-driven culture for executing reliable research studies. The DSS has this in its scope while the wide range of seminars offered in this context has been a tangible proof of this. Discover some of the topics covered over the last months below.

Artificial Intelligence and Machine Learning Applications.png

Artificial Intelligence and Machine Learning Applications


  • The impact of LLMs and GenAI on labor, our physiology and everything: preliminary results, thoughts, and speculations
    Chair: Nikos Askitas (IZA), 11 March 2025.
  • Validation of Machine Learning Estimated Conditional Average Treatment Effects
    Chair: Michael Knaus (University of Tübingen), 12 February 2025.
  • Causal AI for Behavior Learning from Trajectory Data
    Chair: Zhenliang Ma (KTH), 23 Oct 2024.
  • AI and Research: Exploiting the Possibilities of OpenAI models for Empirical Research
    Chair: Terry Gregory and Julio Garbers (LISER), 14 March 2024.
  • AI-driven analysis of big data in web analytics and media monitoring
    Chair: Joscha Krause from Cure Intelligence, 14 February 2024.
  • Innovations in Public Finance: Applying AI to Municipal Budget Analysis
    Chair: Sergio Galletta from Gees Zurich, 24 January 2024.
  • Web-based Indicators with webAI
    Chair: Jan Kinne from ZEW and ISTARI, 22 November 2023.
  • Text Mining / Natural Language Processing (NLP)
    Chair: Jongoh Kim, 26 Jan 2023.
  • Natural Language Processing: Topic modeling with Latent Dirichlet Allocation (LDA)
    Chair: Thiago Brant, 9 Dec 2022.
Data Science and Computational Tools.png

Data Science and Computational Tools

  • Patterns of Knowledge Worker Productivity: Insights from GitHub Activity Data
    Chair: Ingo Isphording (IZA), 15 January 2025.
  • Handling large data with Python and R
    Chair: Etienne Bacher (LM), 25 Sept 2024.
  • Unlocking Research Potential: MeluXina, LuxProvide's Supercomputer Infrastructure
    Chair: Francesco Bongiovanni & Alban Rousset (LuxProvide), 22 May 2024.
  • Papyrus – A Visual Analytics Tool for Scientific Literature Review
    Chair: Mohammad Ghoniem & Nicolas Médoc (LIST), 24 April 2024.
  • Streamlining Collaboration with GitHub: Simplifying Coding Projects for Researchers
    Chair: Thiago Brant, Etienne Bacher, 25 May 2023.
  • Some good practices for research with R
    Chair: Etienne Bacher, 16 March 2023.
  • An introduction to UL HPC supercomputers: performances, functionalities and accessibility
    Chair: Carlotta Montorsi, 1 Feb 2023.
Labor Economics, Public Preferences, Media, and Historical Data.png

Labor Economics, Public Preferences, Media, and Historical Data

  • The Value of Early-Career Skills
    Chair: Christina Langer (Stanford Digital Economy Lab), 11 Dec 2024.
  • Potentials and Pitfalls of Online Job Vacancy Data for Economic Research
    Chair: Eduard Storm from RWI-Essen, 25 October 2023.
  • Measuring Public Preferences and Exposure Effects Using Synthetic Visual Conjoins
    Chair: Simon Munzert (Hertie School), 6 November 2024.
  • impresso: Media Monitoring of the Past II. Beyond Borders: Connecting Historical Newspapers and Radio
    Chair: Marten During (Luxembourg University), 7 December 2023.
  • Gender disparities in Science News
    Chair: Nikos Askitas (IZA Coordinator of Data and Technology), 14 Dec 2022.
Emerging Fields and Specialised Topics.png

Emerging Fields and Specialised Topics

  • Introduction to genoeconomics: nature, nurture, and everything in between
    Chair: Giorgia Menta, 28 June 2023.
  • The use of mobile phone data in economics with a focus on developing economies
    Chairs: Rana Cömertpay, Narcisse Cha'ngom, 20 September 2023.
  • An open discussion round on p-hacking in quantitative empirical research
    Chairs: Felix Stips and Michela Bia, 10 Nov 2022.
Infrastructure
One of the DSS goals is to develop infrastructure that can be used across-departments

The infrastructure helps build up skills in machine learning, Big Data methods, micro-simulation, etc., and to develop top standards capacity to work with data both for research and policy work. As a result, DSS is actively developing GitHub.

For LISER, GitHub proves useful as a tool for facilitating collaboration among researchers and optimising workflow. GitHub provides several advantages to researchers and policy makers: it serves as a repository for code, a library for research, and an organization tool for version control. It is also a place that can be used to centralise information (e.g. provide links) to relevant data.

Furthermore, GitHub serves as a Learning Management System (LMS), it encourages both intra-institutional and inter-institutional collaboration, creating a community of shared knowledge and progress. The platform is free and easily accessible with low entry barriers. Our experience is that setting up an account takes just a few clicks. Importantly, putting a project on GitHub does not mean making it available to anyone, unless one decides so.

Microsimulation (MSM) infrastructure
One of the DSS goals is to develop infrastructure that can be used across-departments

The infrastructure helps build up skills in machine learning, Big Data methods, micro-simulation, etc., and to develop top standards capacity to work with data both for research and policy work. As a result, DSS is actively developing GitHub.

For LISER, GitHub proves useful as a tool for facilitating collaboration among researchers and optimising workflow. GitHub provides several advantages to researchers and policy makers: it serves as a repository for code, a library for research, and an organization tool for version control. It is also a place that can be used to centralise information (e.g. provide links) to relevant data.

Spatial Data Infrastructure (SDI)
One of the DSS goals is to develop infrastructure that can be used across-departments

SDI is a strategic project to facilitate efficient geographic research data management and sharing within and outside LISER, with complying with well-established Open Data and Research principles including FAIR (Fair, Accessible, Interoperable, Reusable) and TRUST (Transparency, Responsibility, User focus, Sustainability, Technology). The goal is to facilitate the discovery, access, management, distribution, reuse, and preservation of digital geospatial resources while centralising the management of spatial data and information related to multilateral projects for best sharing and exchange between researchers.

The Latest News

DISCOVER ALL NEWS
STRATEGIC PARTNERSHIP
SURVEY
CONFERENCE
PROJECTS | SkiLMeeT
SkiLMeeT studies the determinants of skill mismatches and shortages in European labor markets with a particular focus on the role of the digital and green transition. SkiLMeeT relies, among others, on big data to generate new measures using NLP and ML techniques from online job vacancies, online platforms such as LinkedIn, training curricula and patents.

Patent data on advances in digital and green technologies

This task generates proxies for advances in digital and green technologies using NLP methods and big data. The data used come from the European Patent Office (EPO), the USPTO patent website following Kelly et al. (2021), and Google Patents. We transform the documents into text data, split the text into the relevant sections of the patents, and make sure that all patent texts are machine readable. We next compile a list of keywords describing the patents linked to digital and green technologies and use semantic matching to identify these technologies. For advanced digital technologies, e.g., we use Alekseeva et al. (2021). To match the patent information to the industries that potentially make use of digital and green technologies, we use probabilistic matching procedures.

We differentiate between automation and augmenting innovations and classify breakthrough patents, which serve as an instrumental variable to identify causal effects of new technologies.

Preparing big data from online platforms to analyse the supply and demand of skills in different EU countries

This task makes use of big data from online platforms such as LinkedIn, PayScale, Glassdoor, Indeed and OJV data to analyse the supply and demand of skills (using the keywords of Task 2.1) across European countries.

LinkedIn has over 100 million users in the EU and is therefore considered a particularly reliable data source for the analyses in this project. The platform provides data for both skills supply and skills demand. The skills supply can be derived from the user profiles on LinkedIn and the demand can be extracted from the job vacancies at LinkedIn. Using the job histories, we can also extract information on job transitions in different countries.

Exploiting the time dimension, we can track skills supply, skills demand and job mobility for EU countries and for different sectors over time. We will supplement this information with data from PayScale, Glassdoor, Indeed and OJV to have information on skills and salary at the occupation and company level in EU countries. Various data analysis techniques are used to extract and combine data from diverse sources. We use statistical models and NLP for the data analysis.

PROJECTS | AIJOB
The PI, David Marguerit, relies on data science tools in various aspects of his PhD research, enabling him to perform tasks faster and achieve superior results compared to alternative methods.

An illustration of this enhanced efficiency is the project Artificial intelligence, automation and job content: the implication for wages, job mobility and training, where the PI assesses the effect of Artificial Intelligence (AI) exposure at work on wages and employment. For this project, he relies on generative AI to meticulously identify keywords related to AI within a substantial list of 12,000 concepts frequently employed by developers. Research has shown that generative AI performs well in classifying objects. Furthermore, he uses Sentence Similarity to compare different texts between them in terms of meaning. Sentence Similarity, a prominent component of Natural Language Processing, provides a score measuring to what extent two texts have similar meanings. In total, 61,000 pairs of texts are compared.

David’s project with Andrea Albanese on cross-border workers from Belgium to Luxembourg is another example showing how data science techniques can be used. In this project, they assess the effect of becoming cross-border workers on wages after workers return to work in their country of living (i.e. Belgium). First, they use machine learning algorithms, namely Random Forest, to identify the most influential predictors of becoming a cross-border worker. Subsequently, they employ Double Machine Learning to assess the effect of becoming a cross-border worker on wages.

PROJECTS | ROBOTS
Opportunistic collaboration in a network of robot agents with heterogeneous sensory capabilities

Problem description

This study focuses on the collaborative possibilities offered by the IoT to modern agents (including robots). Agents have no a priori information about the sensory capabilities of other existing nodes. An agent with a specific task could gain time, performance and precision by using the resources and capabilities present in the network. There are many challenges to achieving this opportunistic collaboration. These include a universal ontology for communicating available computational and algorithmic resources and sensory capabilities in a comprehensible language. For example, a security camera can be defined as having a processor, a hard disk, algorithms for detecting people and facial recognition, and producing images of a given resolution, etc. There is also the challenge of representing the world and objects in order to carry out tasks. For example how a robot looking for an object can describe it and how a safe camera with a wide view can help and guide.

We considered a specific case with two agents

A fixed camera at height and a mobile robot that has to carry out a task. The robot has to search for an object that is in the camera's wide field of view and this study seeks to explore and demonstrate the possibilities of IoT as opportunistic collaboration between 2 heterogeneous agents with no prior knowledge of each other's capabilities. This case study was implemented using the ROS (Robot Operating System) environment and produced promising results. The next step is to use real robots (such as Turtlebot, Cherokey, Xgo and Spotdog) to quantify environmental parameters, for example.

Team

Hichem Omrani (PI, research scientist, UDM dept, LISER - Luxembourg), Judy Akl (PhD student, University of Lorraine, France), Fahed Abdallah (Full Prof at the University of Lorraine, France), Amadou Gning (Robotics expert, Createc-UK)

THIAGO.jpg
SEED GRANTS TESTIMONIES

Thiago Brant | Generative AI for education

The seed grant we secured from the DSS for the integration of a vector database has been instrumental in moving our research forward on the use of Generative AI for education. This financial support enabled us to leverage state-of-the-art AI methodologies, notably within the sphere of Large Language Models. These tools have transformed how we store, retrieve, and dissect extensive data sets, enhancing both efficiency and semantic depth. The incorporation of the vector database has not only optimized our project's responsiveness and accuracy but has also facilitated richer, more nuanced analysis. This innovative approach to data management empowers our team to tackle intricate research challenges, surpassing previous limitations found in the realm of Natural Language Processing.

DSS's mission is to encourage the development of knowledge in data science across diverse fields. In this light, our endeavor at LISER—to craft bespoke AI solutions that bolster research—resonates perfectly with DSS's focus. As our project continues to evolve, this grant will facilitate the expansion of our project, which in turn will allow for the construction of know-how for LISER.

TERRY.jpg
SEED GRANTS TESTIMONIES

Terry Gregory | AI and Research: Exploiting the Possibilities of OpenAI Models for Empirical Research

The project's objective is to leverage OpenAI's large language models, such as GPT, to develop innovative methodologies for analyzing and processing large-scale data in the context of the Centre of Data Science and Systems (DSS) and Luxembourg Institute of Socio-Economic Research (LISER). The project focuses on three key applications:

◉ Word Processing and Classification in Survey Designs: The first application aims to process open-ended responses in surveys, which produce unstructured data. By using GPT, the project seeks to standardize descriptions of technologies mentioned in survey responses and categorize them efficiently. This can enable better analysis of survey data and provide valuable insights, especially when dealing with large volumes of responses.

◉ Text Recognition and Extraction of Historical Data: The second application explores the use of GPT for text recognition, including both printed and handwritten content. This involves converting historical documents and records into organized datasets, reducing the need for manual transcription and coding. This approach can be applied to various historical sources, making valuable information more accessible for research purposes.

◉ Text Analysis and Data Extraction: The third application involves automating data extraction from textual sources, such as online seller profiles or customer reviews. GPT is employed to answer specific questions about the text, facilitating the creation of datasets for various research inquiries. This automated process accelerates data analysis, improves replicability, and offers flexibility in tailoring questions to different texts and languages.

Team: Salvatore Di Salvo (LISER), Julio Garbers (LISER), and Terry Gregory (LISER and ZEW)

EVA.jpg
SEED GRANTS TESTIMONIES

Eva Sierminska | Specialization in Economics: JEL codes and beyond

Economics, as a discipline, encompasses a broad spectrum of specialization areas, and there is increasing evidence of disparities among underrepresented groups in these fields. Our research addresses questions related to field specialization among early-career economists and its correlation with researchers' gender, country of origin, ethnicity, and other characteristics, as well as its impact on academic outcomes like tenure and promotion. In this emerging but expanding research area, the conventional means of identifying fields have relied on Journal of Economic Literature (JEL) classification codes or self-attribution by researchers. Our work takes advantage of text analysis methods to delve deeper into these complexities.

The DSS grant facilitated the application of text analysis, specifically Latent Dirichlet Allocation (LDA) - an unsupervised machine learning method, for the purpose of topic modeling among dissertation abstracts in Economics. We collected data by scraping approximately 30,000 abstracts from Proquest and merging it with other sources such as EconLit and Academic Analytics. LDA, much like factor analysis for text, uncovers latent 'topics' by recognizing word co-occurrences in documents. It generates a probabilistic word distribution for each topic and a distribution of topics for each abstract, thereby enabling systematic thematic analysis. This approach offers greater stability in the identification of topics and specializations over time, allowing us to identify more accurately primary research fields. The grant provided dedicated time for us to explore the implementation of LDA and the relevant literature, thereby enhancing the robustness of our field specialization identification—a pivotal aspect of our research.

Committee
christina.jpg
Christina Gathmann
Department Director
Photo de notre directeur de département Conditions de vie Eugenio Peluso
Eugenio Peluso
Department Director
Olivier Klein
Olivier Klein
Research Scientist
Coordinators
Anna DOBER.jpg
Anna Dober
Office Manager & Research Data Steward
Terry GREGORY
Terry Gregory
Research Associate