Materials Science with Advanced Data Management and Data Science Techniques

We invite you to our NHR4CES Community Workshop 2025! This year’s topic is Materials Science with Advanced Data Management and Data Science Techniques. The online workshop is organized by SDL Materials Design, CSG Data Management and CSG Data Science and Machine Learning. The workshop will take place on May 07, 1pm-5pm and May 08, 9am-1pm.

With the increasing computational power, an abundance of data has become available in different research fields. To handle such an ever increasing amount of data, tools such as data management, machine learning and/or workflow managers receive increasing attention. In the field of materials science such approaches allow to investigate a wide range of materials and their properties in a systematic manner. The workflow managers are used in high-throughput calculations leading to the generation of huge amount of data. Although, high-throughput calculations are limited by the structure size. Structures with >10^3 atoms are limited by the computational resources. Researchers use the generated data and/or intelligent data mining as an input to machine learning techniques to investigate material properties, beyond limitations imposed by computational resources.

Researchers use the generated data and/or intelligent data mining as an input to machine learning techniques to investigate material properties, beyond limitations imposed by computational resources. Additionally, through the collection of rich metadata in all steps of the workflow, results can be easily reproduced and reused by other researchers. This workshop is going to cover major parts of a data life cycle from generation of data via automatized workflow, their collection using data manager solutions, and their processing and/or re-using via application of machine learning while following the FAIR data principles.

Registration

All about our speakers and their talks

Jan Janssen

is the group leader for materials informatics in the computational materials design department of Prof. Neugebauer at the Max Planck Institute for Sustainable Materials. In his work he aims to combine simulation and experiment in the same machine learning enabled workflows to accelerate the discovery of novel sustainable materials. On a technical level Jan’s interests range from training large language models to build workflows to benchmarking these workflows on the world’s largest super computers. He is a maintainer of over 900 materials informatics related software packages for the conda-forge computing and a frequent contributor to open-source projects on Github.

Jan Janssen’s talk: Pyiron – Workflows for data-driven Materials Science

Given the hierarchical nature of materials, simulation workflows in Materials Science commonly combine simulation codes and utilities developed in different communities to address different time and length scales. The simulation codes and utilities vary in the physical units they use, the variable names and file formats. This hinders the coupling of simulation codes and utilities as well as the development of workflows in general. To address this challenge the pyiron workflow framework was developed, introducing a generic format to couple simulation codes and utilities.

The presentation covers the application of the pyiron workflow manager to calculate the temperature concentration phase diagram starting from just the interaction of the electrons. This includes workflows for fitting machine-learned interatomic potentials and highlights how data-streaming workflows enable the up-scaling to the Frontier Exascale Computer at Oak Ridge National Laboratory. The talk concludes with an outlook on the recent developments in the pyiron workflow framework including the integration of Large Language Models and visual programming interfaces.

Michael Selzer

is a computer scientist and research data specialist based at the Karlsruhe Institute of Technology (KIT). After earning his Diplom and Master’s degrees in Computer Science from Hochschule Karlsruhe, he went on to complete his PhD in Mechanical Engineering at KIT in 2014. His early work focused on numerical simulations using high-performance computing, but over time, his interests evolved toward research data management and infrastructure. Today, he leads the development of Kadi4Mat, an open-source platform that supports the management, sharing, and automation of research data across experimental and simulation-based workflows.

Michael Selzer’s talk: Combining high-performance computing, workflows and machine learning to get new insights in material sciences

The increasing availability of computational power has transformed materials science into a data-intensive discipline. Efficient data management, automated workflows, and machine learning are essential for handling vast amounts of simulation and experimental data, ensuring systematic exploration of material properties, and enhancing research reproducibility. A key challenge in computational materials science is optimizing model parameters, particularly when available literature data is incomplete or inconsistent, such as for solid oxide fuel cells (SOFCs). For sustainable operations of SOFC, understanding component degradation processes is critical for ensuring long-term performance. Microstructural simulations provide valuable insights into these processes, but accurately defining model parameterization remains complex.

To address this, an Active Learning framework combined with Bayesian Optimization is employed to effectively navigate the complex, high-dimensional parameter space, identifying optimal model parameters for both single and multi-objective optimization tasks, while granting efficient resource utilization. Furthermore, automated communication with an HPC cluster has been integrated into our data stream to ensure efficient processing of resource-intensive tasks. To streamline the process while adhering to FAIR data principles, the entire data process is managed within the Kadi4Mat ecosystem. The study has been carried out employing automated workflows, which ensure seamless data storage, metadata management, and reproducibility while enhancing collaborative research.

Chiheb Ben Mahmoud

is a Swiss National Science Foundation postdoctoral fellow in the group of Prof. Volker Deringer at the University of Oxford. He earned his PhD in Materials Science and Engineering in 2023 from the Swiss Federal Institute of Technology (EPFL). His research focuses on leveraging machine learning techniques to model the macroscopic properties of materials.

Chiheb Ben Mahmoud’s talk: Data-driven approaches to atomistic modeling beyond interatomic potentials

Machine learning (ML) has become essential for atomistic modeling. In this talk, I discuss perspectives for modeling electronic, structural, and spectroscopic fingerprints of materials and molecules using ML-driven methods. First, I introduce an ML approach to learn and predict the electronic density of states, demonstrating its utility in simulating materials under extreme temperature and pressure conditions. Next, I explore how graph neural networks can be used to predict solid-state NMR parameters, effectively bridging theoretical predictions with experimental measurements.

Finally, I discuss the transferability of ML-based interatomic potentials—trained on layered materials as a middle ground between bulk materials and molecules—and their accuracy in predicting both static and reactive properties across various chemical systems. These approaches pave the way for more efficient and predictive materials modeling.

Leonid Gerdt

is a Postdoctoral Researcher at the Fraunhofer Institute in Dresden, where he also serves as a Project Assistant in the Task Area of Community Interaction within NFDI Matwerk. His expertise lies in surface engineering, coating technologies, and material characterization.

Ebrahim Norouzi

is a Junior Researcher at FIZ Karlsruhe with Prof. Dr. Harald Sack. He holds an MSc in Materials Science from Ruhr University of Bochum and has experience in data science, Python programming, and material characterization. His research focuses on Machine Learning, Semantic Web, and Modelling in Materials Science.

Leonid Gerdt’s and Ebrahim Norouzi’s will present the Project NFDI MatWerk

Marcel Nellesen

holds a Bachelor degree in Scientific Programing from the FH Aachen, Germany, and a Master degree in Computer Science from the RWTH Aachen University, Germany. From 2019, he works in the department for Research Process & Data Management of the IT Center of the RWTH Aachen University, Germany, as a scientific employee with a focus on research data management. He worked on the Collaborative Scientific Integration Environ-ment (Coscine) a Research Data Management Platform developed at the RWTH Aachen Uni-versity. Currently he is developing JARDS (Joint Application, Review, and Dispatch Service), a platform for the creation and the scientific review of applications for computation time in NHR. In 2021 Marcel joined the Crosssectional Group Data Management at the National High Per-formance Computing Center for Computational Engineering Science (NHR4CES). Since 2022, Marcel leads the Group Research Data Processes for High Performance Compu-ting Systems at RWTH Aachen University, Germany.

Marcel Nellesen’s talk: Putting data into context: metadata schemata and automated extraction of metadata

Effective research data management is inherently dependent on robust metadata management. Metadata contextualizes data and provides valuable information that enhances its usability. The more precisely data is described, the easier it becomes to reuse; this highlights the necessity for domain-specific metadata profiles. However, managing large volumes of data and extensive schemas necessitates the implementation of automation methods.

In this talk, we will provide a brief introduction to research data management, showcase the benefits of metadata, explain how to create metadata profiles, and demonstrate how to utilize extractors for automatic collect metadata from the research data.

Jonas Seng

is a PhD candidate at TU Darmstadt and joined NHR4CES in the CSG Data Science & Machine Learning group in 2021. He holds a Bachelor’s degree in Computer Science from DHBW Mannheim and a Master’s degree from TU Darmstadt. His research focuses on AutoML, including interactive hyperparameter optimization, neural architecture search, and federated AutoML. He is also interested in causal modeling to improve the robustness of machine learning systems.

Jonas Seng’s talk: AutoML Meets Materials Science: Automating Insights and Innovation Abstract

Machine Learning (ML) is transforming industries, from text generation to predictive maintenance. In materials science, ML is revolutionizing research by making it more efficient and data-driven. However, developing high-performing ML models demands expertise and effort, limiting its accessibility.

AutoML—automated machine learning—offers a solution by streamlining model design and training, unlocking ML’s full potential. This talk will introduce AutoML and explore the synergies between AutoML and materials science, showcasing how they can accelerate discoveries and innovation.

Jan Micha Bodensohn

is a doctoral student supervised by Prof. Carsten Binnig at the Data and AI Systems Lab of the Technical University of Darmstadt and works as a researcher for the German Research Center for Artificial Intelligence (DFKI). His research centers on the automation of data engineering tasks with foundation models. As part of his work, he evaluates how Large Language Models (LLMs) can solve classical data engineering tasks on real-world tabular data such as enterprise databases.

Micha joined the Data and AI Systems Lab as a doctoral student in 2023 after completing his Bachelor’s and Master’s degrees in Computer Science at TU Darmstadt. From 2020 to 2022, he worked as a student research assistant at the Data Management Lab at TU Darmstadt, focusing on data exploration and information extraction from text.

Liane Vogel

is a doctoral student supervised by Prof. Carsten Binnig at the Data and AI Systems Lab of the TU Darmstadt. Her research focuses on building foundation models for tabular data and relational databases to automate data engineering tasks. She thereby aims to build models that can take the full relational structure into account and find suitable representations for multi-table data.

She started at the Data and AI Systems Lab in 2021 after obtaining a Bachelor’s and Master’s degree in Computer Science from TU Darmstadt. From 2018 to 2020 she worked as a student research assistant at the UKP Lab at TU Darmstadt in the area of argument search and crowdsourced annotation studies.

Jan Micha Bodensohn’s and Liane Vogel’s talk: Wrangling Tabular Data with LLMs: What’s Possible and What’s Not

Data-driven tasks like empirical analysis and machine learning often come with substantial data preparation overheads. This is especially true for tabular data, which plays a central role in many domains like science and business. The required tables must first be found, integrated, cleaned, and formatted for each specific use case.

Recent research on Large Language Models (LLMs) for data engineering shows that LLMs can solve many of these tasks out-of-the-box without requiring expensive, specialized solutions. Yet these improvements on scientific benchmarks often fail to translate into practice. In our talk, we demystify what LLMs can and cannot do when it comes to wrangling tabular data.

Contact person

Marcel Nellesen

RWTH Aachen University

Jonas Seng

TU Darmstadt

‎Dr. Ganesh Kumar Nayak

RWTH Aachen University

Vasilios Karanikolas

TU Darmstadt

About Us

Infrastructure

Scientific Consulting

Projects

Events & Trainings

17. February 2025

NHR4CES Community Workshop 2025

Materials Science with Advanced Data Management and Data Science Techniques

All about our speakers and their talks

Jan Janssen

Jan Janssen’s talk: Pyiron – Workflows for data-driven Materials Science

Michael Selzer

Michael Selzer’s talk: Combining high-performance computing, workflows and machine learning to get new insights in material sciences

Chiheb Ben Mahmoud

Chiheb Ben Mahmoud’s talk: Data-driven approaches to atomistic modeling beyond interatomic potentials

Leonid Gerdt

Ebrahim Norouzi

Leonid Gerdt’s and Ebrahim Norouzi’s will present the Project NFDI MatWerk

Marcel Nellesen

Marcel Nellesen’s talk: Putting data into context: metadata schemata and automated extraction of metadata

Jonas Seng

Jonas Seng’s talk: AutoML Meets Materials Science: Automating Insights and Innovation Abstract

Jan Micha Bodensohn

Liane Vogel

Jan Micha Bodensohn’s and Liane Vogel’s talk: Wrangling Tabular Data with LLMs: What’s Possible and What’s Not

Contact person

Marcel Nellesen

Jonas Seng

‎Dr. Ganesh Kumar Nayak

Vasilios Karanikolas