23. September 2024
CSG Data Management
and RWTH Aachen University
Cross-Sectional Group
The CSG Data Management develops automated tools to reduce the manual overhead of data engineering tasks like metadata annotation, data cleaning and data transformation. Furthermore, we aim to better integrate research data management (RDM) and HPC to provide better system integration and avoid unnecessary data copies.
The methods of achieving these goals are the following:
- AI to automate Data Engineering Tasks (e.g., Automated Data Cleaning)
- Lineage over complete data engineering workflows (i.e., track which data came from where)
- System integration (e.g., integrate different storage systems)
We work on new tools and provide them as open-source code that can support researchers in their tasks.
Consulting on utilizing these tools will be provided by workshops (e.g., as hackathons). Additionally, we created web platform to increase data literacy where researchers are able to find links to useful other tools and educational resources like online courses: https://data-ai-literacy.ml/.
We also provide a bridge from NHR4CES to the NFDI4x initiatives by supporting users in the use of emerging solutions for automating research data management.
If you have questions for other groups or general questions like access to the HPC infrastructure, have a look at our support website.
Current research topics:
- WikiDBs: A corpus of 100,000 real-world databases (Liane Vogel and Jan-Micha Bodensohn)
– Based on data from Wikidata, we created a large-scale corpus of relational databases from various domains to support the development of foundation models for tabular data - Foundation Models for Automating Data Engineering on Structured Data (Liane Vogel)
– The project aims to explore the usage of pre-trained deep neural networks on structured data from databases in order to reduce the manual overhead on data engineering tasks like data cleaning. - Table Retrieval From Data Lakes (Jan-Micha Bodensohn)
- Linking Research Data Management and HPC (Marcel Nellesen)
– Provide easy ways to apply for storage space similar to HPC applications
– Data Management, Access Workflows and Knowledge Graph based Metadata Management
– Data Transfers between S3 solutions and HPC nodes
– Efficient metadata management with support for automated extraction of metadata
– Connection to many other federal and national RDM initiatives (NFDI, fdm.nrw…)
Training offers 2025:
- Introduction to git and FDM with GitLab
- Community Workshop : Materials Science with Advanced Data Management and Data Science Techniques
- RDM in NHR: Efficient Data Exchange and User-Friendly HPC
- RDM in NHR: Getting started with RDM
- Large Language Models for Data Wrangling
Support activities:
- JARDS as source of project specific metadata and a common entry point to request resources (computing time and storage together with NFDI4Ing)
- Coscine an integration platform that allows services such as the archive, research data storage (RDS.NRW) and GitLab, but also external storages to be linked with one another at project level and stored with metadata
Teaching activities:
- “Data Literacy for All” (hub with links to existing good online courses, best practices, online exercises, ….): https://data-ai-literacy.ml/
- Hackathons with tools for data engineering
- Research Data Management with GitLab
Video
Benjamin Hättasch: WannaDB: Ad-hoc Structured Exploration of Text Collections Using Queries
Project partners
Publications
2025
- Towards Complex Table Question Answering Over Tabular Data Lakes (Daniela Risis, Jan-Micha Bodensohn, Matthias Urban, Carsten Binnig ), DE4DS@BTW’25
2024
- Automating Enterprise Data Engineering with LLMs. (Jan-Micha Bodensohn, Ulf Brackmann, Liane Vogel, Anupam Sanghi, Carsten Binnig ), TRL@NeurIPS’24
- WikiDBs: A Large-Scale Corpus Of Relational Databases From Wikidata. (
Liane Vogel, Jan-Micha Bodensohn, Carsten Binnig), NeurIPS’24 Datasets&Benchmarks Track - CAESURA: Language Models as Multi-Modal Query Planners. (Matthias Urban, Carsten Binnig), CIDR’24
- Demonstrating CAESURA: Language Models as Multi-Modal Query Planners. (Matthias Urban, Carsten Binnig), SIGMOD’24 Demo Track
- Rethinking Table Retrieval from Data Lakes. (Jan-Micha Bodensohn, Carsten Binnig), aiDM’24@SIGMOD’24
- LLMs for Data Engineering on Enterprise Data. (Jan-Micha Bodensohn, Ulf Brackmann, Liane Vogel, Matthias Urban, Anupam Sanghi, Carsten Binnig), TaDA@VLDB’24
2023
- WannaDB: Ad-hoc SQL Queries over Text Collections. (Benjamin Hättasch, Jan-Micha Bodensohn, Liane Vogel, Matthias Urban, Carsten Binnig), Datenbanksysteme für Business, Technologie und Web (BTW 2023)
- Carrots and Sticks: Motivating with Storage for Good RDM (Ilona Lang, Marcel Nellesen, Lukas Bossert, Marius Politze)
- RDM Platform Coscine – FAIR play integrated right from the start (Ilona Lang, Marcel Nellesen, Marius Politze)
- OmniscientDB: A Large Language Model-Augmented DBMS That Knows What Other DBMSs Do Not Know (Matthias Urban, Duc Dat Nguyen and Carsten Binnig), AIDM@SIGMOD
- WikiDBs: A corpus of relational databases from wikidata (Liane Vogel, Carsten Binnig), TADA@VLDB 2023
2022
- Towards Foundation Models for Relational Databases. (Liane Vogel, Benjamin Hilprecht, Carsten Binnig), TRL@NeurIPS 2022
- ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA. (Tobias Ziegler, Carsten Binnig, Viktor Leis), SIGMOD 2022
2021
- It’s AI Match: A Two-Step Approach for Schema Matching Using Embeddings. (Benjamin Hättasch, Michael Truong-Ngoc, Andreas Schmidt, Carsten Binnig), AIDB@VLDB 2020
- ASET: Ad-hoc Structured Exploration of Text Collections. (Benjamin Hättasch, Jan-Micha Bodensohn, Carsten Binnig), AIDB@VLDB 2021
CSG Data Management News
Best Short Paper Award for our CSG Data Management
13. March 2024