23. September 2024
CSG Data Management
and RWTH Aachen University
Cross-Sectional Group
The CSG Data Management develops automated tools to reduce the manual overhead of data engineering tasks like metadata annotation, data cleaning and data transformation. Furthermore, we aim to better integrate research data management (RDM) and HPC to provide better system integration and avoid unnecessary data copies.
The methods of achieving these goals are the following:
- AI to automate Data Engineering Tasks (e.g., Automated Data Cleaning)
- Lineage over complete data engineering workflows (i.e., track which data came from where)
- System integration (e.g., integrate different storage systems)
We work on new tools and provide them as open-source code that can support researchers in their tasks.
Consulting on utilizing these tools will be provided by workshops (e.g., as hackathons). Additionally, a web platform to increase data literacy will be created where researchers are able to find links to useful other tools and educational resources like online courses.
Overall, we will also provide a bridge from NHR4CES to the NFDI4x initiatives by supporting users in the use of emerging solutions for automating research data management.
If you have questions for other groups or general questions like access to the HPC infrastructure, have a look at our support website.
Current research topics:
- Foundation Models for Automating Data Engineering on Structured Data (Liane Vogel)
– The project aims to explore the usage of pre-trained deep neural networks on structured data from databases in order to reduce the manual overhead on data engineering tasks like data cleaning. - Table Retrieval From Data Lakes (Jan-Micha Bodensohn)
- Multi-modal Databases (Matthias Urban)
- Linking Research Data Management and HPC (Marcel Nellesen)
– Provide easy ways to apply for storage space similar to HPC applications
– Data Transfers between S3 solutions and HPC nodes
– Efficient metadata management with support for automated extraction of metadata
– Connection to many other federal and national RDM initiatives (NFDI, fdm.nrw…) - Data Management, Access Workflows and Knowledge Graph based Metadata Management
Training offers 2024:
- Introduction to git and FDM with GitLab
- Large Language Models for Data Wrangling
- RDM in NHR: Efficient Data Exchange and User-Friendly HPC on 27.08.2024
- RDM in NHR: Getting started with RDM (End of 2024)
Support activities:
- “Data Literacy for All” (hub with links to existing good online courses, best practices, online exercises, ….): https://data-ai-literacy.ml/
- JARDS as source of project specific metadata and a common entry point to request resources (computing time and storage together with NFDI4Ing)
- Coscine an integration platform that allows services such as the archive, research data storage (RDS.NRW) and GitLab, but also external storages to be linked with one another at project level and stored with metadata
Teaching activities:
- “Data Literacy for All” (hub with links to existing good online courses, best practices, online exercises, ….): https://data-ai-literacy.ml/
- Hackathons with tools for data engineering
- Research Data Management with GitLab
Video
Benjamin Hättasch: WannaDB: Ad-hoc Structured Exploration of Text Collections Using Queries
Project partners
- NFDI4Ing
- NFDI-MatWerk
- AIMS
- fdm.nrw
- hpc.nrw
- coscine.nrw
- NHR
- datastorage.nrw
Publications
2024
- Automating Enterprise Data Engineering with LLMs. (Jan-Micha Bodensohn, Ulf Brackmann, Liane Vogel, Anupam Sanghi, Carsten Binnig ), TRL@NeurIPS’24
- WikiDBs: A Large-Scale Corpus Of Relational Databases From Wikidata. (
Liane Vogel, Jan-Micha Bodensohn, Carsten Binnig), NeurIPS’24 Datasets&Benchmarks Track - CAESURA: Language Models as Multi-Modal Query Planners. (Matthias Urban, Carsten Binnig), CIDR’24
- Demonstrating CAESURA: Language Models as Multi-Modal Query Planners. (Matthias Urban, Carsten Binnig), SIGMOD’24 Demo Track
- Rethinking Table Retrieval from Data Lakes. (Jan-Micha Bodensohn, Carsten Binnig), aiDM’24@SIGMOD’24
- LLMs for Data Engineering on Enterprise Data. (Jan-Micha Bodensohn, Ulf Brackmann, Liane Vogel, Matthias Urban, Anupam Sanghi, Carsten Binnig), TaDA@VLDB’24
2023
- WannaDB: Ad-hoc SQL Queries over Text Collections. (Benjamin Hättasch, Jan-Micha Bodensohn, Liane Vogel, Matthias Urban, Carsten Binnig), Datenbanksysteme für Business, Technologie und Web (BTW 2023)
- Carrots and Sticks: Motivating with Storage for Good RDM (Ilona Lang, Marcel Nellesen, Lukas Bossert, Marius Politze)
- RDM Platform Coscine – FAIR play integrated right from the start (Ilona Lang, Marcel Nellesen, Marius Politze)
- OmniscientDB: A Large Language Model-Augmented DBMS That Knows What Other DBMSs Do Not Know (Matthias Urban, Duc Dat Nguyen and Carsten Binnig), AIDM@SIGMOD
- WikiDBs: A corpus of relational databases from wikidata (Liane Vogel, Carsten Binnig), TADA@VLDB 2023
2022
- Towards Foundation Models for Relational Databases. (Liane Vogel, Benjamin Hilprecht, Carsten Binnig), TRL@NeurIPS 2022
- ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA. (Tobias Ziegler, Carsten Binnig, Viktor Leis), SIGMOD 2022
2021
- It’s AI Match: A Two-Step Approach for Schema Matching Using Embeddings. (Benjamin Hättasch, Michael Truong-Ngoc, Andreas Schmidt, Carsten Binnig), AIDB@VLDB 2020
- ASET: Ad-hoc Structured Exploration of Text Collections. (Benjamin Hättasch, Jan-Micha Bodensohn, Carsten Binnig), AIDB@VLDB 2021
CSG Data Management News
Best Short Paper Award for our CSG Data Management
13. March 2024