Harvesting AI-Ready Data: Intelligent Import and Semantic Enrichment of Open Data Catalogs
- Forschungsthema/Bereich
- Human-Computer Interaction, Data Management, Open Data Platforms
- Typ der Abschlussarbeit
- Bachelor / Master
- Startzeitpunkt
- 31.05.2025
- Bewerbungsschluss
- 31.05.2025
- Dauer der Arbeit
- 3-6 months
Beschreibung
Abstract / Project DescriptionThe exponential growth of open data platforms such as data.europa.eu, Mobilitek, and Hugging Face has resulted in a vast volume of publicly available datasets. However, the majority of these datasets are not directly usable for AI applications due to inconsistent formats, poor metadata quality, and a lack of semantic structure. This hinders discoverability and reusability, particularly in research domains that depend on diverse and well-annotated data.
The goal of this Bachelor’s thesis is to design and develop a modular prototype for an intelligent data importer that harvests, processes, and semantically enriches metadata from various Open Data sources. The enriched datasets will be integrated into a DCAT-compliant system, such as Piveau.io, thereby creating a foundation for truly AI-ready datasets.Research Objectives
• Develop a robust pipeline for automated dataset extraction from platforms like Hugging Face, Mobilitek, OpenML, and Radar4KIT.
• Apply AI techniques (e.g., using LangChain or LlamaIndex) to enrich and normalize metadata.
• Implement semantic lifting and linking using tools such as LinkML and RDFlib to improve dataset discoverability.
• Integrate enriched data into Piveau through a standardized MCP (Metadata Catalog Publishing) endpoint.
• Evaluate the usability and discoverability of the resulting data catalog from an AI researcher’s perspective.Methodology
• Requirements analysis based on current challenges in AI dataset discovery
• Implementation of a prototype using Python and semantic web technologies
• Integration with the Piveau platform, adhering to DCAT-AP standards
• Performance and quality evaluation of metadata enrichment and retrieval
Voraussetzung
- Voraussetzungen an Studierende
-
- A functional prototype of an intelligent importer for AI-ready data
- Integration of semantically enriched datasets into a DCAT-compliant catalog
- Evaluation of the approach through use case testing and documentation
- A written thesis outlining the design, implementation, and analysis of the solution
- Studiengangsbereiche
-
- Ingenieurwissenschaften
Elektrotechnik & Informationstechnik
Informatik
Mechatronik & Informationstechnik
Energy Engineering and Management
Mobility Systems Engineering and Management
Information System Engineering and Management
- Ingenieurwissenschaften
Betreuung
- Titel, Vorname, Name
- Sarah Makarem
- Organisationseinheit
- TM
- E-Mail Adresse
- makarem@teco.edu
- Link zur eigenen Homepage/Personenseite
- Website
Bewerbung per E-Mail
- Bewerbungsunterlagen
-
- Lebenslauf
- Notenauszug
E-Mail Adresse für die Bewerbung
Senden Sie die oben genannten Bewerbungsunterlagen bitte per Mail an makarem@teco.edu
Zurück