Harvesting AI-Ready Data: Intelligent Import and Semantic Enrichment of Open Data Catalogs

Forschungsthema/Bereich
Human-Computer Interaction, Data Management, Open Data Platforms
Typ der Abschlussarbeit
Bachelor / Master
Startzeitpunkt
31.05.2025
Bewerbungsschluss
31.05.2025
Dauer der Arbeit
3-6 months

Beschreibung

Abstract / Project Description
The exponential growth of open data platforms such as data.europa.eu, Mobilitek, and Hugging Face has resulted in a vast volume of publicly available datasets. However, the majority of these datasets are not directly usable for AI applications due to inconsistent formats, poor metadata quality, and a lack of semantic structure. This hinders discoverability and reusability, particularly in research domains that depend on diverse and well-annotated data.
The goal of this Bachelor’s thesis is to design and develop a modular prototype for an intelligent data importer that harvests, processes, and semantically enriches metadata from various Open Data sources. The enriched datasets will be integrated into a DCAT-compliant system, such as Piveau.io, thereby creating a foundation for truly AI-ready datasets.

Research Objectives
• Develop a robust pipeline for automated dataset extraction from platforms like Hugging Face, Mobilitek, OpenML, and Radar4KIT.
• Apply AI techniques (e.g., using LangChain or LlamaIndex) to enrich and normalize metadata.
• Implement semantic lifting and linking using tools such as LinkML and RDFlib to improve dataset discoverability.
• Integrate enriched data into Piveau through a standardized MCP (Metadata Catalog Publishing) endpoint.
• Evaluate the usability and discoverability of the resulting data catalog from an AI researcher’s perspective.

Methodology
• Requirements analysis based on current challenges in AI dataset discovery
• Implementation of a prototype using Python and semantic web technologies
• Integration with the Piveau platform, adhering to DCAT-AP standards
• Performance and quality evaluation of metadata enrichment and retrieval

Voraussetzung

Voraussetzungen an Studierende
  • A functional prototype of an intelligent importer for AI-ready data
  • Integration of semantically enriched datasets into a DCAT-compliant catalog
  • Evaluation of the approach through use case testing and documentation
  • A written thesis outlining the design, implementation, and analysis of the solution

Studiengangsbereiche
  • Ingenieurwissenschaften
    Elektrotechnik & Informationstechnik
    Informatik
    Mechatronik & Informationstechnik
    Energy Engineering and Management
    Mobility Systems Engineering and Management
    Information System Engineering and Management


Betreuung

Titel, Vorname, Name
Sarah Makarem
Organisationseinheit
TM
E-Mail Adresse
makarem@teco.edu
Link zur eigenen Homepage/Personenseite
Website

Bewerbung per E-Mail

Bewerbungsunterlagen
  • Lebenslauf
  • Notenauszug

E-Mail Adresse für die Bewerbung
Senden Sie die oben genannten Bewerbungsunterlagen bitte per Mail an makarem@teco.edu


Zurück