Harvesting AI-Ready Data: Intelligent Import and Semantic Enrichment of Open Data Catalogs

Research topic/area
Human-Computer Interaction, Data Management, Open Data Platforms
Type of thesis
Bachelor / Master
Start time
31.05.2025
Application deadline
31.05.2025
Duration of the thesis
3-6 months

Description

Abstract / Project Description
The exponential growth of open data platforms such as data.europa.eu, Mobilitek, and Hugging Face has resulted in a vast volume of publicly available datasets. However, the majority of these datasets are not directly usable for AI applications due to inconsistent formats, poor metadata quality, and a lack of semantic structure. This hinders discoverability and reusability, particularly in research domains that depend on diverse and well-annotated data.
The goal of this Bachelor’s thesis is to design and develop a modular prototype for an intelligent data importer that harvests, processes, and semantically enriches metadata from various Open Data sources. The enriched datasets will be integrated into a DCAT-compliant system, such as Piveau.io, thereby creating a foundation for truly AI-ready datasets.

Research Objectives
• Develop a robust pipeline for automated dataset extraction from platforms like Hugging Face, Mobilitek, OpenML, and Radar4KIT.
• Apply AI techniques (e.g., using LangChain or LlamaIndex) to enrich and normalize metadata.
• Implement semantic lifting and linking using tools such as LinkML and RDFlib to improve dataset discoverability.
• Integrate enriched data into Piveau through a standardized MCP (Metadata Catalog Publishing) endpoint.
• Evaluate the usability and discoverability of the resulting data catalog from an AI researcher’s perspective.

Methodology
• Requirements analysis based on current challenges in AI dataset discovery
• Implementation of a prototype using Python and semantic web technologies
• Integration with the Piveau platform, adhering to DCAT-AP standards
• Performance and quality evaluation of metadata enrichment and retrieval

Requirement

Requirements for students
  • A functional prototype of an intelligent importer for AI-ready data
  • Integration of semantically enriched datasets into a DCAT-compliant catalog
  • Evaluation of the approach through use case testing and documentation
  • A written thesis outlining the design, implementation, and analysis of the solution

Faculty departments
  • Engineering sciences
    Electrical engineering & information technologies
    Informatics
    Mechatronics & information technologies
    Energy Engineering and Management
    Mobility Systems Engineering and Management
    Information System Engineering and Management


Supervision

Title, first name, last name
Sarah Makarem
Organizational unit
TM
Email address
makarem@teco.edu
Link to personal homepage/personal page
Website

Application via email

Application documents
  • Curriculum vitae
  • Grade transcript

E-Mail Address for application
Senden Sie die oben genannten Bewerbungsunterlagen bitte per Mail an makarem@teco.edu


Back