Harvesting AI-Ready Data: Intelligent Import and Semantic Enrichment of Open Data Catalogs
- Research topic/area
- Human-Computer Interaction, Data Management, Open Data Platforms
- Type of thesis
- Bachelor / Master
- Start time
- 31.05.2025
- Application deadline
- 31.05.2025
- Duration of the thesis
- 3-6 months
Description
Abstract / Project DescriptionThe exponential growth of open data platforms such as data.europa.eu, Mobilitek, and Hugging Face has resulted in a vast volume of publicly available datasets. However, the majority of these datasets are not directly usable for AI applications due to inconsistent formats, poor metadata quality, and a lack of semantic structure. This hinders discoverability and reusability, particularly in research domains that depend on diverse and well-annotated data.
The goal of this Bachelor’s thesis is to design and develop a modular prototype for an intelligent data importer that harvests, processes, and semantically enriches metadata from various Open Data sources. The enriched datasets will be integrated into a DCAT-compliant system, such as Piveau.io, thereby creating a foundation for truly AI-ready datasets.Research Objectives
• Develop a robust pipeline for automated dataset extraction from platforms like Hugging Face, Mobilitek, OpenML, and Radar4KIT.
• Apply AI techniques (e.g., using LangChain or LlamaIndex) to enrich and normalize metadata.
• Implement semantic lifting and linking using tools such as LinkML and RDFlib to improve dataset discoverability.
• Integrate enriched data into Piveau through a standardized MCP (Metadata Catalog Publishing) endpoint.
• Evaluate the usability and discoverability of the resulting data catalog from an AI researcher’s perspective.Methodology
• Requirements analysis based on current challenges in AI dataset discovery
• Implementation of a prototype using Python and semantic web technologies
• Integration with the Piveau platform, adhering to DCAT-AP standards
• Performance and quality evaluation of metadata enrichment and retrieval
Requirement
- Requirements for students
-
- A functional prototype of an intelligent importer for AI-ready data
- Integration of semantically enriched datasets into a DCAT-compliant catalog
- Evaluation of the approach through use case testing and documentation
- A written thesis outlining the design, implementation, and analysis of the solution
- Faculty departments
-
- Engineering sciences
Electrical engineering & information technologies
Informatics
Mechatronics & information technologies
Energy Engineering and Management
Mobility Systems Engineering and Management
Information System Engineering and Management
- Engineering sciences
Supervision
- Title, first name, last name
- Sarah Makarem
- Organizational unit
- TM
- Email address
- makarem@teco.edu
- Link to personal homepage/personal page
- Website
Application via email
- Application documents
-
- Curriculum vitae
- Grade transcript
E-Mail Address for application
Senden Sie die oben genannten Bewerbungsunterlagen bitte per Mail an makarem@teco.edu
Back