Performance and Scalability of Learned Index Structures in the Wild

A. Wirth
Master Thesis
MT2401 (March, 2024)
Supervised by
o. Univ.-Prof. DI Dr. Michael Schrefl
Instructed by
Simon Staudinger, MSc
Accomplished at
University Linz, Institute of Business Informatics - Data & Knowledge Engineering

Abstract (English)

This thesis aims to investigate the potential of Learned Index structures using the example of different datasets. These index structures surfaced in recent years, as a credible alternative for traditional index approaches, with several papers dedicated to the purpose of evaluating the possibilities enhancing indexes with machine learning models would bring. In this thesis, the previous work will be extended with the aspect of the performance comparison in different real-world datasets, more specifically on datasets from Dynatrace.

The comparison to traditional indexes and indexes commonly found in the industry is done through experiments examining metrics like build- and lookup time as well as other, model specific, parameters. Furthermore, the scalability of Learned Index structures is evaluated. These experiments are conducted on various datasets, both real world datasets from Dynatrace and synthetic ones. The main aim of the experiments was to investigate whether a performance gain would be possible, when implementing Learned Index structures on real world datasets that can be found in "the wild", meaning the industry aside from synthetic datasets. The results of the experiments conducted in this thesis lean to the side that a performance gain is in fact possible, as that Learned indexes can offer very fast lookups. These fast lookup time are due to the fact that Learned Index structures can utilize the Distributive Data Function to effectively find the wanted lookup keys. What must be considered is that the build times of Learned Index structures tend to be higher due to their complexity, which is a fact that has to be considered when using them in industrial use cases.