Download PDF Open PDF in browser Current version

Benchmarking Machine Learning Pipelines in PostgreSQL with TPCx-AI

EasyChair Preprint 13586, version 1

Versions: 12→history

17 pages•Date: June 7, 2024

Abstract

Driven by advancements in model capabilities and ease of access, machine learning (ML) and artificial intelligence (AI) are increasingly applied across industry and government sectors. Traditionally, ML training and serving either relies on big external service providers such AWS or MS Azure, or require data to be transferred from databases or data lakes to local or cloud environments. Apart from dependencies on external ML frameworks, these type transfer not only introduces significant overhead but also pose risks to data security and data integrity. Integrating these technologies directly within database systems promises significant advantages, particularly for production environments. However, the performance and capability of database systems for various ML scenarios remain unclear. To address these uncertainties, this paper proposes the transfer of the TPCx-AI benchmark toolkit into PostgreSQL using the MADlib extension, enabling the entire ML pipeline - from data loading and preprocessing to training, scoring, and serving - within the database system. We present the implementation details and compare its performance with the traditional Python-based approach from the toolkit. Our evaluation, leveraging the synthetic data generator PDGF and use cases provided by TPCx-AI, offers a comprehensive analysis of the benefits and shortcomings of in-database ML training with PostgreSQL and MADlib.

Keyphrases: Apache MADlib, Benchmarking, Database Management Systems, PostgreSQL, TPCx-AI, machine learning, performance evaluation

Links:

https://easychair.org/publications/preprint/Xj6m

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:13586,
  author    = {Leonhard Liu and Patrick Erdelt},
  title     = {Benchmarking Machine Learning Pipelines in PostgreSQL with TPCx-AI},
  howpublished = {EasyChair Preprint 13586},
  year      = {EasyChair, 2024}}

Download PDF Open PDF in browser Current version