Download PDFOpen PDF in browserCurrent versionBenchmarking Machine Learning Pipelines in PostgreSQL with TPCx-AIEasyChair Preprint 13586, version 117 pages•Date: June 7, 2024AbstractDriven by advancements in model capabilities and ease of access, machine learning (ML) and artificial intelligence (AI) are increasingly applied across industry and government sectors. Traditionally, ML training and serving either relies on big external service providers such AWS or MS Azure, or require data to be transferred from databases or data lakes to local or cloud environments. Apart from dependencies on external ML frameworks, these type transfer not only introduces significant overhead but also pose risks to data security and data integrity. Integrating these technologies directly within database systems promises significant advantages, particularly for production environments. However, the performance and capability of database systems for various ML scenarios remain unclear. To address these uncertainties, this paper proposes the transfer of the TPCx-AI benchmark toolkit into PostgreSQL using the MADlib extension, enabling the entire ML pipeline - from data loading and preprocessing to training, scoring, and serving - within the database system. We present the implementation details and compare its performance with the traditional Python-based approach from the toolkit. Our evaluation, leveraging the synthetic data generator PDGF and use cases provided by TPCx-AI, offers a comprehensive analysis of the benefits and shortcomings of in-database ML training with PostgreSQL and MADlib. Keyphrases: Apache MADlib, Benchmarking, Database Management Systems, PostgreSQL, TPCx-AI, machine learning, performance evaluation
|