Meet TQP: The First Query Processor to Run On Tensor Computation Runtimes Delivers up to 20x Speedups Over CPU-Only Systems

If data is AI’s fuel, then compute is its engine. The ever-growing computational requirements of contemporary AI systems have driven investments and R&D on specialized hardware and on building and supporting runtimes and compilers for AI, with leading industry players and open-source communities alike applying enormous efforts to building software that targets AI workloads.

In the new paper Query Processing on Tensor Computation Runtimes, a research team from the University of Washington, UC San Diego and Microsoft prototypes Tensor Query Processor (TQP), a query processor that runs atop tensor computation runtimes (TCRs) such as PyTorch, TVM, and ONNX Runtime. The researchers say TQP is the first query processor to run on TCRs, and demonstrate its ability to improve query execution time by up to 20x over CPU-only systems and up to 5x over specialized GPU solutions.

The team summarizes their main contributions as:

  1. We show that the tensor interface of TCRs is expressive enough to support all common relational operators.
  2. We propose a collection of algorithms and a compiler stack for translating relational operators into tensor computation.
  3. We evaluate the Tensor Query Processor approaches extensively against state-of-the-art baselines on the TPC-H benchmark.

TCRs such as PyTorch and TensorFlow enable data scientists to efficacy exploit the exciting capabilities offered by new hardware to develop and implement deep neural networks (DNNs) with ease. The growing demand for TCRs indicates hardware solutions specifically targeting data-hungry ML are on the rise, begging the question of how databases might also benefit from these innovations.

The team says their proposed TQP was designed to satisfy three objectives:

  1. Performance. The query processor should have performance on par with specialized engines (eg, it should be as performant as GPU databases on GPU devices).
  2. Portability. We strive to have a query processor that is able to run on different hardware devices, from custom ASICs to CPUs and GPUs, across different generations and vendors.
  3. Minimal Engineering Effort. Building high-performance custom operators for each different hardware backend is a herculean task. We should strive to have an approach that is 𝑂(1) over the number of supported hardware, instead of 𝑂(𝑛).

Relational operators and ML models in TQP are compiled into tensor programs using a unified infrastructure. The workflow comprises two phases: 1) In the compilation phase, input queries are transformed into an executable tensor program; 2) In the execution phase, input data is first transformed into tensors and then fed into the compiled program to generate the final query result.

The compilation phase includes four main layers: 1) The parsing layer converts an input SQL statement into an internal intermediate representation (IR) graph depicting the query’s physical plan; 2) The canonicalization and optimization layer does IR-to-IR transformations; 3) The planning layer translates the IR graph generated in the previous layer into an operator plan; and 4) The execution layer generates an executor from the operator plan.

In the execution phase generated by the compilation phase, the program manages data conversion into the tensor format by calling the feeder operator; while also managing data movements to/from device memory and the scheduling of operators in the selected device.

In their empirical study, the team compared TQP with state-of-the-art query processing systems on different hardware settings. For CPU execution, they compared TQP against Apache Spark and DuckDB; for GPU execution, they compared it against popular open-source GPU databases BlazingSQL and OmniSciDB.

The results show that TQP achieves query execution time speedups of up to 20x over CPU-only systems and up to 5x over specialized GPU solutions. TQP also accelerates queries mixing ML predictions and SQL end-to-end, delivering up to 5x speedups over CPU baselines.

Overall, this work shows the proposed TQP is able to take advantage of innovations that have been applied to TCRs and run efficiency on all supported hardware devices.

The paper Query Processing on Tensor Computation Runtimes is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Leave a Reply

%d bloggers like this: