三分钟用一瓶水的代价体验TPC-H 100G性能测试

TPC-H 是由事务处理性能委员会 (TPC) 开发的决策支持基准。它由一套面向业务的临时查询和并发数据修改组成。TPC-H可以根据真实的生产环境建立模型，模拟销售系统的数据仓库。本次测试使用8张表，数据大小为100GB。总共测试了22个查询，主要性能指标是每个查询的响应时间，即提交查询到返回结果之间的持续时间。 TPC-H性能测试是一个费时、费钱的过程，为了完成测试需要准备机器、准备数据、生成报告等，往往需要数天时间以及昂贵的费用。基于云器Lakehouse的秒级弹性伸缩的虚拟集群，以及共享的TPC-H数据集，免去了资源和数据准备过程。基于本文提供的方法，可以在3分钟体验云器Lakehouse的完整的性能测试并查看对比报告。本文代码运行在Zeppelin，你如果想运行本文代码，请按照文档说明安装Zeppelin。

创建测试用AP VC


-- 创建分析型虚拟计算资源
create vcluster if not exists VC_TPCH_100GB vcluster_size='Large' vcluster_type='Analytics'  AUTO_RESUME=TRUE AUTO_SUSPEND_IN_SECOND=300 min_replicas=1 max_replicas=1 comment 'TPCH 100GB TEST';

use vcluster VC_TPCH_100GB;

了解待测试数据

本测试使用TPC-H 100 GB dataset。基于云器Lakehouse天生的存算分离架构，共享了该数据集（clickzetta_sample_data.tpch_100g），用户可以直接访问，从而免去了数据准备过程。


select "customer" as tablename, count(*) as row_count from clickzetta_sample_data.tpch_100g.customer
union all select "lineitem" as tablename, count(*) as row_count from clickzetta_sample_data.tpch_100g.lineitem
union all select "nation" as tablename, count(*) as row_count from clickzetta_sample_data.tpch_100g.nation
union all select "orders" as tablename, count(*) as row_count from clickzetta_sample_data.tpch_100g.orders
union all select "part" as tablename, count(*) as row_count from clickzetta_sample_data.tpch_100g.part
union all select "partsupp" as tablename, count(*) as row_count from clickzetta_sample_data.tpch_100g.partsupp
union all select "region" as tablename, count(*) as row_count from clickzetta_sample_data.tpch_100g.region
union all select "supplier" as tablename, count(*) as row_count from clickzetta_sample_data.tpch_100g.supplier;

运行上述代码，得到测试数据集每张表的行数：

测试过程

整个过程花了不到3分钟，其中22个SQL运行总共花了16.718秒，而费用方面本次测试采用的是Large规格的虚拟计算集群，3分钟的费用大概是1.38元，也就是花了一瓶水的钱完成了本次测试。

Congratulations, it's done.

Please enojoy and learn more!

附录

下载Zeppelin Notebook源文件

02.Quick Start ClickZetta Lakehouse Benchmark with TPCH Sample Data..ipynb 02.Quick Start ClickZetta Lakehouse Benchmark with TPCH Sample Data_2JH4XMBUG.zpln

联系我们