概述

TPC-DS(Transaction Processing Performance Council - Decision Support)是一种由交易处理性能评估委员会(TPC)发布的基准测试标准,旨在评估决策支持系统(Decision Support Systems,DSS)的性能。相较于TPC-H更适合于评估传统的查询和报表性能,TPC-DS包含了对数据集的分析报告、交互查询、数据挖掘等复杂应用,更接近真实的数据仓库业务分析场景。

本报告为您提供了云器Lakehouse与Spark SQL在TPC-DS测试集10TB规模上的测试结果,结论如下:

  • 在TPC-DS 10TB规模数据集上的比较测试中,与Spark相比云器Lakehouse展现出了显著的性能优势,其性能相当于Spark的9.51倍。
  • 云器Lakehouse对Spark长执行作业有明显性能提升。

测试环境

  • Spark测试环境
配置项配置信息
服务器Hadoop集群服务:Master节点:1台阿里云ECS服务器( ecs.g8i.xlarge 4 vCPU 16 GiB)Core节点:4台阿里云ECS服务器(ecs.g7.8xlarge 32 vCPU 128 GiB),每台服务器配置ESSD云盘300GiB*4
网络带宽16Gbps
软件Spark-3.4.2
存储服务阿里云OSS对象存储
数据格式默认Parquet,Snappy压缩
  • 云器Lakehouse测试环境
配置项配置信息
计算资源Virtual Cluster:XLarge(128vCore等效算力)
软件阿里云上海Region-云器Lakehouse服务
存储服务托管存储,阿里云OSS对象存储

测试数据

行数
call_center54
catalog_page40,000
catalog_returns1,440,033,112
catalog_sales14,399,964,710
customer65000000
customer_address32,500,000
customer_demographics1,920,800
date_dim73,049
household_demographics7,200
income_band20
inventory1311525000
item402000
promotion2,000
reason70
ship_mode20
store1,500
store_returns2,880,015,149
store_sales28,799,944,153
time_dim86,400
warehouse25
web_page4,002
web_returns720,020,485
web_sales7,199,963,324
web_site78
  • 数据表通过Analyze收集统计信息

测试过程

在测试中,我们选择了TPC-DS基准测试中的103个复杂SQL查询,对10TB的数据集进行性能测试。测试结果包括每个查询在云器Lakehouse和Spark SQL中的执行时间,以及两者的性能对比。

Spark

在元数据服务中创建TPC-DS的数据表,使用Parquet文件格式、与Lakehouse使用相同的分区设置。

同时,从云器Lakehouse中导出TPC-DS 10TB测试数据,以数据文件形式保存至对象存储服务上,以保证双方的测试数据相同。然后在Spark中使用Insert Into方式读取数据文件并写入Spark定义的数据表中。

  • 运行TPC-DS 103个query时Spark时添加参数:
--spark 生产环境大作业必调参数之一。在处理TPCDS-10T规模的数据时,若使用默认的最大并发数200,会因规模偏小而导致大量task内存占用过高,并且极易触发shuffle spill,进而使Spark运行缓慢。经测试,将该参数值调整为2000后,观察到spill大幅减少。因此,我们决定采用2000这一参数值,以优化Spark的运行性能。
set spark.sql.shuffle.partitions = 2000;spark默认值为200

云器Lakehouse

创建集群和表

使用云器Lakehouse XLARGE VCluster在阿里云OSS上进行测试,所有表均使用默认存储格式。

create vcluster if not exists XLARGE_CLUSTER vcluster_size='XLARGE' vcluster_type='Analytics'  AUTO_RESUME=TRUE AUTO_SUSPEND_IN_SECOND=300 min_replicas=1 max_replicas=1;

建表语句

drop table if exists call_center;
drop table if exists catalog_page;
drop table if exists catalog_returns;
drop table if exists catalog_sales;
drop table if exists customer;
drop table if exists customer_address;
drop table if exists customer_demographics;
drop table if exists date_dim;
drop table if exists household_demographics;
drop table if exists income_band;
drop table if exists inventory;
drop table if exists item;
drop table if exists promotion;
drop table if exists reason;
drop table if exists ship_mode;
drop table if exists store;
drop table if exists store_returns;
drop table if exists store_sales;
drop table if exists time_dim;
drop table if exists warehouse;
drop table if exists web_page;
drop table if exists web_returns;
drop table if exists web_sales;
drop table if exists web_site;
drop table if exists catalog_sales;
drop table if exists catalog_returns;

create table if not exists catalog_sales
(
      cs_sold_date_sk          int,
      cs_sold_time_sk          int,
      cs_ship_date_sk          int,
      cs_bill_customer_sk      int,
      cs_bill_cdemo_sk         int,
      cs_bill_hdemo_sk         int,
      cs_bill_addr_sk          int,
      cs_ship_customer_sk      int,
      cs_ship_cdemo_sk         int,
      cs_ship_hdemo_sk         int,
      cs_ship_addr_sk          int,
      cs_call_center_sk        int,
      cs_catalog_page_sk       int,
      cs_ship_mode_sk          int,
      cs_warehouse_sk          int,
      cs_item_sk               int,
      cs_promo_sk              int,
      cs_order_number          long,
      cs_quantity              int,
      cs_wholesale_cost        decimal(7,2),
      cs_list_price            decimal(7,2),
      cs_sales_price           decimal(7,2),
      cs_ext_discount_amt      decimal(7,2),
      cs_ext_sales_price       decimal(7,2),
      cs_ext_wholesale_cost    decimal(7,2),
      cs_ext_list_price        decimal(7,2),
      cs_ext_tax               decimal(7,2),
      cs_coupon_amt            decimal(7,2),
      cs_ext_ship_cost         decimal(7,2),
      cs_net_paid              decimal(7,2),
      cs_net_paid_inc_tax      decimal(7,2),
      cs_net_paid_inc_ship     decimal(7,2),
      cs_net_paid_inc_ship_tax decimal(7,2),
      cs_net_profit            decimal(7,2)
)  partitioned by (cs_sold_date_sk);

create table if not exists catalog_returns
(
      cr_returned_date_sk      int,
      cr_returned_time_sk      int,
      cr_item_sk               int,
      cr_refunded_customer_sk  int,
      cr_refunded_cdemo_sk     int,
      cr_refunded_hdemo_sk     int,
      cr_refunded_addr_sk      int,
      cr_returning_customer_sk int,
      cr_returning_cdemo_sk    int,
      cr_returning_hdemo_sk    int,
      cr_returning_addr_sk     int,
      cr_call_center_sk        int,
      cr_catalog_page_sk       int,
      cr_ship_mode_sk          int,
      cr_warehouse_sk          int,
      cr_reason_sk             int,
      cr_order_number          long,
      cr_return_quantity       int,
      cr_return_amount         decimal(7,2),
      cr_return_tax            decimal(7,2),
      cr_return_amt_inc_tax    decimal(7,2),
      cr_fee                   decimal(7,2),
      cr_return_ship_cost      decimal(7,2),
      cr_refunded_cash         decimal(7,2),
      cr_reversed_charge       decimal(7,2),
      cr_store_credit          decimal(7,2),
      cr_net_loss              decimal(7,2)
)  partitioned by (cr_returned_date_sk);

create table if not exists inventory
(
  inv_date_sk          int,
  inv_item_sk          int,
  inv_warehouse_sk     int,
  inv_quantity_on_hand int
)  partitioned by (inv_date_sk);

create table if not exists store_sales
(
  ss_sold_date_sk        int,
  ss_sold_time_sk        int,
  ss_item_sk             int,
  ss_customer_sk         int,
  ss_cdemo_sk            int,
  ss_hdemo_sk            int,
  ss_addr_sk             int,
  ss_store_sk            int,
  ss_promo_sk            int,
  ss_ticket_number       long,
  ss_quantity            int,
  ss_wholesale_cost      decimal(7,2),
  ss_list_price          decimal(7,2),
  ss_sales_price         decimal(7,2),
  ss_ext_discount_amt    decimal(7,2),
  ss_ext_sales_price     decimal(7,2),
  ss_ext_wholesale_cost  decimal(7,2),
  ss_ext_list_price      decimal(7,2),
  ss_ext_tax             decimal(7,2),
  ss_coupon_amt          decimal(7,2),
  ss_net_paid            decimal(7,2),
  ss_net_paid_inc_tax    decimal(7,2),
  ss_net_profit          decimal(7,2)
)  partitioned by (ss_sold_date_sk);

create table if not exists store_returns
(
  sr_returned_date_sk    int,
  sr_return_time_sk      int,
  sr_item_sk             int,
  sr_customer_sk         int,
  sr_cdemo_sk            int,
  sr_hdemo_sk            int,
  sr_addr_sk             int,
  sr_store_sk            int,
  sr_reason_sk           int,
  sr_ticket_number       long,
  sr_return_quantity     int,
  sr_return_amt          decimal(7,2),
  sr_return_tax          decimal(7,2),
  sr_return_amt_inc_tax  decimal(7,2),
  sr_fee                 decimal(7,2),
  sr_return_ship_cost    decimal(7,2),
  sr_refunded_cash       decimal(7,2),
  sr_reversed_charge     decimal(7,2),
  sr_store_credit        decimal(7,2),
  sr_net_loss            decimal(7,2)
)  partitioned by (sr_returned_date_sk);

create table if not exists web_sales
(
  ws_sold_date_sk          int,
  ws_sold_time_sk          int,
  ws_ship_date_sk          int,
  ws_item_sk               int,
  ws_bill_customer_sk      int,
  ws_bill_cdemo_sk         int,
  ws_bill_hdemo_sk         int,
  ws_bill_addr_sk          int,
  ws_ship_customer_sk      int,
  ws_ship_cdemo_sk         int,
  ws_ship_hdemo_sk         int,
  ws_ship_addr_sk          int,
  ws_web_page_sk           int,
  ws_web_site_sk           int,
  ws_ship_mode_sk          int,
  ws_warehouse_sk          int,
  ws_promo_sk              int,
  ws_order_number          long,
  ws_quantity              int,
  ws_wholesale_cost        decimal(7,2),
  ws_list_price            decimal(7,2),
  ws_sales_price           decimal(7,2),
  ws_ext_discount_amt      decimal(7,2),
  ws_ext_sales_price       decimal(7,2),
  ws_ext_wholesale_cost    decimal(7,2),
  ws_ext_list_price        decimal(7,2),
  ws_ext_tax               decimal(7,2),
  ws_coupon_amt            decimal(7,2),
  ws_ext_ship_cost         decimal(7,2),
  ws_net_paid              decimal(7,2),
  ws_net_paid_inc_tax      decimal(7,2),
  ws_net_paid_inc_ship     decimal(7,2),
  ws_net_paid_inc_ship_tax decimal(7,2),
  ws_net_profit            decimal(7,2)
) partitioned by (ws_sold_date_sk);

create table if not exists web_returns
(
  wr_returned_date_sk      int,
  wr_returned_time_sk      int,
  wr_item_sk               int,
  wr_refunded_customer_sk  int,
  wr_refunded_cdemo_sk     int,
  wr_refunded_hdemo_sk     int,
  wr_refunded_addr_sk      int,
  wr_returning_customer_sk int,
  wr_returning_cdemo_sk    int,
  wr_returning_hdemo_sk    int,
  wr_returning_addr_sk     int,
  wr_web_page_sk           int,
  wr_reason_sk             int,
  wr_order_number          long,
  wr_return_quantity       int,
  wr_return_amt            decimal(7,2),
  wr_return_tax            decimal(7,2),
  wr_return_amt_inc_tax    decimal(7,2),
  wr_fee                   decimal(7,2),
  wr_return_ship_cost      decimal(7,2),
  wr_refunded_cash         decimal(7,2),
  wr_reversed_charge       decimal(7,2),
  wr_account_credit        decimal(7,2),
  wr_net_loss              decimal(7,2)
)  partitioned by (wr_returned_date_sk);

create table if not exists call_center
(
  cc_call_center_sk        int,
  cc_call_center_id        string,
  cc_rec_start_date        date,
  cc_rec_end_date          date,
  cc_closed_date_sk        int,
  cc_open_date_sk          int,
  cc_name                  string,
  cc_class                 string,
  cc_employees             int,
  cc_sq_ft                 int,
  cc_hours                 string,
  cc_manager               string,
  cc_mkt_id                int,
  cc_mkt_class             string,
  cc_mkt_desc              string,
  cc_market_manager        string,
  cc_division              int,
  cc_division_name         string,
  cc_company               int,
  cc_company_name          string,
  cc_street_number         string,
  cc_street_name           string,
  cc_street_type           string,
  cc_suite_number          string,
  cc_city                  string,
  cc_county                string,
  cc_state                 string,
  cc_zip                   string,
  cc_country               string,
  cc_gmt_offset            decimal(5,2),
  cc_tax_percentage        decimal(5,2)
);

create table if not exists catalog_page (
  cp_catalog_page_sk       int,
  cp_catalog_page_id       string,
  cp_start_date_sk         int,
  cp_end_date_sk           int,
  cp_department            string,
  cp_catalog_number        int,
  cp_catalog_page_number   int,
  cp_description           string,
  cp_type                  string) ;

create table if not exists customer (
  c_customer_sk             int,
  c_customer_id             string,
  c_current_cdemo_sk        int,
  c_current_hdemo_sk        int,
  c_current_addr_sk         int,
  c_first_shipto_date_sk    int,
  c_first_sales_date_sk     int,
  c_salutation              string,
  c_first_name              string,
  c_last_name               string,
  c_preferred_cust_flag     string,
  c_birth_day               int,
  c_birth_month             int,
  c_birth_year              int,
  c_birth_country           string,
  c_login                   string,
  c_email_address           string,
  c_last_review_date        string) ;

create table if not exists customer_address (
  ca_address_sk             int,
  ca_address_id             string,
  ca_street_number          string,
  ca_street_name            string,
  ca_street_type            string,
  ca_suite_number           string,
  ca_city                   string,
  ca_county                 string,
  ca_state                  string,
  ca_zip                    string,
  ca_country                string,
  ca_gmt_offset             decimal(5,2),
  ca_location_type          string) ;

create table if not exists customer_demographics (
  cd_demo_sk                int,
  cd_gender                 string,
  cd_marital_status         string,
  cd_education_status       string,
  cd_purchase_estimate      int,
  cd_credit_rating          string,
  cd_dep_count              int,
  cd_dep_employed_count     int,
  cd_dep_college_count      int) ;

create table if not exists date_dim (
  d_date_sk                 int,
  d_date_id                 string,
  d_date                    date,
  d_month_seq               int,
  d_week_seq                int,
  d_quarter_seq             int,
  d_year                    int,
  d_dow                     int,
  d_moy                     int,
  d_dom                     int,
  d_qoy                     int,
  d_fy_year                 int,
  d_fy_quarter_seq          int,
  d_fy_week_seq             int,
  d_day_name                string,
  d_quarter_name            string,
  d_holiday                 string,
  d_weekend                 string,
  d_following_holiday       string,
  d_first_dom               int,
  d_last_dom                int,
  d_same_day_ly             int,
  d_same_day_lq             int,
  d_current_day             string,
  d_current_week            string,
  d_current_month           string,
  d_current_quarter         string,
  d_current_year            string) ;

create table if not exists household_demographics (
  hd_demo_sk                int,
  hd_income_band_sk         int,
  hd_buy_potential          string,
  hd_dep_count              int,
  hd_vehicle_count          int) ;

create table if not exists income_band (
  ib_income_band_sk         int,
  ib_lower_bound            int,
  ib_upper_bound            int) using parquet ;

create table if not exists item (
  i_item_sk                 int,
  i_item_id                 string,
  i_rec_start_date          date,
  i_rec_end_date            date,
  i_item_desc               string,
  i_current_price           decimal(7,2),
  i_wholesale_cost          decimal(7,2),
  i_brand_id                int,
  i_brand                   string,
  i_class_id                int,
  i_class                   string,
  i_category_id             int,
  i_category                string,
  i_manufact_id             int,
  i_manufact                string,
  i_size                    string,
  i_formulation             string,
  i_color                   string,
  i_units                   string,
  i_container               string,
  i_manager_id              int,
  i_product_name            string) ;

create table if not exists promotion (
  p_promo_sk                int,
  p_promo_id                string,
  p_start_date_sk           int,
  p_end_date_sk             int,
  p_item_sk                 int,
  p_cost                    decimal(15,2),
  p_response_target         int,
  p_promo_name              string,
  p_channel_dmail           string,
  p_channel_email           string,
  p_channel_catalog         string,
  p_channel_tv              string,
  p_channel_radio           string,
  p_channel_press           string,
  p_channel_event           string,
  p_channel_demo            string,
  p_channel_details         string,
  p_purpose                 string,
  p_discount_active         string) ;

create table if not exists reason (
  r_reason_sk               int,
  r_reason_id               string,
  r_reason_desc             string) ;

create table if not exists ship_mode (
  sm_ship_mode_sk           int,
  sm_ship_mode_id           string,
  sm_type                   string,
  sm_code                   string,
  sm_carrier                string,
  sm_contract               string) ;

create table if not exists store (
  s_store_sk                int,
  s_store_id                string,
  s_rec_start_date          date,
  s_rec_end_date            date,
  s_closed_date_sk          int,
  s_store_name              string,
  s_number_employees        int,
  s_floor_space             int,
  s_hours                   string,
  s_manager                 string,
  s_market_id               int,
  s_geography_class         string,
  s_market_desc             string,
  s_market_manager          string,
  s_division_id             int,
  s_division_name           string,
  s_company_id              int,
  s_company_name            string,
  s_street_number           string,
  s_street_name             string,
  s_street_type             string,
  s_suite_number            string,
  s_city                    string,
  s_county                  string,
  s_state                   string,
  s_zip                     string,
  s_country                 string,
  s_gmt_offset              decimal(5,2),
  s_tax_precentage          decimal(5,2)) ;

create table if not exists time_dim (
  t_time_sk                 int,
  t_time_id                 string,
  t_time                    int,
  t_hour                    int,
  t_minute                  int,
  t_second                  int,
  t_am_pm                   string,
  t_shift                   string,
  t_sub_shift               string,
  t_meal_time               string) ;

create table if not exists warehouse (
  w_warehouse_sk           int,
  w_warehouse_id           string,
  w_warehouse_name         string,
  w_warehouse_sq_ft        int,
  w_street_number          string,
  w_street_name            string,
  w_street_type            string,
  w_suite_number           string,
  w_city                   string,
  w_county                 string,
  w_state                  string,
  w_zip                    string,
  w_country                string,
  w_gmt_offset             decimal(5,2)) ;

create table if not exists web_page (
  wp_web_page_sk           int,
  wp_web_page_id           string,
  wp_rec_start_date        date,
  wp_rec_end_date          date,
  wp_creation_date_sk      int,
  wp_access_date_sk        int,
  wp_autogen_flag          string,
  wp_customer_sk           int,
  wp_url                   string,
  wp_type                  string,
  wp_char_count            int,
  wp_link_count            int,
  wp_image_count           int,
  wp_max_ad_count          int) ;

create table if not exists web_site (
  web_site_sk              int,
  web_site_id              string,
  web_rec_start_date       date,
  web_rec_end_date         date,
  web_name                 string,
  web_open_date_sk         int,
  web_close_date_sk        int,
  web_class                string,
  web_manager              string,
  web_mkt_id               int,
  web_mkt_class            string,
  web_mkt_desc             string,
  web_market_manager       string,
  web_company_id           int,
  web_company_name         string,
  web_street_number        string,
  web_street_name          string,
  web_street_type          string,
  web_suite_number         string,
  web_city                 string,
  web_county               string,
  web_state                string,
  web_zip                  string,
  web_country              string,
  web_gmt_offset           decimal(5,2),
  web_tax_percentage       decimal(5,2)) ;

analyze table call_center compute statistics for all columns;
analyze table catalog_page compute statistics for all columns;
analyze table catalog_returns compute statistics for all columns;
analyze table catalog_sales compute statistics for all columns;
analyze table customer compute statistics for all columns;
analyze table customer_address compute statistics for all columns;
analyze table customer_demographics compute statistics for all columns;
analyze table date_dim compute statistics for all columns;
analyze table household_demographics compute statistics for all columns;
analyze table income_band compute statistics for all columns;
analyze table inventory compute statistics for all columns;
analyze table item compute statistics for all columns;
analyze table promotion compute statistics for all columns;
analyze table reason compute statistics for all columns;
analyze table ship_mode compute statistics for all columns;
analyze table store compute statistics for all columns;
analyze table store_returns compute statistics for all columns;
analyze table store_sales compute statistics for all columns;
analyze table time_dim compute statistics for all columns;
analyze table warehouse compute statistics for all columns;
analyze table web_page compute statistics for all columns;
analyze table web_returns compute statistics for all columns;
analyze table web_sales compute statistics for all columns;
analyze table web_site compute statistics for all columns;

执行查询

TPC-DS 103 个测试查询语句: TPC-DS-Query-SQL

测试结果

以下是云器Lakehouse和SparkSQL在103个查询上的性能测试结果,单位为秒(s),数值越低表示性能越好。

  • 所有查询都是取第一次为结果
Query云器LakehouseSpark SQLSpark VS Lakehouse
query14.44319.8624.470402881
query236.636150.4164.105688394
query311.73423.391.99335265
query492.902642.3986.914791931
query514.756163.48911.07949309
query61.8926.5623.468287526
query722.48158.7782.614563409
query84.5516.043.525274725
query944.262643.99114.54952329
query101.99950.34725.18609305
query1138.772238.7356.157407407
query121.2535.3344.25698324
query1311.41867.1025.876861097
query14a88.878490.0515.513749184
query14b68.34477.1276.981665203
query152.9611.9234.028040541
query165.515288.99652.40181324
query178.45266.5757.876833885
query186.26255.0018.783296072
query192.70410.6933.954511834
query201.3945.0213.601865136
query210.683.1794.675
query2211.0849.1450.825063154
query23a98.7221393.11214.11146452
query23b95.8451831.94819.11365225
query24a34.641925.88126.72789469
query24b30.553943.61130.8843976
query2519.82156.4832.849654407
query262.93133.0911.28966223
query276.18554.938.881164107
query2830.606802.20526.21071032
query2913.697186.70413.63101409
query303.15319.2326.099587694
query315.05756.97311.26616571
query321.5777.1394.526949905
query332.87411.5214.008698678
query344.01227.3716.822283151
query356.34179.32512.50985649
query3612.82372.5495.657724401
query373.868105.23627.20682523
query3815.221152.69210.03166678
query39a1.32110.087.630582892
query39b0.9678.0138.286452947
query406.61945.5396.880042302
query410.1171.1119.495726496
query421.3296.6685.017306245
query432.99321.247.096558637
query4413.75915.6441.137001236
query451.98712.4496.265223956
query465.43243.5158.010861561
query4721.831133.5036.115294764
query485.02551.67510.28358209
query4913.32774.8775.618443761
query5028.519789.92527.6982012
query5117.59664.1813.647476699
query521.6858.685.151335312
query532.14932.21314.98976268
query5411.41820.1481.764582239
query550.4857.11114.66185567
query561.8112.8567.102762431
query5713.72474.7145.444039639
query581.4257.4785.247719298
query5916.064158.0259.837213645
query602.95921.8377.37985806
query614.57615.9823.49256993
query625.25833.3786.34804108
query633.05529.139.535188216
query6432.014663.72220.73224214
query6533.916185.2195.461109801
query667.1948.5836.757023644
query67186.6451.8752.421623794
query682.94517.4825.936162988
query692.08322.04210.5818531
query7022.75259.8012.628384318
query713.03421.4987.085695452
query7211.45212.09418.52349345
query731.1469.698.455497382
query7427.236224.4478.240820972
query7546.678385.938.267920648
query7617.695321.02718.14224357
query771.80413.8637.6845898
query78181.223669.2423.692919773
query794.04228.3747.019792182
query8011.991163.78913.65932783
query813.226.4548.266875
query824.572208.44445.59142607
query830.7956.2537.865408805
query844.27723.85.564648118
query856.82646.9286.874890126
query867.52732.7034.344758868
query8715.751159.44410.12278585
query8852.074801.00515.38205246
query893.38934.13410.07199764
query904.4866.70914.89040179
query910.5276.7812.86527514
query921.6215.9893.694632943
query930.030.86628.86666667
query9410.33164.8815.96127783
query9549.464381.5287.713245997
query9611.94799.4148.321252197
query9730.497178.7515.861265042
query982.0059.7954.885286783
query999.35262.9526.731394354
sum1869.18717779.6369.511962153

联系我们
预约咨询
微信咨询
电话咨询