使用information_schema作业历史视图分析实践

概述

本指南帮助您使用sys.information_schema.job_history表分析ClickZetta系统的使用情况,了解资源消耗模式、性能瓶颈和优化机会。所有分析都基于SQL查询,无需额外工具。

数据源介绍

主要分析表

  • 表名: sys.information_schema.job_history
  • 用途: 记录系统中所有作业的执行历史
  • 权限: 需要对sys.information_schema的查询权限

关键字段说明

字段名数据类型说明
workspace_nameString工作空间名称
virtual_clusterString虚拟集群名称
job_idString作业唯一标识
execution_timeFloat作业执行时间(秒)
start_timeTimestamp作业开始时间
input_tablesString输入表信息(JSON格式)
input_bytesString读取的字节数
cache_hitString缓存命中的字节数
statusString作业执行状态

分析目标与方法

分析目标

  1. 资源使用分析: 识别最繁忙的workspace和virtual cluster
  2. 数据访问分析: 找出访问最频繁的表和数据读取模式
  3. 性能优化分析: 评估缓存命中率和查询效率
  4. 容量规划分析: 为资源扩容提供数据支撑

分析时间范围建议

  • 日常监控: 近7天数据
  • 周期性分析: 近30天数据
  • 长期趋势: 近90天数据

一、Workspace和Virtual Cluster忙闲程度分析

分析目的

识别系统中最繁忙的工作空间和虚拟集群,为资源分配和容量规划提供依据。

1.1 Workspace忙闲程度分析

查询目标: 按总执行时间排序,找出最忙碌的workspace

-- Workspace忙闲程度统计(近30天)
SELECT 
    workspace_name,
    COUNT(*) as job_count,                    -- 作业数量
    SUM(execution_time) as total_execution_time,  -- 总执行时间
    AVG(execution_time) as avg_execution_time,    -- 平均执行时间
    SUM(CASE WHEN status = 'SUCCESS' THEN 1 ELSE 0 END) as success_jobs,  -- 成功作业数
    SUM(CASE WHEN status = 'FAILED' THEN 1 ELSE 0 END) as failed_jobs,    -- 失败作业数
    ROUND(SUM(CASE WHEN status = 'SUCCESS' THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) as success_rate -- 成功率
FROM sys.information_schema.job_history 
WHERE start_time >= CURRENT_DATE() - INTERVAL 30 DAY
GROUP BY workspace_name
ORDER BY total_execution_time DESC;

1.2 Virtual Cluster忙闲程度分析

查询目标: 分析各虚拟集群的工作负载分布

-- Virtual Cluster忙闲程度统计(近30天)
SELECT 
    virtual_cluster,
    COUNT(*) as job_count,
    SUM(execution_time) as total_execution_time,
    AVG(execution_time) as avg_execution_time,
    MIN(execution_time) as min_execution_time,
    MAX(execution_time) as max_execution_time
FROM sys.information_schema.job_history 
WHERE start_time >= CURRENT_DATE() - INTERVAL 30 DAY
  AND virtual_cluster IS NOT NULL
GROUP BY virtual_cluster
ORDER BY total_execution_time DESC;

结果示例

Virtual Cluster名称作业数量总执行时间(秒)平均执行时间(秒)最小执行时间(秒)最大执行时间(秒)
MET***_ETL_GP36,695996,551.8927.160.005745.531
DEFAULT338,797558,213.831.650.0063,825.289
CUS***_BILLING531,01445,493.620.090.003165.597
BI_ANALYSE49,1281,725.920.040.003104.061
VC_***_CAL80373.294.670.00760.184
MY_FIRST_VC140.650.050.0110.097
MY_SECOND_VC40.120.030.0150.072

1.3 按时间段分析工作负载

查询目标: 了解不同时间段的系统负载情况

-- 按小时统计作业分布
SELECT 
    HOUR(start_time) as hour_of_day,
    COUNT(*) as job_count,
    SUM(execution_time) as total_execution_time,
    AVG(execution_time) as avg_execution_time
FROM sys.information_schema.job_history 
WHERE start_time >= CURRENT_DATE() - INTERVAL 7 DAY
GROUP BY HOUR(start_time)
ORDER BY hour_of_day;

结果示例

小时作业数量总执行时间(秒)平均执行时间(秒)
024,18918,479.990.76
123,82311,243.610.47
217,72112,227.460.69
319,74628,425.321.44
424,53512,300.860.50
828,22418,066.540.64
920,44327,761.991.36
1525,00429,525.281.18
1820,34329,472.921.45
2317,46111,217.910.64
-- 按工作日统计作业分布
SELECT 
    DAYOFWEEK(start_time) as day_of_week,
    CASE DAYOFWEEK(start_time)
        WHEN 1 THEN '周日'
        WHEN 2 THEN '周一'
        WHEN 3 THEN '周二'
        WHEN 4 THEN '周三'
        WHEN 5 THEN '周四'
        WHEN 6 THEN '周五'
        WHEN 7 THEN '周六'
    END as day_name,
    COUNT(*) as job_count,
    SUM(execution_time) as total_execution_time
FROM sys.information_schema.job_history 
WHERE start_time >= CURRENT_DATE() - INTERVAL 30 DAY
GROUP BY DAYOFWEEK(start_time)
ORDER BY day_of_week;

结果示例

星期数星期名称作业数量总执行时间(秒)
1周日86,383162,597.21
2周一103,041172,924.33
3周二158,431276,514.79
4周三208,982322,615.64
5周四174,951278,444.13
6周五143,648238,794.32
7周六80,380150,478.52

二、表使用统计分析

分析目的

识别最常访问的表,分析数据读取模式,为表优化和索引策略提供指导。

2.1 最常访问的表统计

查询目标: 找出访问频率最高的表

-- 解析input_tables JSON并统计表访问情况
SELECT 
    GET_JSON_OBJECT(input_tables, '$.table[0].tableName') as table_name,
    CONCAT(
        GET_JSON_OBJECT(input_tables, '$.table[0].namespace[0]'), 
        '.', 
        GET_JSON_OBJECT(input_tables, '$.table[0].namespace[1]')
    ) as schema_name,
    COUNT(*) as access_count,
    SUM(CAST(input_bytes AS BIGINT)) as total_bytes_read,
    AVG(CAST(input_bytes AS BIGINT)) as avg_bytes_per_access,
    SUM(CAST(GET_JSON_OBJECT(input_tables, '$.table[0].record') AS BIGINT)) as total_records_read
FROM sys.information_schema.job_history 
WHERE start_time >= CURRENT_DATE() - INTERVAL 30 DAY
  AND input_tables IS NOT NULL 
  AND input_tables != ''
  AND input_tables != '{"table":[]}'
  AND input_bytes > 0
GROUP BY 
    GET_JSON_OBJECT(input_tables, '$.table[0].tableName'),
    CONCAT(
        GET_JSON_OBJECT(input_tables, '$.table[0].namespace[0]'), 
        '.', 
        GET_JSON_OBJECT(input_tables, '$.table[0].namespace[1]')
    )
HAVING table_name IS NOT NULL
ORDER BY access_count DESC
LIMIT 20;

结果示例

表名Schema名称访问次数总读取字节数平均每次读取字节数总读取记录数
bil***_summary_mvmet_bill.bil_mv662,7147,815,536,374,23111,793,230521,718,965,089
vc_***_calculatemet***_bill.public65,837164,257,938,0612,494,9186,127,770,647
met***_events_allmet***_bill.raw8,78711,177,614,832,7141,272,063,000527,117,351,038
cli**gateway**_log_beginsto***_metering.public8,779110,104,760,84212,541,830198,025,739
sku_categorymet***_bill.sku_meta3,8531,734,507,214,974450,170,6001,029,852
bil***_compute_detail_mvmet_bill.bil_mv2,92897,644,902,29633,348,6706,685,089,232
vc_bil***_without_zd_detail_mvmet_bill.bil_mv1,473227,399,306,618154,378,3008,328,596,693
met***_details_allmet***_bill.raw1,4054,312,047,296,0073,069,073,000339,604,011,874
mv_vc_met***_detailsmet***_bill.public1,1858,165,515,4646,890,730856,688,041
sto_oss_bil_detail_mvmet_bill.bil_mv7484,350,578,5515,816,281945,398,941

2.2 数据读取量TOP表

查询目标: 找出读取数据量最大的表

-- 按数据读取量排序的表统计
SELECT 
    GET_JSON_OBJECT(input_tables, '$.table[0].tableName') as table_name,
    CONCAT(
        GET_JSON_OBJECT(input_tables, '$.table[0].namespace[0]'), 
        '.', 
        GET_JSON_OBJECT(input_tables, '$.table[0].namespace[1]')
    ) as schema_name,
    COUNT(*) as access_count,
    SUM(CAST(input_bytes AS BIGINT)) as total_bytes_read,
    SUM(CAST(input_bytes AS BIGINT)) / 1024 / 1024 / 1024 as total_gb_read,
    AVG(CAST(input_bytes AS BIGINT)) / 1024 / 1024 as avg_mb_per_access
FROM sys.information_schema.job_history 
WHERE start_time >= CURRENT_DATE() - INTERVAL 30 DAY
  AND input_tables IS NOT NULL 
  AND input_tables != ''
  AND input_tables != '{"table":[]}'
  AND input_bytes > 0
GROUP BY 1, 2
HAVING table_name IS NOT NULL
ORDER BY total_bytes_read DESC
LIMIT 20;

结果示例

表名Schema名称访问次数总读取字节数总读取量(GB)平均每次读取(MB)
met***_events_allmet***_bill.raw8,78711,177,614,832,71410,409.971,213.13
bil***_summary_mvmet_bill.bil_mv662,7147,815,536,374,2317,278.7911.25
met***_details_allmet***_bill.raw1,4054,312,047,296,0074,015.912,926.90
sku_categorymet***_bill.sku_meta3,8531,734,507,214,9741,615.39429.32
dwd_cz_jobssys_meta_warehouse.inf_schema35387,223,640,942360.6310,551.01
vc_met***_detailsmet***_bill.public743371,186,266,727345.69476.43
vc_bil***_without_zd_detail_mvmet_bill.bil_mv1,473227,399,306,618211.78147.23
vc_***_calculatemet***_bill.public65,837164,257,938,061152.982.38
dim_stu***_instance_dmin_fmet_bill.stu_dw_tenant405130,022,636,178121.09306.17
ins***_account_mappingmet***_bill.public730118,682,911,967110.53155.05

2.3 表访问时间分布分析

查询目标: 分析表访问的时间模式

-- 分析主要表的访问时间分布
WITH top_tables AS (
    SELECT GET_JSON_OBJECT(input_tables, '$.table[0].tableName') as table_name
    FROM sys.information_schema.job_history 
    WHERE start_time >= CURRENT_DATE() - INTERVAL 30 DAY
      AND input_tables IS NOT NULL 
      AND input_tables != '{"table":[]}'
    GROUP BY 1
    ORDER BY COUNT(*) DESC
    LIMIT 5
)
SELECT 
    GET_JSON_OBJECT(h.input_tables, '$.table[0].tableName') as table_name,
    HOUR(h.start_time) as hour_of_day,
    COUNT(*) as access_count,
    SUM(CAST(h.input_bytes AS BIGINT)) / 1024 / 1024 as total_mb_read
FROM sys.information_schema.job_history h
JOIN top_tables t ON GET_JSON_OBJECT(h.input_tables, '$.table[0].tableName') = t.table_name
WHERE h.start_time >= CURRENT_DATE() - INTERVAL 7 DAY
GROUP BY 1, 2
ORDER BY table_name, hour_of_day;

结果示例

表名小时访问次数总读取量(MB)
bil***_summary_mv021,826238,651.21
bil***_summary_mv122,557251,626.92
bil***_summary_mv215,100173,747.51
bil***_summary_mv318,436216,057.70
bil***_summary_mv422,117249,271.86
bil***_summary_mv824,900286,801.29
bil***_summary_mv917,682207,026.35
bil***_summary_mv1519,234225,847.45
bil***_summary_mv1816,891198,234.12
bil***_summary_mv2314,567167,432.89

三、Cache命中率分析

分析目的

评估系统缓存效率,识别缓存优化机会,提升查询性能。

3.1 总体缓存命中率

查询目标: 计算系统整体的缓存命中率

-- 系统总体缓存命中率统计
SELECT 
    CASE 
        WHEN cache_hit = '0' OR cache_hit IS NULL THEN 'Cache Miss'
        ELSE 'Cache Hit'
    END as cache_status,
    COUNT(*) as job_count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage,
    SUM(execution_time) as total_execution_time,
    AVG(execution_time) as avg_execution_time
FROM sys.information_schema.job_history 
WHERE start_time >= CURRENT_DATE() - INTERVAL 30 DAY
GROUP BY 1
ORDER BY job_count DESC;

结果示例

缓存状态作业数量百分比(%)总执行时间(秒)平均执行时间(秒)
Cache Hit738,78477.29883,488.521.20
Cache Miss217,03222.71718,880.423.31

3.2 各Workspace缓存命中率

查询目标: 比较不同workspace的缓存使用效果

-- 各Workspace的缓存命中率分析
SELECT 
    workspace_name,
    SUM(CASE WHEN cache_hit != '0' AND cache_hit IS NOT NULL THEN 1 ELSE 0 END) as cache_hit_jobs,
    SUM(CASE WHEN cache_hit = '0' OR cache_hit IS NULL THEN 1 ELSE 0 END) as cache_miss_jobs,
    COUNT(*) as total_jobs,
    ROUND(SUM(CASE WHEN cache_hit != '0' AND cache_hit IS NOT NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) as cache_hit_rate,
    SUM(CAST(cache_hit AS BIGINT)) / 1024 / 1024 / 1024 as total_cache_gb
FROM sys.information_schema.job_history 
WHERE start_time >= CURRENT_DATE() - INTERVAL 30 DAY
GROUP BY workspace_name
ORDER BY cache_hit_rate DESC;

结果示例

Workspace名称缓存命中作业数缓存未命中作业数总作业数缓存命中率(%)总缓存量(GB)
met***_n_bill732,157136,263868,42084.3112,336.17
sto***_metering6,29029,08235,37217.7836.20
cos***_analyse33751,66452,0010.6598.62
qui***_ws018180.000.00
cli***_sample_data0110.000.00
dev_envirment0440.000.00

3.3 缓存命中率趋势分析

查询目标: 观察缓存命中率的时间趋势

-- 按天统计缓存命中率趋势
SELECT 
    DATE(start_time) as date,
    SUM(CASE WHEN cache_hit != '0' AND cache_hit IS NOT NULL THEN 1 ELSE 0 END) as cache_hit_jobs,
    COUNT(*) as total_jobs,
    ROUND(SUM(CASE WHEN cache_hit != '0' AND cache_hit IS NOT NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) as cache_hit_rate
FROM sys.information_schema.job_history 
WHERE start_time >= CURRENT_DATE() - INTERVAL 30 DAY
GROUP BY DATE(start_time)
ORDER BY date;

结果示例

日期缓存命中作业数总作业数缓存命中率(%)
2025-04-2320,14526,83475.08
2025-04-2422,56728,90178.09
2025-04-2524,12331,24577.21
2025-04-2625,89033,12778.15
2025-04-2723,45630,23477.57
2025-04-2821,78928,56776.27
2025-04-2926,23434,12376.88
2025-04-3024,56731,89077.04
2025-05-0122,89029,56777.42
2025-05-0225,12332,45677.40

四、性能问题诊断查询

4.1 长时间运行的作业

查询目标: 识别执行时间异常长的作业

-- 查找长时间运行的作业
SELECT 
    job_id,
    workspace_name,
    virtual_cluster,
    job_type,
    execution_time,
    start_time,
    end_time,
    status,
    LEFT(job_text, 100) as job_text_preview
FROM sys.information_schema.job_history 
WHERE start_time >= CURRENT_DATE() - INTERVAL 7 DAY
  AND execution_time > 300  -- 大于5分钟的作业
ORDER BY execution_time DESC
LIMIT 50;

结果示例

作业IDWorkspace名称Virtual Cluster作业类型执行时间(秒)开始时间状态作业预览
202505***96423met***_n_billMET***_ETL_GPSELECT3,825.292025-05-22 03:15:23SUCCESSSELECT SUM(CAST(record_count AS BIGINT)) as total_records, SUM(CAST(data_size AS BIGINT))...
202505***84521met***_n_billMET***_ETL_GPSELECT2,456.782025-05-21 15:42:11SUCCESSWITH billing_data AS (SELECT workspace_id, SUM(compute_time) FROM billing_summary...
202505***73941met***_n_billDEFAULTINSERT1,923.452025-05-20 09:33:47SUCCESSINSERT INTO meter SELECT event_id, workspace_id, timestamp, event_type...
202505***62847sto***_meteringDEFAULTSELECT1,567.232025-05-19 14:28:36FAILEDSELECT storage_type, bucket_name, SUM(storage_size) FROM sto***_usage WHERE date...
202505***51238met***_n_billBI_ANALYSESELECT1,234.562025-05-18 11:17:29SUCCESSSELECT DATE_TRUNC('hour', start_time) as hour, COUNT(*) as job_count FROM job_his...

4.2 失败作业分析

查询目标: 分析作业失败的模式和原因

-- 失败作业统计和分析
SELECT 
    workspace_name,
    virtual_cluster,
    job_type,
    COUNT(*) as failed_count,
    COUNT(*) * 100.0 / SUM(COUNT(*)) OVER() as failure_percentage,
    LEFT(error_message, 100) as common_error
FROM sys.information_schema.job_history 
WHERE start_time >= CURRENT_DATE() - INTERVAL 7 DAY
  AND status = 'FAILED'
GROUP BY workspace_name, virtual_cluster, job_type, LEFT(error_message, 100)
ORDER BY failed_count DESC
LIMIT 20;

结果示例

Workspace名称Virtual Cluster作业类型失败次数失败占比(%)常见错误信息
cos***_analyseBI_ANALYSESELECT1,24545.67CZLH-40000 Table 'cost_data.billing_temp' doesn't exist
met***_n_billDEFAULTINSERT56720.82CZLH-42000 Duplicate key error: PRIMARY KEY constraint violated
sto***_meteringDEFAULTSELECT2348.59CZLH-42000 Semantic analysis exception - cannot resolve column
met***_n_billMET***_ETL_GPUPDATE1565.73CZLH-41000 Lock timeout: Table locked by another transaction
cos***_analyseBI_ANALYSEDELETE893.27CZLH-43000 Syntax error: Invalid column reference 'unknown_column'

4.3 资源消耗TOP作业

查询目标: 找出资源消耗最大的作业类型

-- 高资源消耗作业分析
SELECT 
    job_type,
    workspace_name,
    COUNT(*) as job_count,
    SUM(execution_time) as total_execution_time,
    AVG(execution_time) as avg_execution_time,
    SUM(CAST(input_bytes AS BIGINT)) / 1024 / 1024 / 1024 as total_input_gb,
    AVG(CAST(input_bytes AS BIGINT)) / 1024 / 1024 as avg_input_mb
FROM sys.information_schema.job_history 
WHERE start_time >= CURRENT_DATE() - INTERVAL 30 DAY
  AND input_bytes > 0
GROUP BY job_type, workspace_name
ORDER BY total_execution_time DESC
LIMIT 20;

结果示例

作业类型Workspace名称作业数量总执行时间(秒)平均执行时间(秒)总输入量(GB)平均输入量(MB)
SELECTmet***_n_bill345,6781,234,567.893.5715,234.5645.67
INSERTmet***_n_bill67,890456,789.126.738,901.23135.45
UPDATEmet***_n_bill12,345234,567.8919.013,456.78289.34
DELETEcos***_analyse8,901123,456.7813.871,234.56142.78
CREATEsto***_metering2,34556,789.1224.21567.89249.12

五、实用分析模板

5.1 每日监控报告

-- 每日系统运行状况报告
SELECT 
    '总体概览' as metric_category,
    'Jobs Total' as metric_name,
    CAST(COUNT(*) AS STRING) as metric_value
FROM sys.information_schema.job_history 
WHERE DATE(start_time) = CURRENT_DATE() - INTERVAL 1 DAY

UNION ALL

SELECT 
    '总体概览',
    'Execution Time (Hours)',
    CAST(ROUND(SUM(execution_time) / 3600, 2) AS STRING)
FROM sys.information_schema.job_history 
WHERE DATE(start_time) = CURRENT_DATE() - INTERVAL 1 DAY

UNION ALL

SELECT 
    '总体概览',
    'Success Rate (%)',
    CAST(ROUND(SUM(CASE WHEN status = 'SUCCESS' THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS STRING)
FROM sys.information_schema.job_history 
WHERE DATE(start_time) = CURRENT_DATE() - INTERVAL 1 DAY

ORDER BY metric_category, metric_name;

结果示例

指标类别指标名称指标值
总体概览Execution Time (Hours)427.35
总体概览Jobs Total34,567
总体概览Success Rate (%)97.85

5.2 资源使用评估

-- 资源使用评估查询
WITH resource_summary AS (
    SELECT 
        workspace_name,
        COUNT(*) as jobs,
        SUM(execution_time) as total_time,
        SUM(CAST(input_bytes AS BIGINT)) as total_bytes
    FROM sys.information_schema.job_history 
    WHERE start_time >= CURRENT_DATE() - INTERVAL 30 DAY
    GROUP BY workspace_name
)
SELECT 
    workspace_name,
    jobs,
    ROUND(total_time / 3600, 2) as total_hours,
    ROUND(total_bytes / 1024 / 1024 / 1024, 2) as total_gb,
    ROUND(jobs * 100.0 / SUM(jobs) OVER(), 2) as job_percentage,
    ROUND(total_time * 100.0 / SUM(total_time) OVER(), 2) as time_percentage
FROM resource_summary
ORDER BY total_time DESC;

结果示例

Workspace名称作业数量总小时数总数据量(GB)作业占比(%)时间占比(%)
met***_n_bill868,420430.5424,567.8990.8596.74
sto***_metering35,37213.011,234.563.702.93
cos***_analyse52,0011.55567.235.440.35
qui***_ws180.000.010.000.00
dev_envirment40.000.020.000.00
cli***_sample_data10.000.000.000.00

分析频率建议

  • 每日监控: 执行总体概览和失败作业分析
  • 每周分析: 运行完整的忙闲程度和表使用分析
  • 每月评估: 进行缓存效率和资源规划分析

建议将常用查询保存为视图,便于重复使用。

优化行动指南

  • 高执行时间作业: 检查SQL优化机会,尝试用Dynamic Table的增量计算链路降低计算量,减少执行时间;
  • 低缓存命中率: 调整分析型计算集群的自动关机时间,在查询高峰期尽量不关机,避免丢失缓存;
  • 高频访问表: 考虑分区、索引优化;
  • 资源不均衡: 重新分配workspace内的计算集群资源规格,对频繁使用的计算集群,如果希望降低作业执行时间,可以适当扩容。

联系我们
预约咨询
微信咨询
电话咨询