SQL窗口函数(Window Function)

概述

窗口函数(Window Function)是云器Lakehouse SQL中一种强大的分析功能,它允许在一组相关行上执行计算,而不仅仅是针对单个行。通过使用WINDOW子句,可以定义一个或多个命名的窗口,然后在窗口函数中引用这些窗口,避免重复编写相同的窗口规范,使查询更加简洁和易维护。

语法

WINDOW <window_name> AS (<window_specification>, ...)

参数说明

  • window_name: 窗口的名称,必须是有效的标识符,不能与表名或列名冲突
  • window_specification: 窗口规范,定义窗口的范围和顺序,包含以下部分:
    • PARTITION BY: 指定数据分组方式,每个组是一个分区
    • ORDER BY: 指定每个分区内的数据排序方式
    • frame_clause: 指定窗口框架,限制窗口函数的计算范围

窗口框架类型

  1. ROWS Frame: 基于行数的窗口框架
  2. RANGE Frame: 基于值范围的窗口框架

环境准备

创建测试表

CREATE TABLE IF NOT EXISTS sales(
    month INT,
    sales INT,
    profit DOUBLE
);

插入测试数据

INSERT INTO sales (month, sales, profit) VALUES
(1, 100, 0.1),
(2, 120, 0.15),
(3, 80, 0.05),
(4, 150, 0.2),
(5, 90, 0.1),
(6, 110, 0.12);

使用示例

示例1:移动窗口聚合 - 计算移动和与移动平均

业务场景: 计算当前月及前两个月的销售总和与平均值

SELECT month, sales, 
       SUM(sales) OVER w AS sum_sales, 
       AVG(sales) OVER w AS avg_sales
FROM sales
WINDOW w AS (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW);

执行结果:

+-------+-------+-----------+--------------------+
| month | sales | sum_sales | avg_sales          |
+-------+-------+-----------+--------------------+
| 1     | 100   | 100       | 100.0              |
| 2     | 120   | 220       | 110.0              |
| 3     | 80    | 300       | 100.0              |
| 4     | 150   | 350       | 116.67             |
| 5     | 90    | 320       | 106.67             |
| 6     | 110   | 350       | 116.67             |
+-------+-------+-----------+--------------------+

说明: 窗口w定义了一个滑动窗口,包含当前行和前2行,因此每行都会计算最近3个月的销售总和与平均值。


示例2:多窗口排名 - 季度排名与年度排名

业务场景: 同时计算销售额在季度内和全年的排名

SELECT month, sales, 
       RANK() OVER w1 AS rank_quarter, 
       ROW_NUMBER() OVER w2 AS rank_year
FROM sales
WINDOW w1 AS (PARTITION BY CEIL(month / 3) ORDER BY sales DESC),
       w2 AS (ORDER BY sales DESC);

执行结果:

+-------+-------+--------------+-----------+
| month | sales | rank_quarter | rank_year |
+-------+-------+--------------+-----------+
| 4     | 150   | 1            | 1         |
| 2     | 120   | 1            | 2         |
| 6     | 110   | 2            | 3         |
| 1     | 100   | 2            | 4         |
| 5     | 90    | 3            | 5         |
| 3     | 80    | 3            | 6         |
+-------+-------+--------------+-----------+

说明:

  • w1窗口按季度(每3个月)分区,计算各季度内的销售排名
  • w2窗口不分区,计算全年的销售排名
  • 使用CEIL(month/3)将月份分为Q1(1-3月)、Q2(4-6月)等

示例3:累积聚合 - 从年初至今的统计

业务场景: 计算从年初到当前月的累积销售额、平均利润率等

SELECT month, sales, profit,
       SUM(sales) OVER w AS cumulative_sales,
       AVG(profit) OVER w AS avg_profit,
       MAX(sales) OVER w AS max_sales,
       MIN(sales) OVER w AS min_sales
FROM sales
WINDOW w AS (ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW);

执行结果:

+-------+-------+--------+------------------+------------+-----------+-----------+
| month | sales | profit | cumulative_sales | avg_profit | max_sales | min_sales |
+-------+-------+--------+------------------+------------+-----------+-----------+
| 1     | 100   | 0.1    | 100              | 0.1        | 100       | 100       |
| 2     | 120   | 0.15   | 220              | 0.125      | 120       | 100       |
| 3     | 80    | 0.05   | 300              | 0.1        | 120       | 80        |
| 4     | 150   | 0.2    | 450              | 0.125      | 150       | 80        |
| 5     | 90    | 0.1    | 540              | 0.12       | 150       | 80        |
| 6     | 110   | 0.12   | 650              | 0.12       | 150       | 80        |
+-------+-------+--------+------------------+------------+-----------+-----------+

说明: UNBOUNDED PRECEDING表示从分区的第一行开始,CURRENT ROW表示到当前行结束,实现累积计算。


示例4:时间序列分析 - 环比数据对比

业务场景: 查看每月销售额与上月、下月的对比及环比变化

SELECT month, sales,
       LAG(sales, 1) OVER w AS prev_sales,
       LEAD(sales, 1) OVER w AS next_sales,
       sales - LAG(sales, 1) OVER w AS sales_change
FROM sales
WINDOW w AS (ORDER BY month);

执行结果:

+-------+-------+------------+------------+--------------+
| month | sales | prev_sales | next_sales | sales_change |
+-------+-------+------------+------------+--------------+
| 1     | 100   | NULL       | 120        | NULL         |
| 2     | 120   | 100        | 80         | 20           |
| 3     | 80    | 120        | 150        | -40          |
| 4     | 150   | 80         | 90         | 70           |
| 5     | 90    | 150        | 110        | -60          |
| 6     | 110   | 90         | NULL       | 20           |
+-------+-------+------------+------------+--------------+

说明:

  • LAG(sales, 1): 获取上一行的销售额
  • LEAD(sales, 1): 获取下一行的销售额
  • 第一行和最后一行的prev_sales和next_sales为NULL

示例5:多种排名函数对比

业务场景: 理解不同排名函数的差异

SELECT month, sales,
       DENSE_RANK() OVER w AS dense_rank,
       RANK() OVER w AS rank,
       ROW_NUMBER() OVER w AS row_num,
       PERCENT_RANK() OVER w AS percent_rank
FROM sales
WINDOW w AS (ORDER BY sales DESC);

执行结果:

+-------+-------+------------+------+---------+--------------+
| month | sales | dense_rank | rank | row_num | percent_rank |
+-------+-------+------------+------+---------+--------------+
| 4     | 150   | 1          | 1    | 1       | 0.0          |
| 2     | 120   | 2          | 2    | 2       | 0.2          |
| 6     | 110   | 3          | 3    | 3       | 0.4          |
| 1     | 100   | 4          | 4    | 4       | 0.6          |
| 5     | 90    | 5          | 5    | 5       | 0.8          |
| 3     | 80    | 6          | 6    | 6       | 1.0          |
+-------+-------+------------+------+---------+--------------+

排名函数说明:

  • ROW_NUMBER(): 连续唯一的行号,即使值相同也不会并列
  • RANK(): 相同值得到相同排名,下一个排名会跳过
  • DENSE_RANK(): 相同值得到相同排名,下一个排名连续
  • PERCENT_RANK(): 返回0到1之间的百分比排名

示例6:位置函数 - 访问特定位置的值

业务场景: 获取窗口内第一个、最后一个和指定位置的值

SELECT month, sales, profit,
       FIRST_VALUE(sales) OVER w AS first_sales,
       LAST_VALUE(sales) OVER w AS last_sales,
       NTH_VALUE(sales, 2) OVER w AS second_sales
FROM sales
WINDOW w AS (ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);

执行结果:

+-------+-------+--------+-------------+------------+--------------+
| month | sales | profit | first_sales | last_sales | second_sales |
+-------+-------+--------+-------------+------------+--------------+
| 1     | 100   | 0.1    | 100         | 110        | 120          |
| 2     | 120   | 0.15   | 100         | 110        | 120          |
| 3     | 80    | 0.05   | 100         | 110        | 120          |
| 4     | 150   | 0.2    | 100         | 110        | 120          |
| 5     | 90    | 0.1    | 100         | 110        | 120          |
| 6     | 110   | 0.12   | 100         | 110        | 120          |
+-------+-------+--------+-------------+------------+--------------+

说明:

  • FIRST_VALUE(): 返回窗口中第一行的值
  • LAST_VALUE(): 返回窗口中最后一行的值(需要UNBOUNDED FOLLOWING才能看到真正的最后一行)
  • NTH_VALUE(expr, n): 返回窗口中第n行的值

示例7:RANGE窗口框架 - 基于值范围的窗口

业务场景: 找出销售额在±10范围内的所有记录并统计

SELECT month, sales,
       SUM(sales) OVER w AS range_sum,
       COUNT(*) OVER w AS range_count
FROM sales
WINDOW w AS (ORDER BY sales RANGE BETWEEN 10 PRECEDING AND 10 FOLLOWING);

执行结果:

+-------+-------+-----------+-------------+
| month | sales | range_sum | range_count |
+-------+-------+-----------+-------------+
| 3     | 80    | 170       | 2           |
| 5     | 90    | 270       | 3           |
| 1     | 100   | 300       | 3           |
| 6     | 110   | 330       | 3           |
| 2     | 120   | 230       | 2           |
| 4     | 150   | 150       | 1           |
+-------+-------+-----------+-------------+

说明:

  • RANGE窗口基于值的范围而非行数
  • 对于sales=100的行,窗口包含销售额在[90, 110]范围内的所有行
  • 对于sales=150的行,只有它自己在[140, 160]范围内

示例8:综合分析 - 多维度业务报表(高级)

业务场景: 生成包含排名、累积、环比和占比的综合分析报表

SELECT 
    month,
    sales,
    -- 排名相关
    ROW_NUMBER() OVER w1 AS row_num,
    RANK() OVER w1 AS sales_rank,
    -- 累积统计
    SUM(sales) OVER w2 AS ytd_sales,
    AVG(sales) OVER w2 AS ytd_avg,
    -- 环比对比
    LAG(sales) OVER w1 AS prev_month_sales,
    sales - LAG(sales) OVER w1 AS mom_change,
    -- 占比分析
    ROUND(sales * 100.0 / SUM(sales) OVER w3, 2) AS pct_of_total
FROM sales
WINDOW 
    w1 AS (ORDER BY month),
    w2 AS (ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
    w3 AS (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);

执行结果:

+-------+-------+---------+------------+-----------+---------+------------------+------------+---------------+
| month | sales | row_num | sales_rank | ytd_sales | ytd_avg | prev_month_sales | mom_change | pct_of_total  |
+-------+-------+---------+------------+-----------+---------+------------------+------------+---------------+
| 1     | 100   | 1       | 1          | 100       | 100.0   | NULL             | NULL       | 15.38         |
| 2     | 120   | 2       | 2          | 220       | 110.0   | 100              | 20         | 18.46         |
| 3     | 80    | 3       | 3          | 300       | 100.0   | 120              | -40        | 12.31         |
| 4     | 150   | 4       | 4          | 450       | 112.5   | 80               | 70         | 23.08         |
| 5     | 90    | 5       | 5          | 540       | 108.0   | 150              | -60        | 13.85         |
| 6     | 110   | 6       | 6          | 650       | 108.33  | 90               | 20         | 16.92         |
+-------+-------+---------+------------+-----------+---------+------------------+------------+---------------+

说明:

  • 此示例同时使用了3个不同的窗口定义
  • w1: 按月份排序,用于排名和环比
  • w2: 累积窗口,用于YTD(Year-To-Date)计算
  • w3: 完整窗口,用于计算总和并计算占比
  • 展示了如何在一个查询中实现复杂的多维度分析

示例9:数据分桶 - NTILE函数应用

业务场景: 将销售数据按业绩分为高中低三档

SELECT 
    month,
    sales,
    NTILE(3) OVER w AS sales_tier,
    CASE 
        WHEN NTILE(3) OVER w = 1 THEN 'High'
        WHEN NTILE(3) OVER w = 2 THEN 'Medium'  
        ELSE 'Low'
    END AS tier_label
FROM sales
WINDOW w AS (ORDER BY sales DESC);

执行结果:

+-------+-------+------------+------------+
| month | sales | sales_tier | tier_label |
+-------+-------+------------+------------+
| 4     | 150   | 1          | High       |
| 2     | 120   | 1          | High       |
| 6     | 110   | 2          | Medium     |
| 1     | 100   | 2          | Medium     |
| 5     | 90    | 3          | Low        |
| 3     | 80    | 3          | Low        |
+-------+-------+------------+------------+

说明: NTILE(n)将排序后的数据尽可能平均地分成n个桶,常用于分层分析、ABC分类等场景。


示例10:累积分布 - CUME_DIST函数应用

业务场景: 计算每个销售额在整体中的累积分布百分位

SELECT 
    month,
    sales,
    CUME_DIST() OVER w AS cumulative_dist,
    ROUND(CUME_DIST() OVER w * 100, 2) AS percentile
FROM sales
WINDOW w AS (ORDER BY sales);

执行结果:

+-------+-------+------------------+------------+
| month | sales | cumulative_dist  | percentile |
+-------+-------+------------------+------------+
| 3     | 80    | 0.167            | 16.67      |
| 5     | 90    | 0.333            | 33.33      |
| 1     | 100   | 0.500            | 50.00      |
| 6     | 110   | 0.667            | 66.67      |
| 2     | 120   | 0.833            | 83.33      |
| 4     | 150   | 1.000            | 100.00     |
+-------+-------+------------------+------------+

说明: CUME_DIST()返回小于或等于当前值的行数占总行数的比例,表示当前值在数据集中的累积分布位置。


窗口框架详解

ROWS vs RANGE

框架类型说明示例
ROWS基于物理行数ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
RANGE基于值的逻辑范围RANGE BETWEEN 10 PRECEDING AND 10 FOLLOWING

边界关键字

  • UNBOUNDED PRECEDING: 分区的第一行
  • UNBOUNDED FOLLOWING: 分区的最后一行
  • CURRENT ROW: 当前行
  • n PRECEDING: 当前行之前的第n行(ROWS)或值-n的行(RANGE)
  • n FOLLOWING: 当前行之后的第n行(ROWS)或值+n的行(RANGE)

常用窗口函数分类

聚合函数

  • SUM(), AVG(), COUNT(), MAX(), MIN()

排名函数

  • ROW_NUMBER(): 连续唯一行号
  • RANK(): 相同值同排名,有间隔
  • DENSE_RANK(): 相同值同排名,无间隔
  • PERCENT_RANK(): 百分比排名
  • NTILE(n): 将数据分成n个桶
  • CUME_DIST(): 累积分布值

取值函数

  • LAG(): 访问前面的行
  • LEAD(): 访问后面的行
  • FIRST_VALUE(): 窗口第一个值
  • LAST_VALUE(): 窗口最后一个值
  • NTH_VALUE(): 窗口第n个值

最佳实践

1. 使用WINDOW子句提高可读性

不推荐(重复定义窗口):

SELECT month,
       SUM(sales) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),
       AVG(sales) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
FROM sales;

推荐(使用WINDOW子句):

SELECT month,
       SUM(sales) OVER w,
       AVG(sales) OVER w
FROM sales
WINDOW w AS (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW);

2. 合理使用PARTITION BY

对于分组分析,使用PARTITION BY可以在不破坏数据结构的情况下进行组内计算:

-- 示例:按类别分别计算排名
SELECT category, month, sales,
       RANK() OVER w AS rank_in_category
FROM sales_by_category
WINDOW w AS (PARTITION BY category ORDER BY sales DESC);

执行结果:

+-------------+-------+-------+------------------+
| category    | month | sales | rank_in_category |
+-------------+-------+-------+------------------+
| Clothing    | 2     | 600   | 1                |
| Clothing    | 3     | 550   | 2                |
| Clothing    | 1     | 500   | 3                |
| Electronics | 2     | 1200  | 1                |
| Electronics | 1     | 1000  | 2                |
| Electronics | 3     | 900   | 3                |
+-------------+-------+-------+------------------+

说明:每个类别内独立计算排名,互不影响。

3. 注意LAST_VALUE的窗口范围

如果要获取真正的最后一个值,必须使用UNBOUNDED FOLLOWING

WINDOW w AS (ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

4. 特定注意事项

在Lakehouse中,建议在查询末尾添加LIMIT子句以提高性能:

SELECT ... FROM ... WINDOW ... LIMIT 50;

性能优化建议

  1. 索引优化: 在ORDER BY和PARTITION BY使用的列上创建适当的索引
  2. 窗口范围: 尽量使用较小的窗口范围以减少计算量
  3. 复用窗口定义: 使用WINDOW子句避免重复定义相同窗口
  4. 分区策略: 合理使用PARTITION BY减少每个窗口的数据量

常见问题

Q1: WINDOW子句必须放在哪里?

A: WINDOW子句必须放在FROM和WHERE子句之后,ORDER BY子句之前。

Q2: 能否在同一个查询中定义多个窗口?

A: 可以,使用逗号分隔多个窗口定义。

Q3: ROWS和RANGE的主要区别是什么?

A: ROWS基于物理行数,RANGE基于值的逻辑范围。当有重复值时,RANGE会包含所有具有相同值的行。

Q4: 为什么LAST_VALUE没有返回预期的最后一个值?

A: 需要显式指定ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING来包含整个分区。


联系我们
预约咨询
微信咨询
电话咨询