用CTE方式对客户进行同类群组分析
CTE
CTE(Common Table Expression) ,公用表表达式,它是在单个语句的执行范围内定义的临时结果集,只在查询期间有效。它可以自引用,也可在同一查询中多次引用,实现了代码段的重复利用。
同类群组(Revenue Cohort)
同类群组(Revenue Cohort)分析是对几个不同同类群组(即客户群)的分析,以更好地了解行为、模式和趋势。
最常见的同类群组分析类型之一着眼于基于时间的同类群组,这些群组按特定时间范围对用户/客户进行分组。例如,一家公司可能希望了解其在 1 月份开始使用该产品或开始付款的客户与 2 月份的客户相比如何。
基于细分的群组代表使用或购买特定产品或服务的客户群。例如,您可以根据用户每周登录您的应用的时间来细分用户。
另一种类型的群组是基于规模的群组,它按货币价值对客户进行细分。这是游戏行业(免费用户与鲸鱼用户)或 SaaS 世界中的常见做法,通过他们的 LTV 或计划对客户进行细分。
对于本文的其余部分,我们将只关注实施基于时间的收入群组分析。
队列分析所需的数据,在开始队列分析之前,需要以下数据:
* 与购买数据相关联的收入数据
* 用户的唯一标识符,例如客户 ID 或帐户 ID
* 每个用户的初始开始日期,无论是注册日期还是第一次付款。
数据说明
本文所使用的数据来自云器Lakehouse的共享样例数据,可以直接按照如下方式直接使用:
select * from clickzetta_sample_data.ecommerce_events_history.ecommerce_events_multicategorystore limit 10;

分析步骤
- 临时结果集user_cohorts:首先,我们想将用户分到群组中——在这种情况下,我们想按他们的Order Week 。
- 临时结果集order_Week:接下来我们要创建一个order_month变量。例如,客户在首次付款后一个月进行的付款的值为order_month2。创建一个order_Week变量。例如,客户在首次付款后一周进行的付款的order_Week值为 2。
- 临时结果集cohort_size:进一步,我们现在可以汇总cohortMonth第一步中创造的收入。这将使我们能够创建我们的rentention_table.汇总cohortMonth第一步中创造的收入。这将使我们能够创建我们的rentention_table.
构建分析模型如下:
WITH
eCommerce_LifeCycle_Order_Sequence AS (
SELECT event_date,
user_id,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_date ASC) AS customer_order_sequence,
LAG(event_date) OVER (PARTITION BY user_id ORDER BY user_id ASC) AS previous_order_date,
MIN(event_date) AS first_order_date,
MAX(event_date) AS last_order_date
FROM clickzetta_sample_data.ecommerce_events_history.ecommerce_events_multicategorystore
WHERE event_type = 'purchase'
GROUP BY event_date,
user_id)
,
eCommerce_LifeCycle_Time_Between_Orders AS (
SELECT event_date,
user_id,
customer_order_sequence,
CASE
WHEN previous_order_date IS NULL THEN event_date
ELSE previous_order_date
END AS previous_order_date,
DATEDIFF(event_date,previous_order_date) AS days_between_orders,
first_order_date,
last_order_date
FROM eCommerce_LifeCycle_Order_Sequence)
,
eCommerce_LifeCycle AS(
SELECT event_date,
user_id,
CASE
WHEN customer_order_sequence = 1 THEN '新客户'
WHEN days_between_orders > 0 AND days_between_orders < 30 THEN '活跃客户'
WHEN days_between_orders > 30 THEN '沉睡客户'
ELSE '未知'
END AS customer_life_cycle,
customer_order_sequence,
previous_order_date,
CASE
WHEN days_between_orders IS NULL THEN 0
ELSE days_between_orders
END AS days_between_orders,
first_order_date,
last_order_date
FROM eCommerce_LifeCycle_Time_Between_Orders)
select * from eCommerce_LifeCycle;
运行上述SQL代码,结果如下:

也可以通过以下方式,可以将结果直接写入到目标表中:
DROP TABLE if exists eCommerce_User_Cohort;
CREATE TABLE if not exists eCommerce_User_Cohort
AS
--将用户分到群组中。在本例中,我们想按他们的Order Date
with user_cohorts as (
SELECT user_id
, MIN(weekofyear(event_date)) as cohortWeek
FROM clickzetta_sample_data.ecommerce_events_history.ecommerce_events_multicategorystore
WHERE CAST(event_type as string) = 'purchase'
GROUP BY 1
--LIMIT 100
),
--创建一个order_Week变量。例如,客户在首次付款后一周进行的付款的order_Week值为 2。
order_Week as (
SELECT eed.user_id
, (weekofyear(event_date)-cohortWeek+1) as Week_number
, SUM(price) as revenue
FROM clickzetta_sample_data.ecommerce_events_history.ecommerce_events_multicategorystore eed
LEFT JOIN user_cohorts u on eed.user_id=u.user_id --USING(user_id)
WHERE CAST(event_type as string) = 'purchase'
GROUP BY 1, 2
--LIMIT 100
),
--汇总cohortMonth第一步中创造的收入。这将使我们能够创建我们的rentention_table.
cohort_size as (
SELECT sum(price) as revenue
, cohortWeek
FROM clickzetta_sample_data.ecommerce_events_history.ecommerce_events_multicategorystore eed
LEFT JOIN user_cohorts u on eed.user_id=u.user_id --USING (user_id)
WHERE CAST(event_type as string) = 'purchase'
GROUP BY 2
ORDER BY 2
--LIMIT 100
),
retention_table as (
SELECT c.cohortWeek
, o.Week_number
, sum(revenue) as revenue
FROM order_Week o
LEFT JOIN user_cohorts c on c.user_id=o.user_id
GROUP BY 1, 2
)
SELECT r.cohortWeek
, s.revenue as totalRevenue
, r.Week_number
, r.revenue / s.revenue as percentage
FROM retention_table r
LEFT JOIN cohort_size s on r.cohortWeek=s.cohortWeek
WHERE r.cohortWeek IS NOT NULL
ORDER BY 1, 3;
回顾:物理表的方式
不使用CTE,而是将每一个子查询的结果都落成物理表,则整个处理过程如下:
- 根据每个客户的升序订单日期创建特定于每个客户 ID 的订单序列列。这就是ROW_NUMBER 分析函数在下面的查询中所做的。
- 创建一个新列,为客户 ID 插入先前的订单日期,以便在以后的代码块中用于计算订单之间的时间段。这就是LAG 分析函数在以下查询中所做的。
–请注意语句末尾的“Group By”order_date 和 customer_id 列。这很重要,因为客户可以在同一天拥有多个具有不同订单 ID 的订单。
DROP TABLE if exists eCommerce_LifeCycle_Order_Sequence;
CREATE TABLE eCommerce_LifeCycle_Order_Sequence
AS
SELECT event_date,
user_id,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_date ASC) AS customer_order_sequence,
LAG(event_date) OVER (PARTITION BY user_id ORDER BY user_id ASC) AS previous_order_date,
MIN(event_date) AS first_order_date,
MAX(event_date) AS last_order_date
FROM clickzetta_sample_data.ecommerce_events_history.ecommerce_events_multicategorystore
WHERE event_type = 'purchase'
GROUP BY event_date,
user_id;
针对上表运行子查询以计算 order_date 和 previous_order_date 列之间的天数,您将看到创建新列 days_between_orders 的 DATE_DIFF 函数会发生这种情况。
DROP TABLE if exists eCommerce_LifeCycle_Time_Between_Orders;
CREATE TABLE eCommerce_LifeCycle_Time_Between_Orders
AS
SELECT event_date,
user_id,
customer_order_sequence,
CASE
WHEN previous_order_date IS NULL THEN event_date
ELSE previous_order_date
END AS previous_order_date,
DATEDIFF(event_date,previous_order_date) AS days_between_orders,
first_order_date,
last_order_date
FROM eCommerce_LifeCycle_Order_Sequence;
下一个查询使用CASE 语句创建一个附加的 customer_life_cycle 列,以根据 days_between_orders 列指示订单是来自新客户、活跃客户还是已失效客户。在此示例中,如果订单发生在距上一个订单 1 到 365 天之间的任何时间,则客户被视为活跃客户;如果先前订单超过 365 天,则客户被视为已失效。这具有高度的业务特定性,因此您的可能会有所不同。
DROP TABLE if exists eCommerce_LifeCycle;
CREATE TABLE eCommerce_LifeCycle
AS
SELECT event_date,
user_id,
CASE
WHEN customer_order_sequence = 1 THEN '新客户'
WHEN days_between_orders > 0 AND days_between_orders < 30 THEN '活跃客户'
WHEN days_between_orders > 30 THEN '沉睡客户'
ELSE '未知'
END AS customer_life_cycle,
customer_order_sequence,
previous_order_date,
CASE
WHEN days_between_orders IS NULL THEN 0
ELSE days_between_orders
END AS days_between_orders,
first_order_date,
last_order_date
FROM eCommerce_LifeCycle_Time_Between_Orders;
select * from eCommerce_LifeCycle limit 100;
Congratulations, it's done.
Please enojoy and learn more!
附录
下载Zeppelin Notebook源文件
本文代码也提供运行在Zeppelin的版本,你如果想直接运行本文代码,请按照文档说明安装Zeppelin。
03.CTE(Common Table Expression)..ipynb
03.CTE(Common Table Expression)_2JHUJ5BP8.zpln