产品
解决方案
客户案例
资源中心
活动中心
关于我们
在云器Lakehouse上玩Kaggle数据之US Funds dataset from Yahoo Finance(二)
数据见闻
2024年9月14日
本文讨论了在云器 Lakehouse 上处理 Kaggle 数据集中的 SQL Mutual Fund Performance 的过程

关于Kaggle上的SQL Mutual Fund Performance代码

上文提到,Kaggle 是一个数据建模和数据分析竞赛平台,US Funds dataset from Yahoo Finance是Kaggle上的一个数据集。

Kaggle 上不仅提供了数据,也提供了很多数据科学工作者的代码。在Kaggle上,US Funds dataset from Yahoo Finance数据集对应的分析代码有19个:

image.png

其中“SQL Mutual Fund Performance”立即引起了笔者的关注。

代码地址:https://www.kaggle.com/code/sophianguyn/sql-mutual-fund-performance

先看一下该代码的总结:

  1. 问题陈述-本项目的目标是使用 SQL 分析共同基金数据,以确定关键业绩趋势、风险与回报矩阵以及费用预测。随着投资机会的增加,了解这些方面对于投资者做出明智的决策至关重要。此分析旨在对基金进行分类、比较业绩并了解风险回报权衡,为投资者和基金经理提供宝贵的见解。
  2. 结论-分析揭示了共同基金业绩、风险评级和回报评级的重要趋势。通过根据业绩和风险回报矩阵对基金进行分类,我们确定了高绩效和低绩效基金: 高绩效基金:持续表现出高评级和有利风险回报率的基金。 低绩效基金:评级低且风险回报率较差的基金。 此外,不同时间范围(3、5 和 10 年)的费用预测提供了对不同基金相关成本的全面了解。例如,“13D Activist Fund”和“361 Funds”等基金多年来的费用趋势高于其他基金。这种整体视角有助于做出更好的投资决策,因为它可以突出各种共同基金的业绩潜力和成本影响。
  3. 技术-使用 SQL 对数据进行分类和聚合。 基金业绩、风险评级和回报评级的比较分析。 3 年、5 年和 10 年期间的费用预测分析。 可视化关键发现以增强可解释性。
  4. 如果我有更多时间,我可能会做 集成更复杂的统计模型,根据历史数据预测未来表现。 对市场条件对共同基金业绩的影响进行更深入的分析。 实施高级可视化工具以创建更具交互性和更详细的数据可视化表示。

问题引入:这些SQL代码可以跑在云器Lakehouse上么?

云器Lakehouse支持SQL和Python等语言进行数据分析,这就激发了笔者对其中“SQL Mutual Fund Performance”SQL代码的好奇心:

❓ 这些SQL代码可以跑在云器Lakehouse上,以进一步分析上文导入到Lakehouse的数据么?

SQL Mutual Fund Performance包括了如下查询:

查询 1:四年内(2018-2021)最高和最低平均评级 查询 2:每只基金的平均同比评级 查询 3:关键评级 KPI 同比比较表 查询 4:根据基金表现和风险回报矩阵对基金进行分类 查询 5:表现和风险与回报矩阵分析 查询 6:业务审查的风险与回报矩阵分析 查询 7:最高营业额分析 查询 8:费用项目趋势

SQL特点和复杂度分析

凡事先谋而后动,在决定正式代码迁移之前,先仔细阅读一下“SQL Mutual Fund Performance”代码,分析其特点和复杂度。快速看下查询 2(每只基金的平均同比评级)的SQL代码(Pandas SQL语法):

SELECT * 
FROM (
    SELECT
        *,
        SUM(c.yr_18 + c.yr_19 + c.yr_20 + c.yr_21) / 4 AS avg_rating
    FROM (
        SELECT 
            b.fund_long_name,
            MAX(b.yr_2018) AS yr_18,
            MAX(b.yr_2019) AS yr_19,
            MAX(b.yr_2020) AS yr_20,
            MAX(b.yr_2021) AS yr_21
        FROM (
            SELECT
                a.fund_long_name,
                CASE WHEN a.year = '2018' THEN a.avg_rating ELSE 0 END AS yr_2018,
                CASE WHEN a.year = '2019' THEN a.avg_rating ELSE 0 END AS yr_2019,
                CASE WHEN a.year = '2020' THEN a.avg_rating ELSE 0 END AS yr_2020,
                CASE WHEN a.year = '2021' THEN a.avg_rating ELSE 0 END AS yr_2021
            FROM (
                SELECT 
                    strftime('%Y', inception_date) AS Year,
                    fund_long_name,
                    ROUND(AVG(morningstar_overall_rating), 2) AS avg_rating
                FROM mutualfunds m
                GROUP BY fund_long_name, strftime('%Y', inception_date)
                ORDER BY fund_long_name, strftime('%Y', inception_date) DESC
            ) a
        ) b
        GROUP BY b.fund_long_name
    ) c
    GROUP BY c.fund_long_name
) d
ORDER BY d.avg_rating DESC;

特点

  1. 使用公共表表达式(CTE):

    1. 通过WITH子句定义了一个名为expense的CTE,用于计算每个基金家族的费用趋势。这使得查询结构更加清晰和模块化。
  2. 条件聚合:

    1. 使用SUMCASE WHEN语句来计算不同时间段内的费用趋势。这种方法允许在单个查询中进行复杂的条件聚合。
  3. 联合查询(UNION ALL):

    1. 使用UNION ALL将多个子查询的结果合并在一起,每个子查询代表不同的时间段(3年、5年、10年)的费用趋势。
  4. 分组和聚合:

    1. 在最终的查询中,使用GROUP BYMAX函数对不同基金家族的费用趋势进行分组和聚合。

复杂度

  1. 查询嵌套:

    1. 查询包含多个嵌套的子查询和CTE。
  2. 条件逻辑:

    1. 使用了多个CASE WHEN语句来处理不同的条件逻辑。
  3. 性能考虑:

    1. 由于查询涉及多个聚合操作和条件计算,对数据平台对大数据集的计算性能要求高。

可以看到,这些SQL查询通过使用CTE、条件聚合和联合查询来处理复杂的数据分析任务,因此增加了查询的复杂度和执行时的性能挑战。

迁移评估

通过以上分析可以看到,为了对共同基金进行专业分析,“SQL Mutual Fund Performance”代码确实具有复杂性。而云器Lakehouse的SQL引擎对Spark SQL、Pandas SQL具有高度兼容性,肉眼可见的是一个非常微小的兼容性改动:

💡 将strftime('%Y', inception_date) 改为year(inception_date)

经过上述分析,大幅降低了SQL作业迁移难度的心理顾虑,开搞!

SQL作业迁移步骤计划

云器Lakehouse内置有SQL IDE开发工具,包括代码智能提示、版本管理、测试和发布等,愉快的迁移过程就此开始:

image.png

在上文数据加载的基础上规划本次迁移的四个步骤:

  1. 用动态表(Dynamic table)来存放计算结果,考虑到底表变化和结果表都是统计结果,非常适合用动态表来简化数据加工逻辑。
  2. 数据标准:对数据进行详细注释,方便团队理解和大模型理解数据结构。
  3. 刷新动态表:不同于传统的大数据INSERT INTO/OVERWRITE的结果刷新方式,云器Lakehouse的动态表采用增量计算技术,通过REFESHESH 动态表的方式只对变化数据进行计算,从而大幅提升了数据刷新速度并降级计算成本,从而提高性价比,达到降本增效的目的。
  4. 查看结果,通过SELECT语句查看计算结果。

实施SQL作业迁移

迁移过程非常顺利,基本上是个ICP(Information Copy&Paste)的过程。每一个查询的迁移过程如下:

1.从“SQL Mutual Fund Performance”Copy代码。

2.粘贴到云器Lakehouse的SQL IDE。

image.png

3.先跑通SELECT语句,跑通后检查结果正确。

a.语法修改:使用IDE的查找替换功能,将strftime('%Y', inception_date) 改为year(inception_date)

b.检查结果:

image.png

4.根据SELECT语句创建动态表。

CREATE DYNAMIC TABLE IF NOT EXISTS the_highest_and_lowest_average_ratings_over_years 
AS
SELECT ***

image.png

5.迁移后完整的代码如下:

❤️完整代码都放这里了,数据在上文也都有了,你会动手玩起来么?

set cz.optimizer.mv.auto.unique.key.enabled = false;
-- 查询 1:四年(2018-2021 年)最高和最低平均评级 此查询的目的是计算每个基金类别的平均同比 (YoY) 评级。它确定了四年(2018-2021 年)平均评级最高和最低的类别。这可以提供有价值的见解,了解哪些类别表现最佳,哪些类别表现不佳,从而有助于战略决策和投资分析。

CREATE DYNAMIC TABLE IF NOT EXISTS the_highest_and_lowest_average_ratings_over_years AS
SELECT * FROM (
    SELECT 'Category with maximum Rating :- ' || c.fund_category AS category, 
           ROUND(SUM(c.yr_18 + c.yr_19 + c.yr_20 + c.yr_21) / 4, 2) AS avg_YoY
    FROM (
        SELECT b.fund_category,
               MAX(b.yr_2018) AS yr_18,
               MAX(b.yr_2019) AS yr_19,
               MAX(b.yr_2020) AS yr_20,
               MAX(b.yr_2021) AS yr_21
        FROM (
            SELECT a.fund_category,
                   CASE WHEN a.year = '2018' THEN a.avg_rating ELSE 0 END AS yr_2018,
                   CASE WHEN a.year = '2019' THEN a.avg_rating ELSE 0 END AS yr_2019,
                   CASE WHEN a.year = '2020' THEN a.avg_rating ELSE 0 END AS yr_2020,
                   CASE WHEN a.year = '2021' THEN a.avg_rating ELSE 0 END AS yr_2021
            FROM (
                SELECT year(inception_date) AS year,
                       fund_category,
                       ROUND(AVG(morningstar_overall_rating), 2) AS avg_rating
                FROM mutualfunds
                GROUP BY fund_category, year(inception_date)
                ORDER BY fund_category, year(inception_date) DESC
            ) a
        ) b
        GROUP BY b.fund_category
    ) c
    GROUP BY c.fund_category
    ORDER BY avg_YoY DESC
    LIMIT 1
)

UNION ALL

SELECT * FROM (
    SELECT 'Category with minimum Rating :- ' || c.fund_category AS category, 
           ROUND(SUM(c.yr_18 + c.yr_19 + c.yr_20 + c.yr_21) / 4, 2) AS avg_YoY
    FROM (
        SELECT b.fund_category,
               MAX(b.yr_2018) AS yr_18,
               MAX(b.yr_2019) AS yr_19,
               MAX(b.yr_2020) AS yr_20,
               MAX(b.yr_2021) AS yr_21
        FROM (
            SELECT a.fund_category,
                   CASE WHEN a.year = '2018' THEN a.avg_rating ELSE 0 END AS yr_2018,
                   CASE WHEN a.year = '2019' THEN a.avg_rating ELSE 0 END AS yr_2019,
                   CASE WHEN a.year = '2020' THEN a.avg_rating ELSE 0 END AS yr_2020,
                   CASE WHEN a.year = '2021' THEN a.avg_rating ELSE 0 END AS yr_2021
            FROM (
                SELECT year(inception_date) AS year,
                       fund_category,
                       ROUND(AVG(morningstar_overall_rating), 2) AS avg_rating
                FROM mutualfunds
                GROUP BY fund_category, year(inception_date)
                ORDER BY fund_category, year(inception_date) DESC
            ) a
        ) b
        GROUP BY b.fund_category
    ) c
    GROUP BY c.fund_category
    ORDER BY avg_YoY ASC
    LIMIT 1
);

-- 查询 2:每个基金的平均同比 (YoY) 评分 此查询根据基金的 fund_long_name 计算每个基金的平均同比 (YoY) 评分。它汇总了每年(2018、2019、2020 和 2021)的评分,并计算这些年的平均评分。从此分析得出的业务推论包括确定平均同比评分最高和最低的基金类别、突出显示不一致之处和趋势以及确定评分为 0 的类别。此信息对于了解各种基金类别随时间推移的业绩动态至关重要。

CREATE DYNAMIC TABLE IF NOT EXISTS average_yoy_rating_for_funds AS
SELECT * FROM (
    SELECT
    c.fund_long_name,
    AVG(c.yr_18) as yr_18, AVG(c.yr_19) as yr_19, AVG(c.yr_20)  as yr_20 ,AVG(c.yr_21)  as yr_21,
    SUM(c.yr_18 + c.yr_19 + c.yr_20 + c.yr_21) / 4 AS avg_rating
    FROM (
        SELECT 
        b.fund_long_name,
        MAX(b.yr_2018) AS yr_18,
        MAX(b.yr_2019) AS yr_19,
        MAX(b.yr_2020) AS yr_20,
        MAX(b.yr_2021) AS yr_21

        FROM (
            SELECT
            a.fund_long_name,
            CASE WHEN a.year='2018' THEN a.avg_rating ELSE 0 END AS yr_2018,
            CASE WHEN a.year='2019' THEN a.avg_rating ELSE 0 END AS yr_2019,
            CASE WHEN a.year='2020' THEN a.avg_rating ELSE 0 END AS yr_2020,
            CASE WHEN a.year='2021' THEN a.avg_rating ELSE 0 END AS yr_2021

            FROM

            (SELECT 
            year(inception_date) AS Year,
            fund_long_name,
            ROUND(AVG(morningstar_overall_rating), 2) AS avg_rating
            FROM mutualfunds m
            GROUP BY fund_long_name, year(inception_date)
            ORDER BY fund_long_name, year(inception_date) DESC) a
        ) b

        GROUP BY b.fund_long_name
    ) c

    GROUP BY c.fund_long_name
) d
ORDER BY d.avg_rating DESC;

-- 查询 3:关键评级 KPI 同比 (YoY) 比较表 此查询生成关键评级 KPI 同比 (YoY) 比较表。KPI 包括:平均评级 (avg_rating):衡量基金组合的整体表现。回报评级 (avg_return_rating):评估基金产生的产出或回报程度。风险评级 (avg_risk_rating):评估与基金相关的风险程度。查询根据基金的 fund_long_name 汇总每个基金的这些 KPI,并计算 2018、2019、2020 和 2021 年的平均值。
CREATE DYNAMIC TABLE IF NOT EXISTS comparison_table_for_key_rating_kpis_yoy AS
SELECT
rating.fund_long_name,
rating.avg_rating,
return.avg_return_rating,
risk.avg_risk_rating
FROM (
    SELECT * FROM (
        SELECT
        c.fund_long_name,
        SUM(c.yr_18 + c.yr_19 + c.yr_20 + c.yr_21) / 4 AS avg_rating
        FROM (
            SELECT 
            b.fund_long_name,
            MAX(b.yr_2018) AS yr_18,
            MAX(b.yr_2019) AS yr_19,
            MAX(b.yr_2020) AS yr_20,
            MAX(b.yr_2021) AS yr_21
            FROM (
                SELECT
                a.fund_long_name,
                CASE WHEN a.year='2018' THEN a.avg_rating ELSE 0 END AS yr_2018,
                CASE WHEN a.year='2019' THEN a.avg_rating ELSE 0 END AS yr_2019,
                CASE WHEN a.year='2020' THEN a.avg_rating ELSE 0 END AS yr_2020,
                CASE WHEN a.year='2021' THEN a.avg_rating ELSE 0 END AS yr_2021
                FROM (
                    SELECT 
                    year(inception_date) AS Year,
                    fund_long_name,
                    ROUND(AVG(morningstar_overall_rating), 2) AS avg_rating
                    FROM mutualfunds m
                    GROUP BY fund_long_name, year(inception_date)
                    ORDER BY fund_long_name, year(inception_date) DESC
                ) a
            ) b
            GROUP BY b.fund_long_name
        ) c
        GROUP BY c.fund_long_name
    ) d
    ORDER BY d.avg_rating DESC
) rating
JOIN (
    SELECT * FROM (
        SELECT
        c.fund_long_name,
        SUM(c.yr_18 + c.yr_19 + c.yr_20 + c.yr_21) / 4 AS avg_return_rating
        FROM (
            SELECT 
            b.fund_long_name,
            MAX(b.yr_2018) AS yr_18,
            MAX(b.yr_2019) AS yr_19,
            MAX(b.yr_2020) AS yr_20,
            MAX(b.yr_2021) AS yr_21
            FROM (
                SELECT
                a.fund_long_name,
                CASE WHEN a.year='2018' THEN a.avg_return_rating ELSE 0 END AS yr_2018,
                CASE WHEN a.year='2019' THEN a.avg_return_rating ELSE 0 END AS yr_2019,
                CASE WHEN a.year='2020' THEN a.avg_return_rating ELSE 0 END AS yr_2020,
                CASE WHEN a.year='2021' THEN a.avg_return_rating ELSE 0 END AS yr_2021
                FROM (
                    SELECT 
                    year(inception_date) AS Year,
                    fund_long_name,
                    ROUND(AVG(morningstar_return_rating), 2) AS avg_return_rating
                    FROM mutualfunds m
                    GROUP BY fund_long_name, year(inception_date)
                    ORDER BY fund_long_name, year(inception_date) DESC
                ) a
            ) b
            GROUP BY b.fund_long_name
        ) c
        GROUP BY c.fund_long_name
    ) d
    ORDER BY d.avg_return_rating DESC
) return ON rating.fund_long_name = return.fund_long_name
JOIN (
    SELECT * FROM (
        SELECT
        c.fund_long_name,
        SUM(c.yr_18 + c.yr_19 + c.yr_20 + c.yr_21) / 4 AS avg_risk_rating
        FROM (
            SELECT 
            b.fund_long_name,
            MAX(b.yr_2018) AS yr_18,
            MAX(b.yr_2019) AS yr_19,
            MAX(b.yr_2020) AS yr_20,
            MAX(b.yr_2021) AS yr_21
            FROM (
                SELECT
                a.fund_long_name,
                CASE WHEN a.year='2018' THEN a.avg_risk_rating ELSE 0 END AS yr_2018,
                CASE WHEN a.year='2019' THEN a.avg_risk_rating ELSE 0 END AS yr_2019,
                CASE WHEN a.year='2020' THEN a.avg_risk_rating ELSE 0 END AS yr_2020,
                CASE WHEN a.year='2021' THEN a.avg_risk_rating ELSE 0 END AS yr_2021
                FROM (
                    SELECT 
                    year(inception_date) AS Year,
                    fund_long_name,
                    ROUND(AVG(morningstar_risk_rating), 2) AS avg_risk_rating
                    FROM mutualfunds m
                    GROUP BY fund_long_name, year(inception_date)
                    ORDER BY fund_long_name, year(inception_date) DESC
                ) a
            ) b
            GROUP BY b.fund_long_name
        ) c
        GROUP BY c.fund_long_name
    ) d
    ORDER BY d.avg_risk_rating DESC
) risk ON rating.fund_long_name = risk.fund_long_name
ORDER BY risk.avg_risk_rating DESC;

-- 查询 4:根据基金的表现和风险回报矩阵对基金进行分类 此查询旨在通过根据基金的表现和风险回报矩阵对基金进行分类来确定绩效趋势。它有助于业务团队了解不同基金的表现及其相关的风险和回报特征。 评级:反映基金的表现。 风险评级:表示与基金相关的风险程度。 回报评级:表示基金的产出或回报程度。 该查询根据基金的平均评级将基金分为“表现良好”和“表现不佳”。它还将基金分为四个风险回报类别:高风险高回报、高风险低回报、低风险低回报和低风险高回报。
CREATE DYNAMIC TABLE IF NOT EXISTS categorizing_funds_based_on_their_performance_and_risk_return_matrix AS
SELECT
    f.fund_long_name,
    f.Performance,
    f.RiskvsReturnMatrix
FROM (
    SELECT
        e.row_number,
        e.fund_long_name,
        CASE 
            WHEN e.avg_rating > AVG(e.avg_rating) OVER (ORDER BY e.row_number ASC) THEN 'Performing fund' 
            ELSE 'Non-Performing fund' 
        END AS Performance,
        CASE 
            WHEN e.avg_risk_rating > AVG(e.avg_risk_rating) OVER (ORDER BY e.row_number ASC) 
                 AND e.avg_return_rating > AVG(e.avg_return_rating) OVER (ORDER BY e.row_number ASC) THEN 'High-Risk and High-Return'
            WHEN e.avg_risk_rating >= AVG(e.avg_risk_rating) OVER (ORDER BY e.row_number ASC) 
                 AND e.avg_return_rating <= AVG(e.avg_return_rating) OVER (ORDER BY e.row_number ASC) THEN 'High-Risk and Low-Return'
            WHEN e.avg_risk_rating < AVG(e.avg_risk_rating) OVER (ORDER BY e.row_number ASC) 
                 AND e.avg_return_rating < AVG(e.avg_return_rating) OVER (ORDER BY e.row_number ASC) THEN 'Low-Risk and Low-Return'
            ELSE 'Low-Risk and High-Return' 
        END AS RiskvsReturnMatrix
    FROM (
        SELECT
            ROW_NUMBER() OVER (ORDER BY rating.fund_long_name ASC) AS row_number,
            rating.fund_long_name,
            rating.avg_rating,
            return.avg_return_rating,
            ROUND(risk.avg_risk_rating, 2) AS avg_risk_rating
        FROM (
            SELECT 
                c.fund_long_name,
                SUM(c.yr_18 + c.yr_19 + c.yr_20 + c.yr_21) / 4 AS avg_rating
            FROM (
                SELECT 
                    b.fund_long_name,
                    MAX(b.yr_2018) AS yr_18,
                    MAX(b.yr_2019) AS yr_19,
                    MAX(b.yr_2020) AS yr_20,
                    MAX(b.yr_2021) AS yr_21
                FROM (
                    SELECT
                        a.fund_long_name,
                        CASE WHEN a.year='2018' THEN a.avg_rating ELSE 0 END AS yr_2018,
                        CASE WHEN a.year='2019' THEN a.avg_rating ELSE 0 END AS yr_2019,
                        CASE WHEN a.year='2020' THEN a.avg_rating ELSE 0 END AS yr_2020,
                        CASE WHEN a.year='2021' THEN a.avg_rating ELSE 0 END AS yr_2021
                    FROM (
                        SELECT 
                            year(inception_date) AS Year,
                            fund_long_name,
                            ROUND(AVG(morningstar_overall_rating), 2) AS avg_rating
                        FROM mutualfunds m
                        GROUP BY fund_long_name, year(inception_date)
                        ORDER BY fund_long_name, year(inception_date) DESC
                    ) a
                ) b
                GROUP BY b.fund_long_name
            ) c
            GROUP BY c.fund_long_name 
        ) rating
        JOIN (
            SELECT 
                c.fund_long_name,
                SUM(c.yr_18 + c.yr_19 + c.yr_20 + c.yr_21) / 4 AS avg_return_rating
            FROM (
                SELECT 
                    b.fund_long_name,
                    MAX(b.yr_2018) AS yr_18,
                    MAX(b.yr_2019) AS yr_19,
                    MAX(b.yr_2020) AS yr_20,
                    MAX(b.yr_2021) AS yr_21
                FROM (
                    SELECT
                        a.fund_long_name,
                        CASE WHEN a.year='2018' THEN a.avg_return_rating ELSE 0 END AS yr_2018,
                        CASE WHEN a.year='2019' THEN a.avg_return_rating ELSE 0 END AS yr_2019,
                        CASE WHEN a.year='2020' THEN a.avg_return_rating ELSE 0 END AS yr_2020,
                        CASE WHEN a.year='2021' THEN a.avg_return_rating ELSE 0 END AS yr_2021
                    FROM (
                        SELECT 
                            year(inception_date) AS Year,
                            fund_long_name,
                            ROUND(AVG(morningstar_return_rating), 2) AS avg_return_rating
                        FROM mutualfunds m
                        GROUP BY fund_long_name, year(inception_date)
                        ORDER BY fund_long_name, year(inception_date) DESC
                    ) a
                ) b
                GROUP BY b.fund_long_name
            ) c
            GROUP BY c.fund_long_name 
        ) return ON rating.fund_long_name = return.fund_long_name
        JOIN (
            SELECT 
                c.fund_long_name,
                SUM(c.yr_18 + c.yr_19 + c.yr_20 + c.yr_21) / 4 AS avg_risk_rating
            FROM (
                SELECT 
                    b.fund_long_name,
                    MAX(b.yr_2018) AS yr_18,
                    MAX(b.yr_2019) AS yr_19,
                    MAX(b.yr_2020) AS yr_20,
                    MAX(b.yr_2021) AS yr_21
                FROM (
                    SELECT
                        a.fund_long_name,
                        CASE WHEN a.year='2018' THEN a.avg_risk_rating ELSE 0 END AS yr_2018,
                        CASE WHEN a.year='2019' THEN a.avg_risk_rating ELSE 0 END AS yr_2019,
                        CASE WHEN a.year='2020' THEN a.avg_risk_rating ELSE 0 END AS yr_2020,
                        CASE WHEN a.year='2021' THEN a.avg_risk_rating ELSE 0 END AS yr_2021
                    FROM (
                        SELECT 
                            year(inception_date) AS Year,
                            fund_long_name,
                            ROUND(AVG(morningstar_risk_rating), 2) AS avg_risk_rating
                        FROM mutualfunds m
                        GROUP BY fund_long_name, year(inception_date)
                        ORDER BY fund_long_name, year(inception_date) DESC
                    ) a
                ) b
                GROUP BY b.fund_long_name
            ) c
            GROUP BY c.fund_long_name 
        ) risk ON rating.fund_long_name = risk.fund_long_name
    ) e
    GROUP BY e.row_number, e.fund_long_name, e.avg_rating, e.avg_risk_rating, e.avg_return_rating
) f
GROUP BY f.fund_long_name, f.Performance, f.RiskvsReturnMatrix;

-- 查询 5:绩效和风险与回报矩阵分析 此查询评估各种共同基金的绩效和风险回报状况,以进行业务审查。它比较了表现良好和表现不佳的基金的数量,并将它们归类为不同的风险回报矩阵。
-- Performance and Risk vs Return Matrix analysis
CREATE DYNAMIC TABLE IF NOT EXISTS performance_and_risk_vs_return_matrix_analysis AS
SELECT
    'The Comparision is as follows :- '||
    MAX(h.Performing_fund) AS Value_one,
    MAX(h.Non_Performing_fund) AS Value_two
FROM (
    SELECT 
        g.Performance,
        SUM(CASE WHEN g.performance='Performing fund' THEN 1 ELSE 0 END) AS Performing_fund,
        SUM(CASE WHEN g.performance='Non-Performing fund' THEN 1 ELSE 0 END) AS Non_Performing_fund
    FROM (
        SELECT
            f.fund_long_name,
            f.Performance,
            f.RiskvsReturnMatrix
        FROM (
            SELECT
                e.row_number,
                e.fund_long_name,
                CASE 
                    WHEN e.avg_rating > AVG(e.avg_rating) OVER (ORDER BY e.row_number ASC) THEN 'Performing fund' 
                    ELSE 'Non-Performing fund' 
                END AS Performance,
                CASE 
                    WHEN e.avg_risk_rating > AVG(e.avg_risk_rating) OVER (ORDER BY e.row_number ASC) 
                         AND e.avg_return_rating > AVG(e.avg_return_rating) OVER (ORDER BY e.row_number ASC) THEN 'High-Risk and High-Return'
                    WHEN e.avg_risk_rating >= AVG(e.avg_risk_rating) OVER (ORDER BY e.row_number ASC) 
                         AND e.avg_return_rating <= AVG(e.avg_return_rating) OVER (ORDER BY e.row_number ASC) THEN 'High-Risk and Low-Return'
                    WHEN e.avg_risk_rating < AVG(e.avg_risk_rating) OVER (ORDER BY e.row_number ASC) 
                         AND e.avg_return_rating < AVG(e.avg_return_rating) OVER (ORDER BY e.row_number ASC) THEN 'Low-Risk and Low-Return'
                    ELSE 'Low-Risk and High-Return' 
                END AS RiskvsReturnMatrix
            FROM (
                SELECT
                    ROW_NUMBER() OVER (ORDER BY rating.fund_long_name ASC) AS row_number,
                    rating.fund_long_name,
                    rating.avg_rating,
                    return.avg_return_rating,
                    ROUND(risk.avg_risk_rating, 2) AS avg_risk_rating
                FROM (
                    SELECT 
                        c.fund_long_name,
                        SUM(c.yr_18 + c.yr_19 + c.yr_20 + c.yr_21) / 4 AS avg_rating
                    FROM (
                        SELECT 
                            b.fund_long_name,
                            MAX(b.yr_2018) AS yr_18,
                            MAX(b.yr_2019) AS yr_19,
                            MAX(b.yr_2020) AS yr_20,
                            MAX(b.yr_2021) AS yr_21
                        FROM (
                            SELECT
                                a.fund_long_name,
                                CASE WHEN a.year='2018' THEN a.avg_rating ELSE 0 END AS yr_2018,
                                CASE WHEN a.year='2019' THEN a.avg_rating ELSE 0 END AS yr_2019,
                                CASE WHEN a.year='2020' THEN a.avg_rating ELSE 0 END AS yr_2020,
                                CASE WHEN a.year='2021' THEN a.avg_rating ELSE 0 END AS yr_2021
                            FROM (
                                SELECT 
                                    year(inception_date) AS Year,
                                    fund_long_name,
                                    ROUND(AVG(morningstar_overall_rating), 2) AS avg_rating
                                FROM mutualfunds m
                                GROUP BY fund_long_name, year(inception_date)
                                ORDER BY fund_long_name, year(inception_date) DESC
                            ) a
                        ) b
                        GROUP BY b.fund_long_name
                    ) c
                    GROUP BY c.fund_long_name 
                ) rating
                JOIN (
                    SELECT 
                        c.fund_long_name,
                        SUM(c.yr_18 + c.yr_19 + c.yr_20 + c.yr_21) / 4 AS avg_return_rating
                    FROM (
                        SELECT 
                            b.fund_long_name,
                            MAX(b.yr_2018) AS yr_18,
                            MAX(b.yr_2019) AS yr_19,
                            MAX(b.yr_2020) AS yr_20,
                            MAX(b.yr_2021) AS yr_21
                        FROM (
                            SELECT
                                a.fund_long_name,
                                CASE WHEN a.year='2018' THEN a.avg_return_rating ELSE 0 END AS yr_2018,
                                CASE WHEN a.year='2019' THEN a.avg_return_rating ELSE 0 END AS yr_2019,
                                CASE WHEN a.year='2020' THEN a.avg_return_rating ELSE 0 END AS yr_2020,
                                CASE WHEN a.year='2021' THEN a.avg_return_rating ELSE 0 END AS yr_2021
                            FROM (
                                SELECT 
                                    year(inception_date) AS Year,
                                    fund_long_name,
                                    ROUND(AVG(morningstar_return_rating), 2) AS avg_return_rating
                                FROM mutualfunds m
                                GROUP BY fund_long_name, year(inception_date)
                                ORDER BY fund_long_name, year(inception_date) DESC
                            ) a
                        ) b
                        GROUP BY b.fund_long_name
                    ) c
                    GROUP BY c.fund_long_name 
                ) return ON rating.fund_long_name = return.fund_long_name
                JOIN (
                    SELECT 
                        c.fund_long_name,
                        SUM(c.yr_18 + c.yr_19 + c.yr_20 + c.yr_21) / 4 AS avg_risk_rating
                    FROM (
                        SELECT 
                            b.fund_long_name,
                            MAX(b.yr_2018) AS yr_18,
                            MAX(b.yr_2019) AS yr_19,
                            MAX(b.yr_2020) AS yr_20,
                            MAX(b.yr_2021) AS yr_21
                        FROM (
                            SELECT
                                a.fund_long_name,
                                CASE WHEN a.year='2018' THEN a.avg_risk_rating ELSE 0 END AS yr_2018,
                                CASE WHEN a.year='2019' THEN a.avg_risk_rating ELSE 0 END AS yr_2019,
                                CASE WHEN a.year='2020' THEN a.avg_risk_rating ELSE 0 END AS yr_2020,
                                CASE WHEN a.year='2021' THEN a.avg_risk_rating ELSE 0 END AS yr_2021
                            FROM (
                                SELECT 
                                    year(inception_date) AS Year,
                                    fund_long_name,
                                    ROUND(AVG(morningstar_risk_rating), 2) AS avg_risk_rating
                                FROM mutualfunds m
                                GROUP BY fund_long_name, year(inception_date)
                                ORDER BY fund_long_name, year(inception_date) DESC
                            ) a
                        ) b
                        GROUP BY b.fund_long_name
                    ) c
                    GROUP BY c.fund_long_name 
                ) risk ON rating.fund_long_name = risk.fund_long_name
            ) e
            GROUP BY e.row_number, e.fund_long_name, e.avg_rating, e.avg_risk_rating, e.avg_return_rating
        ) f
        GROUP BY f.fund_long_name, f.Performance, f.RiskvsReturnMatrix
    ) g
    GROUP BY g.Performance
) h;

-- Query 6: 查询 6:业务审查的风险与回报矩阵分析。此查询分析各种共同基金的风险与回报矩阵,以帮助业务审查团队了解绩效趋势。重点是确定属于四类的基金:高风险高回报、低风险高回报、高风险低回报以及低风险低回报。业务推断是顶级组是低-低和高-高,检查点是评估低-低基金的数量,以确定对于普通客户来说低多少。
-- Risk vs Return Matrix analysis for business review

CREATE DYNAMIC TABLE IF NOT EXISTS risk_vs_return_matrix_analysis_for_business_review AS
SELECT
    'The Comparison is as follows :- '||
    MAX(h.High_High) AS High_High,
    MAX(h.Low_High) AS Low_High,
    MAX(h.High_Low) AS High_Low,
    MAX(h.Low_Low) AS Low_Low
FROM (
    SELECT 
        g.RiskvsReturnMatrix,
        SUM(CASE WHEN g.RiskvsReturnMatrix='High-Risk and High-Return' THEN 1 ELSE 0 END) AS High_High,
        SUM(CASE WHEN g.RiskvsReturnMatrix='Low-Risk and High-Return' THEN 1 ELSE 0 END) AS Low_High,
        SUM(CASE WHEN g.RiskvsReturnMatrix='High-Risk and Low-Return' THEN 1 ELSE 0 END) AS High_Low,
        SUM(CASE WHEN g.RiskvsReturnMatrix='Low-Risk and Low-Return' THEN 1 ELSE 0 END) AS Low_Low
    FROM (
        SELECT
            f.fund_long_name,
            f.Performance,
            f.RiskvsReturnMatrix
        FROM (
            SELECT
                e.row_number,
                e.fund_long_name,
                CASE 
                    WHEN e.avg_rating > AVG(e.avg_rating) OVER (ORDER BY e.row_number ASC) THEN 'Performing fund' 
                    ELSE 'Non-Performing fund' 
                END AS Performance,
                CASE 
                    WHEN e.avg_risk_rating > AVG(e.avg_risk_rating) OVER (ORDER BY e.row_number ASC) 
                         AND e.avg_return_rating > AVG(e.avg_return_rating) OVER (ORDER BY e.row_number ASC) THEN 'High-Risk and High-Return'
                    WHEN e.avg_risk_rating >= AVG(e.avg_risk_rating) OVER (ORDER BY e.row_number ASC) 
                         AND e.avg_return_rating <= AVG(e.avg_return_rating) OVER (ORDER BY e.row_number ASC) THEN 'High-Risk and Low-Return'
                    WHEN e.avg_risk_rating < AVG(e.avg_risk_rating) OVER (ORDER BY e.row_number ASC) 
                         AND e.avg_return_rating < AVG(e.avg_return_rating) OVER (ORDER BY e.row_number ASC) THEN 'Low-Risk and Low-Return'
                    ELSE 'Low-Risk and High-Return' 
                END AS RiskvsReturnMatrix
            FROM (
                SELECT
                    ROW_NUMBER() OVER (ORDER BY rating.fund_long_name ASC) AS row_number,
                    rating.fund_long_name,
                    rating.avg_rating,
                    return.avg_return_rating,
                    ROUND(risk.avg_risk_rating, 2) AS avg_risk_rating
                FROM (
                    SELECT 
                        c.fund_long_name,
                        SUM(c.yr_18 + c.yr_19 + c.yr_20 + c.yr_21) / 4 AS avg_rating
                    FROM (
                        SELECT 
                            b.fund_long_name,
                            MAX(b.yr_2018) AS yr_18,
                            MAX(b.yr_2019) AS yr_19,
                            MAX(b.yr_2020) AS yr_20,
                            MAX(b.yr_2021) AS yr_21
                        FROM (
                            SELECT
                                a.fund_long_name,
                                CASE WHEN a.year='2018' THEN a.avg_rating ELSE 0 END AS yr_2018,
                                CASE WHEN a.year='2019' THEN a.avg_rating ELSE 0 END AS yr_2019,
                                CASE WHEN a.year='2020' THEN a.avg_rating ELSE 0 END AS yr_2020,
                                CASE WHEN a.year='2021' THEN a.avg_rating ELSE 0 END AS yr_2021
                            FROM (
                                SELECT 
                                    year(inception_date) AS Year,
                                    fund_long_name,
                                    ROUND(AVG(morningstar_overall_rating), 2) AS avg_rating
                                FROM mutualfunds m
                                GROUP BY fund_long_name, year(inception_date)
                                ORDER BY fund_long_name, year(inception_date) DESC
                            ) a
                        ) b
                        GROUP BY b.fund_long_name
                    ) c
                    GROUP BY c.fund_long_name 
                ) rating
                JOIN (
                    SELECT 
                        c.fund_long_name,
                        SUM(c.yr_18 + c.yr_19 + c.yr_20 + c.yr_21) / 4 AS avg_return_rating
                    FROM (
                        SELECT 
                            b.fund_long_name,
                            MAX(b.yr_2018) AS yr_18,
                            MAX(b.yr_2019) AS yr_19,
                            MAX(b.yr_2020) AS yr_20,
                            MAX(b.yr_2021) AS yr_21
                        FROM (
                            SELECT
                                a.fund_long_name,
                                CASE WHEN a.year='2018' THEN a.avg_return_rating ELSE 0 END AS yr_2018,
                                CASE WHEN a.year='2019' THEN a.avg_return_rating ELSE 0 END AS yr_2019,
                                CASE WHEN a.year='2020' THEN a.avg_return_rating ELSE 0 END AS yr_2020,
                                CASE WHEN a.year='2021' THEN a.avg_return_rating ELSE 0 END AS yr_2021
                            FROM (
                                SELECT 
                                    year(inception_date) AS Year,
                                    fund_long_name,
                                    ROUND(AVG(morningstar_return_rating), 2) AS avg_return_rating
                                FROM mutualfunds m
                                GROUP BY fund_long_name, year(inception_date)
                                ORDER BY fund_long_name, year(inception_date) DESC
                            ) a
                        ) b
                        GROUP BY b.fund_long_name
                    ) c
                    GROUP BY c.fund_long_name 
                ) return ON rating.fund_long_name = return.fund_long_name
                JOIN (
                    SELECT 
                        c.fund_long_name,
                        SUM(c.yr_18 + c.yr_19 + c.yr_20 + c.yr_21) / 4 AS avg_risk_rating
                    FROM (
                        SELECT 
                            b.fund_long_name,
                            MAX(b.yr_2018) AS yr_18,
                            MAX(b.yr_2019) AS yr_19,
                            MAX(b.yr_2020) AS yr_20,
                            MAX(b.yr_2021) AS yr_21
                        FROM (
                            SELECT
                                a.fund_long_name,
                                CASE WHEN a.year='2018' THEN a.avg_risk_rating ELSE 0 END AS yr_2018,
                                CASE WHEN a.year='2019' THEN a.avg_risk_rating ELSE 0 END AS yr_2019,
                                CASE WHEN a.year='2020' THEN a.avg_risk_rating ELSE 0 END AS yr_2020,
                                CASE WHEN a.year='2021' THEN a.avg_risk_rating ELSE 0 END AS yr_2021
                            FROM (
                                SELECT 
                                    year(inception_date) AS Year,
                                    fund_long_name,
                                    ROUND(AVG(morningstar_risk_rating), 2) AS avg_risk_rating
                                FROM mutualfunds m
                                GROUP BY fund_long_name, year(inception_date)
                                ORDER BY fund_long_name, year(inception_date) DESC
                            ) a
                        ) b
                        GROUP BY b.fund_long_name
                    ) c
                    GROUP BY c.fund_long_name 
                ) risk ON rating.fund_long_name = risk.fund_long_name
            ) e
            GROUP BY e.row_number, e.fund_long_name, e.avg_rating, e.avg_risk_rating, e.avg_return_rating
        ) f
        GROUP BY f.fund_long_name, f.Performance, f.RiskvsReturnMatrix
    ) g
    GROUP BY g.RiskvsReturnMatrix
) h;

-- 查询 7:最高周转率分析
-- 此查询分析按基金全名、基金类别、投资类型和规模类型分类的共同基金的最大年度持股周转率。目标是确定哪个基金或类别的周转率最高,从而深入了解每个细分市场的流动性和活动水平。

-- 关键术语:

-- 基金全名:共同基金的扩展名称。
-- 基金类别:根据基金的投资策略或资产类别对基金进行的分类。
-- 投资类型:基金使用的投资方法类型。
-- 规模类型:根据基金规模对基金进行的分类。
-- Highest turnover analysis
CREATE DYNAMIC TABLE IF NOT EXISTS highest_turnover_analysis AS
WITH turnover_data AS (
    SELECT
        fund_long_name,
        fund_category,
        investment_type,
        size_type,
        SUM(annual_holdings_turnover) AS annual_holdings_turnover
    FROM mutualfunds
    GROUP BY fund_long_name, fund_category, investment_type, size_type
)

SELECT
    'MAX turnover by fund_long_name :- ' AS Category_Name,
    fund_long_name AS title,
    MAX(annual_holdings_turnover) AS value
FROM turnover_data
GROUP BY fund_long_name

UNION ALL

SELECT
    'MAX turnover by fund_category :- ' AS Category_Name,
    fund_category AS title,
    MAX(annual_holdings_turnover) AS value
FROM turnover_data
GROUP BY fund_category

UNION ALL

SELECT
    'MAX turnover by investment_type :- ' AS Category_Name,
    investment_type AS title,
    MAX(annual_holdings_turnover) AS value
FROM turnover_data
GROUP BY investment_type

UNION ALL

SELECT
    'MAX turnover by size_type :- ' AS Category_Name,
    size_type AS title,
    MAX(annual_holdings_turnover) AS value
FROM turnover_data
GROUP BY size_type;

-- 查询 8:费用项目趋势
-- 这些查询分析并总结了不同时间范围内(3 年、5 年和 10 年)共同基金的费用预测。此分析有助于了解费用趋势并识别具有重大费用预测的基金家族。

-- Average Expense Projections:

CREATE DYNAMIC TABLE IF NOT EXISTS average_expense_projections AS
SELECT 
    ROUND(AVG(fund_year3_expense_projection), 2) AS avg_expe_3yrs,
    ROUND(AVG(fund_year5_expense_projection), 2) AS avg_expe_5yrs,
    ROUND(AVG(fund_year10_expense_projection), 2) AS avg_expe_10yrs
FROM mutualfunds m;

-- Expense Trend Analysis by Fund Family:

CREATE DYNAMIC TABLE IF NOT EXISTS expense_trend_analysis_by_fund_family AS
WITH expense AS (
    SELECT 
        m.fund_family,
        SUM(CASE WHEN m.fund_year3_expense_projection > 494.35 THEN 1 ELSE 0 END) AS Expense_Project_Trend_3y,
        SUM(CASE WHEN m.fund_year5_expense_projection > 799.61 THEN 1 ELSE 0 END) AS Expense_Project_Trend_5y,
        SUM(CASE WHEN m.fund_year10_expense_projection > 1636.06 THEN 1 ELSE 0 END) AS Expense_Project_Trend_10y 
    FROM mutualfunds m
    GROUP BY m.fund_family
)

SELECT 
    Expense_Year,
    MAX(thitreenDActivist) AS thitreenDActivist,
    MAX(threesixtyoneFunds) AS threesixtyoneFunds,
    MAX(AAM) AS AAM,
    MAX(AAMA) AS AAMA,
    MAX(ABR) AS ABR
FROM (
    SELECT
        '03 years' AS Expense_Year,
        CASE WHEN e.fund_family = '13D Activist Fund' THEN Expense_Project_Trend_3y ELSE 0 END AS thitreenDActivist,
        CASE WHEN e.fund_family = '361 Funds' THEN Expense_Project_Trend_3y ELSE 0 END AS threesixtyoneFunds,
        CASE WHEN e.fund_family = 'AAM' THEN Expense_Project_Trend_3y ELSE 0 END AS AAM,
        CASE WHEN e.fund_family = 'AAMA' THEN Expense_Project_Trend_3y ELSE 0 END AS AAMA,
        CASE WHEN e.fund_family = 'ABR' THEN Expense_Project_Trend_3y ELSE 0 END AS ABR
    FROM expense e

    UNION ALL

    SELECT
        '05 years' AS Expense_Year,
        CASE WHEN e.fund_family = '13D Activist Fund' THEN Expense_Project_Trend_5y ELSE 0 END AS thitreenDActivist,
        CASE WHEN e.fund_family = '361 Funds' THEN Expense_Project_Trend_5y ELSE 0 END AS threesixtyoneFunds,
        CASE WHEN e.fund_family = 'AAM' THEN Expense_Project_Trend_5y ELSE 0 END AS AAM,
        CASE WHEN e.fund_family = 'AAMA' THEN Expense_Project_Trend_5y ELSE 0 END AS AAMA,
        CASE WHEN e.fund_family = 'ABR' THEN Expense_Project_Trend_5y ELSE 0 END AS ABR
    FROM expense e

    UNION ALL

    SELECT
        '10 years' AS Expense_Year,
        CASE WHEN e.fund_family = '13D Activist Fund' THEN Expense_Project_Trend_10y ELSE 0 END AS thitreenDActivist,
        CASE WHEN e.fund_family = '361 Funds' THEN Expense_Project_Trend_10y ELSE 0 END AS threesixtyoneFunds,
        CASE WHEN e.fund_family = 'AAM' THEN Expense_Project_Trend_10y ELSE 0 END AS AAM,
        CASE WHEN e.fund_family = 'AAMA' THEN Expense_Project_Trend_10y ELSE 0 END AS AAMA,
        CASE WHEN e.fund_family = 'ABR' THEN Expense_Project_Trend_10y ELSE 0 END AS ABR
    FROM expense e
) AS combined
GROUP BY Expense_Year
ORDER BY Expense_Year ASC;

-- Year-over-Year Fund Expense Comparison:

CREATE DYNAMIC TABLE IF NOT EXISTS yoy_fund_expense_comparison AS
SELECT
    'fund expense' AS fund_expense_YoY,
    m.total_funds_2019,
    a.total_funds_2020
FROM (
    SELECT
        1 AS col,
        '2019' AS Year,
        SUM(fund_return_2019_q1 + fund_return_2019_q2 + fund_return_2019_q3 + fund_return_2019_q4) AS total_funds_2019
    FROM mutualfunds
) m

JOIN (
    SELECT * FROM (
        SELECT
            1 AS col,
            '2020' AS Year,
            SUM(fund_return_2020_q1 + fund_return_2020_q2 + fund_return_2020_q3 + fund_return_2020_q4) AS total_funds_2020
        FROM mutualfunds
    ) b
) a ON a.col = m.col;

数据标注

这里是数据工程师的美德了!小伙伴们需要更好的理解数据,大模型也同样需要。

ALTER DYNAMIC TABLE the_highest_and_lowest_average_ratings_over_years 
SET COMMENT '四年(2018-2021 年)最高和最低平均评级 此查询的目的是计算每个基金类别的平均同比 (YoY) 评级。它确定了四年(2018-2021 年)平均评级最高和最低的类别。这可以提供有价值的见解,了解哪些类别表现最佳,哪些类别表现不佳,从而有助于战略决策和投资分析。';

ALTER DYNAMIC TABLE average_yoy_rating_for_funds 
SET COMMENT '每个基金的平均同比 (YoY) 评分 此查询根据基金的 fund_long_name 计算每个基金的平均同比 (YoY) 评分。它汇总了每年(2018、2019、2020 和 2021)的评分,并计算这些年的平均评分。从此分析得出的业务推论包括确定平均同比评分最高和最低的基金类别、突出显示不一致之处和趋势以及确定评分为 0 的类别。此信息对于了解各种基金类别随时间推移的业绩动态至关重要。';

ALTER DYNAMIC TABLE comparison_table_for_key_rating_kpis_yoy 
SET COMMENT '关键评级 KPI 同比 (YoY) 比较表 此查询生成关键评级 KPI 同比 (YoY) 比较表。KPI 包括:平均评级 (avg_rating):衡量基金组合的整体表现。回报评级 (avg_return_rating):评估基金产生的产出或回报程度。风险评级 (avg_risk_rating):评估与基金相关的风险程度。查询根据基金的 fund_long_name 汇总每个基金的这些 KPI,并计算 2018、2019、2020 和 2021 年的平均值。';

ALTER DYNAMIC TABLE categorizing_funds_based_on_their_performance_and_risk_return_matrix 
SET COMMENT '根据基金的表现和风险回报矩阵对基金进行分类 此查询旨在通过根据基金的表现和风险回报矩阵对基金进行分类来确定绩效趋势。它有助于业务团队了解不同基金的表现及其相关的风险和回报特征。 评级:反映基金的表现。 风险评级:表示与基金相关的风险程度。 回报评级:表示基金的产出或回报程度。 该查询根据基金的平均评级将基金分为“表现良好”和“表现不佳”。它还将基金分为四个风险回报类别:高风险高回报、高风险低回报、低风险低回报和低风险高回报。';

ALTER DYNAMIC TABLE performance_and_risk_vs_return_matrix_analysis 
SET COMMENT '绩效和风险与回报矩阵分析 此查询评估各种共同基金的绩效和风险回报状况,以进行业务审查。它比较了表现良好和表现不佳的基金的数量,并将它们归类为不同的风险回报矩阵。';

ALTER DYNAMIC TABLE risk_vs_return_matrix_analysis_for_business_review 
SET COMMENT '业务审查的风险与回报矩阵分析。此查询分析各种共同基金的风险与回报矩阵,以帮助业务审查团队了解绩效趋势。重点是确定属于四类的基金:高风险高回报、低风险高回报、高风险低回报以及低风险低回报。业务推断是顶级组是低-低和高-高,检查点是评估低-低基金的数量,以确定对于普通客户来说低多少。';

ALTER DYNAMIC TABLE highest_turnover_analysis 
SET COMMENT '最高周转率分析,分析按基金全名、基金类别、投资类型和规模类型分类的共同基金的最大年度持股周转率。目标是确定哪个基金或类别的周转率最高,从而深入了解每个细分市场的流动性和活动水平。';

ALTER DYNAMIC TABLE average_expense_projections 
SET COMMENT '费用项目趋势-Average Expense Projections,分析并总结了不同时间范围内(3 年、5 年和 10 年)共同基金的费用预测。此分析有助于了解费用趋势并识别具有重大费用预测的基金家族。';

ALTER DYNAMIC TABLE expense_trend_analysis_by_fund_family 
SET COMMENT '费用项目趋势-Expense Trend Analysis by Fund Family,分析并总结了不同时间范围内(3 年、5 年和 10 年)共同基金的费用预测。此分析有助于了解费用趋势并识别具有重大费用预测的基金家族。';

ALTER DYNAMIC TABLE yoy_fund_expense_comparison 
SET COMMENT '费用项目趋势-Year-over-Year Fund Expense Comparison,分析并总结了不同时间范围内(3 年、5 年和 10 年)共同基金的费用预测。此分析有助于了解费用趋势并识别具有重大费用预测的基金家族。';

执行完上述代码,看下小伙伴们是不是就更好理解你的数据产出啦(你不是一个人在战斗!):

image.png

刷新动态表

当第一次建好动态表或者之后在底表数据发生变化时,可以通过执行如下语句刷新动态表里的数据:

image.png

可以看到在云器Lakehouse执行上述10个动态表的结果刷新,仅用了1.6秒。而这10个刷新动作的底表有7千多万行数据,之前担心执行性能的问题,发现也不是问题嘞。

set cz.optimizer.mv.auto.unique.key.enabled = false;
refresh dynamic table the_highest_and_lowest_average_ratings_over_years;
refresh dynamic table average_yoy_rating_for_funds;
refresh dynamic table comparison_table_for_key_rating_kpis_yoy;
refresh dynamic table categorizing_funds_based_on_their_performance_and_risk_return_matrix;
refresh dynamic table performance_and_risk_vs_return_matrix_analysis;
refresh dynamic table risk_vs_return_matrix_analysis_for_business_review;
refresh dynamic table highest_turnover_analysis;
refresh dynamic table average_expense_projections;
refresh dynamic table expense_trend_analysis_by_fund_family;
refresh dynamic table yoy_fund_expense_comparison;

也可以周期性的调度上述代码:

image.png

通过设置每次调度间隔时长为1分钟,这样就可以将动态表的数据新鲜度保持在分钟级了,也就是说当底表数据发生变化1分钟后,结果表的结果就会自动被刷新了。

关于云器Lakehouse动态表详细解释,参考:https://www.yunqi.tech/documents/dynamic_table_summary

查看结果

image.png

一次查询10个动态表里的结果,仅用了656毫秒。

性能体验

image.png

同样的数据在Notebook里运行需要31秒,迁移到云器Lakehouse后,计算和分析的过程仅需要2.3秒,从性能体验的角度看提升了13.5倍,这对数据工作人员而言,确实能够大幅降低时间成本,提高工作效率。

总结

好的数据分析,好的AI效果,总是离不开好的数据。而数据工程又是个繁琐、耗时、需要很大耐心的工作。通过上述实验,将“SQL Mutual Fund Performance”那些复杂的SQL代码迁移到云器Lakehouse的心得总结如下:

  1. 云器Lakehouse SQL语法高度兼容Pandas SQL,迁移过程基本上是个粘贴拷贝的过程,只有一个微小的函数语法改动。
  2. 云器Lakehouse的SQL IDE提供了版本管理、测试、调度和发布集成环境,提高了开发效率。
  3. 云器Lakehouse的动态表技术大幅提升了在大数据量下的性能体验,对比Notebook运行有13.5倍的性能提升。
  4. 因为简单,所以美好。整个实验都是在云器Lakehouse上完成,这给整个实验带来了简单美的体验。相信因为简单,可以让更多数据爱好者更顺利的玩转数据。

Untitled

云器Lakehouse现已开放注册
欢迎申请体验,每个账号开通会获赠一定金额的代金券,助您快速试用体验。如需更多代金券额度,请您联系商务获取。
预约咨询
微信咨询
电话咨询