Databricks Unity Catalog 联邦查询实践

云器 Lakehouse 通过 External Catalog 直接查询 Databricks Unity Catalog 中的表,数据留在 Databricks 的 S3 存储里不动,Lakehouse 负责 SQL 执行和结果返回。本文以 AWS 环境为例,记录从零配通的完整操作路径。


前置条件

  • Databricks 工作区:支持 Unity Catalog(Free Edition 已支持)
  • 云器 Lakehouse 实例:与 Databricks 数据存储(S3)在同一云平台(均为 AWS)
  • 工具:cz-cli,需提前配好对应 AWS profile

涉及的 SQL 命令

命令用途
CREATE CATALOG CONNECTION
CREATE CATALOG CONNECTION
存储 Databricks OAuth M2M 认证信息
CREATE EXTERNAL CATALOG
CREATE EXTERNAL CATALOG
创建指向 Databricks Unity Catalog 的外部目录
SHOW SCHEMAS IN
SHOW SCHEMAS IN
列出 Databricks Catalog 中的 Schema
SHOW TABLES IN
SHOW TABLES IN
列出 Schema 中的表
SELECT
SELECT
查询 Databricks 表数据

Databricks 侧配置

创建 Service Principal

打开

https://accounts.cloud.databricks.com
https://accounts.cloud.databricks.com
User managementService principalsAdd service principal,起一个便于识别的名称(如
lakehouse_connector
lakehouse_connector
)。

创建后点击 SP 名称进入详情页,依次完成:

  1. Roles tab → 开启 Account admin
  2. Principal information tab → 记录 Application ID(即后续用到的
    CLIENT_ID
    CLIENT_ID
    ,格式为 UUID)
  3. Credentials & secrets tab → Generate secret → 记录生成的完整 Secret(即后续用到的
    CLIENT_SECRET
    CLIENT_SECRET

将 SP 加入 Workspace

在 Databricks Workspace → SettingsIdentity and accessService PrincipalsAdd service principal,将刚创建的 SP 加进来。

开启 Metastore External Data Access

在 Databricks Workspace → Catalog → 齿轮图标 → MetastoreDetails tab → 找到 External data access → 打开开关。

不开启此选项,查询数据时会报:

PermissionDenied: External Data Access from non Databricks Compute environment is disabled for metastore

授予 Catalog 和 Schema 权限

在 Databricks Catalog Explorer 里,给 SP 授予以下权限:

  • Catalog 层级
    USE CATALOG
    USE CATALOG
  • Schema 层级
    USE SCHEMA
    USE SCHEMA
    SELECT
    SELECT
    EXTERNAL USE SCHEMA
    EXTERNAL USE SCHEMA

如果你有 SQL 执行环境(Notebook 或 SQL Warehouse),也可以用 GRANT 命令完成授权(将

<application-id>
<application-id>
替换为 SP 的 Application ID,格式为 UUID):

GRANT USE CATALOG ON CATALOG workspace TO `<application-id>`; GRANT USE SCHEMA ON SCHEMA workspace.table_types_demo TO `<application-id>`; GRANT SELECT ON SCHEMA workspace.table_types_demo TO `<application-id>`; -- 联邦查询必须的权限 GRANT EXTERNAL USE SCHEMA ON SCHEMA workspace.table_types_demo TO `<application-id>`;


创建 Catalog Connection

CREATE CATALOG CONNECTION IF NOT EXISTS databricks_conn TYPE databricks HOST = 'https://<workspace-url>.cloud.databricks.com' CLIENT_ID = '<application-id>' CLIENT_SECRET = '<oauth-secret>' ACCESS_REGION = '<s3-bucket-region>';

验证连接:

SHOW CATALOG CONNECTIONS;


创建 External Catalog

CREATE EXTERNAL CATALOG IF NOT EXISTS databricks_catalog CONNECTION databricks_conn OPTIONS ('catalog' = '<databricks-catalog-name>');

catalog
catalog
的值是 Databricks Unity Catalog 里的 catalog 名称。在 Databricks Workspace → 左侧 Catalog 图标 → 展开左侧面板,My organization 下列出的即为可用 catalog(如
workspace
workspace
main
main
hive_metastore
hive_metastore
等)。

验证:

SHOW SCHEMAS IN databricks_catalog;

输出示例:

schema_name ----------- default information_schema table_types_demo


查询数据

-- 查看表列表 SHOW TABLES IN databricks_catalog.table_types_demo; -- 查询数据 SELECT * FROM databricks_catalog.table_types_demo.orders_external LIMIT 10;


联邦查询:跨平台 SQL 分析

联邦查询的核心价值是:Lakehouse 可以直接查询、关联 Databricks 的数据,也可以将 Databricks 数据写入 Lakehouse 本地表,全程无需数据搬迁。

场景一:Databricks 表之间 JOIN

orders_external
orders_external
(订单)和
customers_external
customers_external
(客户)通过
customer_id
customer_id
关联,统计每个客户的订单数和消费总额。两张表的
price
price
字段类型均为
DECIMAL(10,2)
DECIMAL(10,2)
,无需转换:

SELECT c.customer_name, c.country, c.loyalty_tier, COUNT(o.order_id) AS order_count, SUM(o.price) AS total_revenue FROM databricks_catalog.table_types_demo.orders_external o JOIN databricks_catalog.table_types_demo.customers_external c ON o.customer_id = c.customer_id GROUP BY c.customer_name, c.country, c.loyalty_tier ORDER BY total_revenue DESC;

查询结果:

customer_namecountryloyalty_tierorder_counttotal_revenue
Alice ChenChinaGold21698.99
Frank LiuChinaSilver1299.99
Carol ZhangChinaPlatinum2249.98
David LeeSingaporeBronze1129.99
Emma WangChinaSilver189.99

场景二:库存不足预警

inventory_delta
inventory_delta
记录各仓库库存(字段:
product_id
product_id
warehouse_location
warehouse_location
quantity_available
quantity_available
)。查询库存低于阈值的产品,分仓库汇总:

SELECT warehouse_location, COUNT(*) AS low_stock_products, MIN(quantity_available) AS min_stock, AVG(quantity_available) AS avg_stock FROM databricks_catalog.table_types_demo.inventory_delta WHERE quantity_available < 200 GROUP BY warehouse_location ORDER BY min_stock;

查询结果:

warehouse_locationlow_stock_productsmin_stockavg_stock
Warehouse A15050.0
Warehouse B275112.5

场景三:Databricks 数据写入 Lakehouse

联邦查询的结果可以直接写入 Lakehouse 本地表,实现数据汇聚。下面的

public
public
是 Lakehouse 本地 Schema(不是 Databricks 的),替换成你实际的本地 Schema 名称:

-- 将 Databricks 已完成订单抽取到 Lakehouse 本地表 CREATE TABLE public.orders_from_databricks AS SELECT order_id, customer_id, order_date, product_name, CAST(price AS DECIMAL(10,2)) AS price, status FROM databricks_catalog.table_types_demo.orders_external WHERE status = 'Delivered';

查询本地表,无网络开销:

SELECT * FROM public.orders_from_databricks;

order_idcustomer_idorder_dateproduct_namepricestatus
200610052026-05-20Webcam HD89.99Delivered
200110012026-05-15Laptop Pro1299.99Delivered
200410012026-05-18Monitor 27inch399.00Delivered

场景四:Dynamic Table 消费 Databricks 数据

将 Databricks 数据作为 Dynamic Table 的上游,定时聚合到 Lakehouse。

public
public
是 Lakehouse 本地 Schema,替换成你实际的本地 Schema 名称:

CREATE OR REPLACE DYNAMIC TABLE public.orders_daily_summary REFRESH INTERVAL 1 HOUR VCLUSTER DEFAULT COMMENT '从 Databricks 聚合的每日订单摘要' AS SELECT order_date, COUNT(*) AS order_count, SUM(CAST(price AS DECIMAL(10,2))) AS total_revenue, AVG(CAST(price AS DECIMAL(10,2))) AS avg_order_value FROM databricks_catalog.table_types_demo.orders_external GROUP BY order_date;

首次刷新后查询:

REFRESH DYNAMIC TABLE public.orders_daily_summary; SELECT * FROM public.orders_daily_summary ORDER BY order_date;

order_dateorder_counttotal_revenueavg_order_value
2026-05-1511299.991299.99
2026-05-17149.9949.99
2026-05-181399.00399.00
2026-05-191129.99129.99
2026-05-20189.9989.99
2026-06-042499.98249.99

支持的表类型

并非所有 Databricks 表都能从 Lakehouse 查询,支持情况取决于表的存储类型:

表类型格式支持说明
TABLE_DELTA_EXTERNAL
TABLE_DELTA_EXTERNAL
Delta完全支持,推荐
TABLE_DELTA
TABLE_DELTA
Delta完全支持
TABLE_EXTERNAL
TABLE_EXTERNAL
(Delta 格式)
Delta支持
TABLE_EXTERNAL
TABLE_EXTERNAL
(Parquet/CSV/JSON)
非 Delta
unsupported databricks table format
unsupported databricks table format
TABLE_DB_STORAGE
TABLE_DB_STORAGE
Managed Delta不支持跨平台访问
VIEW
VIEW
driver 兼容性问题

关键结论:只有 Delta 格式的表支持联邦查询,无论是 External 还是普通 Delta 表。Parquet、CSV、JSON 格式的外部表目前不支持。

在 Databricks 中建表时,优先使用 Delta 格式:

-- 推荐:Delta External Table CREATE TABLE catalog.schema.my_table USING DELTA LOCATION 's3://my-bucket/my-table/'; -- 不推荐(Lakehouse 无法查询) CREATE TABLE catalog.schema.my_table USING PARQUET LOCATION 's3://my-bucket/my-table/';


常见错误排查

invalid_client
invalid_client

OAuth 认证失败。按顺序检查:

  1. SP 是否已在 Account Console → Roles 里开启 Account admin
  2. SP 是否已在 Workspace → SettingsService Principals 里添加
  3. CLIENT_SECRET
    CLIENT_SECRET
    是否是完整值(不是
    50db****7f61
    50db****7f61
    这样的脱敏值)——去 Account Console → SP → Credentials & secrets 重新生成一个
  4. Secret 是否已过期——在 Credentials & secrets 页面查看
    Expires at
    Expires at
    字段,过期后需生成新 Secret 并重新执行
    CREATE CATALOG CONNECTION
    CREATE CATALOG CONNECTION

PermissionDenied: External Data Access ... is disabled
PermissionDenied: External Data Access ... is disabled

Metastore 未开启外部数据访问。在 Databricks Workspace → Catalog → 齿轮 → Metastore → External data access → 开启。

PermissionDenied: User does not have USE CATALOG
PermissionDenied: User does not have USE CATALOG

SP 没有 Catalog 访问权限。在 Databricks Catalog Explorer 里找到对应 Catalog → Permissions → Grant → 给 SP 加

USE CATALOG
USE CATALOG

PermissionDenied: User does not have EXTERNAL USE SCHEMA
PermissionDenied: User does not have EXTERNAL USE SCHEMA

SP 没有 Schema 的外部访问权限。在 Databricks Catalog Explorer → Schema → Permissions → Grant → 给 SP 加

USE SCHEMA
USE SCHEMA
SELECT
SELECT
EXTERNAL USE SCHEMA
EXTERNAL USE SCHEMA

NotFound: Catalog 'main' does not exist
NotFound: Catalog 'main' does not exist

OPTIONS ('catalog' = ...)
OPTIONS ('catalog' = ...)
填写的 catalog 名称不存在。打开 Databricks Workspace → Catalog 面板,查看实际的 catalog 名称。

查询超时(300 秒)或
PermanentRedirect
PermanentRedirect

ACCESS_REGION
ACCESS_REGION
填错了,S3 请求被重定向。检查表的实际存储位置(Catalog Explorer → 表详情 → Storage Location),确认 S3 bucket 所在 region 后重建 Catalog Connection。

unsupported databricks table format {} [PARQUET/CSV/JSON]
unsupported databricks table format {} [PARQUET/CSV/JSON]

该表使用的是非 Delta 格式,目前不支持从 Lakehouse 查询。在 Databricks 中用 Delta 格式重建该表,或将数据转换为 Delta 表。

Table cannot be accessed from outside of Databricks Compute Environment ... kind being TABLE_DB_STORAGE
Table cannot be accessed from outside of Databricks Compute Environment ... kind being TABLE_DB_STORAGE

该表是 Databricks Managed Table(托管表),数据存在 Databricks 控制的存储里,不支持跨平台直接访问。需在 Databricks 中将其转为 External Table 后才能查询。


注意事项

注意点说明
云平台限制Databricks 的 S3 存储必须与 Lakehouse 实例在同一云平台(均为 AWS)。Databricks 在 GCP/Azure 时无法与 AWS Lakehouse 互联
region 一致性
ACCESS_REGION
ACCESS_REGION
必须与 S3 bucket 所在 region 一致,不是 workspace region
只读限制External Catalog 是只读的——不支持从 Lakehouse 向 Databricks 写入数据(INSERT/UPDATE/DELETE)。反向操作(将 Databricks 数据写入 Lakehouse)完全支持,见场景三
版本要求需要支持 Unity Catalog 的版本。Free Edition 已支持;Community Edition 不支持
Secret 保存OAuth Secret 只在生成时显示完整值,请立即保存;遗失需重新生成

相关文档

联系我们
预约咨询
微信咨询
电话咨询
邮件咨询