使用COPY INTO导出文件

将表或查询结果导出到指定路径。您可以将一张表或查询结果导出为文件并保存到 Volume 的指定路径。请注意,如果指定路径下已存在同名文件,则该文件将会被覆盖。

注意事项

  • 默认文件名:Lakehouse 导出时,默认文件名是相同的。因此,如果您多次运行导出操作并且指定了相同的子目录(SUBDIRECTORY),新文件将覆盖上次导出的文件。
  • 避免覆盖:为了防止意外覆盖现有文件,请确保每次导出时使用唯一的文件名或子目录。

语法

---语法
COPY INTO { VOLUME external_volume_name | TABLE VOLUME table_name  | USER VOLUME }  
SUBDIRECTORY '<path>'
FROM { [schema.]<table_name> | ( <query> ) }
FILE_FORMAT = ( TYPE = { CSV  | PARQUET | JSON }  [ formatTypeOptions ]  ) 
[ copyOptions ]

参数说明

  • COPY INTO:表示向目录中追加数据

  • VOLUME支持external_volume_name:具体介绍参考Volume介绍

    • external_volume_name: 客户指定的外部存储位置,云器仅保留路径元信息。支持的存储产品有:阿里云 OSS 、腾讯云 COS 、 亚马逊云 S3。创建过程可以参考CONNECTION创建VOLUME创建
    • Table Volume:Lakehouse内部存储,与内表对象共同存储与指定Schema路径下
    • User Volume:用户账号关联的文件存储区域,Workspace User默认对该区域具备管理权限。每个Workspace都默认拥有一个具备管理权限的User Volume
      --1.将数据导入到external_volume中。前提必须创建external_volume
      COPY INTO VOLUME my_volume SUBDIRECTORY 'dau_unload/'
      FROM TABLE birds
      file_format = (type = CSV);
      --查看volume中的文件
      --show volume directory
      SHOW VOLUME DIRECTORY my_volume SUBDIRECTORY 'dau_unload/';
      +--------------------------+----------------------------------------------------------------+------+---------------------+
      |      relative_path       |                              url                               | size | last_modified_time  |
      +--------------------------+----------------------------------------------------------------+------+---------------------+
      | dau_unload/part00001.csv | oss://lakehouse-perf-test/test_insert/dau_unload/part00001.csv | 105  | 2024-12-27 16:49:01 |
      +--------------------------+----------------------------------------------------------------+------+---------------------+
      --2.将数据导入到user volume中
      COPY INTO USER VOLUME SUBDIRECTORY 'dau_unload/'
      FROM TABLE birds
      file_format = (type = CSV);
      SHOW  USER VOLUME DIRECTORY  SUBDIRECTORY 'dau_unload/';
      +--------------------------+-------------------------------------------------------------------------------------------------------------------+------+---------------------+
      |      relative_path       |                                                        url                                                        | size | last_modified_time  |
      +--------------------------+-------------------------------------------------------------------------------------------------------------------+------+---------------------+
      | dau_unload/part00001.csv | oss://lakehouse-hz-uat/86/workspaces/qingyun_6552637055735457988/internal_volume/user_13/dau_unload/part00001.csv | 105  | 2024-12-27 16:52:06 |
      +--------------------------+-------------------------------------------------------------------------------------------------------------------+------+---------------------+
      --删除user volume中的文件
      REMOVE  VOLUME my_volume  FILE 'dau_unload/part00001.csv'
      
      --3.将数据导入到table volume中
      COPY INTO TABLE VOLUME birds  SUBDIRECTORY 'dau_unload/'
      FROM TABLE birds
      file_format = (type = CSV);
      SHOW  TABLE VOLUME  DIRECTORY  birds SUBDIRECTORY 'dau_unload/';
      +--------------------------+-------------------------------------------------------------------------------------------------------------------------------------+------+---------------------+
      |      relative_path       |                                                                 url                                                                 | size | last_modified_time  |
      +--------------------------+-------------------------------------------------------------------------------------------------------------------------------------+------+---------------------+
      | dau_unload/part00001.csv | oss://lakehouse-hz-uat/86/workspaces/qingyun_6552637055735457988/internal_volume/table_6808843871866704121/dau_unload/part00001.csv | 105  | 2024-12-27 16:57:54 |
      +--------------------------+-------------------------------------------------------------------------------------------------------------------------------------+------+---------------------+
      --删除table volume中的文件
      REMOVE TABLE VOLUME birds  FILE 'dau_unload/part00001.csv';

  • SUBDIRECTORY:指定子路径,参数必须填写。例如:`subdirectory 'month=02'

  • FROM:支持直接导出表中数据,直接直接写query查询

 --直接指定表
 COPY INTO VOLUME my_volume SUBDIRECTORY 'dau_unload/'
 FROM TABLE birds
 file_format = (type = CSV);
 --将SQL结果集导出
 COPY INTO VOLUME my_volume SUBDIRECTORY 'dau_unload/'
 FROM  (select * from birds limit 1)
 file_format = (type = CSV);
  • formatTypeOptions:文件格式,支持CSV、TEXT、PARQUET、JSON。其中JSON导出的格式是JSON LINE

  • CSV 格式支持的参数:

    • sep:列分隔符,默认为 ”,“。最多支持长度为 1 的字符,例如:sep=','
    • compression:配置文件压缩格式。支持的压缩格式有:gzip/zstd/zlib。例如:compression='gzip'
    • lineSep:行分隔符,默认值为"\n"。最多支持长度为 2 的字符,例如:lineSep='$'
    • quote:设置用于转义引号值的单个字符。默认值为双引号“"“,例如:quote='"'
    • header:是否解析表头,默认为 false。布尔类型,例如:header='true'
    • timeZone:配置时区,没有默认值。用于指定文件中时间格式的时区。例如:timeZone = 'Asia/Shanghai'
    • escape:用于转义已加引号的值中的引号,默认值为”\“,例如:escape='\'
    • nullValue: 用于判定什么样的内容应该是 Null,默认值为 `""`,例如 nullValue='\\N'或者nullValue=r'\N'。使用r可以不需要转译字符,参考正则转译
    • multiLine: 是否有跨行的 csv 记录,默认值为 false,如果有这样的记录,需要配置 multiLine='true'
    --指定分隔符为|和压缩
    COPY INTO VOLUME my_volume  SUBDIRECTORY 'dau_unload/'
    FROM TABLE birds
    file_format = (
        type= CSV
        sep='|'
        compression='gzip'
    );
  • JSON格式支持的参数:

    • compression: 源文件/目标文件是否压缩,默认不压缩,配置如 'compression'='gzip'
    • COPY INTO VOLUME my_volume  SUBDIRECTORY 'dau_unload/'
      FROM TABLE birds
      file_format = (
          type= json
      );
  • Parquet , ORC , BSON 格式:无

  • COPY INTO VOLUME my_volume  SUBDIRECTORY 'dau_unload/'
    FROM TABLE birds
    file_format = (
        type= parquet
    );
  • copyOptions

    • filename_prefix = '<prefex_name>'。可选参数。设置文件前缀,例如:filename_prefix = 'my_prefix_'
     --给文件添加前缀
     COPY INTO VOLUME my_volume  SUBDIRECTORY 'dau_unload/'
     FROM TABLE birds
     file_format = (
         type= json
       
     )
     filename_prefix='birds'
     ;
     --查看目录,如第一行添加了前缀birds
     SHOW VOLUME DIRECTORY my_volume SUBDIRECTORY 'dau_unload/';
    
     +-------------------------------+---------------------------------------------------------------------+------+---------------------+
     |         relative_path         |                                 url                                 | size | last_modified_time  |
     +-------------------------------+---------------------------------------------------------------------+------+---------------------+
     | dau_unload/birds00001.json    | oss://lakehouse-perf-test/test_insert/dau_unload/birds00001.json    | 295  | 2024-12-27 17:29:20 |
     | dau_unload/part00001.csv      | oss://lakehouse-perf-test/test_insert/dau_unload/part00001.csv      | 105  | 2024-12-27 17:15:40 |
     | dau_unload/part00001.csv.gzip | oss://lakehouse-perf-test/test_insert/dau_unload/part00001.csv.gzip | 104  | 2024-12-27 17:19:33 |
     | dau_unload/part00001.json     | oss://lakehouse-perf-test/test_insert/dau_unload/part00001.json     | 295  | 2024-12-27 17:24:26 |
     | dau_unload/part00001.parquet  | oss://lakehouse-perf-test/test_insert/dau_unload/part00001.parquet  | 1886 | 2024-12-27 17:27:15 |
     | dau_unload/part00001.text     | oss://lakehouse-perf-test/test_insert/dau_unload/part00001.text     | 87   | 2024-12-27 17:25:34 |
     +-------------------------------+---------------------------------------------------------------------+------+---------------------+
    • filename_suffix = '<suffix>'。参数。设置文件后缀,例如:filename_suffix = '.data'
          --添加后缀
          COPY INTO VOLUME my_volume  SUBDIRECTORY 'dau_unload/'
          FROM TABLE birds
          file_format = (
              type= json
            
          )
          filename_suffix='.data';
          --查看目录,如第四行添加了前缀birds
          SHOW VOLUME DIRECTORY my_volume SUBDIRECTORY 'dau_unload/';
          +-------------------------------+---------------------------------------------------------------------+------+---------------------+
          |         relative_path         |                                 url                                 | size | last_modified_time  |
          +-------------------------------+---------------------------------------------------------------------+------+---------------------+
          | dau_unload/birds00001.json    | oss://lakehouse-perf-test/test_insert/dau_unload/birds00001.json    | 295  | 2024-12-27 17:29:20 |
          | dau_unload/part00001.csv      | oss://lakehouse-perf-test/test_insert/dau_unload/part00001.csv      | 105  | 2024-12-27 17:15:40 |
          | dau_unload/part00001.csv.gzip | oss://lakehouse-perf-test/test_insert/dau_unload/part00001.csv.gzip | 104  | 2024-12-27 17:19:33 |
          | dau_unload/part00001.data     | oss://lakehouse-perf-test/test_insert/dau_unload/part00001.data     | 295  | 2024-12-27 17:33:49 |
          | dau_unload/part00001.json     | oss://lakehouse-perf-test/test_insert/dau_unload/part00001.json     | 295  | 2024-12-27 17:24:26 |
          | dau_unload/part00001.parquet  | oss://lakehouse-perf-test/test_insert/dau_unload/part00001.parquet  | 1886 | 2024-12-27 17:27:15 |
          | dau_unload/part00001.text     | oss://lakehouse-perf-test/test_insert/dau_unload/part00001.text     | 87   | 2024-12-27 17:25:34 |
          +-------------------------------+---------------------------------------------------------------------+------+---------------------+
    • include_job_id = 'TRUE' | 'FALSE'。 可选参数。设置文件名是否写入作业ID,不设置时默认为不写入作业ID。例如:include_job_id = 'TRUE'
    --导出的文件包含jobid
    COPY INTO VOLUME my_volume  SUBDIRECTORY 'dau_unload/'
    FROM TABLE birds
    file_format = (
        type= json
      
    )
    include_job_id = 'TRUE' ;
    --查看目录,如第8行包含了导出的文件包含了jobid
    +--------------------------------------------------------+----------------------------------------------------------------------------------------------+------+---------------------+
    |                     relative_path                      |                                             url                                              | size | last_modified_time  |
    +--------------------------------------------------------+----------------------------------------------------------------------------------------------+------+---------------------+
    | dau_unload/birds00001.json                             | oss://lakehouse-perf-test/test_insert/dau_unload/birds00001.json                             | 295  | 2024-12-27 17:29:20 |
    | dau_unload/part00001.csv                               | oss://lakehouse-perf-test/test_insert/dau_unload/part00001.csv                               | 105  | 2024-12-27 17:15:40 |
    | dau_unload/part00001.csv.gzip                          | oss://lakehouse-perf-test/test_insert/dau_unload/part00001.csv.gzip                          | 104  | 2024-12-27 17:19:33 |
    | dau_unload/part00001.data                              | oss://lakehouse-perf-test/test_insert/dau_unload/part00001.data                              | 295  | 2024-12-27 17:33:49 |
    | dau_unload/part00001.json                              | oss://lakehouse-perf-test/test_insert/dau_unload/part00001.json                              | 295  | 2024-12-27 17:24:26 |
    | dau_unload/part00001.parquet                           | oss://lakehouse-perf-test/test_insert/dau_unload/part00001.parquet                           | 1886 | 2024-12-27 17:27:15 |
    | dau_unload/part00001.text                              | oss://lakehouse-perf-test/test_insert/dau_unload/part00001.text                              | 87   | 2024-12-27 17:25:34 |
    | dau_unload/part202412271736045501gmspelya5o900001.json | oss://lakehouse-perf-test/test_insert/dau_unload/part202412271736045501gmspelya5o900001.json | 295  | 2024-12-27 17:36:04 |
    +--------------------------------------------------------+----------------------------------------------------------------------------------------------+------+---------------------+

使用示例

导出数据到user volume中

CREATE TABLE birds (
    id INT,
    name VARCHAR(50),
    wingspan_cm FLOAT,
    colors STRING
);
INSERT INTO birds (id, name, wingspan_cm, colors) VALUES
    (1, 'Sparrow', 15.5, 'Brown'),
    (2, 'Blue Jay', 20.2, 'Blue'),
    (3, 'Cardinal', 22.1, 'Red'),
    (4, 'Robin', 18.7, 'Red","Brown');

COPY INTO USER VOLUME SUBDIRECTORY 'dau_unload/'
FROM TABLE birds
file_format = (type = CSV);
--查看是否导出成功
SHOW  USER VOLUME DIRECTORY  SUBDIRECTORY 'dau_unload/';
--删除文件避免占用存储
REMOVE  VOLUME my_volume  FILE 'dau_unload/part00001.csv';

导出数据到table volume中

CREATE TABLE birds (
    id INT,
    name VARCHAR(50),
    wingspan_cm FLOAT,
    colors STRING
);
INSERT INTO birds (id, name, wingspan_cm, colors) VALUES
    (1, 'Sparrow', 15.5, 'Brown'),
    (2, 'Blue Jay', 20.2, 'Blue'),
    (3, 'Cardinal', 22.1, 'Red'),
    (4, 'Robin', 18.7, 'Red","Brown');
COPY INTO TABLE VOLUME birds  SUBDIRECTORY 'dau_unload/'
FROM TABLE birds
file_format = (type = CSV);
--查看是否导出成功
SHOW  TABLE VOLUME  DIRECTORY  birds SUBDIRECTORY 'dau_unload/';
--删除文件避免占用存储
REMOVE  TABLE  VOLUME birds  FILE 'dau_unload/part00001.csv';

导出到外部volume中使用前提需要创建VOLUME和CONNECTION。创建过程可以参考CONNECTION创建VOLUME创建

导出数据到oss中

--创建表
CREATE TABLE birds (
    id INT,
    name VARCHAR(50),
    wingspan_cm FLOAT,
    colors STRING
);
INSERT INTO birds (id, name, wingspan_cm, colors) VALUES
    (1, 'Sparrow', 15.5, 'Brown'),
    (2, 'Blue Jay', 20.2, 'Blue'),
    (3, 'Cardinal', 22.1, 'Red'),
    (4, 'Robin', 18.7, 'Red","Brown');
--创建oss conenction
CREATE STORAGE CONNECTION  catalog_storage_oss
    type OSS
    ACCESS_ID='xxxx'
    ACCESS_KEY='xxxxxxx'
    ENDPOINT='oss-cn-hangzhou-internal.aliyuncs.com';
--创建volume
CREATE EXTERNAL VOLUME my_volume
    location 'oss://mybucket/test_insert/'
    using connection catalog_storage_oss
    directory = (
        enable=true,
        auto_refresh=true
    );
--将数据导出到test_insert子目录下
COPY INTO VOLUME my_volume SUBDIRECTORY 'dau_unload/'
FROM TABLE birds
file_format = (type = CSV);

导出数据到cos中

--创建表
CREATE TABLE birds (
    id INT,
    name VARCHAR(50),
    wingspan_cm FLOAT,
    colors STRING
);
INSERT INTO birds (id, name, wingspan_cm, colors) VALUES
    (1, 'Sparrow', 15.5, 'Brown'),
    (2, 'Blue Jay', 20.2, 'Blue'),
    (3, 'Cardinal', 22.1, 'Red'),
    (4, 'Robin', 18.7, 'Red","Brown');
--创建cos conenction
CREATE STORAGE CONNECTION my_conn 
  TYPE COS
  ACCESS_KEY = '<access_key>'
  SECRET_KEY = '<secret_key>'
  REGION = 'ap-shanghai'
  APP_ID = '1310000503';
--创建volume
CREATE EXTERNAL VOLUME my_volume
    location 'cos://mybucket/test_insert/'
    using connection my_conn
    directory = (
        enable=true,
        auto_refresh=true
    );
--将数据导出到test_insert子目录下
COPY INTO VOLUME my_volume SUBDIRECTORY 'dau_unload/'
FROM TABLE birds
file_format = (type = CSV);

导出数据到s3中

--创建表
CREATE TABLE birds (
    id INT,
    name VARCHAR(50),
    wingspan_cm FLOAT,
    colors STRING
);
INSERT INTO birds (id, name, wingspan_cm, colors) VALUES
    (1, 'Sparrow', 15.5, 'Brown'),
    (2, 'Blue Jay', 20.2, 'Blue'),
    (3, 'Cardinal', 22.1, 'Red'),
    (4, 'Robin', 18.7, 'Red","Brown');
--创建s3 conenction
CREATE STORAGE CONNECTION aws_bj_conn
    TYPE S3
    ACCESS_KEY = 'AKIAQNBSBP6EIJE33***'
    SECRET_KEY = '7kfheDrmq***************************'
    ENDPOINT = 's3.cn-north-1.amazonaws.com.cn'
    REGION = 'cn-north-1';
--创建volume
CREATE EXTERNAL VOLUME my_volume
    location 's3://mybucket/test_insert/'
    using connection aws_bj_conn
    directory = (
        enable=true,
        auto_refresh=true
    );
--将数据导出到test_insert子目录下
COPY INTO VOLUME my_volume SUBDIRECTORY 'dau_unload/'
FROM TABLE birds
file_format = (type = CSV);

联系我们
预约咨询
微信咨询
电话咨询