数据导入Lakehouse操作实践_通过copy命令批量加载
本指南将帮助您通过脚本方式和云器Lakehouse SQLLine命令行工具,将大量公共URL Parquet文件中的数据(例如New York City Taxi and For-Hire Vehicle Data)导入到云器Lakehouse中。
1、download_raw_data.sh,将Public URL Parquet文件下载到本地。本文以NYC Taxi Data为例,包含450多个文件,60GB的数据(Parquet格式)。
2、initialize_database.sh,通过Lakehouse SQLLine命令创建云器Lakehouse的schema和tables。
3、Import taxi and FHV data,用R转换Parquet文件到CSV文件、通过Lakehouse SQLLine copy命令将本地CSV文件里的数据导入到云器Lakehouse的表里,支持批量多个文件的导入。
操作指南
脚本下载地址:https://github.com/yunqiqiliang/nyc-taxi-data-clickzetta
1. Install ClickZetta SQLLine
2. Install R
From CRAN
Note that R used to be optional for this repo, but is required starting with the 2022 file format change. The scripts use R to convert Parquet files to CSV before loading into Postgres. There are other ways to convert from Parquet to CSV that wouldn't require R, but I found that R's arrow
package was faster than some of the other CLI tools I tried
3. Download raw data
./download_raw_data.sh
while done modify download_raw_data.sh and run again to download data from 202212: wget -i setup_files/raw_data_urls_new.txt -P data/ -w 2
4. Initialize database and set up schema
./initialize_database.sh
##5. Import taxi and FHV data
./import_yellow_taxi_trip_data.sh
./import_green_taxi_trip_data.sh
./import_fhv_taxi_trip_data.sh
./import_fhvhv_trip_data.sh
Note that the full import process might take several hours or possibly even over a day depending on computing power
Schema
trips
table contains all yellow and green taxi trips. Each trip has acab_type_id
, which references thecab_types
table and refers to one ofyellow
orgreen
fhv_trips
table contains all for-hire vehicle trip records, including ride-hailing apps Uber, Lyft, Via, and Junofhv_bases
mapsfhv_trips
to base names and "doing business as" labels, which include ride-hailing app namesnyct2010
table contains NYC census tracts plus the Newark Airport. It also maps census tracts to NYC's official neighborhood tabulation areastaxi_zones
table contains the TLC's official taxi zone boundaries. Starting in July 2016, the TLC no longer provides pickup and dropoff coordinates. Instead, each trip comes with taxi zone pickup and dropoff location IDscentral_park_weather_observations
has summary weather data by date