AWS RDS 导出 parquet 格式表格时,会根据所选的文件大小进行自动分段,以确保数据的高效性和可伸缩性。
AWS SDK 提供了一个 python 示例代码,可以用来导出 parquet 格式的数据,并将其分段:
import sys
import boto3
from boto3.session import Session
session = Session(aws_access_key_id='YOUR_ACCESS_KEY', aws_secret_access_key='YOUR_SECRET_KEY')
s3 = session.resource('s3')
aws_bucket = 'YOUR_AWS_BUCKET_NAME'
aws_object_key = 'YOUR_AWS_OBJECT_KEY'
local_file = '/tmp/MY_LOCAL_FILE.parquet'
# Create an S3 resource
s3 = boto3.resource('s3')
# Get the bucket
bucket = s3.Bucket(aws_bucket)
# Download the object to local file
bucket.download_file(aws_object_key, local_file)
# Partition the data using pandas
import pandas as pd
# Read the parquet file
df = pd.read_parquet(local_file)
# Write the parquet partitions to S3
import awswrangler as wr
# Write the data partitioned to Parquet
wr.s3.to_parquet(
df=df,
dataset=True,
database='mydatabase',
table='mytable',
path='s3://aws-bucket/mytable/',
partition_cols=['year', 'month', 'day'],
mode='overwrite',
concurrent_partitioning=True
)
其中 concurrent_partitioning=True
选项表示并发分区。如果您的表格非常大,则会自动分段。