- 确保源数据中的数字列的数据类型正确。如果数据类型不正确,例如数字列被指定为字符串列,则Glue Job可能无法正确导入数据。在这种情况下,您需要在转换或写入数据之前将其类型更改为数字类型。
以下是示例代码,将“number_col”列转换为整数(Int)类型:
import pyspark.sql.functions as f
from pyspark.sql.types import IntegerType
df = spark.read.format("csv").option("header", "true").load("s3://path/to/input")
df = df.withColumn("number_col", f.col("number_col").cast(IntegerType()))
df.write.format("jdbc").option("url", "jdbc:").option("dbtable", "").option("user", "") \
.option("password", "").option("driver", "org.postgresql.Driver") \
.mode("overwrite").save()
- 在Glue Job中使用显式模式定义RDS数据库表的Schema。这将确保您需要的每个列都正确地映射到RDS表中的列。以下是示例代码:
from awsglue.dynamicframe import DynamicFrame
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, LongType
sc = SparkContext.getOrCreate()
glue_context = GlueContext(sc)
spark = glue_context.spark_session
db_name = ""
tbl_name = ""
rds_url = ""
username = ""
password = ""
# Define schema for RDS table
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("number_col", IntegerType(), True),
StructField("date_col", LongType(), True),
StructField("timestamp_col", LongType(), True)
])
# Read data from source (