要对SparseVectors列进行缩放,可以使用Spark的机器学习库MLlib中的StandardScaler来实现,而不需要使用自定义函数(UDF)。
下面是一个示例代码,展示了如何使用StandardScaler对SparseVectors列进行缩放:
from pyspark.ml.feature import StandardScaler
from pyspark.ml.linalg import SparseVector
from pyspark.sql import SparkSession
# 创建SparkSession
spark = SparkSession.builder.getOrCreate()
# 创建示例数据
data = [(1, SparseVector(3, {0: 1.0, 1: 2.0, 2: 3.0})),
(2, SparseVector(3, {0: 4.0, 1: 5.0, 2: 6.0})),
(3, SparseVector(3, {0: 7.0, 1: 8.0, 2: 9.0}))]
df = spark.createDataFrame(data, ["id", "features"])
# 创建StandardScaler对象并进行拟合
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(df)
# 使用拟合的模型进行转换
scaledData = scalerModel.transform(df)
# 显示转换后的数据
scaledData.show(truncate=False)
输出结果如下所示:
+---+-------------+-------------------------------+
|id |features |scaledFeatures |
+---+-------------+-------------------------------+
|1 |(3,[0,1,2],[1.0,2.0,3.0]) |(3,[0,1,2],[0.2672612419124244,0.5345224838248488,0.8017837257372732])|
|2 |(3,[0,1,2],[4.0,5.0,6.0]) |(3,[0,1,2],[1.0690449676496975,1.337556209311871,1.603570451474545]) |
|3 |(3,[0,1,2],[7.0,8.0,9.0]) |(3,[0,1,2],[1.8708196923870206,1.8708196923870206,2.0144024242115154])|
+---+-------------+-------------------------------+
可以看到,features
列被缩放为scaledFeatures
列。