在pyspark [重复]中添加UUID的有效方法>> [

此问题已经在这里有了答案：

Pyspark add sequential and deterministic index to dataframe
（1个答案）
[
20天前关闭
。

我有一个DataFrame，我想添加一列不同的uuid4（）行。我的代码：

from pyspark.sql import SparkSession from pyspark.sql import functions as f from pyspark.sql.types import StringType from uuid import uuid4 spark_session = SparkSession.builder.getOrCreate() df = spark_session.createDataFrame([ [1, 1, 'teste'], [2, 2, 'teste'], [3, 0, 'teste'], [4, 5, 'teste'], ], list('abc')) df = df.withColumn("_tmp", f.lit(1)) uuids = [str(uuid4()) for _ in range(df.count())] df1 = spark_session.createDataFrame(uuids, StringType()) df1 = df_1.withColumn("_tmp", f.lit(1)) df2 = df.join(df_1, "_tmp", "inner").drop("_tmp") df2.show()

但是我有这个错误：

Py4JJavaError: An error occurred while calling o1571.showString. : org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans

我已经尝试使用别名并使用monotonically_increasing_id作为连接列，但是我看到了here我无法相信monotonically_increasing_id作为合并列。我期望：

+---+---+-----+------+ | a| b| c| value| +---+---+-----+------+ | 1| 1|teste| uuid4| | 2| 2|teste| uuid4| | 3| 0|teste| uuid4| | 4| 5|teste| uuid4| +---+---+-----+------+

在这种情况下正确的方法是什么？

我有一个DataFrame，我想添加一列不同的uuid4（）行。我的代码：从pyspark.sql导入从pyspark.sql导入SparkSession从pyspark.sql.types导入为f，从StringType ...

4
投票

我使用row_number作为@Tetlanesh建议。我必须创建一个ID列，以确保row_number计数Window的每一行。

JAVA c c++go swift javascript Nginx UI/UE 小程序 Python C#php asp

热门推荐