随机化pyspark列值?

我是pyspark编程的初学者。我在csv文件中包含以下数据,该文件正在读取到spark数据帧中。

enter image description here

# read the csv file in a spark dataframe
df = (spark.read
       .option("inferSchema", "true")
       .option("header", "true")
       .csv(file_path))

我想分别对每列中的数据进行改组,如下面的快照中所示,分别为“ InvoiceNo”,“ StockCode”,“ Description”。

enter image description here

以下代码已实现,以按顺序随机排序列值-

from pyspark.sql.functions import *

df.orderBy("InvoiceNo", rand()).show(10)

I'm not getting the correct output even after executing the above. Can anyone help in solving the problem? This link was also referred : Randomly shuffle column in Spark RDD or dataframe but the code mentioned is throwing an error.

评论
  • 花=有毒=
    花=有毒= 回复

    The pyspark rand function can be used to create a column of random values on your dataframe. The dataframe can then be ordered by the new column to produced the randomised order that you are looking for e.g.

    from pyspark.sql.functions import rand
    
    df.withColumn('rand', rand(seed=42)).orderBy('rand')