按类型从pyspark数据框中删除行

我在pyspark中有包含产品项的大文件,其中一些是确切的数字,而其他一些则包含字符串。 我想从数据框中删除所有带数字的行项目(计算效率高)。

|Product-1| Pj3f|
|Product-2| 12  |
|Product-3| Pj2F|
|Product-4| 20  |

如何在pyspark数据框的列中按项目类型过滤行? pyspark过滤器功能似乎没有此功能。

ķ

评论
qporro
qporro

cast the column to int then filter only the null value columns.

  • Or by using .rlike function

Example:

df.show()
#+---------+-----+
#|  product|descr|
#+---------+-----+
#|Product-1| pj3f|
#|product-2|   12|
#+---------+-----+

df.filter(col("descr").cast("int").isNull()).show()
df.filter(~col("descr").rlike("^([\s\d]+)$")).show()
#+---------+-----+
#|  product|descr|
#+---------+-----+
#|Product-1| pj3f|
#+---------+-----+
点赞
评论
bby
bby

Columns in spark are all of the same type. If you mix two columns with a union for example of different types, spark will try to transform to a valid type for both, usualy String, and puts the string representation of the values.

例子:

  • A String column and a Float, will result into a String column, with the floats represented in a string with the dot for decimals. String + Float => String
  • A Integer column union a Float will transform all the integers into Floats. Integer + Float => Float

For your case will depend, if its a topical string or numeric, I would go for the regex filtering.

val stringsDF = df.filter(regex_extract($"column", "[0-9]+([.|,][0-9]+)?") === ""))

这将保留所有不跟随浮点或整数值的值。

点赞
评论