实际上是替代other列中WHERE列的快速Spark替代品

我正在寻找一种快速的PySpark替代品

SELECT foo FROM bar
WHERE foo IN (SELECT baz FROM bar)

Collecting beforehand into a Python list is absolutely not an option as the dataframes handled are quite large and collect takes up tremendous amounts of time relative to the other option I came up with. So I can't think of a way to use a native PySparkian where(col(bar).isin(baz)) since baz would have to be a list in this case.

One option I came up with is right JOIN as a replacement of IN and left_semi JOIN as a replacement of NOT IN, consider a following example:

bar_where_foo_is_in_baz = bar.join(bar.select('baz').alias('baz_'), col('foo') == col('baz_'), 'right').drop('baz_')

This however is quite verbose, pretty hard to interpret when read after a while and results in a fair bit of head scratching when a larger number of conditions are handled within WHERE so I'd like to avoid that.

还有其他选择吗?

评论
  • 记忆狠清晰
    记忆狠清晰 回复

    Try array_contains=True in filter. Array_contains spark api Spark1.5+

    df.show()
    #sample dataframe
    #+---+---------+
    #|foo|      baz|
    #+---+---------+
    #|  1|[1, 2, 3]|
    #|  2|   [1, 2]|
    #|  4|   [1, 3]|
    #+---+---------+
    
    df.select("foo").filter("array_contains(baz,foo)=True").show()
    
    #+---+
    #|foo|
    #+---+
    #|  1|
    #|  2|
    #+---+