实际上是替代other列中WHERE列的快速Spark替代品

我正在寻找一种快速的PySpark替代品

SELECT foo FROM bar
WHERE foo IN (SELECT baz FROM bar)

Collecting beforehand into a Python list is absolutely not an option as the dataframes handled are quite large and collect takes up tremendous amounts of time relative to the other option I came up with. So I can't think of a way to use a native PySparkian where(col(bar).isin(baz)) since baz would have to be a list in this case.

One option I came up with is right JOIN as a replacement of IN and left_semi JOIN as a replacement of NOT IN, consider a following example:

bar_where_foo_is_in_baz = bar.join(bar.select('baz').alias('baz_'), col('foo') == col('baz_'), 'right').drop('baz_')

This however is quite verbose, pretty hard to interpret when read after a while and results in a fair bit of head scratching when a larger number of conditions are handled within WHERE so I'd like to avoid that.

还有其他选择吗?