SELECT foo FROM bar
WHERE foo IN (SELECT baz FROM bar)
Collecting beforehand into a Python list is absolutely not an option as the dataframes handled are quite large and collect takes up tremendous amounts of time relative to the other option I came up with. So I can't think of a way to use a native PySparkian
baz would have to be a list in this case.
One option I came up with is
right JOIN as a replacement of IN and
left_semi JOIN as a replacement of NOT IN, consider a following example:
bar_where_foo_is_in_baz = bar.join(bar.select('baz').alias('baz_'), col('foo') == col('baz_'), 'right').drop('baz_')
This however is quite verbose, pretty hard to interpret when read after a while and results in a fair bit of head scratching when a larger number of conditions are handled within
WHERE so I'd like to avoid that.