我正在寻找一种快速的PySpark替代品
SELECT foo FROM bar
WHERE foo IN (SELECT baz FROM bar)
Collecting beforehand into a Python list is absolutely not an option as the dataframes handled are quite large and collect takes up tremendous amounts of time relative to the other option I came up with. So I can't think of a way to use a native PySparkian where(col(bar).isin(baz))
since baz
would have to be a list in this case.
One option I came up with is right JOIN
as a replacement of IN and left_semi JOIN
as a replacement of NOT IN, consider a following example:
bar_where_foo_is_in_baz = bar.join(bar.select('baz').alias('baz_'), col('foo') == col('baz_'), 'right').drop('baz_')
This however is quite verbose, pretty hard to interpret when read after a while and results in a fair bit of head scratching when a larger number of conditions are handled within WHERE
so I'd like to avoid that.
还有其他选择吗?
Try
array_contains=True
infilter
. Array_contains spark apiSpark1.5+