I'm trying to apply a pandas_udf
to my PySpark dataframe for some filtering, following the groupby('Key').apply(UDF)
method. To use the pandas_udf
I defined an output schema
and have a condition on the column Number
. As an example, the simplified idea here is that I wish only to return the ID
of the rows with odd Number
.
This now brings up a problem that sometimes there is no odd Number
in a group therefore the UDF just returns an empty dataframe, which is in conflict with the defined schema
to return an int
for Number
.
Is there a way to solve this problem and only output and combine all the odd Number
rows as a new dataframe?
schema = StructType([
StructField("Key", StringType()),
StructField("Number", IntegerType())
])
@pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def get_odd(df):
odd = df.loc[df['Number']%2 == 1]
return odd[['ID']]