cuDF-groupby UDF以支持日期时间


columns = ["col1", "col2", "dt"]

The (dt) in the form of datetime64[ns].

I would like to write a UDF to apply to each group in this dataframe, and get max of dt for each group. Here is what I am trying, but seems like numba doesn't support the datetime64[ns] values in UDFs.

def f1(dt, out):
   l = len(dt)
   maxvalue = dt[0]
   for i in  range(cuda.threadIdx.x, l, cuda.blockDim.x):
      if dt[i] > maxvalue:
         maxvalue = dt[i]
   out[:0] = maxvalue

gdf = df.groupby(["col1", "col2"], method="cudf")
df = gdf.apply_grouped(f1, incols={"dt": "dt"}, outcols=dict(out=numpy.datetime64))


This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: resolving callee type: Function(<numba.cuda.compiler.DeviceFunctionTemplate object at 0x7effda063510>)
[2] During: typing of call at <string> (10)


  • 小梨涡
    小梨涡 回复

    Apply_groups won't give you what I think you're after, which is groupby on max dt. You needed to use aggs with max on dt. cudf's groupby functions would have done the rest. To get your values in datetime64[ms], you use astype(), and save it back to the dataframe (very fast). See my example:

    import cudf
    a = cudf.DataFrame({"col1": [1, 1, 1, 2, 2, 2], "col2": [1, 2, 1, 1, 2, 1], "dt": [1, 2, 3, 1, 2, 4]}) 
    a['dt'] = a['dt'].astype('datetime64[ms]')
    gdf = a.groupby(["col1", "col2"]).agg({'dt':'max'})

    dt column values would be formatted between 1-4 seconds from Jan 1st, 1970, giving you a print out of

    col1 col2                        
    1    1    1970-01-01 00:00:00.003
         2    1970-01-01 00:00:00.002
    2    1    1970-01-01 00:00:00.004
         2    1970-01-01 00:00:00.002