cuDF-groupby UDF以支持日期时间

我有一个带有以下列的cuDF数据框:

columns = ["col1", "col2", "dt"]

The (dt) in the form of datetime64[ns].

I would like to write a UDF to apply to each group in this dataframe, and get max of dt for each group. Here is what I am trying, but seems like numba doesn't support the datetime64[ns] values in UDFs.

def f1(dt, out):
   l = len(dt)
   maxvalue = dt[0]
   for i in  range(cuda.threadIdx.x, l, cuda.blockDim.x):
      if dt[i] > maxvalue:
         maxvalue = dt[i]
   out[:0] = maxvalue

gdf = df.groupby(["col1", "col2"], method="cudf")
df = gdf.apply_grouped(f1, incols={"dt": "dt"}, outcols=dict(out=numpy.datetime64))

这是我得到的错误:

This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: resolving callee type: Function(<numba.cuda.compiler.DeviceFunctionTemplate object at 0x7effda063510>)
[2] During: typing of call at <string> (10)

我有类似的功能,可以很好地处理整数和浮点数。这是否意味着numba不支持日期时间?

评论
  • 小梨涡
    小梨涡 回复

    Apply_groups won't give you what I think you're after, which is groupby on max dt. You needed to use aggs with max on dt. cudf's groupby functions would have done the rest. To get your values in datetime64[ms], you use astype(), and save it back to the dataframe (very fast). See my example:

    import cudf
    a = cudf.DataFrame({"col1": [1, 1, 1, 2, 2, 2], "col2": [1, 2, 1, 1, 2, 1], "dt": [1, 2, 3, 1, 2, 4]}) 
    a['dt'] = a['dt'].astype('datetime64[ms]')
    gdf = a.groupby(["col1", "col2"]).agg({'dt':'max'})
    print(gdf.head(6))
    

    dt column values would be formatted between 1-4 seconds from Jan 1st, 1970, giving you a print out of

                                   dt
    col1 col2                        
    1    1    1970-01-01 00:00:00.003
         2    1970-01-01 00:00:00.002
    2    1    1970-01-01 00:00:00.004
         2    1970-01-01 00:00:00.002