预编译numba cuda内核(非Jit)

Hi I am using cuda to write some kernels with the @cuda.jit decorator. I have 8 CPU threads each calling a kernel on 1 of 2 GPU devices. (cpu_idx % len(cuda.gpus) to be specific)

I believe each CPU thread is compiling the kernel, which takes up a lot of time relative to the time it takes for the kernel to process an entire image. Ideally it should be only compiled once for all the CPU threads to use. But I can't initialize any cuda gpu code before forking with multiprocessing.Pool because cuda doesn't like that.

那么有没有办法预编译cuda内核?我不希望即时编译