Hi I am using cuda to write some kernels with the
@cuda.jit decorator. I have 8 CPU threads each calling a kernel on 1 of 2 GPU devices. (
cpu_idx % len(cuda.gpus) to be specific)
I believe each CPU thread is compiling the kernel, which takes up a lot of time relative to the time it takes for the kernel to process an entire image. Ideally it should be only compiled once for all the CPU threads to use. But I can't initialize any cuda gpu code before forking with
multiprocessing.Pool because cuda doesn't like that.