如何为numpy数组的冗余行使用相同的内存地址?

I'm making an RNN language model with Keras and in order to train the model (supervised learning) I have to create a numpy array y (with the labels of each observation for each sequence) of shape (num_of_training_sequences, size_of_vocabulary) containing one-hot vectors.

When I have too many training sequences, this array is too big to fit in memory. However, it doesn't have to be! Since the number of possible one-hot vectors is only size_of_vocabulary, then y could just be a num_of_training_sequences sized array that contains references (aka pointers) to pre-allocated one-hot vectors. This way, if two sequences end in the same word and should have the same one-hot vector, then they would just reference the same address in memory of that one-hot vector.

每个人都应该开心,除了numpy数组。因为当我将这种极为有效的数据结构转换为numpy数组时,它会尝试将整个数组以及重复的和冗余的一键式向量分配到内存中。

Is there anything I can do to overcome this? Keras's code and documentation says fit() only accepts numpy arrays and tensors.

评论