Suppose I have a dataframe with several numerical variables and 1 categorical variable with 10000 categories. I use a neural network with Keras to get a matrix of embeddings for the categorical variable. The embedding size is 50 so the matrix that Keras returns has dimension 10002 x 50
.
The extra 2 rows are for unknown categories and the other one I don't know exactly - it's the only way Keras would work, i.e.,
model_i = keras.layers.Embedding(input_dim=num_categories+2, output_dim=embedding_size, input_length=1,
name=f'embedding_{cat_feature}')(input_i)
without the +2
it would not work.
因此,我有一个约有12M行的训练集和约有100万行的验证集。 现在,我想到的重新构造嵌入的方式是:
- having a reversed dictionary with numerical values (which were encoded before to represent the categories) as keys and the category names as values
- Add 50
NaN
columns to the data frame - for
i
in range(10002) (which is the number of categories + 2) look for the corresponding value of keyi
in the reversed dictionary and if it is in the dictionary, using pandas.loc
, replace each row (in those 50NaN
columns) that correspond to the value ofi
(i.e., where the categorical variable is equal to the category name whichi
is encoded for) with the corresponding row vector from the10002 x 50
matrix.
The problem with this solution is that it's highly inefficient.
A friend told me about another solution which consists of converting the categorical variable to a one-hot sparse matrix with dimensions 12M x 10000
(for the training set), and then use matrix multiplication with the embeddings matrix which should have dimensions 10000 x 50
thus getting a 12M x 50
matrix which I can then concatenate to my original data frame. The problems here are:
- It won't work on the validation set because the number of categories appearing there is or may be different than in training, so the dimensions do not match.
- Even when used on the training set, I have 10002 (=
num_categories + 2
) rows in the matrix Keras gives me, instead of 10000. And so again, the dimensions do not match.
有谁知道这样做的更好方法,或者可以解决第二种方法中的问题? 我的最终目标是拥有一个数据框,其中所有变量都减去类别变量,而要另外创建50列带有行向量的列,这些行向量表示该类别变量的嵌入。