机器学习回归的局限性?

I've been learning some of the core concepts of ML lately and writing code using the Sklearn library. After some basic practice, I tried my hand at the AirBnb NYC dataset from kaggle (which has around 40000 samples) - https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data#New_York_City_.png

考虑到数据集的各种功能,我试图建立一个可以预测房间/公寓价格的模型。我意识到这是一个回归问题,使用这个sklearn速查表,我开始尝试各种回归模型。

enter image description here I used the sklearn.linear_model.Ridge as my baseline and after doing some basic data cleaning, I got an abysmal R^2 score of 0.12 on my test set. Then I thought, maybe the linear model is too simplistic so I tried the 'kernel trick' method adapted for regression (sklearn.kernel_ridge.Kernel_Ridge) but they would take too much time to fit (>1hr)! To counter that, I used the sklearn.kernel_approximation.Nystroem function to approximate the kernel map, applied the transformation to the features prior to training and then used a simple linear regression model. However, even that took a lot of time to transform and fit if I increased the n_components parameter which I had to to get any meaningful increase in the accuracy.

所以我现在在想,当您想对庞大的数据集进行回归时会发生什么?内核技巧在计算上极其昂贵,而线性回归模型过于简单,因为实际数据很少是线性的。那么神经网络是唯一的答案还是我缺少一些聪明的解决方案?

附言我才刚开始使用Overflow,所以请让我知道我可以做些什么来改善我的问题!

评论