我最近一直在研究高斯过程。概率多输出的观点在我的领域中很有前途。特别是空间统计。但是我遇到了三个问题:
meuse
数据集(来自R包sp
)运行一个简单的案例研究。import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
meuse = pd.read_csv(filepath_or_buffer='https://gist.githubusercontent.com/essicolo/91a2666f7c5972a91bca763daecdc5ff/raw/056bda04114d55b793469b2ab0097ec01a6d66c6/meuse.csv', sep=',')
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(121, aspect=1)
ax1.set_title('Lead')
ax1.scatter(x=meuse.x, y=meuse.y, s=meuse.lead, alpha=0.5, color='grey')
ax2 = fig.add_subplot(122, aspect=1)
ax2.set_title('Copper')
ax2.scatter(x=meuse.x, y=meuse.y, s=meuse.copper, alpha=0.5, color='orange')
plt.plot(meuse['lead'], meuse['copper'], '.')
plt.xlabel('Lead')
plt.ylabel('Copper')
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process import GaussianProcessRegressor as GPR
reg = GPR(kernel=RBF())
reg.fit(X=meuse[['x', 'y']], y=meuse[['lead', 'copper']])
predicted = reg.predict(meuse[['x', 'y']])
fig = plt.figure(figsize=(12,4))
ax1 = fig.add_subplot(121)
ax1.set_title('Lead')
ax1.set_xlabel('Measured')
ax1.set_ylabel('Predicted')
ax1.plot(meuse.lead, predicted[:,0], '.')
ax2 = fig.add_subplot(122)
ax2.set_title('Copper')
ax2.set_xlabel('Measured')
ax2.set_ylabel('Predicted')
ax2.plot(meuse.copper, predicted[:,1], '.')
最佳答案
首先,您需要拆分数据。训练模型,然后根据相同的训练数据进行预测,就像您观察到的那样看起来过拟合,但是您没有在任何保留的数据上测试模型,因此您不知道它在野外的性能如何。尝试使用 sklearn.model_selection.train_test_split
分割数据,如下所示:
X_train, X_test, y_train, y_test = train_test_split(meuse[['x', 'y']], meuse[['lead', 'copper']])
length_scale=1e-05
的内核。基本上,您的模型中没有噪音。使用此设置所做的预测将非常紧密地围绕您的输入点(X_train
)集中,您将无法对它们周围的站点进行任何预测。您需要更改alpha
的GaussianProcessRegressor
参数来解决此问题。您可能需要对网格进行搜索,因为默认值为1e-10。例如,我使用了alpha=0.1
。reg = GPR(RBF(), alpha=0.1)
reg.fit(X_train, y_train)
predicted = reg.predict(X_test)
fig = plt.figure(figsize=(12,4))
ax1 = fig.add_subplot(121)
ax1.set_title('Lead')
ax1.set_xlabel('Measured')
ax1.set_ylabel('Predicted')
ax1.plot(y_test.lead, predicted[:,0], '.')
ax2 = fig.add_subplot(122)
ax2.set_title('Copper')
ax2.set_xlabel('Measured')
ax2.set_ylabel('Predicted')
ax2.plot(y_test.copper, predicted[:,1], '.')
MultiOutputRegressor
。最后将它们与原始模型进行比较。像这样:reg = GPR(RBF(), alpha=1)
reg.fit(X_train, y_train)
preds = reg.predict(X_test)
reg_lead = GPR(RBF(), alpha=1)
reg_lead.fit(X_train, y_train.lead)
lead_preds = reg_lead.predict(X_test)
reg_cop = GPR(RBF(), alpha=1)
reg_cop.fit(X_train, y_train.copper)
cop_preds = reg_cop.predict(X_test)
multi_reg = MultiOutputRegressor(GPR(RBF(), alpha=1))
multi_reg.fit(X_train, y_train)
multi_preds = multi_reg.predict(X_test)
https://stackoverflow.com/questions/43618633/