This project will be showing the application of Generalized Additive Models and Nadaraya-Watson Regression.
Introducing the Data
This dataset examines physicochemical properties of protein tertiary structure. It is a multivariate dataset with 45,730 samples. Our Attributes are as follows:
Target:
RMSD-Size of the residue.
Features:
- F1 - Total surface area.
- F2 - Non polar exposed area.
- F3 - Fractional area of exposed non polar residue.
- F4 - Fractional area of exposed non polar part of residue.
- F5 - Molecular mass weighted exposed area.
- F6 - Average deviation from standard exposed area of residue.
- F7 - Euclidian distance.
- F8 - Secondary structure penalty.
- F9 - Spacial Distribution constraints (N,K Value).
Setting up the DataFrame
Below is the python code used to set up the dataframe:
#Mount google drive to access data
from google.colab import drive
drive.mount('/content/drive')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rcParams['figure.dpi'] = 120
!pip install pygam
# general imports
import numpy as np
import pandas as pd
from pygam import LinearGAM
from sklearn.model_selection import train_test_split as tts
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from scipy.optimize import minimize
from matplotlib import pyplot
df = pd.read_csv('/content/drive/MyDrive/CASP(1).csv')
df
We then can divide the dataset into its independent and dependent variables, as well as splitting the training and testing sets and standardizing the data.
features = ['F1', 'F2','F3','F4','F5', 'F6','F7','F8','F9']
X = np.array(df[features])
y = np.array(df['RMSD']).reshape(-1,1)
Xdf = df[features]
Xdf.shape
X_train, X_test, y_train, y_test = tts(X,y,test_size=0.1,random_state=2021)
scale = StandardScaler()
Xs_train = scale.fit_transform(X_train)
Xs_test =scale.transform(X_test)
Generalized Additive Modeling (GAM)
Additive models estimate an additive approximation to multivariate regression functions. They can deal with highly non-linear and non-monotonic relationships by using the data to shape the predictor functions. This type of model helps avoid the curse of dimensionality by using univariate smoothers. The individual term estimates will also explain the relationship between variables. GAM models should be used when non-linearity in partial residual plots suggest the need for semi-parametric models(relationships among variables are not restricted to any shape generally).
GAM models will separate predictors into sections and will fit the data in each section using spline functions. All the functions are then added to predict the link function. The link function is then smoothed by LOESS.
Python Code
Let us attempt to fit a GAM with the pyGAM library. Generally, high splines are used (~20), but let us begin with 6.
#fit the GAM with 6 splines
gam = LinearGAM(n_splines=6).gridsearch(Xs_train, y_train,objective='GCV')
gam.summary()
Below we see the summary of our model: we can use AIC, GCV, and R-Squared indeces to help us understand our model. This also tells us the smoothing pentalty used which is lambda=0.001. This controls the strength of the regularization penalty on each term.
RMSE
For this particular project, we will be comparing our models with the root mean square error. The RMSE represents the standard deviation of the prediction errors, known as residuals. The residuals show how far a data point is from the regression line. The RMSE formula is as follows:
In order to find the RMSE, we can use the sklearn library and add the parameter “squared=False” in order to find the RMSE instead of the MSE. This is shown below:
from sklearn.metrics import mean_squared_error
yhat=gam.predict(Xs_test)
rms = mean_squared_error(y_test, yhat, squared=False)
rms
We must try to minimize our RMSE value. Our RMSE must be evaluated in context of the range of the dataset. Our RMSE for this model at 6 splines is:
R^2
We can also use R^2 to evaluate our models. R-Squared values show how well the data fits the regression model by determining the proportion of variance in y that is captured by the model.
from sklearn.metrics import r2_score as R2
yhat=gam.predict(Xs_test)
R_2= R2(y_test, yhat)
R_2
The closer an R-squared value is to one or negative one, the better the regression describes the data. Our R-squared value here is:
Residual Plots
Residual values measure the difference between the regression lines and the data points. A residual plot will have the residuals on the y axis and the y target on the x axis. Let us now see the residual plots for this model:
residuals=gam.deviance_residuals(X_train,y_train)
residualst=gam.deviance_residuals(X_test,y_test)
import matplotlib.pyplot as plt
plt.scatter(residuals, y_train, c="pink", s=20, alpha=0.5)
plt.scatter(residualst, y_test, c="green", s=20, alpha=0.5)
plt.xlabel("Predicted Value")
plt.ylabel("Deviance Residuals")
plt.show()
We can also plot a histogram of our residuals in order to see whether the variance has a particular distribution. A histogram will clarify the frequency of the residuals.
import matplotlib.pyplot as plt
import numpy as np
from numpy.lib.histograms import histogram
plt.hist(residuals, label='residuals train', alpha=.5, color='pink')
plt.hist(residualst, label='residuals test', alpha=.5, color='green')
plt.legend()
plt.show()
From the Residuals Plot, we can see that there is a large cluster that travels diagonally. This is a warning sign that our model may not be a good fit for this dataset. A good fitting model will have a residual plot with scattered points. We can note that the pink dots are the training set and the green are the testing set.
Our histogram below is skewed to the right:
Different Splines
Let us see if changing the amount of splines can improve our RMSE and R-squared. We can test splines in the range of 4 through 30:
listt=[]
listr=[]
for i in range(4, 30):
gam = LinearGAM(n_splines=i).gridsearch(Xs_train, y_train,objective='GCV')
yhat=gam.predict(Xs_test)
rms = mean_squared_error(y_test, yhat, squared=False)
R_2= R2(y_test, yhat)
listt.append(rms)
listr.append(R_2)
RMSE Plot Splines
import matplotlib.pyplot as plt
fig, ax= plt.subplots(figsize=(8,6))
a_range=range(4, 30)
ax.scatter(a_range, listt)
ax.plot(a_range, listt, c='red')
min(listt)
Our lowest RMSE is:
but does not change in an extreme manner.
R2 Plot Splines
import matplotlib.pyplot as plt
fig, ax= plt.subplots(figsize=(8,6))
a_range=range(4, 30)
ax.scatter(a_range, listt)
ax.plot(a_range, listt, c='red')
The R^2 value seems to improve when splines are increased:
KFold Split
Let us now try K-fold cross validation on this model with a k=10:
def do_kfold(X,y,k,rs, n_splines):
PE_internal_validation=[]
PE_external_validation=[]
kf=KFold(n_splits=k, shuffle=True, random_state=rs)
for idxtrain, idxtest in kf.split(X):
X_train=X[idxtrain,:]
y_train=y[idxtrain]
X_test=X[idxtest,:]
y_test=y[idxtest]
gam = LinearGAM(n_splines=n_splines).gridsearch(X_train, y_train,objective='GCV')
yhat_test=gam.predict(X_test)
yhat_train=gam.predict(X_train)
PE_internal_validation.append(mean_squared_error(y_test, yhat, squared=False))
PE_external_validation.append(mean_squared_error(y_test, yhat, squared=False))
return np.mean(PE_internal_validation), np.mean(PE_external_validation)
do_kfold(X,y,10,2021,6)
The internal and external RMSE validations are:
Nadaraya-Watson Regression
Let us now apply kernel regression to our dataset and see if it improves. We have previously explored Kernel Regression in a previous project.
Code Application
First, we must either import the Nadaraya-Watson Regression Code or place it into our file:
#from nadaraya_watson import NadarayaWatson, NadarayaWatsonCV
"""
Nadaraya-Watson Regression.
"""
# Authors: Joseph Knox <josephk@alleninstitute.org>
# License: Allen Institute Software License
# TODO : evaluate overwrite of K (kernel)
from __future__ import division
import numpy as np
from scipy.sparse import issparse
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.metrics.scorer import check_scoring
from sklearn.metrics.pairwise import pairwise_kernels
from sklearn.model_selection import GridSearchCV, ParameterGrid, check_cv
from sklearn.model_selection._search import _check_param_grid
from sklearn.utils.validation import check_is_fitted
from sklearn.utils import check_array
from sklearn.utils import check_X_y
def squared_norm(arr):
"""Compute the square frobenius/vector norm.
Parameters
----------
arr : np.ndarray
Array of which we compute the norm.
Returns
-------
norm: float
"""
arr = arr.ravel(order='K')
return np.dot(arr, arr)
class NadarayaWatson(BaseEstimator, RegressorMixin):
"""NadarayaWatson Estimator.
Parameters
----------
kernel : string or callable, default="linear"
Kernel mapping used to compute weights.
gamma : float, default=None
Gamma parameter for the RBF, laplacian, polynomial, exponential chi2
and sigmoid kernels. Ignored by other kernels.
degree : float, default=3
Degree of the polynomial kernel. Ignored by other kernels.
coef0 : float, default=1
Zero coefficient for polynomial and sigmoid kernels.
Ignored by other kernels.
kernel_params : mapping of string to any, optional
Additional parameters for kernel function passed as callable object.
Notes
-----
See `sklearn.kernel_ridge <http://scikit-learn.org/stable/modules/
generated/sklearn.kernel_ridge.KernelRidge.html#sklearn.kernel_ridge.
KernelRidge>`_, for more info: Kernel Ridge Regression estimator from
which the structure of this estimator is based.
Examples
--------
>>> import numpy as np
>>> from nadaraya_watson import NadarayaWatson
>>> # generate some fake data
>>> n_samples, n_features = 10, 5
>>> np.random.seed(0)
>>> y = np.random.randn(n_samples)
>>> X = np.random.randn(n_samples, n_features)
>>> # fit regressor
>>> reg = NadarayaWatson()
>>> reg.fit(X, y)
NadarayaWatson(coef0=1, degree=3, gamma=None, kernel='linear',
kernel_params=None)
"""
def __init__(self, kernel="linear", degree=3,
coef0=1, gamma=None, kernel_params=None):
self.kernel = kernel
self.gamma = gamma
self.degree = degree
self.coef0 = coef0
self.kernel_params = kernel_params
def _get_kernel(self, X, y=None):
"""Gets kernel matrix."""
if callable(self.kernel):
params = self.kernel_params or {}
else:
params = {"gamma": self.gamma,
"degree": self.degree,
"coef0": self.coef0}
return pairwise_kernels(X, y, metric=self.kernel,
filter_params=True, **params)
@property
def _pairwise(self):
return self.kernel == "precomputed"
def _check_fit_arrays(self, X, y, sample_weight=None):
"""Checks fit arrays and scales y if sample_weight is not None."""
# Convert data
X, y = check_X_y(X, y, accept_sparse=("csr", "csc"),
multi_output=True, y_numeric=True)
if sample_weight is not None and not isinstance(sample_weight, float):
# TODO: break up?
sample_weight = check_array(sample_weight, ensure_2d=False)
# do not want to rescale X!!!!
y = np.multiply(sample_weight[:, np.newaxis], y)
if len(y.shape) == 1:
y = y.reshape(-1, 1)
return X, y
def fit(self, X, y, sample_weight=None):
"""Fit Nadaraya Watson estimator.
Parameters
----------
X : array, shape (n_samples, n_features)
Training data.
y : array, shape (n_samples, n_features)
Target values.
Returns
-------
self : returns an instance of self
"""
X, y = self._check_fit_arrays(X, y, sample_weight)
self.X_ = X
self.y_ = y
return self
@staticmethod
def _normalize_kernel(K, overwrite=False):
"""Normalizes kernel to have row sum == 1 if sum != 0"""
factor = K.sum(axis=1)
# if kernel has finite support, do not divide by zero
factor[factor == 0] = 1
# divide in place
if overwrite:
return np.divide(K, factor[:, np.newaxis], K)
return K/factor[:, np.newaxis]
def get_weights(self, X):
"""Return model weights."""
check_is_fitted(self, ["X_", "y_"])
K = self._get_kernel(X, self.X_)
return self._normalize_kernel(K, overwrite=True)
def predict(self, X):
"""Predict using the Nadaraya Watson model.
Parameters
----------
X : array, shape (n_samples, n_features)
Training data.
Returns
-------
C : array, shape (n_samples,) or (n_samples, n_targets)
Returns predicted values.
"""
check_is_fitted(self, ["X_", "y_"])
if len(X.shape) == 1:
X = X.reshape(-1, 1)
w = self.get_weights(X)
# TODO: evaluate sklearn.utils.extmath.safe_sparse_dot()
if issparse(self.y_):
# has to be of form sparse.dot(dense)
# more efficient than w.dot( y_.toarray() )
return self.y_.T.dot(w.T).T
return w.dot(self.y_)
@property
def nodes(self):
"""Nodes (data)"""
check_is_fitted(self, ["X_", "y_"])
return self.y_
class _NadarayaWatsonLOOCV(NadarayaWatson):
"""Nadaraya watson with built-in Cross-Validation
It allows efficient Leave-One-Out cross validation
This class is not intended to be used directly. Use NadarayaWatsonCV instead.
"""
def __init__(self, param_grid, scoring=None, store_cv_scores=False):
#TODO: check _check_param_grid in proper spot
self.param_grid = param_grid
self.scoring = scoring
self.store_cv_scores = store_cv_scores
_check_param_grid(param_grid)
@property
def _param_iterator(self):
return ParameterGrid(self.param_grid)
def _errors_and_values_helper(self, K):
"""Helper function to avoid duplication between self._errors and
self._values.
fill diagonal with 0, renormalize
"""
np.fill_diagonal(K, 0)
S = self._normalize_kernel(K, overwrite=True)
return S
def _errors(self, K, y):
""" mean((y - Sy)**2) = mean( ((I-S)y)**2 )"""
S = self._errors_and_values_helper(K)
# I - S (S has 0 on diagonal)
S *= -1
np.fill_diagonal(S, 1.0)
mse = lambda x: squared_norm(x) / x.size
return mse(S.dot(y))
def _values(self, K, y):
""" prediction """
S = self._errors_and_values_helper(K)
return S.dot(y)
def fit(self, X, y, sample_weight=None):
"""Fit the model using efficient leave-one-out cross validation"""
X, y = self._check_fit_arrays(X, y, sample_weight)
candidate_params = list(self._param_iterator)
scorer = check_scoring(self, scoring=self.scoring, allow_none=True)
# error = scorer is None
error = self.scoring is None
if not error:
# scorer wants an object to make predictions
# but are already computed efficiently by _NadarayaWatsonCV.
# This identity_estimator will just return them
def identity_estimator():
pass
identity_estimator.predict = lambda y_pred: y_pred
cv_scores = []
for candidate in candidate_params:
# NOTE: a bit hacky, find better way
K = NadarayaWatson(**candidate)._get_kernel(X)
if error:
# NOTE: score not error!
score = -self._errors(K, y)
else:
y_pred = self._values(K, y)
score = scorer(identity_estimator, y, y_pred)
cv_scores.append(score)
self.n_splits_ = X.shape[0]
self.best_index_ = np.argmax(cv_scores)
self.best_score_ = cv_scores[self.best_index_]
self.best_params_ = candidate_params[self.best_index_]
if self.store_cv_scores:
self.cv_scores_ = cv_scores
return self
class NadarayaWatsonCV(NadarayaWatson):
"""NadarayaWatson Estimator with built in Leave-one-out cross validation.
By default, it performs Leave-one-out cross validation efficiently, but
can accept cv argument to perform arbitrary cross validation splits.
Parameters
----------
param_grid : dict or list of dictionaries
Dictionary with parameters names (string) as keys and lists of
parameter settings to try as values or a list of such dictionaries,
in which case the grids spanned by each dictionary in the list are
explored. This enables searching over any sequence of parameter settings.
scoring : string, callable or None, optional, default: None
A string (see sklearn.model_evaluation documentation) or a scorer
callable object / function with signature
``scorer(estimator, X, y)``
cv : int, cross-validation generator or an iterable, optional, default: None
Determines the cross-validation splitting strategy. If None, perform
efficient leave-one-out cross validation, else use
sklearn.model_selection.GridSearchCV.
store_cv_scores : boolean, optional, default=False
Flag indicating if the cross-validation values should be stored in
`cv_scores_` attribute. This flag is only compatible with `cv=None`.
Attributes
----------
cv_scores_ : array, shape = (n_samples, ~len(param_grid))
Cross-validation scores for each candidate parameter (if
`store_cv_scores=True` and `cv=None`)
best_score_ : float
Mean cross-validated score of the best performing estimator.
n_splits_ : int
Number of cross-validation splits (folds/iterations)
Examples
--------
>>> import numpy as np
>>> from nadaraya_watson import NadarayaWatson
>>> # generate some fake data
>>> n_samples, n_features = 10, 5
>>> np.random.seed(0)
>>> y = np.random.randn(n_samples)
>>> X = np.random.randn(n_samples, n_features)
>>> # fit regressor
>>> param_grid = [dict(kernel=['linear'], degree=np.arange(1, 4)),
... dict(kernel=['rbf'], gamma=np.logspace(-1, 1, 3))]
>>> reg = NadarayaWatsonCV(param_grid)
>>> reg.fit(X, y)
NadarayaWatsonCV(coef0=1, cv=None, degree=3, gamma=1.0, kernel='rbf',
kernel_params=None,
param_grid=[{'kernel': ['linear'], 'degree': array([1, 2, 3])},
{'kernel': ['rbf'], 'gamma': array([ 0.1, 1. , 10. ])}],
scoring=None, store_cv_scores=False)
"""
def __init__(self, param_grid, scoring=None, cv=None, store_cv_scores=False,
kernel="linear", degree=3, coef0=1, gamma=None, kernel_params=None):
self.param_grid = param_grid
self.scoring = scoring
self.cv = cv
self.store_cv_scores = store_cv_scores
# NadarayaWatson kwargs :: for compatibility
self.kernel = kernel
self.gamma = gamma
self.degree = degree
self.coef0 = coef0
self.kernel_params = kernel_params
def _update_params(self, param_dict):
for k, v in param_dict.items():
setattr(self, k, v)
def fit(self, X, y, sample_weight=None):
"""Fit Nadaraya Watson estimator.
Parameters
----------
X : array, shape (n_samples, n_features)
Training data.
y : array, shape (n_samples, n_features)
Target values.
Returns
-------
self : returns an instance of self
"""
if self.cv is None:
estimator = _NadarayaWatsonLOOCV(param_grid=self.param_grid,
scoring=self.scoring,
store_cv_scores=self.store_cv_scores)
estimator.fit(X, y, sample_weight=sample_weight)
self.best_score_ = estimator.best_score_
self.n_splits_ = estimator.n_splits_
best_params_ = estimator.best_params_
if self.store_cv_scores:
self.best_index_ = estimator.best_index_
self.cv_scores_ = estimator.cv_scores_
else:
if self.store_cv_scores:
raise ValueError("cv!=None and store_cv_score=True "
"are incompatible")
gs = GridSearchCV(NadarayaWatson(), self.param_grid,
cv=self.cv, scoring=self.scoring, refit=True)
gs.fit(X, y, sample_weight=sample_weight)
estimator = gs.best_estimator_
self.n_splits_ = gs.n_splits_
self.best_score_ = gs.best_score_
best_params_ = gs.best_params_
# set params for predict
self._update_params(best_params_)
# store data for predict
self.X_ = X
self.y_ = y
return self
Once this is done, we can move onto applying the model to our dataset.
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error as mse
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split as tts
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
%matplotlib inline
%config InlineBackend.
param_grid=dict(kernel=["rbf"],gamma=np.linspace(30, 50, 50))
model = GridSearchCV(NadarayaWatson(), scoring='neg_mean_squared_error', cv=5, param_grid=param_grid)
model=NadarayaWatson(kernel='rbf', gamma=35)
model.fit(X_train, y_train)
yhat= model.predict(X_test)
R-Squared and RMSE
We can find the R-squared value with the predicted and known set:
R2(y_test, yhat)
The R-squared value is:
Our RMSE is:
mse(y_test, yhat)
Residuals Plot
Let us now see the Residuals Plot:
from yellowbrick.regressor import ResidualsPlot
visualizer = ResidualsPlot(model)
model = GridSearchCV(NadarayaWatson(), scoring='neg_mean_squared_error', cv=5, param_grid=param_grid)
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
visualizer.poof()
The Residuals Plot also shows an upwards diagonal trend, thus also showing that this model may not have worked well for this dataset.
K-Fold Validation
Let us now cross validate the test and training set at k=10:
def DoKFold(X,y,k):
PE = []
kf = KFold(n_splits=k,shuffle=True,random_state=2021)
for idxtrain, idxtest in kf.split(X):
param_grid=dict(kernel=["rbf"],gamma=np.linspace(30, 50, 50))
model = GridSearchCV(NadarayaWatson(), scoring='neg_mean_squared_error', cv=5, param_grid=param_grid)
X_train = X[idxtrain,:]
y_train = y[idxtrain]
X_test = X[idxtest,:]
y_test = y[idxtest]
model.fit(X_train,y_train)
yhat_test = model.predict(X_test)
PE.append(mean_squared_error(y_test, yhat, squared=False))
return np.mean(PE)
DoKFold(X,y,10)
Unfortunately, the results to this never were loaded; the code has been running for a very long time.
Comparison between GAM and Nadaraya-Watson Regression
We can see from the dataset that GAM had better results in context of the RMSE; however, both were fairly low. Neither model showed that it performed better than the other when comparing it to the R-squared value. Both models’ residual plots also showed that this dataset most likely does not work well with GAM or Kernel Regression, as there were clusters in both that sloped diagonally.