Costa Rican Household Poverty Level Prediction- IDB
Introducing the Data
This dataset was scavenged from an old competition on Kaggle asking users to classify household poverty levels on a scale of 1 to 4 based on demographic data. The dataset originally had 142 features with 9557 data entries, but was reduced during pre-processing and clean up phases. The dataset’s features before preprocessing were as follows:
Preprocessing
9557 data entries were reduced to 156 entries due to the amount of values that were missing from the set. Although NaN values can be either estimated with KNN algorithms or inputed with designated values, an enormous data set was not necessary for the primary purpose of this project. Four columns were also dropped due to them being object type variables. Rather than deleting these columns, we could have assigned a representative number (for example, Yes=0 No=1). The final preprocessed dataset had 156 rows and 138 columns
#Import Libraries
import numpy as np
import pandas as pd
#Create Dataset
df = pd.read_csv('/content/drive/MyDrive/costa-rican-household-poverty-prediction/train.csv')
#Drop rows with NaN values
df2=df
df2=df2.dropna()
#Exclude type: object
df2 = df2.select_dtypes(exclude=['object'])
#Convert all objects to Numeric
df2=df2.apply(pd.to_numeric)
#Define X variable/features
X=df2.drop('Target', axis=1)
#Define Y variable/target
y=df2['Target']
We can also see the data distribution in our target:
y.value_counts()
We can see that the dataset hardly has data recorded for level 1 and 2 poverty levels, thus making it more likely that our model will not accurately predict these levels of poverty. It is likely that there were many unknown values in these categories that were deleted in our preprocessing phase. Thus, we should reconsider how we process our NaN values in the future.
Correlation
In order to later understand regularization techniques, the correlation of the features must be analyzed.
corr_matrix=np.corrcoef(X)
corr_matrix
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(200,200))
sns.heatmap(X.corr(), vmax=1, vmin=-1, cmap="spring", annot=True,fmt='.2f')
plt.show()
Although it is somewhat hard to tell about the nature of specific correlations due to the anount of features in the dataset, we can see that the majority of the data has a mild correlation relationship (dominance of peachy coral pink color).
In order to further verify that the data is weakly correlated, we can apply a one sample t-test with a two sided pvalue.
from scipy import stats
def tstat_corr(r,n):
t = r*np.sqrt(n-2)/np.sqrt(1-r**2)
pval = stats.t.sf(np.abs(t), n-1)*2 # two-sided pvalue = Prob(abs(t)>tt)
print('t-statistic = %6.3f pvalue = %6.4f' % (t, pval))
tstat_corr(0.02,len(X)) #put in an r value from matrix
One of the lowest correlation values in the heatmap was chosen as the r value. We are left with the following results:
The t-value measures the size of the difference relative to the variation in the data. The greater t is, the greater the chance is that there is a significant difference between the data. Our p value is extremly high at about 0.8, thus telling us that the data is most likely weakly correlated.
Regularization and Feature Selection
Finding the Weights Mathematically without Regularization
Let us assume we have a model Y with equation Y=F(X1, X2,…,Xp). In weighted linear regression, this equation is multiplied by certain weights which define the sparsity pattern of the data. An error term is also added into the equation:
Our goal is to minimize the squared error by obtaining weights that will do so. Our error term will be found by subtracting the predicted valaues from the observed values and then squared.
The sum of squared errors can also be expressed in matrix form, since our data sets tend to have hundreds/thousands of items. Thus, in matrix notation, the sum of squared errors is the same as transposing two error vectors.
Putting in the equatoin for error we have:
Then, the partial derivatives of the weights are taken:
And the equation is set to zero to obtain the global minimum:
Thus, this becomes are solution to the linear regression.
Linear Regression Application of the IDB Dataset
Note: I was so excited about finding this dataset that I did not realize that it addressed a classification problem rather than a linear regression. It was not until I finished the majority of the coding that I realized that the linear models predicts on a continuous scale rather than a discrete scale. Thus, I was seeing predicted values like 1.6 rather than 2. In order to try to make up for this error, I decided to round the values to their integers. This led to a lot more error/less accuracy than would have occurred with a different X and y. I have decided to keep the error in since this is the second time I have made this error in a Machine Learning Project, so I would like to keep this as a reminder to understand the dataset before running into the application phase. An SVM would better work for classification problems.
Before applying any regularization, let us examine a linear regression model on the data:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
#split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lm = LinearRegression()
#fit the variables to model
lm.fit(X_train, y_train)
#predict output by inputting test variable
yhat_lm=lm.predict(X_test)
#Turn continuous variables into discrete
ylist=[]
for output in yhat_lm:
if output<4.5:
output=round(output)
ylist.append(output)
else:
output=4
ylist.append(output)
#round
y_hat_lm_rounded=np.array(ylist)
#MAE
from sklearn.metrics import mean_absolute_error
print("Intercept: {:,.3f}".format(lm.intercept_))
mae = mean_absolute_error(y_test, y_hat_lm_rounded)
print("MAE = {:,.2f}".format(mae))
Predicted Values:
lm.coef_
#max coefficient (range)
max(lm.coef_)
#min coefficient
min(lm.coef_)
Max:
Min:
We can see from the data that the our coefficients have a rather wide range and our MAE is high. Let us see what happens when we apply regularization to our model.
Regularization Embedded Models
When we are working with large datasets such as this one, too many features may create bias and variance in our results. Bias is defined as the inability for the model to capture the true relationship of the data and while variance is defined as the difference in fit between the training and testing sets. For example, a model has high variance if the model predicts the training set very accurately but fails to do so in the testing set (overfitting). In order to reduce bias and variance, feature selection, regularization, boosting, and bagging techniques can be applied.
Feature selection is defined as the selection of features that best contribute to the accuracy of the model. Regularization will add constraints that lower the size of the coefficients in order to make the model less complex and avoid it from fitting variables that are not as important. This will penalize the loss function by adding a regularization term and minimize the sum of square residuals with constraints on the weight vector.
Lasso Regression/ L1 Regularization
Lasso Regression will produce a model with high accuracy and a subset of the original features. Lasso regression puts in a constraint where the sum of absolute values of coefficients is less than a fixed value. Thus, it will lower the size of the coefficients and lead some features to have a coefficient of 0, essentially dropping it from the model.
Looking back at our model without regularization, we saw that our coefficients were found with the formula:
L1 regularization’s loss function is defined by:
otherwise known as:
By adding in an extra cost term, the weights will be penalized. Lasso Regression will find the closed form solution to this equation in order to derive the weights.
This type of regularization should not be applied to a dataset with a low number of features as it will possibly drop features that are essential to the model. Lasso regularization also does not work well with features that are highly correlated, as it may drop correlated groups. The solution will be sparse as a large proportion of features will have a weight of zero.
Lasso Regression Application
We can first fit our data to a lasso regression:
from sklearn.linear_model import Lasso
ls = Lasso(alpha=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
ls.fit(X_train,y_train)
yhat_ls=ls.predict(X_test)
#Turn continuous variables into discrete
ylist=[]
for output in yhat_ls:
if output<4.5:
output=round(output)
ylist.append(output)
else:
output=4
ylist.append(output)
y_hat_ls_rounded=np.array(ylist)
Our predicted values are:
Let us find the mean absolute error to compare the model on the training and testing set:
from sklearn.metrics import mean_absolute_error
print("Intercept: {:,.3f}".format(ls.intercept_))
mae = mean_absolute_error(y_test, y_hat_ls_rounded)
print("MAE = {:,.2f}".format(mae))
The weighted coefficients are:
print('Coefficients')
ls.coef_
Let us examine the range of the coefficients:
min(ls.coef_)
max(ls.coef_)
We can see that the range of the coefficients has decreased greatly compared to the non regularized model.
Since Lasso Regression eliminates certain features, we can derive the ones that are left:
#get features that work
list=[]
count=-1
counted_list=[]
A=ls.coef_
for a in A:
count=count+1
if a!=0:
list.append(a)
counted_list.append(count)
#Features
for c in counted_list:
print(X.columns[c])
It can be noticed that we chose an arbitrary penalty term (alpha value) in this example. In order to choose alpha, we must use cross validation. The alpha value that gives us the least variance is the optimal value. We can thus test other alpha values that may make our predictions better:
#test other alphas
import matplotlib.pyplot as plt
maeLS=[]
for i in range(100):
ls = Lasso(alpha=i)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
ls.fit(X_train,y_train)
yhat_ls=ls.predict(X_test)
#Turn continuous variables into discrete
ylist=[]
for output in yhat_ls:
if output<4.5:
output=round(output)
ylist.append(output)
else:
output=4
ylist.append(output)
y_hat_ls_rounded=np.array(ylist)
maeLS.append(mean_absolute_error(y_test, y_hat_ls_rounded))
plt.scatter(range(100),maeLS)
Standardized Lasso Regression Application
Looking at our data, we can observe that the values are not nearly in the same ranges. Thus, we can standardize our data to attempt to make our model better. We apply the same steps while standardizing our features:
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ls = Lasso(alpha=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
ls.fit(ss.fit_transform(X_train),y_train)
yhat_ls=ls.predict(ss.fit_transform(X_test))
#Turn continuous variables into discrete
ylist=[]
for output in yhat_ls:
if output<4.5:
output=round(output)
ylist.append(output)
else:
output=4
ylist.append(output)
y_hat_ls_rounded=np.array(ylist)
y_hat_ls_rounded
from sklearn.metrics import mean_absolute_error
print("Intercept: {:,.3f}".format(ls.intercept_))
mae = mean_absolute_error(y_test, y_hat_ls_rounded)
print("MAE = {:,.2f}".format(mae))
We can see that standardizing our data improves our MAE greatly at the same alpha (from 0.32 to 0.23).
print('Coefficients')
ls.coef_
Let us look at the range of our coefficients:
min(ls.coef_)
max(ls.coef_)
Thus, we can see that the standardized Lasso Regression decreased the range of weights in comparison to the non standardized Lasso Regression.
#get features that work
list=[]
count=-1
counted_list=[]
A=ls.coef_
for a in A:
count=count+1
if a!=0:
list.append(a)
counted_list.append(count)
#Features
for c in counted_list:
print(X.columns[c])
#test other alphas
import matplotlib.pyplot as plt
maeSLS=[]
for i in range(100):
ss = StandardScaler()
ls = Lasso(alpha=i)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
ls.fit(ss.fit_transform(X_train),y_train)
yhat_ls=ls.predict(ss.fit_transform(X_test))
#Turn continuous variables into discrete
ylist=[]
for output in yhat_ls:
if output<4.5:
output=round(output)
ylist.append(output)
else:
output=4
ylist.append(output)
y_hat_ls_rounded=np.array(ylist)
maeSLS.append(mean_absolute_error(y_test, y_hat_ls_rounded))
plt.scatter(range(100),maeSLS)
We see a similar trend in the standardized and non-standardized forms of Lasso regression; the MAE plateaus below 0.22 as alpha goes to infinity.
Ridge Regression/L2 Regularization
Ridge Regression shares many conceptual similarities with Lasso Regression; it also adds on a penalty to the loss function. The regularization term is the sum of squares of all the feature weights. Unlike Lasso Regression, this type of regression will make the weights smaller but never zero. Ridge regession is not good for data with a lot of outliers, as it blows up the error differences of the outliers and the regularization term tries to fix it by penalizing the weights. Ridge regression is also better when all the features influence the output and all the weights are roughly the same size. This regularization technique does not offer feature selection and has a non sparse solution. It should be noted that ridge regression can hel solve models in which there are less data points than parameters. Ridge regression will penalize large weight coefficients more than the smaller ones as opposed to Lasso regression which penalizes each coefficient uniformly.
Ridge regression’s loss function is defined as:
One can solve for the weighted term of ridge regression by finding the closed form solution using the derivative. We end up with the form:
Ridge Regression Application
Ridge Regression has a similar application code wise as Lasso Regression:
#Apply Ridge
from sklearn.linear_model import Ridge
lr = Ridge(alpha=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lr.fit(X_train,y_train)
yhat_lr=lr.predict(X_test)
#Turn continuous variables into discrete
ylist=[]
for output in yhat_lr:
if output<4.5:
output=round(output)
ylist.append(output)
else:
output=4
ylist.append(output)
y_hat_lr_rounded=np.array(ylist)
We are able to calculate the MAE on unstandardized data:
from sklearn.metrics import mean_absolute_error
print("Intercept: {:,.3f}".format(lr.intercept_))
mae = mean_absolute_error(y_test, y_hat_lr_rounded)
print("MAE = {:,.2f}".format(mae))
We see that unstandardized Ridge and Lasso regularization produce the same MAE for this dataset.
print('Coefficients:')
lr.coef_
Our range of weights decreases from the non-regularized linear regression coefficients, but does not decrease as much as Lasso Regression:
max(lr.coef_)
min(lr.coef_)
Max:
Min:
We once again test for optimal alpha values:
import matplotlib.pyplot as plt
maeLR=[]
for i in range(1000):
lr = Ridge(alpha=i)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lr.fit(X_train,y_train)
yhat_lr=lr.predict(X_test)
#Turn continuous variables into discrete
ylist=[]
for output in yhat_lr:
if output<4.5:
output=round(output)
ylist.append(output)
else:
output=4
ylist.append(output)
y_hat_lr_rounded=np.array(ylist)
maeLR.append(mean_absolute_error(y_test, y_hat_lr_rounded))
plt.scatter(range(1000),maeLR)
It can be noted that as alpha gets bigger, y is less sensitive to the features. The optimal alpha value drops below an MAE of 0.20 when alpha is about 250. This so far has presented the best model.
Standardized Ridge Regression Application
We repeat the process with standardization:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
lr = Ridge(alpha=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lr.fit(ss.fit_transform(X_train),y_train)
scaled_yhat_lr=lr.predict(ss.fit_transform(X_test))
#Turn continuous variables into discrete
ylist=[]
for output in scaled_yhat_lr:
if output<4.5:
output=round(output)
ylist.append(output)
else:
output=4
ylist.append(output)
y_hat_lr_rounded=np.array(ylist)
The MAE has increased above all models including the non-regularized model, thus showing that this may not be an adequate structure for our data.
from sklearn.metrics import mean_absolute_error
print("Intercept: {:,.3f}".format(lr.intercept_))
mae = mean_absolute_error(y_test, y_hat_lr_rounded)
print("MAE = {:,.2f}".format(mae))
print('Coefficients:')
lr.coef_
Although our MAE has increased, the range of values of weight coefficients is still lower than the original model.
max(lr.coef_)
min(lr.coef_)
Max:
Min:
Trying different penalty values we get:
import matplotlib.pyplot as plt
maeSLR=[]
for i in range(200):
lr = Ridge(alpha=i)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lr.fit(ss.fit_transform(X_train),y_train)
scaled_yhat_lr=lr.predict(ss.fit_transform(X_test))
#Turn continuous variables into discrete
ylist=[]
for output in scaled_yhat_lr:
if output<4.5:
output=round(output)
ylist.append(output)
else:
output=4
ylist.append(output)
y_hat_lr_rounded=np.array(ylist)
maeSLR.append( mean_absolute_error(y_test, y_hat_lr_rounded))
plt.scatter(range(200),maeSLR)
Note that at an alpha value of about 90-150, the MAE decreases to 0.15. This so far has been the lowest recorded MAE, proving that this may indeed be a good model for our data.
Ridge vs Lasso
Let us more clearly compare the two regularization techniques:
ElasticNet
ElasticNet regression combines L1 and L2 regularization:
This particular model combines feature elimination with coefficient reduction to create a better model; ElasticNet will also do well with highly correlated data. When variables are correlated, Ridge regression will shrink the two coefficients towards each other while Lasso will overlook one variable and drop it. ElasticNet will avoid such problems by combining both models.
from sklearn.linear_model import ElasticNet
lel = ElasticNet(alpha=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lel.fit(X_train,y_train)
yhat_lel=lel.predict(X_test)
#Turn continuous variables into discrete
ylist=[]
for output in yhat_lel:
if output<4.5:
output=round(output)
ylist.append(output)
else:
output=4
ylist.append(output)
y_hat_lel_rounded=np.array(ylist)
Our MAE becomes:
from sklearn.metrics import mean_absolute_error
print("Intercept: {:,.3f}".format(lel.intercept_))
mae = mean_absolute_error(y_test, y_hat_lel_rounded)
print("MAE = {:,.2f}".format(mae))
The MAE is about the same as our linear regression model at this alpha. Let us see the range of coefficients:
print('Coefficients:')
lel.coef_
min(lel.coef_)
max(lel.coef_)
Our range has shrunk to a difference of about 0.6. This is a great improvement from the non regularized model. We can also note that several features have coefficients of zero, meaning they have been deemed less necessary by the model.
list=[]
count=-1
counted_list=[]
A=lel.coef_
for a in A:
count=count+1
if a!=0:
list.append(a)
counted_list.append(count)
#Features
for c in counted_list:
print(X.columns[c])
Let us examine other alphas to see if there is a better MAE:
import matplotlib.pyplot as plt
maeEN=[]
for i in range(200):
lel = ElasticNet(alpha=i)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lel.fit(X_train,y_train)
yhat_lel=lel.predict(X_test)
#Turn continuous variables into discrete
ylist=[]
for output in yhat_lel:
if output<4.5:
output=round(output)
ylist.append(output)
else:
output=4
ylist.append(output)
y_hat_lel_rounded=np.array(ylist)
maeEN.append(mean_absolute_error(y_test, y_hat_lel_rounded))
plt.scatter(range(200),maeEN)
From the graph we see that we can achieve a lower MAE once we increase the value of alpha. The graph plateaus at an MAE of less than 0.225.
Standardized ElasticNet
Let us examine if standardizing the data can improve our results:
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
lel = ElasticNet(alpha=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lel.fit(ss.fit_transform(X_train),y_train)
yhat_lel=lel.predict(ss.fit_transform(X_test))
#Turn continuous variables into discrete
ylist=[]
for output in yhat_lel:
if output<4.5:
output=round(output)
ylist.append(output)
else:
output=4
ylist.append(output)
y_hat_lel_rounded=np.array(ylist)
from sklearn.metrics import mean_absolute_error
print("Intercept: {:,.3f}".format(lel.intercept_))
mae = mean_absolute_error(y_test, y_hat_lel_rounded)
print("MAE = {:,.2f}".format(mae))
Our MAE at the same alpha is much lower than the unstandardized data. Let us examine our range of coefficients:
print('Coefficients:')
lel.coef_
min(lel.coef_)
max(lel.coef_)
Our range of coefficients has shrunk by over 100%, going from a difference of ~0.6 to ~0.3
list=[]
count=-1
counted_list=[]
A=lel.coef_
for a in A:
count=count+1
if a!=0:
list.append(a)
counted_list.append(count)
#Features
for c in counted_list:
print(X.columns[c])
Examining other alphas, we note that greater alphas have no effect on the MAE:
import matplotlib.pyplot as plt
maeSEN=[]
for i in range(200):
ss = StandardScaler()
lel = ElasticNet(alpha=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lel.fit(ss.fit_transform(X_train),y_train)
yhat_lel=lel.predict(ss.fit_transform(X_test))
#Turn continuous variables into discrete
ylist=[]
for output in yhat_lel:
if output<4.5:
output=round(output)
ylist.append(output)
else:
output=4
ylist.append(output)
y_hat_lel_rounded=np.array(ylist)
maeSEN.append(mean_absolute_error(y_test, y_hat_lel_rounded))
plt.scatter(range(200),maeSEN)
Square Root Lasso/ Scaled Lasso
Square Root Lasso Regularization is a type of lasso that addresses noise in the data. A square root lasso model can fit potentially high-dimensional data. This type of model chooses an equilibrium with a sparse regression method by iteratively estimating the noise level via the mean residual square and the penalty in proportion to noise level.
Application of Square Root Lasso in Python
Let us now try to apply Square Root Lasso in python. We can calculate the MAE using cross-validation:
! pip install --upgrade Cython
! pip install --upgrade git+https://github.com/statsmodels/statsmodels
import statsmodels.api as sm
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split as tts
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
model = sm.OLS(y_train,X_train)
result = model.fit_regularized(method='sqrt_lasso', alpha=0.01)
yhat_test = result.predict(X_test)
#Turn continuous variables into discrete
ylist=[]
for output in yhat_test:
if output<0.5:
output=1
ylist.append(output)
elif output<4.5:
output=round(output)
ylist.append(output)
else:
output=4
ylist.append(output)
y_hat_test_rounded=np.array(ylist)
y_hat_test_rounded
Let us calculate our MAE:
MAE(y_test,y_hat_test_rounded)
Our MAE is 0.34 at alpha 0.01, which is on the higher end of our models. Let us look at the range of the coefficients:
result.params
min(result.params)
max(result.params)
We can see from the results that the range of coefficients has increased; this type of regularization may not be adequate for our model.
Let us examine other alpha values to see if we can improve our results:
maeSL=[]
for i in range(200):
model = sm.OLS(y_train,X_train)
result = model.fit_regularized(method='sqrt_lasso', alpha=i)
yhat_test = result.predict(X_test)
maeSL.append(MAE(y_test,yhat_test))
plt.scatter(range(200),maeSL)
Our results show that the MAE is never better than 0.340, and increases as alpha is increased.
Scaled Square Root Lasso
Let us try to improve our model by scaling our data:
scale = StandardScaler()
Xs_train = scale.fit_transform(X_train)
Xs_test = scale.transform(X_test)
model = sm.OLS(y_train,Xs_train)
result = model.fit_regularized(method='sqrt_lasso', alpha=0.5)
yhat_test = result.predict(Xs_test)
#Turn continuous variables into discrete
ylist=[]
for output in yhat_test:
if output<0.5:
output=1
ylist.append(output)
elif output<4.5:
output=round(output)
ylist.append(output)
else:
output=4
ylist.append(output)
y_hat_test_rounded=np.array(ylist)
y_hat_test_rounded
Our predicted values look very different than our actual values. Let us look at our MAE:
MAE(y_test,y_hat_test_rounded)
Our MAE is now extremely high. Let us look at the range of our coefficients.
result.params
min(result.params)
max(result.params)
Our range of coefficients is not high, but due to our large MAE, this model is not adequate for our data. Let us examine other alpha values:
maeSSL=[]
for i in range(200):
model = sm.OLS(y_train,Xs_train)
result = model.fit_regularized(method='sqrt_lasso', alpha=0.5)
yhat_test = result.predict(Xs_test)
maeSSL.append(MAE(y_test,yhat_test))
plt.scatter(range(200),maeSSL)
As we can see, increasing or decreasing the alpha values did not have an impact on our MAE.
SCAD
The last feature selection method we have is Smoothly Clipped Absolute Deviations (SCAD).
SCAD suffers from bias but encourages sparsity, thus allowing for larger weights. This type of penalty relaxes the rate of penalization as the absolute value of the weight coefficient increases, unlike Lasso regularization which increases the penalty with respect to the weight.
Application of SCAD to IDB problem
We begin by reshaping the data and defining the SCAD algorithm:
y=y.values.reshape(-1,1)
X=X.values
from scipy.optimize import minimize
def scad_penalty(beta_hat, lambda_val, a_val):
is_linear = (np.abs(beta_hat) <= lambda_val)
is_quadratic = np.logical_and(lambda_val < np.abs(beta_hat), np.abs(beta_hat) <= a_val * lambda_val)
is_constant = (a_val * lambda_val) < np.abs(beta_hat)
linear_part = lambda_val * np.abs(beta_hat) * is_linear
quadratic_part = (2 * a_val * lambda_val * np.abs(beta_hat) - beta_hat**2 - lambda_val**2) / (2 * (a_val - 1)) * is_quadratic
constant_part = (lambda_val**2 * (a_val + 1)) / 2 * is_constant
return linear_part + quadratic_part + constant_part
def scad_derivative(beta_hat, lambda_val, a_val):
return lambda_val * ((beta_hat <= lambda_val) + (a_val * lambda_val - beta_hat)*((a_val * lambda_val - beta_hat) > 0) / ((a_val - 1) * lambda_val) * (beta_hat > lambda_val))
def scad(beta):
beta = beta.flatten()
beta = beta.reshape(-1,1)
n = len(y)
return 1/n*np.sum((y-X.dot(beta))**2) + np.sum(scad_penalty(beta,lam,a))
def dscad(beta):
beta = beta.flatten()
beta = beta.reshape(-1,1)
n = len(y)
return np.array(-2/n*np.transpose(X).dot(y-X.dot(beta))+scad_derivative(beta,lam,a)).flatten()
p = X.shape[1]
b0 = np.random.normal(1,1,p)
lam = 0.5
a = 0.001
output = minimize(scad, b0, method='L-BFGS-B', jac=dscad,options={'gtol': 1e-8, 'maxiter': 50000,'maxls': 25,'disp': True})
yhat_test_scad = X_test.dot(output.x)
output.x
min(output.x)
max(output.x)
The range of our output is rather large, with a difference of ~.
yhat_test_scad = X_test.dot(output.x)
yhat_test_scad
Let us find the predicted y values:
ylist=[]
for output in yhat_test_scad:
if output<0.5:
output=1
ylist.append(output)
elif output<4.5:
output=round(output)
ylist.append(output)
else:
output=4
ylist.append(output)
y_hat_test_rounded=np.array(ylist)
y_hat_test_rounded
Let us find the MAE at this beta:
mae = mean_absolute_error(y_test, y_hat_test_rounded)
print("MAE = {:,.2f}".format(mae))
Our MAE is extremely high for our data, as it predicted a lot of 1 values. Let us try to use other alpha values:
listofoutput=[]
lam = 0.5
for i in range(200):
a = i
output = minimize(scad, b0, method='L-BFGS-B', jac=dscad,options={'gtol': 1e-8, 'maxiter': 50000,'maxls': 25,'disp': True})
yhat_test_scad = X_test.dot(output.x)
ylist=[]
for output in yhat_test_scad:
if output<0.5:
output=1
ylist.append(output)
elif output<4.5:
output=round(output)
ylist.append(output)
else:
output=4
ylist.append(output)
y_hat_test_rounded=np.array(ylist)
listofoutput.append(mean_absolute_error(y_test, y_hat_test_rounded))
plt.scatter(range(200),listofoutput)
min(listofoutput)
We can see that the minimum MAE is 0.21 and occurs at some alpha in the range of 0 to 200
Standardized SCAD with the IDB dataset
Let us see if standardizing our dataset can help lower our MAE:
y=y.values.reshape(-1,1)
X=X.values
Xs = scale.fit_transform(X)
def scad_penalty(beta_hat, lambda_val, a_val):
is_linear = (np.abs(beta_hat) <= lambda_val)
is_quadratic = np.logical_and(lambda_val < np.abs(beta_hat), np.abs(beta_hat) <= a_val * lambda_val)
is_constant = (a_val * lambda_val) < np.abs(beta_hat)
linear_part = lambda_val * np.abs(beta_hat) * is_linear
quadratic_part = (2 * a_val * lambda_val * np.abs(beta_hat) - beta_hat**2 - lambda_val**2) / (2 * (a_val - 1)) * is_quadratic
constant_part = (lambda_val**2 * (a_val + 1)) / 2 * is_constant
return linear_part + quadratic_part + constant_part
def scad_derivative(beta_hat, lambda_val, a_val):
return lambda_val * ((beta_hat <= lambda_val) + (a_val * lambda_val - beta_hat)*((a_val * lambda_val - beta_hat) > 0) / ((a_val - 1) * lambda_val) * (beta_hat > lambda_val))
def scad(beta):
beta = beta.flatten()
beta = beta.reshape(-1,1)
n = len(y)
return 1/n*np.sum((y-X.dot(beta))**2) + np.sum(scad_penalty(beta,lam,a))
def dscad(beta):
beta = beta.flatten()
beta = beta.reshape(-1,1)
n = len(y)
return np.array(-2/n*np.transpose(X).dot(y-X.dot(beta))+scad_derivative(beta,lam,a)).flatten()
p = Xs.shape[1]
b0 = np.random.normal(1,1,p)
lam = 1
a = 0.01
output = minimize(scad, b0, method='L-BFGS-B', jac=dscad,options={'gtol': 1e-8, 'maxiter': 50000,'maxls': 25,'disp': True})
output.x
Our Min and Max Output:
min(output.x)
max(output.x)
Our y Predictions:
yhat_test_scad = X_test.dot(output.x)
yhat_test_scad
ylist=[]
for output in yhat_test_scad:
if output<0.5:
output=1
ylist.append(output)
elif output<4.5:
output=round(output)
ylist.append(output)
else:
output=4
ylist.append(output)
y_hat_test_rounded=np.array(ylist)
y_hat_test_rounded
And finally…our MAE
mae = mean_absolute_error(y_test, y_hat_test_rounded)
print("MAE = {:,.2f}".format(mae))
As we can see, the MAE was hardly reduced and still remains above the majority of the other models. Let us examine if it can be improved at other alphas:
listofoutput=[]
lam = 0.5
for i in range(200):
a = i
output = minimize(scad, b0, method='L-BFGS-B', jac=dscad,options={'gtol': 1e-8, 'maxiter': 50000,'maxls': 25,'disp': True})
yhat_test_scad = X_test.dot(output.x)
ylist=[]
for output in yhat_test_scad:
if output<0.5:
output=1
ylist.append(output)
elif output<4.5:
output=round(output)
ylist.append(output)
else:
output=4
ylist.append(output)
y_hat_test_rounded=np.array(ylist)
listofoutput.append(mean_absolute_error(y_test, y_hat_test_rounded))
plt.scatter(range(200),listofoutput)
min(listofoutput)
Our minimum MAE is close to our unstandardized SCAD MAE.
Comparison of Regularization Techniques
After trying four different regularization techniques and trying both standardized and unstandardized versions of our models, we can conclude that the Standardized Ridge Regularization reduced our MAE the most (0.15). Looking further at the data table, we can note that Lasso Regularization shrunk the model the most in terms of the size of the weights (~0.3 between the max and min coefficient). It is not a surprise that Elastic Net (~0.6) comes in a close second in respect to this, as it is partially made of Lasso Regularization. This confirms that Lasso Regularization will throw out the terms it does not need and will simplify the model in an extreme way. It can also be confirmed that Lasso Regularization was able to decently reduce the MAE because our data did not have strong correlations between its features. Had the data had stronger correlatations, we may have seen a larger difference between Lasso and Ridge. Because Ridge outperformed Lasso by a little in this exercise, we may assume that Lasso may have slightly oversimplified the relationship between features. However, with so many features and such little difference between Lasso and Ridge Regularization, it may be better to opt for Lasso Regularization due to its feature selection capabilities. We can also note that at the selected alpha (a=0.01), Standardized Lasso performed best(0.23), while the other regularization techniques were scattered. It can also be noted that SCAD worked best at greater alphas, and although at a=0.01 it had a quite large MAE, its minimum MAE was in range. Although all the regularization techniques lowered the MAE into the 0.2s, standardized square root lasso did not improve at all.
Significant Earthquakes, 1965-2016
The same regularization analysis can be tested on different datasets. This dataset describes major earthquakes greater than a magnitude of 5.5 from 1965 to 2016 and can be found on Kaggle. We will attempt to predict earthquake magnitude from factors such as latitude, length, and depth.
Data Preprocessing
#import libraries
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
df = pd.read_csv('/content/drive/MyDrive/database.csv')
df2=df
df2=df2.drop('Status', axis=1)
df2=df2.drop('Date', axis=1)
df2=df2.drop(['Time','Type','ID', 'Source', 'Location Source', 'Magnitude Source'], axis=1)
df2=df2.drop('Magnitude Type', axis=1)
df2=df2.apply(pd.to_numeric)
mean_value1=df2['Depth Error'].mean()
df2['Depth Error'].fillna(mean_value1, inplace=True)
mean_value2=df2['Depth Seismic Stations'].mean()
df2['Depth Seismic Stations'].fillna(mean_value2, inplace=True)
mean_value3=df2['Magnitude Error'].mean()
df2['Magnitude Error'].fillna(mean_value3, inplace=True)
mean_value4=df2['Magnitude Seismic Stations'].mean()
df2['Magnitude Seismic Stations'].fillna(mean_value4, inplace=True)
mean_value5=df2['Azimuthal Gap'].mean()
df2['Azimuthal Gap'].fillna(mean_value5, inplace=True)
mean_value6=df2['Horizontal Distance'].mean()
df2['Horizontal Distance'].fillna(mean_value6, inplace=True)
mean_value7=df2['Horizontal Error'].mean()
df2['Horizontal Error'].fillna(mean_value7, inplace=True)
df2=df2.drop('Root Mean Square', axis=1)
Unlike the IDB dataset, the NaN values in this dataset were filled in with the mean of each column. This may have skewed our data, but removing the NaN values would have left very few viable rows.
Visualizations
#Define X and y
X=df2.drop('Magnitude', axis=1)
y=df2['Magnitude']
X.describe()
y.describe()
Feature Breakdown
Target Breakdown
From the breakdowns of our Feature and Targets, we can see that standardization is most likely required, as each feature is on its own scale.
df2=df2.apply(pd.to_numeric)
Magnitude vs Depth
import seaborn as sns
plt.figure(figsize=(10,10))
sns.violinplot(x=df2['Depth'], y=df2['Magnitude'])
Although plotting all of the data would render a variety of plots, it is most interesting to see the relationship between the recorded depth of the Earthquake and Magnitude. It is most likely that this feature will have the most impact. From previous research, we know that earthquakes that occur at greater depths tend to cause less shaking at the surface.
Data Correlations
corr_matrix=np.corrcoef(X)
corr_matrix
import seaborn as sns
plt.figure(figsize=(20,20))
sns.heatmap(X.corr(), vmax=1, vmin=-1, cmap="spring", annot=True,fmt='.2f')
plt.show()
The Correlation Matrix shows that the majority of the features are not correlated very much at all, with most values being close to zero. This means that regularization may not be as necessary for this particular dataset.
Sample Data t-test
from scipy import stats
def tstat_corr(r,n):
t = r*np.sqrt(n-2)/np.sqrt(1-r**2)
pval = stats.t.sf(np.abs(t), n-1)*2 # two-sided pvalue = Prob(abs(t)>tt)
print('t-statistic = %6.3f pvalue = %6.4f' % (t, pval))
tstat_corr(0.02,len(X)) #put in an r value from matrix
With a high t statistic and low p-value, our t-test supports low correlation between the features.
Linear Regression
#APPLY OLS
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lm = LinearRegression()
#fit the variables to model
lm.fit(X_train, y_train)
#predict output by inputting test variable
yhat_lm=lm.predict(X_test)
yhat_lm
from sklearn.metrics import mean_absolute_error
print("Intercept: {:,.3f}".format(lm.intercept_))
mae = mean_absolute_error(y_test, yhat_lm)
print("MAE = {:,.2f}".format(mae))
print('coefficients:')
lm.coef_
#max coefficient (range)
max(lm.coef_)
#min coefficient
min(lm.coef_)
Predictions
MAE and Intercept
Coefficients
Max Coefficient:
Min Coefficient:
Ridge Regularization
#Apply Ridge
from sklearn.linear_model import Ridge
lr = Ridge(alpha=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lr.fit(X_train,y_train)
yhat_lr=lr.predict(X_test)
yhat_lr
from sklearn.metrics import mean_absolute_error
print("Intercept: {:,.3f}".format(lr.intercept_))
mae = mean_absolute_error(y_test, yhat_lr)
print("MAE = {:,.2f}".format(mae))
print('Coefficients:')
lr.coef_
max(lr.coef_)
min(lr.coef_)
maeLR=[]
for i in range(1000):
lr = Ridge(alpha=i)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lr.fit(X_train,y_train)
yhat_lr=lr.predict(X_test)
#Turn continuous variables into discrete
maeLR.append(mean_absolute_error(y_test, yhat_lr))
plt.scatter(range(1000),maeLR)
Predictions
Intercept and MAE
Coefficients
Max Coefficient:
Min Coefficient:
MAE at other Alpha Values
Standardized Ridge
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
lr = Ridge(alpha=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lr.fit(ss.fit_transform(X_train),y_train)
scaled_yhat_lr=lr.predict(ss.fit_transform(X_test))
from sklearn.metrics import mean_absolute_error
print("Intercept: {:,.3f}".format(lr.intercept_))
mae = mean_absolute_error(y_test, scaled_yhat_lr)
print("MAE = {:,.2f}".format(mae))
print('Coefficients:')
lr.coef_
max(lr.coef_)
min(lr.coef_)
maeSLR=[]
for i in range(200):
lr = Ridge(alpha=i)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lr.fit(ss.fit_transform(X_train),y_train)
scaled_yhat_lr=lr.predict(ss.fit_transform(X_test))
#Turn continuous variables into discrete
maeSLR.append( mean_absolute_error(y_test, scaled_yhat_lr))
plt.scatter(range(200),maeSLR)
Predictions
Intercept and MAE
Coefficients
Min Coefficient:
Max Coefficient:
MAE at other Alpha Values
Lasso Regularization
from sklearn.linear_model import Lasso
ls = Lasso(alpha=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
ls.fit(X_train,y_train)
yhat_ls=ls.predict(X_test)
from sklearn.metrics import mean_absolute_error
print("Intercept: {:,.3f}".format(ls.intercept_))
mae = mean_absolute_error(y_test, yhat_ls)
print("MAE = {:,.2f}".format(mae))
print('Coefficients')
ls.coef_
min(ls.coef_)
max(ls.coef_)
#get features that work
list=[]
count=-1
counted_list=[]
A=ls.coef_
for a in A:
count=count+1
if a!=0:
list.append(a)
counted_list.append(count)
print(list)
print(counted_list)
#Features
for c in counted_list:
print(X.columns[c])
maeLS=[]
for i in range(100):
ls = Lasso(alpha=i)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
ls.fit(X_train,y_train)
yhat_ls=ls.predict(X_test)
maeLS.append(mean_absolute_error(y_test, yhat_ls))
plt.scatter(range(100),maeLS)
Predictions
Intercept and MAE
Coefficients
Max Coefficients:
Min Coefficients:
Nonzero Coefficient Features
MAE at other Alpha Values
Standardized Lasso
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ls = Lasso(alpha=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
ls.fit(ss.fit_transform(X_train),y_train)
yhat_ls=ls.predict(ss.fit_transform(X_test))
from sklearn.metrics import mean_absolute_error
print("Intercept: {:,.3f}".format(ls.intercept_))
mae = mean_absolute_error(y_test, yhat_ls)
print("MAE = {:,.2f}".format(mae))
print('Coefficients')
ls.coef_
min(ls.coef_)
max(ls.coef_)
#get features that work
list=[]
count=-1
counted_list=[]
A=ls.coef_
for a in A:
count=count+1
if a!=0:
list.append(a)
counted_list.append(count)
print(list)
print(counted_list)
#Features
for c in counted_list:
print(X.columns[c])
maeSLS=[]
for i in range(100):
ss = StandardScaler()
ls = Lasso(alpha=i)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
ls.fit(ss.fit_transform(X_train),y_train)
yhat_ls=ls.predict(ss.fit_transform(X_test))
maeSLS.append(mean_absolute_error(y_test, yhat_ls))
plt.scatter(range(100),maeSLS)
Predictions
Intercept and MAE
Coefficients
Min Coefficent:
Max Coefficient:
NonZero Coefficient Features
MAE at other Alpha Values
Elastic Net Regularization
from sklearn.linear_model import ElasticNet
lel = ElasticNet(alpha=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lel.fit(X_train,y_train)
yhat_lel=lel.predict(X_test)
from sklearn.metrics import mean_absolute_error
print("Intercept: {:,.3f}".format(lel.intercept_))
mae = mean_absolute_error(y_test, yhat_lel)
print("MAE = {:,.2f}".format(mae))
print('Coefficients:')
lel.coef_
min(lel.coef_)
max(lel.coef_)
list=[]
count=-1
counted_list=[]
A=lel.coef_
for a in A:
count=count+1
if a!=0:
list.append(a)
counted_list.append(count)
print(list)
print(counted_list)
#Features
for c in counted_list:
print(X.columns[c])
maeEN=[]
for i in range(200):
lel = ElasticNet(alpha=i)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lel.fit(X_train,y_train)
yhat_lel=lel.predict(X_test)
maeEN.append(mean_absolute_error(y_test, yhat_lel))
plt.scatter(range(200),maeEN)
Predictions
Intercept and MAE
Coefficients
Min Coefficient:
Max Coefficients:
Nonzero Coefficient Features
MAE at other Alpha Values
Standardized Elastic Net
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
lel = ElasticNet(alpha=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lel.fit(ss.fit_transform(X_train),y_train)
yhat_lel=lel.predict(ss.fit_transform(X_test))
from sklearn.metrics import mean_absolute_error
print("Intercept: {:,.3f}".format(lel.intercept_))
mae = mean_absolute_error(y_test, yhat_lel)
print("MAE = {:,.2f}".format(mae))
print('Coefficients:')
lel.coef_
max(lel.coef_)
min(lel.coef_)
list=[]
count=-1
counted_list=[]
A=lel.coef_
for a in A:
count=count+1
if a!=0:
list.append(a)
counted_list.append(count)
print(list)
print(counted_list)
#Features
for c in counted_list:
print(X.columns[c])
maeSEN=[]
for i in range(200):
ss = StandardScaler()
lel = ElasticNet(alpha=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lel.fit(ss.fit_transform(X_train),y_train)
yhat_lel=lel.predict(ss.fit_transform(X_test))
maeSEN.append(mean_absolute_error(y_test, yhat_lel))
plt.scatter(range(200),maeSEN)
Predictions
MAE and Intercept
Coefficients
Min Coefficient:
Max Coefficient:
NonZero Coefficient Features
MAE at other Alpha Values
Squared Root Lasso
! pip install --upgrade Cython
! pip install --upgrade git+https://github.com/statsmodels/statsmodels
import statsmodels.api as sm
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split as tts
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
model = sm.OLS(y_train,X_train)
result = model.fit_regularized(method='sqrt_lasso', alpha=0.01)
yhat_test = result.predict(X_test)
MAE(y_test,yhat_test)
result.params
min(result.params)
max(result.params)
maeSL=[]
for i in range(200):
model = sm.OLS(y_train,X_train)
result = model.fit_regularized(method='sqrt_lasso', alpha=i)
yhat_test = result.predict(X_test)
maeSL.append(MAE(y_test,yhat_test))
plt.scatter(range(200),maeSL)
Predictions
MAE
Coefficients
Max Coefficient:
Min Coefficient:
MAE at other Alpha Values
Scaled Square Root Lasso
scale = StandardScaler()
Xs_train = scale.fit_transform(X_train)
Xs_test = scale.transform(X_test)
model = sm.OLS(y_train,Xs_train)
result = model.fit_regularized(method='sqrt_lasso', alpha=0.5)
yhat_test = result.predict(Xs_test)
#Turn continuous variables into discrete
MAE(y_test, yhat_test)
result.params
max(result.params)
min(result.params)
maeSSL=[]
for i in range(200):
model = sm.OLS(y_train,Xs_train)
result = model.fit_regularized(method='sqrt_lasso', alpha=0.5)
yhat_test = result.predict(Xs_test)
maeSSL.append(MAE(y_test,yhat_test))
plt.scatter(range(200),maeSSL)
Predictions
MAE
Coefficients
Max Coefficient:
Min Coefficient:
MAE at other Alpha Values
SCAD
y=y.values.reshape(-1,1)
X=X.values
from scipy.optimize import minimize
def scad_penalty(beta_hat, lambda_val, a_val):
is_linear = (np.abs(beta_hat) <= lambda_val)
is_quadratic = np.logical_and(lambda_val < np.abs(beta_hat), np.abs(beta_hat) <= a_val * lambda_val)
is_constant = (a_val * lambda_val) < np.abs(beta_hat)
linear_part = lambda_val * np.abs(beta_hat) * is_linear
quadratic_part = (2 * a_val * lambda_val * np.abs(beta_hat) - beta_hat**2 - lambda_val**2) / (2 * (a_val - 1)) * is_quadratic
constant_part = (lambda_val**2 * (a_val + 1)) / 2 * is_constant
return linear_part + quadratic_part + constant_part
def scad_derivative(beta_hat, lambda_val, a_val):
return lambda_val * ((beta_hat <= lambda_val) + (a_val * lambda_val - beta_hat)*((a_val * lambda_val - beta_hat) > 0) / ((a_val - 1) * lambda_val) * (beta_hat > lambda_val))
def scad(beta):
beta = beta.flatten()
beta = beta.reshape(-1,1)
n = len(y)
return 1/n*np.sum((y-X.dot(beta))**2) + np.sum(scad_penalty(beta,lam,a))
def dscad(beta):
beta = beta.flatten()
beta = beta.reshape(-1,1)
n = len(y)
return np.array(-2/n*np.transpose(X).dot(y-X.dot(beta))+scad_derivative(beta,lam,a)).flatten()
p = X.shape[1]
b0 = np.random.normal(1,1,p)
lam = 0.5
a = 0.01
output = minimize(scad, b0, method='L-BFGS-B', jac=dscad,options={'gtol': 1e-8, 'maxiter': 50000,'maxls': 25,'disp': True})
yhat_test_scad = X_test.dot(output.x)
output.x
min(output.x)
max(output.x)
yhat_test_scad = X_test.dot(output.x)
yhat_test_scad
mae = mean_absolute_error(y_test, yhat_test_scad)
print("MAE = {:,.2f}".format(mae))
listofoutput=[]
lam = 0.5
for i in range(200):
a = i
output = minimize(scad, b0, method='L-BFGS-B', jac=dscad,options={'gtol': 1e-8, 'maxiter': 50000,'maxls': 25,'disp': True})
yhat_test_scad = X_test.dot(output.x)
y_hat_test_rounded=np.array(ylist)
listofoutput.append(mean_absolute_error(y_test, yhat_test_scad))
plt.scatter(range(200),listofoutput)
min(listofoutput)
Predictions
MAE
Coefficients
Max Coefficients:
Min Coefficients:
MAE at other Alpha Values
Scaled SCAD
y=y.reshape(-1,1)
X=X.values
Xs = scale.fit_transform(X)
def scad_penalty(beta_hat, lambda_val, a_val):
is_linear = (np.abs(beta_hat) <= lambda_val)
is_quadratic = np.logical_and(lambda_val < np.abs(beta_hat), np.abs(beta_hat) <= a_val * lambda_val)
is_constant = (a_val * lambda_val) < np.abs(beta_hat)
linear_part = lambda_val * np.abs(beta_hat) * is_linear
quadratic_part = (2 * a_val * lambda_val * np.abs(beta_hat) - beta_hat**2 - lambda_val**2) / (2 * (a_val - 1)) * is_quadratic
constant_part = (lambda_val**2 * (a_val + 1)) / 2 * is_constant
return linear_part + quadratic_part + constant_part
def scad_derivative(beta_hat, lambda_val, a_val):
return lambda_val * ((beta_hat <= lambda_val) + (a_val * lambda_val - beta_hat)*((a_val * lambda_val - beta_hat) > 0) / ((a_val - 1) * lambda_val) * (beta_hat > lambda_val))
def scad(beta):
beta = beta.flatten()
beta = beta.reshape(-1,1)
n = len(y)
return 1/n*np.sum((y-X.dot(beta))**2) + np.sum(scad_penalty(beta,lam,a))
def dscad(beta):
beta = beta.flatten()
beta = beta.reshape(-1,1)
n = len(y)
return np.array(-2/n*np.transpose(X).dot(y-X.dot(beta))+scad_derivative(beta,lam,a)).flatten()
p = Xs.shape[1]
b0 = np.random.normal(1,1,p)
lam = 1
a = 0.01
output = minimize(scad, b0, method='L-BFGS-B', jac=dscad,options={'gtol': 1e-8, 'maxiter': 50000,'maxls': 25,'disp': True})
output.x
min(output.x)
max(output.x)
yhat_test_scad = X_test.dot(output.x)
yhat_test_scad
mae = mean_absolute_error(y_test, yhat_test_scad)
print("MAE = {:,.2f}".format(mae))
listofoutput=[]
lam = 0.5
for i in range(200):
a = i
output = minimize(scad, b0, method='L-BFGS-B', jac=dscad,options={'gtol': 1e-8, 'maxiter': 50000,'maxls': 25,'disp': True})
yhat_test_scad = X_test.dot(output.x)
listofoutput.append(mean_absolute_error(y_test, yhat_test_scad))
plt.scatter(range(200),listofoutput)
min(listofoutput)
Predictions
MAE
Coefficients
Max Coefficients:
Min Coefficients:
MAE at other Alpha Values
Discussion Pertaining to the Regularization Methods
Looking at the results, we note that none of the Regularization techniques improved the MAE. We can attribute this to the lack of correlation between the data. Looking further at the results, we also note that this dataset may not be adequate for prediction, as the features seem to have no effect on each other. We can further support this conclusion by lookig at the Lasso and Elastic Net results, where we lost anywhere in between one and five features with little to no effect on the MAE. In Standardized Lasso Regularization, we lost all the features which I hypothesized as having the most effect: location (latitude/longitude and depth). Thus, we can conclude that this particular dataset does not support that magnitude of an earthquake can be predicted from only latitude and depth. It would be interesting to add in other features such as weather and volcanic activty in the region.
We can see that regularization only worsened our results, particularly with SCAD and Square Root Lasso. It is interesting to see that when alpha was changed, most of the regularization method results either did not change or worsened the MAE as alpha approached infinity (standard ridge, square root lasso, scaled SCAD).