Costa Rican Household Poverty Level Prediction- IDB

Introducing the Data

This dataset was scavenged from an old competition on Kaggle asking users to classify household poverty levels on a scale of 1 to 4 based on demographic data. The dataset originally had 142 features with 9557 data entries, but was reduced during pre-processing and clean up phases. The dataset’s features before preprocessing were as follows:

Preprocessing

9557 data entries were reduced to 156 entries due to the amount of values that were missing from the set. Although NaN values can be either estimated with KNN algorithms or inputed with designated values, an enormous data set was not necessary for the primary purpose of this project. Four columns were also dropped due to them being object type variables. Rather than deleting these columns, we could have assigned a representative number (for example, Yes=0 No=1). The final preprocessed dataset had 156 rows and 138 columns

#Import Libraries
import numpy as np
import pandas as pd

#Create Dataset
df = pd.read_csv('/content/drive/MyDrive/costa-rican-household-poverty-prediction/train.csv')

#Drop rows with NaN values
df2=df
df2=df2.dropna()

#Exclude type: object
df2 = df2.select_dtypes(exclude=['object'])

#Convert all objects to Numeric
df2=df2.apply(pd.to_numeric)

#Define X variable/features
X=df2.drop('Target', axis=1)

#Define Y variable/target
y=df2['Target']

We can also see the data distribution in our target:

y.value_counts()

We can see that the dataset hardly has data recorded for level 1 and 2 poverty levels, thus making it more likely that our model will not accurately predict these levels of poverty. It is likely that there were many unknown values in these categories that were deleted in our preprocessing phase. Thus, we should reconsider how we process our NaN values in the future.

Correlation

In order to later understand regularization techniques, the correlation of the features must be analyzed.

corr_matrix=np.corrcoef(X)
corr_matrix

import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(200,200))
sns.heatmap(X.corr(), vmax=1, vmin=-1, cmap="spring", annot=True,fmt='.2f')
plt.show()

Although it is somewhat hard to tell about the nature of specific correlations due to the anount of features in the dataset, we can see that the majority of the data has a mild correlation relationship (dominance of peachy coral pink color).

In order to further verify that the data is weakly correlated, we can apply a one sample t-test with a two sided pvalue.

from scipy import stats
def tstat_corr(r,n):
  t = r*np.sqrt(n-2)/np.sqrt(1-r**2)
  pval = stats.t.sf(np.abs(t), n-1)*2  # two-sided pvalue = Prob(abs(t)>tt)
  print('t-statistic = %6.3f pvalue = %6.4f' % (t, pval))
  
 tstat_corr(0.02,len(X)) #put in an r value from matrix

One of the lowest correlation values in the heatmap was chosen as the r value. We are left with the following results:

The t-value measures the size of the difference relative to the variation in the data. The greater t is, the greater the chance is that there is a significant difference between the data. Our p value is extremly high at about 0.8, thus telling us that the data is most likely weakly correlated.

Regularization and Feature Selection

Finding the Weights Mathematically without Regularization

Let us assume we have a model Y with equation Y=F(X1, X2,…,Xp). In weighted linear regression, this equation is multiplied by certain weights which define the sparsity pattern of the data. An error term is also added into the equation:

Our goal is to minimize the squared error by obtaining weights that will do so. Our error term will be found by subtracting the predicted valaues from the observed values and then squared.

The sum of squared errors can also be expressed in matrix form, since our data sets tend to have hundreds/thousands of items. Thus, in matrix notation, the sum of squared errors is the same as transposing two error vectors.

Putting in the equatoin for error we have:

Then, the partial derivatives of the weights are taken:

And the equation is set to zero to obtain the global minimum:

Thus, this becomes are solution to the linear regression.

Linear Regression Application of the IDB Dataset

Note: I was so excited about finding this dataset that I did not realize that it addressed a classification problem rather than a linear regression. It was not until I finished the majority of the coding that I realized that the linear models predicts on a continuous scale rather than a discrete scale. Thus, I was seeing predicted values like 1.6 rather than 2. In order to try to make up for this error, I decided to round the values to their integers. This led to a lot more error/less accuracy than would have occurred with a different X and y. I have decided to keep the error in since this is the second time I have made this error in a Machine Learning Project, so I would like to keep this as a reminder to understand the dataset before running into the application phase. An SVM would better work for classification problems.

Before applying any regularization, let us examine a linear regression model on the data:

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

#split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)

lm = LinearRegression()

#fit the variables to model
lm.fit(X_train, y_train)

#predict output by inputting test variable
yhat_lm=lm.predict(X_test)

#Turn continuous variables into discrete
ylist=[]
for output in yhat_lm:
  if output<4.5:
    output=round(output)
    ylist.append(output)
  else:
    output=4
    ylist.append(output)
    
#round
y_hat_lm_rounded=np.array(ylist)

#MAE 
from sklearn.metrics import mean_absolute_error 
print("Intercept: {:,.3f}".format(lm.intercept_))
    
mae = mean_absolute_error(y_test, y_hat_lm_rounded)
print("MAE = {:,.2f}".format(mae))

Predicted Values:

lm.coef_

#max coefficient (range)
max(lm.coef_)

#min coefficient
min(lm.coef_)

Max:

Min:

We can see from the data that the our coefficients have a rather wide range and our MAE is high. Let us see what happens when we apply regularization to our model.

Regularization Embedded Models

When we are working with large datasets such as this one, too many features may create bias and variance in our results. Bias is defined as the inability for the model to capture the true relationship of the data and while variance is defined as the difference in fit between the training and testing sets. For example, a model has high variance if the model predicts the training set very accurately but fails to do so in the testing set (overfitting). In order to reduce bias and variance, feature selection, regularization, boosting, and bagging techniques can be applied.

Feature selection is defined as the selection of features that best contribute to the accuracy of the model. Regularization will add constraints that lower the size of the coefficients in order to make the model less complex and avoid it from fitting variables that are not as important. This will penalize the loss function by adding a regularization term and minimize the sum of square residuals with constraints on the weight vector.

Lasso Regression/ L1 Regularization

Lasso Regression will produce a model with high accuracy and a subset of the original features. Lasso regression puts in a constraint where the sum of absolute values of coefficients is less than a fixed value. Thus, it will lower the size of the coefficients and lead some features to have a coefficient of 0, essentially dropping it from the model.

Looking back at our model without regularization, we saw that our coefficients were found with the formula:

L1 regularization’s loss function is defined by:

otherwise known as:

By adding in an extra cost term, the weights will be penalized. Lasso Regression will find the closed form solution to this equation in order to derive the weights.

This type of regularization should not be applied to a dataset with a low number of features as it will possibly drop features that are essential to the model. Lasso regularization also does not work well with features that are highly correlated, as it may drop correlated groups. The solution will be sparse as a large proportion of features will have a weight of zero.

Lasso Regression Application

We can first fit our data to a lasso regression:

from sklearn.linear_model import Lasso
ls = Lasso(alpha=0.01)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
ls.fit(X_train,y_train)
yhat_ls=ls.predict(X_test)

#Turn continuous variables into discrete
ylist=[]
for output in yhat_ls:
  if output<4.5:
    output=round(output)
    ylist.append(output)
  else:
    output=4
    ylist.append(output)

y_hat_ls_rounded=np.array(ylist)

Our predicted values are:

Let us find the mean absolute error to compare the model on the training and testing set:

from sklearn.metrics import mean_absolute_error 
print("Intercept: {:,.3f}".format(ls.intercept_))
    
mae = mean_absolute_error(y_test, y_hat_ls_rounded)
print("MAE = {:,.2f}".format(mae))

The weighted coefficients are:

print('Coefficients')
ls.coef_

Let us examine the range of the coefficients:

min(ls.coef_)

max(ls.coef_)

We can see that the range of the coefficients has decreased greatly compared to the non regularized model.

Since Lasso Regression eliminates certain features, we can derive the ones that are left:

#get features that work
list=[]
count=-1
counted_list=[]
A=ls.coef_
for a in A:
  count=count+1
  if a!=0:
    list.append(a)
    counted_list.append(count)

#Features
for c in counted_list:
  print(X.columns[c])

It can be noticed that we chose an arbitrary penalty term (alpha value) in this example. In order to choose alpha, we must use cross validation. The alpha value that gives us the least variance is the optimal value. We can thus test other alpha values that may make our predictions better:

#test other alphas
import matplotlib.pyplot as plt
maeLS=[]
for i in range(100):
  ls = Lasso(alpha=i)

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
  ls.fit(X_train,y_train)
  yhat_ls=ls.predict(X_test)
  #Turn continuous variables into discrete
  ylist=[]
  for output in yhat_ls:
    if output<4.5:
      output=round(output)
      ylist.append(output)
    else:
      output=4
      ylist.append(output)

  y_hat_ls_rounded=np.array(ylist)

  maeLS.append(mean_absolute_error(y_test, y_hat_ls_rounded))
  plt.scatter(range(100),maeLS)

Standardized Lasso Regression Application

Looking at our data, we can observe that the values are not nearly in the same ranges. Thus, we can standardize our data to attempt to make our model better. We apply the same steps while standardizing our features:

from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ls = Lasso(alpha=0.01)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
ls.fit(ss.fit_transform(X_train),y_train)
yhat_ls=ls.predict(ss.fit_transform(X_test))
#Turn continuous variables into discrete
ylist=[]
for output in yhat_ls:
  if output<4.5:
    output=round(output)
    ylist.append(output)
  else:
    output=4
    ylist.append(output)

y_hat_ls_rounded=np.array(ylist)
y_hat_ls_rounded

from sklearn.metrics import mean_absolute_error 
print("Intercept: {:,.3f}".format(ls.intercept_))
    
mae = mean_absolute_error(y_test, y_hat_ls_rounded)
print("MAE = {:,.2f}".format(mae))

We can see that standardizing our data improves our MAE greatly at the same alpha (from 0.32 to 0.23).

print('Coefficients')
ls.coef_

Let us look at the range of our coefficients:

min(ls.coef_)

max(ls.coef_)

Thus, we can see that the standardized Lasso Regression decreased the range of weights in comparison to the non standardized Lasso Regression.

#get features that work
list=[]
count=-1
counted_list=[]
A=ls.coef_
for a in A:
  count=count+1
  if a!=0:
    list.append(a)
    counted_list.append(count)

#Features
for c in counted_list:
  print(X.columns[c])

#test other alphas
import matplotlib.pyplot as plt
maeSLS=[]
for i in range(100):
  ss = StandardScaler()
  ls = Lasso(alpha=i)

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
  ls.fit(ss.fit_transform(X_train),y_train)
  yhat_ls=ls.predict(ss.fit_transform(X_test))
  #Turn continuous variables into discrete
  ylist=[]
  for output in yhat_ls:
    if output<4.5:
      output=round(output)
      ylist.append(output)
    else:
      output=4
      ylist.append(output)

  y_hat_ls_rounded=np.array(ylist)

  maeSLS.append(mean_absolute_error(y_test, y_hat_ls_rounded))
  plt.scatter(range(100),maeSLS)

We see a similar trend in the standardized and non-standardized forms of Lasso regression; the MAE plateaus below 0.22 as alpha goes to infinity.

Ridge Regression/L2 Regularization

Ridge Regression shares many conceptual similarities with Lasso Regression; it also adds on a penalty to the loss function. The regularization term is the sum of squares of all the feature weights. Unlike Lasso Regression, this type of regression will make the weights smaller but never zero. Ridge regession is not good for data with a lot of outliers, as it blows up the error differences of the outliers and the regularization term tries to fix it by penalizing the weights. Ridge regression is also better when all the features influence the output and all the weights are roughly the same size. This regularization technique does not offer feature selection and has a non sparse solution. It should be noted that ridge regression can hel solve models in which there are less data points than parameters. Ridge regression will penalize large weight coefficients more than the smaller ones as opposed to Lasso regression which penalizes each coefficient uniformly.

Ridge regression’s loss function is defined as:

One can solve for the weighted term of ridge regression by finding the closed form solution using the derivative. We end up with the form:

Ridge Regression Application

Ridge Regression has a similar application code wise as Lasso Regression:

#Apply Ridge
from sklearn.linear_model import Ridge

lr = Ridge(alpha=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lr.fit(X_train,y_train)
yhat_lr=lr.predict(X_test)
#Turn continuous variables into discrete
ylist=[]
for output in yhat_lr:
  if output<4.5:
    output=round(output)
    ylist.append(output)
  else:
    output=4
    ylist.append(output)

y_hat_lr_rounded=np.array(ylist)

We are able to calculate the MAE on unstandardized data:

from sklearn.metrics import mean_absolute_error 
print("Intercept: {:,.3f}".format(lr.intercept_))
    
mae = mean_absolute_error(y_test, y_hat_lr_rounded)
print("MAE = {:,.2f}".format(mae))

We see that unstandardized Ridge and Lasso regularization produce the same MAE for this dataset.

print('Coefficients:')
lr.coef_

Our range of weights decreases from the non-regularized linear regression coefficients, but does not decrease as much as Lasso Regression:

max(lr.coef_)
min(lr.coef_)

Max:

Min:

We once again test for optimal alpha values:

import matplotlib.pyplot as plt
maeLR=[]
for i in range(1000):
  lr = Ridge(alpha=i)
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
  lr.fit(X_train,y_train)
  yhat_lr=lr.predict(X_test)
  #Turn continuous variables into discrete
  ylist=[]
  for output in yhat_lr:
    if output<4.5:
      output=round(output)
      ylist.append(output)
    else:
      output=4
      ylist.append(output)

  y_hat_lr_rounded=np.array(ylist)
  
  maeLR.append(mean_absolute_error(y_test, y_hat_lr_rounded))
  plt.scatter(range(1000),maeLR)

It can be noted that as alpha gets bigger, y is less sensitive to the features. The optimal alpha value drops below an MAE of 0.20 when alpha is about 250. This so far has presented the best model.

Standardized Ridge Regression Application

We repeat the process with standardization:

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

lr = Ridge(alpha=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lr.fit(ss.fit_transform(X_train),y_train)
scaled_yhat_lr=lr.predict(ss.fit_transform(X_test))
#Turn continuous variables into discrete
ylist=[]
for output in scaled_yhat_lr:
  if output<4.5:
    output=round(output)
    ylist.append(output)
  else:
    output=4
    ylist.append(output)

y_hat_lr_rounded=np.array(ylist)

The MAE has increased above all models including the non-regularized model, thus showing that this may not be an adequate structure for our data.

from sklearn.metrics import mean_absolute_error 
print("Intercept: {:,.3f}".format(lr.intercept_))
    
mae = mean_absolute_error(y_test, y_hat_lr_rounded)
print("MAE = {:,.2f}".format(mae))

print('Coefficients:')
lr.coef_

Although our MAE has increased, the range of values of weight coefficients is still lower than the original model.

max(lr.coef_)
min(lr.coef_)

Max:

Min:

Trying different penalty values we get:

import matplotlib.pyplot as plt
maeSLR=[]
for i in range(200):
  lr = Ridge(alpha=i)
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
  lr.fit(ss.fit_transform(X_train),y_train)
  scaled_yhat_lr=lr.predict(ss.fit_transform(X_test))
  #Turn continuous variables into discrete
  ylist=[]
  for output in scaled_yhat_lr:
    if output<4.5:
      output=round(output)
      ylist.append(output)
    else:
      output=4
      ylist.append(output)

  y_hat_lr_rounded=np.array(ylist)
  
  maeSLR.append( mean_absolute_error(y_test, y_hat_lr_rounded))
  plt.scatter(range(200),maeSLR)

Note that at an alpha value of about 90-150, the MAE decreases to 0.15. This so far has been the lowest recorded MAE, proving that this may indeed be a good model for our data.

Ridge vs Lasso

Let us more clearly compare the two regularization techniques:

ElasticNet

ElasticNet regression combines L1 and L2 regularization:

This particular model combines feature elimination with coefficient reduction to create a better model; ElasticNet will also do well with highly correlated data. When variables are correlated, Ridge regression will shrink the two coefficients towards each other while Lasso will overlook one variable and drop it. ElasticNet will avoid such problems by combining both models.

from sklearn.linear_model import  ElasticNet
lel = ElasticNet(alpha=0.01)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lel.fit(X_train,y_train)
yhat_lel=lel.predict(X_test)
#Turn continuous variables into discrete
ylist=[]
for output in yhat_lel:
  if output<4.5:
    output=round(output)
    ylist.append(output)
  else:
    output=4
    ylist.append(output)

y_hat_lel_rounded=np.array(ylist)

Our MAE becomes:

from sklearn.metrics import mean_absolute_error 
print("Intercept: {:,.3f}".format(lel.intercept_))
    
mae = mean_absolute_error(y_test, y_hat_lel_rounded)
print("MAE = {:,.2f}".format(mae))

The MAE is about the same as our linear regression model at this alpha. Let us see the range of coefficients:

print('Coefficients:')
lel.coef_

min(lel.coef_)

max(lel.coef_)

Our range has shrunk to a difference of about 0.6. This is a great improvement from the non regularized model. We can also note that several features have coefficients of zero, meaning they have been deemed less necessary by the model.

list=[]
count=-1
counted_list=[]
A=lel.coef_
for a in A:
  count=count+1
  if a!=0:
    list.append(a)
    counted_list.append(count)
#Features
for c in counted_list:
  print(X.columns[c])

Let us examine other alphas to see if there is a better MAE:

import matplotlib.pyplot as plt
maeEN=[]
for i in range(200):
  lel = ElasticNet(alpha=i)

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
  lel.fit(X_train,y_train)
  yhat_lel=lel.predict(X_test)
  #Turn continuous variables into discrete
  ylist=[]
  for output in yhat_lel:
    if output<4.5:
      output=round(output)
      ylist.append(output)
    else:
      output=4
      ylist.append(output)

  y_hat_lel_rounded=np.array(ylist)
  maeEN.append(mean_absolute_error(y_test, y_hat_lel_rounded))
  plt.scatter(range(200),maeEN)

From the graph we see that we can achieve a lower MAE once we increase the value of alpha. The graph plateaus at an MAE of less than 0.225.

Standardized ElasticNet

Let us examine if standardizing the data can improve our results:

from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
lel = ElasticNet(alpha=0.01)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lel.fit(ss.fit_transform(X_train),y_train)
yhat_lel=lel.predict(ss.fit_transform(X_test))
#Turn continuous variables into discrete
ylist=[]
for output in yhat_lel:
  if output<4.5:
    output=round(output)
    ylist.append(output)
  else:
    output=4
    ylist.append(output)

y_hat_lel_rounded=np.array(ylist)

from sklearn.metrics import mean_absolute_error 
print("Intercept: {:,.3f}".format(lel.intercept_))
    
mae = mean_absolute_error(y_test, y_hat_lel_rounded)
print("MAE = {:,.2f}".format(mae))

Our MAE at the same alpha is much lower than the unstandardized data. Let us examine our range of coefficients:

print('Coefficients:')
lel.coef_

min(lel.coef_)

max(lel.coef_)

Our range of coefficients has shrunk by over 100%, going from a difference of ~0.6 to ~0.3

list=[]
count=-1
counted_list=[]
A=lel.coef_
for a in A:
  count=count+1
  if a!=0:
    list.append(a)
    counted_list.append(count)

#Features
for c in counted_list:
  print(X.columns[c])

Examining other alphas, we note that greater alphas have no effect on the MAE:

import matplotlib.pyplot as plt
maeSEN=[]
for i in range(200):
  ss = StandardScaler()
  lel = ElasticNet(alpha=0.01)

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
  lel.fit(ss.fit_transform(X_train),y_train)
  yhat_lel=lel.predict(ss.fit_transform(X_test))
  #Turn continuous variables into discrete
  ylist=[]
  for output in yhat_lel:
    if output<4.5:
      output=round(output)
      ylist.append(output)
    else:
      output=4
      ylist.append(output)

  y_hat_lel_rounded=np.array(ylist)
  maeSEN.append(mean_absolute_error(y_test, y_hat_lel_rounded))
  plt.scatter(range(200),maeSEN)

Square Root Lasso/ Scaled Lasso

Square Root Lasso Regularization is a type of lasso that addresses noise in the data. A square root lasso model can fit potentially high-dimensional data. This type of model chooses an equilibrium with a sparse regression method by iteratively estimating the noise level via the mean residual square and the penalty in proportion to noise level.

Application of Square Root Lasso in Python

Let us now try to apply Square Root Lasso in python. We can calculate the MAE using cross-validation:

! pip install --upgrade Cython
! pip install --upgrade git+https://github.com/statsmodels/statsmodels
import statsmodels.api as sm

from sklearn.metrics import mean_absolute_error as MAE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split as tts

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)

model = sm.OLS(y_train,X_train)
result = model.fit_regularized(method='sqrt_lasso', alpha=0.01)
yhat_test = result.predict(X_test)

  #Turn continuous variables into discrete
ylist=[]
for output in yhat_test:
  if output<0.5:
      output=1
      ylist.append(output)
  elif output<4.5:
      output=round(output)
      ylist.append(output)
  else:
      output=4
      ylist.append(output)

y_hat_test_rounded=np.array(ylist)
y_hat_test_rounded

Let us calculate our MAE:

MAE(y_test,y_hat_test_rounded)

Our MAE is 0.34 at alpha 0.01, which is on the higher end of our models. Let us look at the range of the coefficients:

result.params

min(result.params)

max(result.params)

We can see from the results that the range of coefficients has increased; this type of regularization may not be adequate for our model.

Let us examine other alpha values to see if we can improve our results:

maeSL=[]
for i in range(200):
  model = sm.OLS(y_train,X_train)
  result = model.fit_regularized(method='sqrt_lasso', alpha=i)
  yhat_test = result.predict(X_test)
  maeSL.append(MAE(y_test,yhat_test))
 plt.scatter(range(200),maeSL)

Our results show that the MAE is never better than 0.340, and increases as alpha is increased.

Scaled Square Root Lasso

Let us try to improve our model by scaling our data:

scale = StandardScaler()
Xs_train = scale.fit_transform(X_train)
Xs_test  = scale.transform(X_test)
model = sm.OLS(y_train,Xs_train)
result = model.fit_regularized(method='sqrt_lasso', alpha=0.5)
yhat_test = result.predict(Xs_test)
#Turn continuous variables into discrete
ylist=[]
for output in yhat_test:
  if output<0.5:
      output=1
      ylist.append(output)
  elif output<4.5:
      output=round(output)
      ylist.append(output)
  else:
      output=4
      ylist.append(output)

y_hat_test_rounded=np.array(ylist)
y_hat_test_rounded

Our predicted values look very different than our actual values. Let us look at our MAE:

MAE(y_test,y_hat_test_rounded)

Our MAE is now extremely high. Let us look at the range of our coefficients.

result.params

min(result.params)

max(result.params)

Our range of coefficients is not high, but due to our large MAE, this model is not adequate for our data. Let us examine other alpha values:

maeSSL=[]
for i in range(200):
  model = sm.OLS(y_train,Xs_train)
  result = model.fit_regularized(method='sqrt_lasso', alpha=0.5)
  yhat_test = result.predict(Xs_test)
  maeSSL.append(MAE(y_test,yhat_test))

plt.scatter(range(200),maeSSL)

As we can see, increasing or decreasing the alpha values did not have an impact on our MAE.

SCAD

The last feature selection method we have is Smoothly Clipped Absolute Deviations (SCAD).

SCAD suffers from bias but encourages sparsity, thus allowing for larger weights. This type of penalty relaxes the rate of penalization as the absolute value of the weight coefficient increases, unlike Lasso regularization which increases the penalty with respect to the weight.

Application of SCAD to IDB problem

We begin by reshaping the data and defining the SCAD algorithm:

y=y.values.reshape(-1,1)
X=X.values

from scipy.optimize import minimize
def scad_penalty(beta_hat, lambda_val, a_val):
    is_linear = (np.abs(beta_hat) <= lambda_val)
    is_quadratic = np.logical_and(lambda_val < np.abs(beta_hat), np.abs(beta_hat) <= a_val * lambda_val)
    is_constant = (a_val * lambda_val) < np.abs(beta_hat)
    
    linear_part = lambda_val * np.abs(beta_hat) * is_linear
    quadratic_part = (2 * a_val * lambda_val * np.abs(beta_hat) - beta_hat**2 - lambda_val**2) / (2 * (a_val - 1)) * is_quadratic
    constant_part = (lambda_val**2 * (a_val + 1)) / 2 * is_constant
    return linear_part + quadratic_part + constant_part
    
def scad_derivative(beta_hat, lambda_val, a_val):
    return lambda_val * ((beta_hat <= lambda_val) + (a_val * lambda_val - beta_hat)*((a_val * lambda_val - beta_hat) > 0) / ((a_val - 1) * lambda_val) * (beta_hat > lambda_val))
    
def scad(beta):
  beta = beta.flatten()
  beta = beta.reshape(-1,1)
  n = len(y)
  return 1/n*np.sum((y-X.dot(beta))**2) + np.sum(scad_penalty(beta,lam,a))
  
def dscad(beta):
  beta = beta.flatten()
  beta = beta.reshape(-1,1)
  n = len(y)
  return np.array(-2/n*np.transpose(X).dot(y-X.dot(beta))+scad_derivative(beta,lam,a)).flatten()
  
p = X.shape[1]
b0 = np.random.normal(1,1,p)

lam = 0.5
a = 0.001
output = minimize(scad, b0, method='L-BFGS-B', jac=dscad,options={'gtol': 1e-8, 'maxiter': 50000,'maxls': 25,'disp': True})
yhat_test_scad = X_test.dot(output.x)

output.x

min(output.x)

max(output.x)

The range of our output is rather large, with a difference of ~.

yhat_test_scad = X_test.dot(output.x)
yhat_test_scad

Let us find the predicted y values:

ylist=[]
for output in yhat_test_scad:
  if output<0.5:
      output=1
      ylist.append(output)
  elif output<4.5:
      output=round(output)
      ylist.append(output)
  else:
      output=4
      ylist.append(output)

y_hat_test_rounded=np.array(ylist)
y_hat_test_rounded

Let us find the MAE at this beta:

mae = mean_absolute_error(y_test, y_hat_test_rounded)
print("MAE = {:,.2f}".format(mae))

Our MAE is extremely high for our data, as it predicted a lot of 1 values. Let us try to use other alpha values:

listofoutput=[]
lam = 0.5
for i in range(200):
  a = i
  output = minimize(scad, b0, method='L-BFGS-B', jac=dscad,options={'gtol': 1e-8, 'maxiter': 50000,'maxls': 25,'disp': True})
  yhat_test_scad = X_test.dot(output.x)
  ylist=[]
  for output in yhat_test_scad:
    if output<0.5:
        output=1
        ylist.append(output)
    elif output<4.5:
        output=round(output)
        ylist.append(output)
    else:
        output=4
        ylist.append(output)

  y_hat_test_rounded=np.array(ylist)
  listofoutput.append(mean_absolute_error(y_test, y_hat_test_rounded))
  
plt.scatter(range(200),listofoutput)

min(listofoutput)

We can see that the minimum MAE is 0.21 and occurs at some alpha in the range of 0 to 200

Standardized SCAD with the IDB dataset

Let us see if standardizing our dataset can help lower our MAE:

y=y.values.reshape(-1,1)
X=X.values

Xs = scale.fit_transform(X)
def scad_penalty(beta_hat, lambda_val, a_val):
    is_linear = (np.abs(beta_hat) <= lambda_val)
    is_quadratic = np.logical_and(lambda_val < np.abs(beta_hat), np.abs(beta_hat) <= a_val * lambda_val)
    is_constant = (a_val * lambda_val) < np.abs(beta_hat)
    
    linear_part = lambda_val * np.abs(beta_hat) * is_linear
    quadratic_part = (2 * a_val * lambda_val * np.abs(beta_hat) - beta_hat**2 - lambda_val**2) / (2 * (a_val - 1)) * is_quadratic
    constant_part = (lambda_val**2 * (a_val + 1)) / 2 * is_constant
    return linear_part + quadratic_part + constant_part
    
def scad_derivative(beta_hat, lambda_val, a_val):
    return lambda_val * ((beta_hat <= lambda_val) + (a_val * lambda_val - beta_hat)*((a_val * lambda_val - beta_hat) > 0) / ((a_val - 1) * lambda_val) * (beta_hat > lambda_val))
    

def scad(beta):
  beta = beta.flatten()
  beta = beta.reshape(-1,1)
  n = len(y)
  return 1/n*np.sum((y-X.dot(beta))**2) + np.sum(scad_penalty(beta,lam,a))
  
def dscad(beta):
  beta = beta.flatten()
  beta = beta.reshape(-1,1)
  n = len(y)
  return np.array(-2/n*np.transpose(X).dot(y-X.dot(beta))+scad_derivative(beta,lam,a)).flatten()
  
p = Xs.shape[1]
b0 = np.random.normal(1,1,p)

lam = 1
a = 0.01
output = minimize(scad, b0, method='L-BFGS-B', jac=dscad,options={'gtol': 1e-8, 'maxiter': 50000,'maxls': 25,'disp': True})

output.x

Our Min and Max Output:

min(output.x)

max(output.x)

Our y Predictions:

yhat_test_scad = X_test.dot(output.x)
yhat_test_scad

ylist=[]
for output in yhat_test_scad:
  if output<0.5:
      output=1
      ylist.append(output)
  elif output<4.5:
      output=round(output)
      ylist.append(output)
  else:
      output=4
      ylist.append(output)

y_hat_test_rounded=np.array(ylist)
y_hat_test_rounded

And finally…our MAE

mae = mean_absolute_error(y_test, y_hat_test_rounded)
print("MAE = {:,.2f}".format(mae))

As we can see, the MAE was hardly reduced and still remains above the majority of the other models. Let us examine if it can be improved at other alphas:

listofoutput=[]
lam = 0.5
for i in range(200):
  a = i
  output = minimize(scad, b0, method='L-BFGS-B', jac=dscad,options={'gtol': 1e-8, 'maxiter': 50000,'maxls': 25,'disp': True})
  yhat_test_scad = X_test.dot(output.x)
  ylist=[]
  for output in yhat_test_scad:
    if output<0.5:
        output=1
        ylist.append(output)
    elif output<4.5:
        output=round(output)
        ylist.append(output)
    else:
        output=4
        ylist.append(output)

  y_hat_test_rounded=np.array(ylist)
  listofoutput.append(mean_absolute_error(y_test, y_hat_test_rounded))
  
plt.scatter(range(200),listofoutput)

min(listofoutput)

Our minimum MAE is close to our unstandardized SCAD MAE.

Comparison of Regularization Techniques

After trying four different regularization techniques and trying both standardized and unstandardized versions of our models, we can conclude that the Standardized Ridge Regularization reduced our MAE the most (0.15). Looking further at the data table, we can note that Lasso Regularization shrunk the model the most in terms of the size of the weights (~0.3 between the max and min coefficient). It is not a surprise that Elastic Net (~0.6) comes in a close second in respect to this, as it is partially made of Lasso Regularization. This confirms that Lasso Regularization will throw out the terms it does not need and will simplify the model in an extreme way. It can also be confirmed that Lasso Regularization was able to decently reduce the MAE because our data did not have strong correlations between its features. Had the data had stronger correlatations, we may have seen a larger difference between Lasso and Ridge. Because Ridge outperformed Lasso by a little in this exercise, we may assume that Lasso may have slightly oversimplified the relationship between features. However, with so many features and such little difference between Lasso and Ridge Regularization, it may be better to opt for Lasso Regularization due to its feature selection capabilities. We can also note that at the selected alpha (a=0.01), Standardized Lasso performed best(0.23), while the other regularization techniques were scattered. It can also be noted that SCAD worked best at greater alphas, and although at a=0.01 it had a quite large MAE, its minimum MAE was in range. Although all the regularization techniques lowered the MAE into the 0.2s, standardized square root lasso did not improve at all.

Significant Earthquakes, 1965-2016

The same regularization analysis can be tested on different datasets. This dataset describes major earthquakes greater than a magnitude of 5.5 from 1965 to 2016 and can be found on Kaggle. We will attempt to predict earthquake magnitude from factors such as latitude, length, and depth.

Data Preprocessing

#import libraries
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

df = pd.read_csv('/content/drive/MyDrive/database.csv')
df2=df
df2=df2.drop('Status', axis=1)
df2=df2.drop('Date', axis=1)
df2=df2.drop(['Time','Type','ID', 'Source', 'Location Source', 'Magnitude Source'], axis=1)
df2=df2.drop('Magnitude Type', axis=1)
df2=df2.apply(pd.to_numeric)

mean_value1=df2['Depth Error'].mean()
df2['Depth Error'].fillna(mean_value1, inplace=True)
mean_value2=df2['Depth Seismic Stations'].mean()
df2['Depth Seismic Stations'].fillna(mean_value2, inplace=True)
mean_value3=df2['Magnitude Error'].mean()
df2['Magnitude Error'].fillna(mean_value3, inplace=True)
mean_value4=df2['Magnitude Seismic Stations'].mean()
df2['Magnitude Seismic Stations'].fillna(mean_value4, inplace=True)
mean_value5=df2['Azimuthal Gap'].mean()
df2['Azimuthal Gap'].fillna(mean_value5, inplace=True)
mean_value6=df2['Horizontal Distance'].mean()
df2['Horizontal Distance'].fillna(mean_value6, inplace=True)
mean_value7=df2['Horizontal Error'].mean()
df2['Horizontal Error'].fillna(mean_value7, inplace=True)
df2=df2.drop('Root Mean Square', axis=1)

Unlike the IDB dataset, the NaN values in this dataset were filled in with the mean of each column. This may have skewed our data, but removing the NaN values would have left very few viable rows.

Visualizations

#Define X and y
X=df2.drop('Magnitude', axis=1)
y=df2['Magnitude']
X.describe()
y.describe()

Feature Breakdown

Target Breakdown

From the breakdowns of our Feature and Targets, we can see that standardization is most likely required, as each feature is on its own scale.

df2=df2.apply(pd.to_numeric)

Magnitude vs Depth

import seaborn as sns
plt.figure(figsize=(10,10))
sns.violinplot(x=df2['Depth'], y=df2['Magnitude'])

Although plotting all of the data would render a variety of plots, it is most interesting to see the relationship between the recorded depth of the Earthquake and Magnitude. It is most likely that this feature will have the most impact. From previous research, we know that earthquakes that occur at greater depths tend to cause less shaking at the surface.

Data Correlations

corr_matrix=np.corrcoef(X)
corr_matrix

import seaborn as sns
plt.figure(figsize=(20,20))
sns.heatmap(X.corr(), vmax=1, vmin=-1, cmap="spring", annot=True,fmt='.2f')
plt.show()

The Correlation Matrix shows that the majority of the features are not correlated very much at all, with most values being close to zero. This means that regularization may not be as necessary for this particular dataset.

Sample Data t-test

from scipy import stats
def tstat_corr(r,n):
  t = r*np.sqrt(n-2)/np.sqrt(1-r**2)
  pval = stats.t.sf(np.abs(t), n-1)*2  # two-sided pvalue = Prob(abs(t)>tt)
  print('t-statistic = %6.3f pvalue = %6.4f' % (t, pval))

tstat_corr(0.02,len(X)) #put in an r value from matrix

With a high t statistic and low p-value, our t-test supports low correlation between the features.

Linear Regression

#APPLY OLS
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)

lm = LinearRegression()
#fit the variables to model
lm.fit(X_train, y_train)
#predict output by inputting test variable
yhat_lm=lm.predict(X_test)
yhat_lm

from sklearn.metrics import mean_absolute_error 
print("Intercept: {:,.3f}".format(lm.intercept_))
    
mae = mean_absolute_error(y_test, yhat_lm)
print("MAE = {:,.2f}".format(mae))

print('coefficients:')
lm.coef_

#max coefficient (range)
max(lm.coef_)

#min coefficient
min(lm.coef_)

Predictions

MAE and Intercept

Coefficients

Max Coefficient:

Min Coefficient:

Ridge Regularization

#Apply Ridge
from sklearn.linear_model import Ridge

lr = Ridge(alpha=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lr.fit(X_train,y_train)
yhat_lr=lr.predict(X_test)
yhat_lr

from sklearn.metrics import mean_absolute_error 
print("Intercept: {:,.3f}".format(lr.intercept_))
    
mae = mean_absolute_error(y_test, yhat_lr)
print("MAE = {:,.2f}".format(mae))

print('Coefficients:')
lr.coef_

max(lr.coef_)
min(lr.coef_)

maeLR=[]
for i in range(1000):
  lr = Ridge(alpha=i)
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
  lr.fit(X_train,y_train)
  yhat_lr=lr.predict(X_test)
  #Turn continuous variables into discrete
  
  maeLR.append(mean_absolute_error(y_test, yhat_lr))
  
 plt.scatter(range(1000),maeLR)

Predictions

Intercept and MAE

Coefficients

Max Coefficient:

Min Coefficient:

MAE at other Alpha Values

Standardized Ridge

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

lr = Ridge(alpha=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lr.fit(ss.fit_transform(X_train),y_train)
scaled_yhat_lr=lr.predict(ss.fit_transform(X_test))

from sklearn.metrics import mean_absolute_error 
print("Intercept: {:,.3f}".format(lr.intercept_))
    
mae = mean_absolute_error(y_test, scaled_yhat_lr)
print("MAE = {:,.2f}".format(mae))

print('Coefficients:')
lr.coef_

max(lr.coef_)
min(lr.coef_)

maeSLR=[]
for i in range(200):
  lr = Ridge(alpha=i)
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
  lr.fit(ss.fit_transform(X_train),y_train)
  scaled_yhat_lr=lr.predict(ss.fit_transform(X_test))
  #Turn continuous variables into discrete
 
  
  maeSLR.append( mean_absolute_error(y_test, scaled_yhat_lr))
  
plt.scatter(range(200),maeSLR)

Predictions

Intercept and MAE

Coefficients

Min Coefficient:

Max Coefficient:

MAE at other Alpha Values

Lasso Regularization

from sklearn.linear_model import Lasso
ls = Lasso(alpha=0.01)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
ls.fit(X_train,y_train)
yhat_ls=ls.predict(X_test)

from sklearn.metrics import mean_absolute_error 
print("Intercept: {:,.3f}".format(ls.intercept_))
    
mae = mean_absolute_error(y_test, yhat_ls)
print("MAE = {:,.2f}".format(mae))

print('Coefficients')
ls.coef_

min(ls.coef_)
max(ls.coef_)

#get features that work
list=[]
count=-1
counted_list=[]
A=ls.coef_
for a in A:
  count=count+1
  if a!=0:
    list.append(a)
    counted_list.append(count)

print(list)
print(counted_list)

#Features
for c in counted_list:
  print(X.columns[c])

maeLS=[]
for i in range(100):
  ls = Lasso(alpha=i)

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
  ls.fit(X_train,y_train)
  yhat_ls=ls.predict(X_test)

  maeLS.append(mean_absolute_error(y_test, yhat_ls))
  
plt.scatter(range(100),maeLS)

Predictions

Intercept and MAE

Coefficients

Max Coefficients:

Min Coefficients:

Nonzero Coefficient Features

MAE at other Alpha Values

Standardized Lasso

from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ls = Lasso(alpha=0.01)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
ls.fit(ss.fit_transform(X_train),y_train)
yhat_ls=ls.predict(ss.fit_transform(X_test))

from sklearn.metrics import mean_absolute_error 
print("Intercept: {:,.3f}".format(ls.intercept_))
    
mae = mean_absolute_error(y_test, yhat_ls)
print("MAE = {:,.2f}".format(mae))

print('Coefficients')
ls.coef_

min(ls.coef_)
max(ls.coef_)

#get features that work
list=[]
count=-1
counted_list=[]
A=ls.coef_
for a in A:
  count=count+1
  if a!=0:
    list.append(a)
    counted_list.append(count)

print(list)
print(counted_list)

#Features
for c in counted_list:
  print(X.columns[c])

maeSLS=[]
for i in range(100):
  ss = StandardScaler()
  ls = Lasso(alpha=i)

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
  ls.fit(ss.fit_transform(X_train),y_train)
  yhat_ls=ls.predict(ss.fit_transform(X_test))


  maeSLS.append(mean_absolute_error(y_test, yhat_ls))
  
plt.scatter(range(100),maeSLS)

Predictions

Intercept and MAE

Coefficients

Min Coefficent:

Max Coefficient:

NonZero Coefficient Features

MAE at other Alpha Values

Elastic Net Regularization

from sklearn.linear_model import  ElasticNet
lel = ElasticNet(alpha=0.01)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lel.fit(X_train,y_train)
yhat_lel=lel.predict(X_test)

from sklearn.metrics import mean_absolute_error 
print("Intercept: {:,.3f}".format(lel.intercept_))
    
mae = mean_absolute_error(y_test, yhat_lel)
print("MAE = {:,.2f}".format(mae))

print('Coefficients:')
lel.coef_

min(lel.coef_)

max(lel.coef_)

list=[]
count=-1
counted_list=[]
A=lel.coef_
for a in A:
  count=count+1
  if a!=0:
    list.append(a)
    counted_list.append(count)

print(list)
print(counted_list)

#Features
for c in counted_list:
  print(X.columns[c])

maeEN=[]
for i in range(200):
  lel = ElasticNet(alpha=i)

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
  lel.fit(X_train,y_train)
  yhat_lel=lel.predict(X_test)

  maeEN.append(mean_absolute_error(y_test, yhat_lel))

plt.scatter(range(200),maeEN)

Predictions

Intercept and MAE

Coefficients

Min Coefficient:

Max Coefficients:

Nonzero Coefficient Features

MAE at other Alpha Values

Standardized Elastic Net

from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
lel = ElasticNet(alpha=0.01)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
lel.fit(ss.fit_transform(X_train),y_train)
yhat_lel=lel.predict(ss.fit_transform(X_test))

from sklearn.metrics import mean_absolute_error 
print("Intercept: {:,.3f}".format(lel.intercept_))
    
mae = mean_absolute_error(y_test, yhat_lel)
print("MAE = {:,.2f}".format(mae))

print('Coefficients:')
lel.coef_

max(lel.coef_)

min(lel.coef_)

list=[]
count=-1
counted_list=[]
A=lel.coef_
for a in A:
  count=count+1
  if a!=0:
    list.append(a)
    counted_list.append(count)

print(list)
print(counted_list)

#Features
for c in counted_list:
  print(X.columns[c])

maeSEN=[]
for i in range(200):
  ss = StandardScaler()
  lel = ElasticNet(alpha=0.01)

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)
  lel.fit(ss.fit_transform(X_train),y_train)
  yhat_lel=lel.predict(ss.fit_transform(X_test))

  maeSEN.append(mean_absolute_error(y_test, yhat_lel))

plt.scatter(range(200),maeSEN)

Predictions

MAE and Intercept

Coefficients

Min Coefficient:

Max Coefficient:

NonZero Coefficient Features

MAE at other Alpha Values

Squared Root Lasso

! pip install --upgrade Cython
! pip install --upgrade git+https://github.com/statsmodels/statsmodels
import statsmodels.api as sm

from sklearn.metrics import mean_absolute_error as MAE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split as tts

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2021)

model = sm.OLS(y_train,X_train)
result = model.fit_regularized(method='sqrt_lasso', alpha=0.01)
yhat_test = result.predict(X_test)

MAE(y_test,yhat_test)

result.params

min(result.params)

max(result.params)

maeSL=[]
for i in range(200):
  model = sm.OLS(y_train,X_train)
  result = model.fit_regularized(method='sqrt_lasso', alpha=i)
  yhat_test = result.predict(X_test)
  maeSL.append(MAE(y_test,yhat_test))

plt.scatter(range(200),maeSL)

Predictions

MAE

Coefficients

Max Coefficient:

Min Coefficient:

MAE at other Alpha Values

Scaled Square Root Lasso

scale = StandardScaler()
Xs_train = scale.fit_transform(X_train)
Xs_test  = scale.transform(X_test)

model = sm.OLS(y_train,Xs_train)
result = model.fit_regularized(method='sqrt_lasso', alpha=0.5)
yhat_test = result.predict(Xs_test)
#Turn continuous variables into discrete

MAE(y_test, yhat_test)

result.params

max(result.params)
min(result.params)

maeSSL=[]
for i in range(200):
  model = sm.OLS(y_train,Xs_train)
  result = model.fit_regularized(method='sqrt_lasso', alpha=0.5)
  yhat_test = result.predict(Xs_test)
  maeSSL.append(MAE(y_test,yhat_test))

plt.scatter(range(200),maeSSL)

Predictions

MAE

Coefficients

Max Coefficient:

Min Coefficient:

MAE at other Alpha Values

SCAD

y=y.values.reshape(-1,1)
X=X.values

from scipy.optimize import minimize
def scad_penalty(beta_hat, lambda_val, a_val):
    is_linear = (np.abs(beta_hat) <= lambda_val)
    is_quadratic = np.logical_and(lambda_val < np.abs(beta_hat), np.abs(beta_hat) <= a_val * lambda_val)
    is_constant = (a_val * lambda_val) < np.abs(beta_hat)
    
    linear_part = lambda_val * np.abs(beta_hat) * is_linear
    quadratic_part = (2 * a_val * lambda_val * np.abs(beta_hat) - beta_hat**2 - lambda_val**2) / (2 * (a_val - 1)) * is_quadratic
    constant_part = (lambda_val**2 * (a_val + 1)) / 2 * is_constant
    return linear_part + quadratic_part + constant_part
    
def scad_derivative(beta_hat, lambda_val, a_val):
    return lambda_val * ((beta_hat <= lambda_val) + (a_val * lambda_val - beta_hat)*((a_val * lambda_val - beta_hat) > 0) / ((a_val - 1) * lambda_val) * (beta_hat > lambda_val))

def scad(beta):
  beta = beta.flatten()
  beta = beta.reshape(-1,1)
  n = len(y)
  return 1/n*np.sum((y-X.dot(beta))**2) + np.sum(scad_penalty(beta,lam,a))
  
def dscad(beta):
  beta = beta.flatten()
  beta = beta.reshape(-1,1)
  n = len(y)
  return np.array(-2/n*np.transpose(X).dot(y-X.dot(beta))+scad_derivative(beta,lam,a)).flatten()

p = X.shape[1]
b0 = np.random.normal(1,1,p)

lam = 0.5
a = 0.01
output = minimize(scad, b0, method='L-BFGS-B', jac=dscad,options={'gtol': 1e-8, 'maxiter': 50000,'maxls': 25,'disp': True})
yhat_test_scad = X_test.dot(output.x)

output.x

min(output.x)

max(output.x)

yhat_test_scad = X_test.dot(output.x)
yhat_test_scad

mae = mean_absolute_error(y_test, yhat_test_scad)
print("MAE = {:,.2f}".format(mae))

listofoutput=[]
lam = 0.5
for i in range(200):
  a = i
  output = minimize(scad, b0, method='L-BFGS-B', jac=dscad,options={'gtol': 1e-8, 'maxiter': 50000,'maxls': 25,'disp': True})
  yhat_test_scad = X_test.dot(output.x)

  y_hat_test_rounded=np.array(ylist)
  listofoutput.append(mean_absolute_error(y_test, yhat_test_scad))

plt.scatter(range(200),listofoutput)

min(listofoutput)

Predictions

MAE

Coefficients

Max Coefficients:

Min Coefficients:

MAE at other Alpha Values

Scaled SCAD

y=y.reshape(-1,1)
X=X.values

Xs = scale.fit_transform(X)

def scad_penalty(beta_hat, lambda_val, a_val):
    is_linear = (np.abs(beta_hat) <= lambda_val)
    is_quadratic = np.logical_and(lambda_val < np.abs(beta_hat), np.abs(beta_hat) <= a_val * lambda_val)
    is_constant = (a_val * lambda_val) < np.abs(beta_hat)
    
    linear_part = lambda_val * np.abs(beta_hat) * is_linear
    quadratic_part = (2 * a_val * lambda_val * np.abs(beta_hat) - beta_hat**2 - lambda_val**2) / (2 * (a_val - 1)) * is_quadratic
    constant_part = (lambda_val**2 * (a_val + 1)) / 2 * is_constant
    return linear_part + quadratic_part + constant_part
    
def scad_derivative(beta_hat, lambda_val, a_val):
    return lambda_val * ((beta_hat <= lambda_val) + (a_val * lambda_val - beta_hat)*((a_val * lambda_val - beta_hat) > 0) / ((a_val - 1) * lambda_val) * (beta_hat > lambda_val))

def scad(beta):
  beta = beta.flatten()
  beta = beta.reshape(-1,1)
  n = len(y)
  return 1/n*np.sum((y-X.dot(beta))**2) + np.sum(scad_penalty(beta,lam,a))
  
def dscad(beta):
  beta = beta.flatten()
  beta = beta.reshape(-1,1)
  n = len(y)
  return np.array(-2/n*np.transpose(X).dot(y-X.dot(beta))+scad_derivative(beta,lam,a)).flatten()

p = Xs.shape[1]
b0 = np.random.normal(1,1,p)

lam = 1
a = 0.01
output = minimize(scad, b0, method='L-BFGS-B', jac=dscad,options={'gtol': 1e-8, 'maxiter': 50000,'maxls': 25,'disp': True})

output.x

min(output.x)

max(output.x)

yhat_test_scad = X_test.dot(output.x)
yhat_test_scad

mae = mean_absolute_error(y_test, yhat_test_scad)
print("MAE = {:,.2f}".format(mae))

listofoutput=[]
lam = 0.5
for i in range(200):
  a = i
  output = minimize(scad, b0, method='L-BFGS-B', jac=dscad,options={'gtol': 1e-8, 'maxiter': 50000,'maxls': 25,'disp': True})
  yhat_test_scad = X_test.dot(output.x)

  listofoutput.append(mean_absolute_error(y_test, yhat_test_scad))
  
plt.scatter(range(200),listofoutput)

min(listofoutput)

Predictions

MAE

Coefficients

Max Coefficients:

Min Coefficients:

MAE at other Alpha Values

Discussion Pertaining to the Regularization Methods

Looking at the results, we note that none of the Regularization techniques improved the MAE. We can attribute this to the lack of correlation between the data. Looking further at the results, we also note that this dataset may not be adequate for prediction, as the features seem to have no effect on each other. We can further support this conclusion by lookig at the Lasso and Elastic Net results, where we lost anywhere in between one and five features with little to no effect on the MAE. In Standardized Lasso Regularization, we lost all the features which I hypothesized as having the most effect: location (latitude/longitude and depth). Thus, we can conclude that this particular dataset does not support that magnitude of an earthquake can be predicted from only latitude and depth. It would be interesting to add in other features such as weather and volcanic activty in the region.

We can see that regularization only worsened our results, particularly with SCAD and Square Root Lasso. It is interesting to see that when alpha was changed, most of the regularization method results either did not change or worsened the MAE as alpha approached infinity (standard ridge, square root lasso, scaled SCAD).