Annette-s-Responses

View on GitHub

July 22, 2020

Boosted Trees

  1. What is a one-hot-encoded column and why might it be needed when transforming a feature? Are the source values continuous or discrete?
    • One-hot coding is a process where categorical variables are converted into ways that are easier for the machine to read and process. Instead of just substituting a value for each company, it applies binarization to each category in order to avoid mistakes and jumbling value numbers could offer. Their source values are discrete.

ENCODING

  1. What is a dense feature? For example, if you execute example = dict(dftrain) and then tf.keras.layers.DenseFeatures(your_features)(your_object).numpy(), how has the content of your data frame been transformed? Why might this be useful?
    • Dense features produce tensors. As seen with the one-hot-encoding, the categorical features are turned into vectors. Zeros are placed to show an absence of data, which could further help efficiency in the machine. As we often saw in Data 146, missing data often means that the data must be reorganized in order for the machine to make sense of it, or thrown out. Because throwing data out can mean we lose other valuable data, putting a placeholder for the absence of data can help us mitigate such losses.
  2. Provide a histogram of the probabilities for the logistic regression as well as your boosted tree model. How do you interpret the two different models? Are their predictions essentially the same or is there some area where they are noticeable different. Plot the probability density function of the resulting probability predictions from the two models and use them to further illustrate your argument. Include the ROC plot and interpret it with regard to the proportion of true to false positive rates, as well as the area under the ROC curve. How does the measure of the AUC reflect upon the predictive power of your model?

### Linear

linear

### Boosted Trees

boosted trees

Boosted Trees continued (with model understanding)

  1. Upload your feature values contribution to predicted probability horizontal bar plot as well as your violin plot. Interpret and discuss the two plots. Which features appear to contribute the most to the predicted probability?

boosted trees continued 1 violin plot

average