a. How would this customer be classified?
A. This customer would be classified as not accepting the personal loan offer. According to the KNN_Output there appears to be overfitting due to the discrepancies in the classification matrix for training (Class 0 = 0% error, Class 1 = 0% error, Overall = 0% error), and validation error (Class 0 = 4.2% error, Class 1 = 55.85% error, and Overall = 9.1% error).
b. What is a choice of k that balances between overfitting and ignoring the predictor information? A. A choice of k that balances between overfitting and ignoring the predictor would be k = 6. The value is chosen because it minimizes the % validation error. After testing various k levels. According to the validation error log for different k the best k points to 6, where %error training is 7.4% and validation % error is 8.75%.
c. Show the classification matrix for the validation data that results from using the best k.
d. Classify the customer using the best k
A. According to the best k the customer would not be inclined to accept the personal loan. e. Re-partition the data, this time into training, validation, and test sets (50%: 30%: 20%). Apply the k-NN method with the k chosen above, compare the classification matrix of the test set with that of the training and validation sets. Comment on the differences and their reason. A. Based on the training, validation, and test matrices we can see a steady increase in the percentage errors. There does not appear to be overfitting due to the minimal error discrepancies among all three matrices, from the training to the validation error there is a 5.69% difference, and from validation to test error there is a 14.05% error difference. Based on the lift chart, the model appears to make a difference even though the loan acceptance has a 82% error rate for the test classification matrix. 9.3
i. Compare the tree generated by the CT with the one generated by the RT. Are they different? (Look at structure, the top predictors, size of tree, etc.) Why? A. According to the Regression Tree and Classification Tree Output, both appear to have age, kilometers, and horsepower as the most important car specifications. The regression tree seems to be structurally bigger compared to the classification tree. In addition, both trees appear to use similar predictors. According to the classification matrix in the classification matrix for the training error report the percentage error is 0% for all 20 bins.
For the validation error report there are approximately 1 bins with 100% error rates, overall error is 74.88%. Finally, for the test error report there are 4 bins with 100% error rates, overall error is 75.98%. There appears to be a slight decrease in the overall error percentage between the validation and the test error report, but there is clearly overfitting due to the distinct difference between training and validation confusion matrix. ii. Predict the price, using the RT and the CT, of a used Toyota Corolla with the specifications listed in Table 9.3 A. After running both models the predicted price was the same.