Use the Kaggle Credit Card Data set for this exercise. Use 100K and the entire data set representing fraudulent and non-fraudulent data. Use the same approach to generate test and training data sets.
1. Perform ridge and lasso to reduce the input feature set. Use the reduced feature set to rerun the logistic regression. Identify the reduced input feature set.
2. Compare with the raw logistic regression. The total accuracy for the comparison is not a good measure. Explain why. Use other measures to compare the two models.
As explained in class, this credit card data set is unbalanced. Read https://journal.r-project.org/archive/2014-1/menardi-lunardon-torelli.pdf for a discussion of how to handle unbalanced data sets.
3. Make a powerpoint presentation of the technique used with unbalanced data in the paper https://journal.r-project.org/archive/2014-1/menardi-lunardon-torelli.pdf.
4. Use the ROSE package discussed adjust for the imbalance in the credit fraud data. Run logistic regression with the new data set. Also check https://cran.r-project.org/web/packages/ROSE/ROSE.pdf (Links to an external site.) for a more concise explanation.
https://www.kaggle.com/dalpozz/creditcardfraud . >> data set