I'm facing a problem about which I can't find any answer. I have a binary classification problem (output Y=0 or Y=1) with Y=1 the minority class (actually Y=1 indicates default of a company, with proportion=0.02 in the original dataframe). Therefore, I computed oversampling using SMOTE algorithm on my training set only (after splitting my dataframe in training and testing sets). I train a logistic regression on my training set (with proportions of class "defaut"=0.3) and then look at the ROC Curve and MSE to test whether my algorithm predicts well the default. I get very good results in terms both of AUC (AUC=0.89) and MSE (MSE=0.06). However, when I then try to look more preciselly and individually at my predictions, I find that 20% of default aren't well predicted. Do you have a method to evaluate well the quality of my prediction (quality means for me predictions that predict well default). I thought that AUC was a good criterium... So far do you also have a method in order to improve my regression? Thanks
SMOTE Algorithm and Classification: overrated prediction success
283 views Asked by T. Ciffréo At
1
There are 1 answers
Related Questions in R
- How to make an R Shiny app with big data?
- How do I keep only specific rows based on whether a column has a specific value?
- Likert scale study - ordinal regression model
- Extract a table/matrix from R into Excel with same colors and stle
- How can I solve non-conformable arguments in R netmeta::discomb (Error in B.matrix %*% C.matrix)?
- Can raw means and estimated marginal means be the same ? And when?
- Understanding accumulate function when .dir is set to "backwards"
- Error in if (nrow(peaks) > 0) { : argument is of length zero Calls: CopywriteR ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous> Execution ha
- How to increase quality of mathjax output?
- Convert the time intervals to equal hours and fill in the value column
- How to run an R function getpoints() from IPDfromKM package in an R shiny app which in R pops up a plot that utilizes clicks to capture coordinates?
- Replace NA in list of dfs in certain columns and under certain conditions
- R and text on Cyrillic
- The ts() function in R is returning the correct start and frequency but not end value which is 1 and not 179
- TROUBLING with the "DROP_NA" Function
Related Questions in MACHINE-LEARNING
- Trained ML model with the camera module is not giving predictions
- Keras similarity calculation. Enumerating distance between two tensors, which indicates as lists
- How to get content of BLOCK types LAYOUT_TITLE, LAYOUT_SECTION_HEADER and LAYOUT_xx in Textract
- How to predict input parameters from target parameter in a machine learning model?
- The training accuracy and the validation accuracy curves are almost parallel to each other. Is the model overfitting?
- ImportError: cannot import name 'HuggingFaceInferenceAPI' from 'llama_index.llms' (unknown location)
- Which library can replace causal_conv1d in machine learning programming?
- Fine-Tuning Large Language Model on PDFs containing Text and Images
- Sketch Guided Text to Image Generation
- My ICNN doesn't seem to work for any n_hidden
- Optuna Hyperband Algorithm Not Following Expected Model Training Scheme
- How can I resolve this error and work smoothly in deep learning?
- ModuleNotFoundError: No module named 'llama_index.node_parser'
- Difference between model.evaluate and metrics.accuracy_score
- Give Bert an input and ask him to predict. In this input, can Bert apply the first word prediction result to all subsequent predictions?
Related Questions in CROSS-VALIDATION
- Get fitted estimator from CV function of XGBoost
- Solving R fatal error when using loo with GAM
- How to adapt lgb.cv in my k-folds splitting way?
- Why do I get the same results with different cross-validation specifications in caret for `lm`
- fbprophet how to adapt the date of my data to the prediction date and cross-validation
- Cross validation and/or train_test_split in scikit-learn?
- How to weight samples with sklearns's cross_validate for scoring only?
- Argument of length 0" during cross-validation in R
- ValueError: could not convert string to float: 'Curtis RIngraham Directge'
- Time Series Cross Validation Warning (tidymodels, fit_resamples)
- Problems with building a custom cv splitter for sklearn
- TypeError: Singleton array in nested stratified cross-validation
- Is there a proper way to apply median imputation by groups in caret?
- Key Error when Implementing Cross Validation with GroupKFold
- fabletools: Using forecast function on stretching window using external regressors
Related Questions in OVERSAMPLING
- PointCloud upsampling
- Does a oversampling technique like smote or adasyn convert all data to a single class label?
- Generate synthetic data for majority and minority classes
- How to deal with ordinal category when using SMOTENC?
- SMOTE throws a cython error during fit_resample
- SMOTE imbalanced data without changing mean and standard deviation of numerical variables
- Problems importing imblearn python package on Google Colab
- why do I get weird plots on ANN with random oversampling
- Multi-class oversampling based online bagging
- How to use resampling/oversampling methods to calculate the p-value of a single point or generate new data in the "tails" of a distribution?
- imblearn library BorderlineSMOTE module does not generate any synthetic data
- Appropriate way to use post-stratification weights when running statistical tests SPSS
- After Oversamling Smote With IsolationForest my result doesnt improve
- Can I correct the coefficient standard errors after oversampling my data?
- Oversampled train set and test set - machine learning classification
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
For every classification problem you can build a confusion matrix.
This is a two way entry matrix, and lets you see not only the true positives/true negatives (TP/TN), which are your correct predictions, but also the false positives (FP)/false negatives (FN), and this is most of the time your true interest.
FP and FN are the errors that your model make, you can track how well your model is doing in detecting either the TP (1-FP) or the TN (1-FN), by using sensitivity or specificity (link).
Note that you can't improve one without lowering the other. So sometimes you need to pick one.
A good compromise is the F1-score, which tries to average the two.
So if you're more interested in defaults (lets imagine that
defaults=Positive Class), you'll prefer a model with a higher sensitivity. But remember to not neglect completely the specificity either.Here an example code in R: