Wine Quality analysis is a statistical study of features affecting the quality of wine. In the Wine Quality data set there are about 11 features that affect it’s quality. The quality is measured on a scale of 0-10, this work envisons to study the specific features that play a key role in determining the quality of wine by two different statistical tools - Multiple regression and Logistic regressions. The box plots of features and their corresponding effects on Quality is shown for better and clear understanding. Due to some limitations in multiple regression such as collinearity and others, we have adapted Logistic regression. Finally as an out-of-the box step, we have tested the predicitability of this model to ensure if the model developed out of this data set can be used on some other data set.
| ï..FA | VA | CA | RS | CL | FS | TS | D | PH | S | A |
|---|---|---|---|---|---|---|---|---|---|---|
| 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 |
| 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.20 | 0.68 | 9.8 |
| 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.9970 | 3.26 | 0.65 | 9.8 |
| 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.9980 | 3.16 | 0.58 | 9.8 |
| 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 |
| 7.4 | 0.66 | 0.00 | 1.8 | 0.075 | 13 | 40 | 0.9978 | 3.51 | 0.56 | 9.4 |
The abbrevations used in Table 1 have the following meanings:
1- FA - Fixed Acidity
2- VA - Volatile Acidity
3- CA - Citric Acid
4- RS - Residual Sugar
5- CL - Chlorides
6- FS - Free Sulfur Dioxide
7- TS - Total Sulfur Dioxide
8- D - Density
9- PH - pH
10- S - Sulphates
11- A - Alcohol
12- Q - Quality
Wine, once an expensive good is now increasingly enjoyed by variety of consumers. In fact, Portugal is one of the top ten wine exporting country with about 32% of the market share in 2005 [5]. It’s export has increased to about 36% in 2007. Therefore new technologies has been adapted to enhance the making and selling of this wine. In this process there are two major steps: Wine Certification and Quality Assessment.
While Certification ensures the prevention of illegal adulteration, Quality evaluation which is a part of certification process is an indicator that is used for improving the wine making by identifying the most important features which thereby helps to classify wines as premium brands.
Generally, wine certification is done using phsiochemical and sensory tests, wherein the former is used to characterize wine based on density, alcohol or pH values while sensory test rely on human senses. Since the taste is the least understood of human senses [6] the relationships between these two tests are very difficult to understand, wine classification becomes an onerous task.
In such an atmosphere with the help of the technologies the data pertaining to this Wine Quality are collected and stored. These data contain important informations that explains trends and features on which the quality of wine depends. Based on this data and its associated information it is possible to improvise the quality by performing statistical analysis.
Therefore in this work, we have collected the data set pertaining to Wine Quality [4] on which we performed two types of stastical analysis: Multiple regression and Logistic regression. With these analysis we have extracted the important features that affect the wine quality and validated it with measures of “Goodness of fit”. At the end as an out-of-the box initiative we performed a prediction on this data set besides classification by developing a model, ensure that this model can be used on some other data sets too.
In the previous sections a brief idea about the Wine quality data set was shown, while in this section this raw data needs to be analyzed. Since there are 11 independant variables with Quality being the response variable, there are two basic approaches that comes to our mind:
Multiple Regression - This is one of the basic regression models that can be applied to find out the nature of relationship between the dependant and the independant variable. It also helps us to determine the nature of relationship between the different variables in the data set.
Logistic Regression - Due to shortcomings in Linear regression such as its inability to deal with categorical variables, it will be better and ideal to use logistic regression. Along with this simple logistic regression, combining prediction based analysis of this model will provide a tangible conclusion.
But before performing any analysis, we need to determine the effect of features on the quality through box plot in the next section. With this idea we will first implement Mutliple followed by Logistic regressions.
In this Fig.12 we can seen that there is no clear linear relationships between the response variable and the regressor variables. There is an indication of collinearity between these regressor variables.
Fig.12: Pair-wise relationship between variables
Fig.13: Crazy diagnostics
Fig.13: Crazy diagnostics
Fig.13: Crazy diagnostics
Fig.13: Crazy diagnostics
The Fig.13 shows crazy diagnostics because the response variable is a categorical variable. While this is applicable for the continuous variable, therefore no proper conclusion can be inferred from this.
Post observation of the data set, at first glance, Multiple Regression was the first choice. Since there are several independant variables and a response variable this type of analysis was the first choice. But if Multiple regression were to be applied on this model, there are five assumptions that needs to be taken into consideration : Linearity, Zero-mean, Equal Varience, Independant and Normality Assumption.
In the previous section of Multiple Regression, for this data set the following observations could be drawn:
This particular data set has “crazy” diagnostics- no proper conclusions about the Linear Assumption, Normality Assumption, Zero-mean Assumption, could be inferred.
This particular data sets violates the Normality Assumption and since one of the 5 criterias was violated, it is not necessary to deal with the others as for a data set if regression were to be applied all the five conditions should be satisfied.
Also, from the pair-wise diagnostics it can be seen that few of these independant variables are either positively or negatively correlated. Such dependance affects the performance of the model to great extent.
What the reason for such anomaly and how can this particular data set be analyzed?
This is a serious question, that has a simple answer. If we dig deeper into the reasons for such weird behavior, we can find that in all such analysis the response variable was continuous. While in this data set, it is not so.
In this situation we need to convert this response variable into binomial expression. In other words, as mentioned in the related works section, by considered the Quality value of 5 and above as Good and below 5 as bad, and applying Logistic regression the problem can be solved.
In this section, a logistic regression analysis on this data set is performed by spilting the data set into training and test data set. Initially, the training data set is validated by developing a model1 and computing measures of Goodness of fit. Then the non-significant variables are removed, model2 is developed which is then validated and measures of Goodness of fit is computed. Finally in this section, these two models are compared. In the next section, we want to verify the predicability of the model therefore we apply the test data to the developed model2 and appropriate inferences are discussed.
It may be noted here that, 80% of the total data 1280 is used as training set while the remaining 20% of the data 319 is used as test data set.
Based on the output, the measures of Goodness of fit is computed. Then the same process is repeated after removing the non-significant variables and these two models are compared.
Goodness of fit is used to find out how well the a model fits the data. Generally this process is done after selecting the final model. But this Goodness of fit is used to compare any two given models.
Generally a model is deemed to be the fit based on the hypothesis testing. But such testing might not be always useful because the non-rejection of null hypothesis does not always mean that the model fits the data well.
How do we check the if the model fits the data well?
Use the measures of Goodness of fit:[11]
ROC curves
Logistic Regression \(R^{2}\)
Ch-square goodness of fit
Model validation by outside data or spilting the data set
In this work we have considered all these aformentioned measures of Goodness of fit.
There are the definitions for the terms used in the analysis of this data:[8]
(i) MisClassification Error - It is the percentage mismatch of predicted vs actuals. Lower this value, better the model.
(ii) ROC - Receiver Operating Characteristics Curve that traces the percentage of true value correctly predicted by the logit model as prediction probability is reduced from 1 to 0. For a good model when the cut off is lower, it should mark more 1 than 0.
(iii) Concordance - The model calculated probability scores of positives aka 1’s should be greater than that of the negative. Higher concordance, better the model.
(iv) Specificity and Sensitivity - Sensitivity is the percentage of 1s correctly predicted by the model while the specificity is the percentage of 0s correctly predicted by the model.
(v) Confusion Matrix - It is used to describe the actual performance of the classification model on a set of test data when true values are known.
(vi) Accuracy - It is the ratio of the correct predictions to the total number of samples.
This model is created by using the data set as such after converting the response variable to binomial, that is, to 1 if the Quality is greater than or equal to 5 and 0 if less than 5.
Call:
glm(formula = Y ~ ., family = binomial(link = "logit"), data = df_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.3541 -0.8475 0.2982 0.8060 2.3279
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.29234 0.07230 4.043 5.27e-05 ***
ï..FA 0.26308 0.19920 1.321 0.1866
VA -0.66353 0.10036 -6.611 3.81e-11 ***
CA -0.31405 0.12660 -2.481 0.0131 *
RS 0.01551 0.09161 0.169 0.8655
CL -0.17476 0.08184 -2.135 0.0327 *
FS 0.21897 0.10056 2.177 0.0294 *
TS -0.58643 0.10972 -5.345 9.04e-08 ***
D -0.12025 0.17341 -0.693 0.4880
PH -0.06610 0.12494 -0.529 0.5968
S 0.40682 0.08165 4.983 6.27e-07 ***
A 0.93673 0.12357 7.580 3.44e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1765.0 on 1279 degrees of freedom
Residual deviance: 1298.3 on 1268 degrees of freedom
AIC: 1322.3
Number of Fisher Scoring iterations: 5
We can see here that this table comprises of both significant and non-significant variables. Without removing the non-significant variables we will be performing the analysis.
llh llhNull G2 McFadden r2ML r2CU
-649.1480235 -882.4959936 466.6959402 0.2644182 0.3055299 0.4083835
[1] "Misclassification Error is 0.50656660412758"
[1] "concordance is 0.828747463567607"
[2] "concordance is 0.171252536432393"
[3] "concordance is 2.77555756156289e-17"
[4] "concordance is 406575"
[1] "Sensitivity is 0.765467625899281"
[1] "Specificity is 0.747008547008547"
The accuracy of this model can be computed from the confusion matrix:
0 1
0 492 241
1 93 454
Accuracy : 73.90625%
As mentioned in the previous section, that that the measures of Goodness of fit and model diagnostic was computed while considering both significant and non-significant variables. While in this section, we will be eliminating the non-significant variables and re-construct the model using t-tests.
For example, let us consider: \(\beta_{1}\): The change in Wine Quality for a unit percent increase in fixed acidity while keeping all the other regressor variables constant.
Let \(H_{0}\) : \(\beta_{1}\) = 0 versus \(H_{1}\) : \(\beta_{1}\) \(\neq\) 0. The p-value is 0.1866 > 0.01 = \(\alpha\) the level of significance. Failed to reject \(H_{0}\). Therefore fixed acidity does not contribute significantly to the model given that the other regressor variables are present in the model.
In this way, we can eliminate FA,VA,CA,RS,D,PH with this level of significance using the same t-tests. With the rest of the regressor variables we can create a new model, called model2 and perform the same kinds of tests.
Call:
glm(formula = Y ~ ., family = binomial(link = "logit"), data = df_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.1155 -0.8571 0.2947 0.8263 2.2685
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.29721 0.07056 4.212 2.53e-05 ***
VA -0.57508 0.07496 -7.672 1.70e-14 ***
CL -0.22237 0.07367 -3.019 0.00254 **
TS -0.48704 0.07146 -6.815 9.40e-12 ***
S 0.39768 0.07834 5.076 3.85e-07 ***
A 0.94878 0.08438 11.244 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1765.0 on 1279 degrees of freedom
Residual deviance: 1313.6 on 1274 degrees of freedom
AIC: 1325.6
Number of Fisher Scoring iterations: 5
This new model2, has no non-significant variables in it. With this model let’s analyze the same parameters as we did for the model1
llh llhNull G2 McFadden r2ML r2CU
-656.7831410 -882.4959936 451.4257051 0.2557664 0.2971954 0.3972432
[1] "Misclassification Error is 0.490931832395247"
[1] "concordance is 0.822571481276517" "concordance is 0.177428518723483"
[3] "concordance is 0" "concordance is 406575"
[1] "Sensitivity is 0.785611510791367"
[1] "Specificity is 0.712820512820513"
The accuracy can be computed from the confusion matrix:
0 1
0 417 149
1 168 546
Accuracy : 75.23437%
In this section, we can compare the two models : Model1 and Model2 on the basis of the below mentioned parameters.
McFadden \(R^{2}\) - Model 1 has a Mc Fadden \(R^{2}\) value of 0.264 than Model2 which has about 0.255. Higher Mc Fadden \(R^{2}\) better is the model. Therefore here Model1 is better than Model2.
MisClassification Error - Model 2 has a higher Misclassification Error (0.491) compared to Model 1 (0.482). Lower this MisClassification Error better is the model, therefore Model1 is better than Model2.
Sensitivity and Specificity - Model1 has a sensitivity of 0.765 and specificity of 0.747 while Model2 has 0.785 and 0.712 respectively. Generally either of these will be higher and the other will be lower in any Models.
Accuracy - Model1 has an accuracy of 73.906% while Model2 has an accuracy of 75.234%. In this way, we can tell Model2 is better than Model1. Model2 is able to predict large number of correct values over Model1.
Concordance - Both these models have almost the same concordance of about 82%. Higher the concordance better is the model.
Chi - Square test - In the model2 the deviation is reduced significantly compared to 1.
Analysis of Deviance Table
Model 1: Y ~ VA + CL + TS + S + A
Model 2: Y ~ ï..FA + VA + CA + RS + CL + FS + TS + D + PH + S + A
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 1274 1313.6
2 1268 1298.3 6 15.27 0.01826 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Model2 has a higher accuracy over Model1. Therefore Model2 can be used for prediction to test if the model fits the data well.
Prediction refers to the output of an algorithm after it is trained on a historical data set and applied to the new data [9]. This will generate probable values for an unknown variable that will help the builder to identify the most likely value.
It extracts the significant information from the data set and uses it to predict the behavior patterns and trends [10]. The foundation of prediction understanding the relationships between the response and independant variable, and based on this the unknown outcome is predicted. It finds its useful in various domains and spheres of applications.
How is Prediction done in this work?
In this work, as mentioned before the data is spilt into training and test. In the previous section the entire analysis was done on the training data set while in this model on the developed training data set we apply the test data and check for the performance of this model and come up with appropriate conclusion.
The area under the curve plays a crucial role in the classification problems. It helps us to understand how much the model can distinguish between the classes, Higher AuC means better the model. This model has an AuC of about 78.57% which is good and the predictability of this model is also good.
[1] "Misclassification error is 0.490931832395247"
Fig.14 : Area under the Curve
In this sub-section, the other parameters such as accuracy, specificity, sensitivity and Misclassification Error are computed.
[1] "Misclassification error is 0.490931832395247"
[1] "Sensitivity is 0.88125"
[1] "Specificity is 0.616352201257862"
[1] "Concordance is 0.822571481276517" "Concordance is 0.177428518723483"
[3] "Concordance is 0" "Concordance is 406575"
Accuracy can be computed from the Confusion matrix:
0 1
0 98 19
1 61 141
Accuracy is : (98+141)/319 = 74.92163009 %
In the prediction performed in the previous section, it was found that this model is performing very well on this test data set. In the Logistic regressiin section, we considered two models based on the training data test. Here with this new model developed after removing the non-significant variables and applied test data has an accuracy of about 74.921% which is nearly closer to that of the Model 2 (developed and tested using the training data) which has an accuracy of about 75.23%.
This shows that this model2 is doing well on any test data applied to it. This can be seen through the ROC curve, whose area under curve is about 78.57% which is acceptable and better than the ones obtained in [2]. In [2] the overall accuracy is about 70% while in this work, we have an overall accuracy of about 74.921%. The low accuracy in [2] could be attributed to the non-removal of non-significant variables and not considering the collinearity between the regressor variables.
While in [1] the author has considered the response variable quality to be a continious variable and performed the corresponding regression analysis. This resulted in poor performance of the model,as the Adj.\(R^{2}\) was very low of about 20%. The reason for this was the incomplete consideration of the data set. Since the author partially considered the data set, it was difficult to perform analysis and resulted in poor performance.
Therefore this model so developed has taken into account the ones which [1] and [2] failed and this resulted in a model that has a good performance and predictability and this can be used to test any data sets.
But this model so developed may not be the ideal or the best. There are lots of scope for improvising it. In future bi-variate and multi-variate analysis on this data set could be done. That is considering a pair of features that contribute significantly to the Quality of wine. In terms of prediction, this Wine Quality data set could be applied on some other model developed with some other data set and review its performance. To this model developed using Wine Quality data set, Red Wine and White Wine data sets that are available in [3] can be applied and the performance of this model can be analysed.
[1] https://rpubs.com/prasad_pagade/wine_quality_prediction
[2] http://rstudio-pubs-static.s3.amazonaws.com/438329_edfaab4011ce44a59fb9ae2d216d8dea.html
[3] https://www.kaggle.com/sagarnildass/red-wine-analysis-by-r
[4] https://archive.ics.uci.edu/ml/index.php
[5] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. “Modeling wine preferences by data mining from physicochemical properties”
[6] D. Smith and R. Margolskee “Making sense of taste. Scientific American,Special issue”.
[7] Dr. Chen Notes Chapter 4 “Model Adequacy Checking”.
[8] https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234
[9] https://www.datarobot.com/wiki/prediction-explanations/
[10] https://en.m.wikipedia.org/wiki/Predictive_analytics
[11] http://www.medicine.mcgill.ca/epidemiology/joseph/courses/epib-621/logfit.pdf
---
title: "Final Project"
author: "Venkat Kumar"
output:
flexdashboard::flex_dashboard:
theme: simplex
orientation: columns
source_code: embed
---
```{r setup, include=FALSE}
# load necessary packages
library(Amelia)
library(ggplot2)
library(plotly)
library(plyr)
library(flexdashboard) ## you need this package to create dashboard
library(psych)
library(car)
library(ROCR)
library(InformationValue)
library(pscl)
library(knitr)
library(kableExtra)
# read the data set here, I use data: mtcars as an example
Z<-read.csv("C:/ProgramData/Microsoft/Windows/Start Menu/Programs/RStudio/Wine.csv")
Z$Y <- ifelse(Z$Q>5, 1, 0)
Z$Y <- as.factor(Z$Y)
df <- Z[,-12]
df[,1:11]<-apply(df[,1:11],2,scale)
train_index <- sample(1:1280,1280,replace=FALSE)
df_train <- df[train_index,]
df_test <-df[-train_index,]
model <- glm(Y ~.,family=binomial(link='logit'),data=df_train)
```
Introduction
=======================================================================
Column {data-width=400}
-----------------------------------------------------------------------
### Wine Quality Analysis : A Statistical Approach
Wine Quality analysis is a statistical study of features affecting the quality of wine. In the Wine Quality data set there are about 11 features that affect it's quality. The quality is measured on a scale of 0-10, this work envisons to study the specific features that play a key role in determining the quality of wine by two different statistical tools - Multiple regression and Logistic regressions. The box plots of features and their corresponding effects on Quality is shown for better and clear understanding. Due to some limitations in multiple regression such as collinearity and others, we have adapted Logistic regression. Finally as an out-of-the box step, we have tested the predicitability of this model to ensure if the model developed out of this data set can be used on some other data set.
### Insight to features affecting the Wine Quality
```{r}
G<-read.csv("C:/ProgramData/Microsoft/Windows/Start Menu/Programs/RStudio/Wine.csv")
dt<-G[1:6,1:11]
kable(dt,caption = "Table 1: Features affecting Wine Quality") %>% kable_styling()
```
The abbrevations used in Table 1 have the following meanings:\
1- FA - Fixed Acidity\
2- VA - Volatile Acidity\
3- CA - Citric Acid\
4- RS - Residual Sugar\
5- CL - Chlorides\
6- FS - Free Sulfur Dioxide\
7- TS - Total Sulfur Dioxide\
8- D - Density\
9- PH - pH\
10- S - Sulphates\
11- A - Alcohol\
12- Q - Quality\
Column {.tabset data-width=400}
-----------------------------------------------------------------------
### Motivation
Wine, once an expensive good is now increasingly enjoyed by variety of consumers. In fact, Portugal is one of the top ten wine exporting country with about 32\% of the market share in 2005 [5]. It's export has increased to about 36\% in 2007. Therefore new technologies has been adapted to enhance the making and selling of this wine. In this process there are two major steps: Wine Certification and Quality Assessment.
While Certification ensures the prevention of illegal adulteration, Quality evaluation which is a part of certification process is an indicator that is used for improving the wine making by identifying the most important features which thereby helps to classify wines as premium brands.
Generally, wine certification is done using phsiochemical and sensory tests, wherein the former is used to characterize wine based on density, alcohol or pH values while sensory test rely on human senses. **Since the taste is the least understood of human senses [6] the relationships between these two tests are very difficult to understand,** wine classification becomes an onerous task.
In such an atmosphere with the help of the technologies the data pertaining to this Wine Quality are collected and stored. These data contain important informations that explains trends and features on which the quality of wine depends. Based on this data and its associated information it is possible to improvise the quality by performing statistical analysis.
Therefore in this work, we have collected the data set pertaining to Wine Quality [4] on which we performed two types of stastical analysis: Multiple regression and Logistic regression. With these analysis we have extracted the important features that affect the wine quality and validated it with measures of "Goodness of fit". At the end as an out-of-the box initiative we performed a prediction on this data set besides classification by developing a model, ensure that this model can be used on some other data sets too.
### Related Works
Wine Quality analysis has been first found in the research works of : P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis in the paper titled "Modeling wine preferences by data mining from physicochemical properties" [5]. In this work, the authors have analyzed the features and reasons for wine quality analysis besides using Support Vector Machines (SVM) and Neural Networks for their data mining analysis.
In [1] the author has made of pair-wise features that affect the quality of wine, in other words, bivariate and multivariate analysis has been done initially considering the quality as a continious variable, then classified them into two levels : good and bad. The author then concluded with the pairs of features that affects the wine quality after several trial and errors.
While in [2] the authors have done prediction and classification. For the classification the authors have used Logistic regression, Linear regression, Multiple and Polynomial regressions. He has analyzed the limitations of using each of these mentioned methodologies with the help of measures of Goodness of fit. In case of prediction, the author applied this data set on some other already developed model and analyzed its Goodness of fit and he developed a model based on this data set too.
The authors in [3] have classified the Quality of Wine into three different levels - Bad, Average and Good based on the values of Quality on a scale of 0-10. The author has provided a comparitive study of various features and it's effect on the quality of wine.
In this work, we have considered two levels based on the values of the quality (on a scale of 10): \
(i) Good when Quality is 5 and above\
(ii) Bad when Quality is below 5.
In the Data exploration section, the analysis of raw data is shown. The features and their effect on the quality of wine is demonstrated using box plots for better and clear understanding. While in the methods section the two different statistical approaches that could possibly be used is discussed along with it's corresponding limitations. In the prediction section, a model based on this data is developed along with discussion of predictability of this model. In the last section the results from this work is compared with that of similar works. References that were used in this project is listed out at the tail end of this project work.
### Methods
In the previous sections a brief idea about the Wine quality data set was shown, while in this section this raw data needs to be analyzed. Since there are 11 independant variables with Quality being the response variable, there are two basic approaches that comes to our mind:\
(a) Multiple Regression - This is one of the basic regression models that can be applied to find out the nature of relationship between the dependant and the independant variable. It also helps us to determine the nature of relationship between the different variables in the data set. \
(b) Logistic Regression - Due to shortcomings in Linear regression such as its inability to deal with categorical variables, it will be better and ideal to use logistic regression. Along with this simple logistic regression, combining prediction based analysis of this model will provide a tangible conclusion.\
But before performing any analysis, we need to determine the effect of features on the quality through box plot in the next section. With this idea we will first implement Mutliple followed by Logistic regressions.
Data Exploration
=======================================================================
Column{data-width=300}
-----------------------------------------------------------------------
### Fig.1: Effect of RS
```{r}
g <- plot_ly(df, x = ~RS, color = ~ifelse(df$Y==0, "Good", "Bad"), type = "box")
ggplotly(g)
```
### Fig.2: Effect of FS
```{r}
g <- plot_ly(df, x = ~FS, color = ~ifelse(df$Y==0, "Good", "Bad"), type = "box")
ggplotly(g)
```
Column {data-width=300}
-----------------------------------------------------------------------
### Fig.3: Effect of CA
```{r}
g <- plot_ly(df, x = ~CA, color = ~ifelse(df$Y==0, "Good", "Bad"), type = "box")
ggplotly(g)
```
### Fig.4: Effect of VA
```{r}
g <- plot_ly(df, x = ~VA, color = ~ifelse(df$Y==0, "Good", "Bad"), type = "box")
ggplotly(g)
```
Column {data-width=300}
-----------------------------------------------------------------------
### Fig.5: Effect of Chlorides
```{r}
g <- plot_ly(df, x = ~CL, color = ~ifelse(df$Y==0, "Good", "Bad"), type = "box")
ggplotly(g)
```
### Fig.6: Effect of TS
```{r}
g <- plot_ly(df, x = ~TS, color = ~ifelse(df$Y==0, "Good", "Bad"), type = "box",fill="Quality")
ggplotly(g)
```
Column {data-width=300}
-----------------------------------------------------------------------
### Fig.7: Effect of FA
```{r}
g <- plot_ly(df, x = ~df$ï..FA, color = ~ifelse(df$Y==0, "Good", "Bad"), type = "box",fill="Quality")
ggplotly(g)
```
### Fig.8: Effect of pH
```{r}
g <- plot_ly(df, x = ~PH, color = ~ifelse(df$Y==0, "Good", "Bad"), type = "box",fill="Quality")
ggplotly(g)
```
Column {data-width=300}
-----------------------------------------------------------------------
### Fig.9: Effect of Density
```{r}
x <- list(
title = "x Axis")
g <- plot_ly(df, x = ~D, color = ~ifelse(df$Y==0, "Good", "Bad"), type = "box",fill="Quality")%>%
layout(xaxis = x)
ggplotly(g)
```
### Fig.10: Effect of Alcohol
```{r}
g <- plot_ly(df, x = ~A, color = ~ifelse(df$Y==0, "Good", "Bad"), type = "box",fill="Quality")
ggplotly(g)
```
column {data-width=300}
-------------------------------------------------------------------
### Fig.11 Effect of Sulphates
```{r}
g <- plot_ly(df, x = ~S, color = ~ifelse(df$Y==0, "Good", "Bad"), type = "box",fill="Quality")
ggplotly(g)
```
### Fig.12 Histogram of Quality
```{r}
p <- plot_ly(x = ~Z$Q, type = "histogram")
ggplotly(p)
```
Multiple Regression
=========================================================================
column {.tabset data-width=600}
----------------------------------------------------------------------------------
### Diagnostics
In this Fig.12 we can seen that there is no clear linear relationships between the response variable and the regressor variables. There is an indication of collinearity between these regressor variables.
```{r,fig.cap = 'Fig.12: Pair-wise relationship between variables'}
pairs.panels(Z[,1:12])
```
### Crazy diagnostics
```{r,fig.cap="Fig.13: Crazy diagnostics"}
Z$Y <- ifelse(Z$Q>5, 1, 0)
Z$Y <- as.factor(Z$Y)
df <- Z[,-12]
fit <- glm(Y~., family = binomial(link = "logit"), df)
plot(fit)
```
The Fig.13 shows crazy diagnostics because the response variable is a categorical variable. While this is applicable for the continuous variable, therefore no proper conclusion can be inferred from this.
column {data-width=400}
-------------------------------------------------------------------------
### Model Adequacy check
Post observation of the data set, at first glance, Multiple Regression was the first choice. Since there are several independant variables and a response variable this type of analysis was the first choice. But if Multiple regression were to be applied on this model, there are five assumptions that needs to be taken into consideration : Linearity, Zero-mean, Equal Varience, Independant and Normality Assumption.
Logistic Regression
=================================================================================
column {data-width=400}
--------------------------------------------------------------------------
### Insight to Logistic regression
In the previous section of Multiple Regression, for this data set the following observations could be drawn:\
(i) This particular data set has "crazy" diagnostics- no proper conclusions about the Linear Assumption, Normality Assumption, Zero-mean Assumption, could be inferred.\
(ii) This particular data sets violates the Normality Assumption and since one of the 5 criterias was violated, it is not necessary to deal with the others as for a data set if regression were to be applied all the five conditions should be satisfied.\
(iii) Also, from the pair-wise diagnostics it can be seen that few of these independant variables are either positively or negatively correlated. Such dependance affects the performance of the model to great extent.\
**What the reason for such anomaly and how can this particular data set be analyzed?**
This is a serious question, that has a simple answer. If we dig deeper into the reasons for such weird behavior, we can find that in all such analysis the response variable was **continuous**. While in this data set, it is not so.
In this situation we need to convert this response variable into binomial expression. In other words, as mentioned in the related works section, by considered the Quality value of 5 and above as Good and below 5 as bad, and applying Logistic regression the problem can be solved.
In this section, a logistic regression analysis on this data set is performed by spilting the data set into training and test data set. Initially, the training data set is validated by developing a **model1** and computing measures of Goodness of fit. Then the non-significant variables are removed, **model2** is developed which is then validated and measures of Goodness of fit is computed. Finally in this section, these two models are compared. In the next section, we want to verify the predicability of the model therefore we apply the test data to the developed model2 and appropriate inferences are discussed.
**It may be noted here that, 80% of the total data 1280 is used as training set while the remaining 20% of the data 319 is used as test data set.**
Based on the output, the measures of Goodness of fit is computed. Then the same process is repeated after removing the non-significant variables and these two models are compared.
column {.tabset data-width=400}
--------------------------------------------------------------------------
### Goodness of fit
Goodness of fit is used to find out how well the a model fits the data. Generally this process is done after selecting the final model. But this Goodness of fit is used to compare any two given models.
Generally a model is deemed to be the fit based on the hypothesis testing. But such testing might not be always useful because the non-rejection of null hypothesis does not always mean that the model fits the data well.
**How do we check the if the model fits the data well?**
Use the measures of Goodness of fit:[11]\
(i) ROC curves \
(ii) Logistic Regression $R^{2}$\
(iii) Ch-square goodness of fit\
(iv) Model validation by outside data or spilting the data set\
In this work we have considered all these aformentioned measures of Goodness of fit.
### Note
There are the definitions for the terms used in the analysis of this data:[8] \
**(i) MisClassification Error ** - It is the percentage mismatch of predicted vs actuals. Lower this value, better the model.\
**(ii) ROC ** - Receiver Operating Characteristics Curve that traces the percentage of true value correctly predicted by the logit model as prediction probability is reduced from 1 to 0. For a good model when the cut off is lower, it should mark more 1 than 0.\
**(iii) Concordance ** - The model calculated probability scores of positives aka 1's should be greater than that of the negative. Higher concordance, better the model.\
**(iv) Specificity and Sensitivity ** - Sensitivity is the percentage of 1s correctly predicted by the model while the specificity is the percentage of 0s correctly predicted by the model.\
**(v) Confusion Matrix ** - It is used to describe the actual performance of the classification model on a set of test data when true values are known.\
**(vi) Accuracy ** - It is the ratio of the correct predictions to the total number of samples.
### Model 1
This model is created by using the data set as such after converting the response variable to binomial, that is, to 1 if the Quality is greater than or equal to 5 and 0 if less than 5.
```{r}
df <- Z[,-12]
df[,1:11]<-apply(df[,1:11],2,scale)
train_index <- sample(1:1280,1280,replace=FALSE)
df_train <- df[train_index,]
df_test <-df[-train_index,]
model1 <- glm(Y ~.,family=binomial(link='logit'),data=df_train)
summary(model1)
```
We can see here that this table comprises of both significant and non-significant variables. Without removing the non-significant variables we will be performing the analysis.
```{r}
pR2(model1)
fitted.results <- predict(model1,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != df$Y)
print(paste("Misclassification Error is",misClasificError))
p <- predict(model1,df_train)
optCutOff <- optimalCutoff(df_train$Y, p)
con<- Concordance(df_train$Y, p)
print(paste("concordance is ", con))
sens<-sensitivity(df_train$Y, p, threshold = optCutOff)
print(paste("Sensitivity is ",sens))
specs<-specificity(df_train$Y, p, threshold = optCutOff)
print(paste("Specificity is ",specs))
```
The accuracy of this model can be computed from the confusion matrix:
```{r}
confusionMatrix(df_train$Y, p)
```
Accuracy : 73.90625\%
### Model 2
As mentioned in the previous section, that that the measures of Goodness of fit and model diagnostic was computed while considering both significant and non-significant variables. While in this section, we will be eliminating the non-significant variables and re-construct the model using t-tests.
For example, let us consider:
$\beta_{1}$: The change in Wine Quality for a unit percent increase in fixed acidity while keeping all the other regressor variables constant.
Let $H_{0}$ : $\beta_{1}$ = 0 versus $H_{1}$ : $\beta_{1}$ $\neq$ 0. The p-value is 0.1866 > 0.01 = $\alpha$ the level of significance. Failed to reject $H_{0}$. Therefore fixed acidity does not contribute significantly to the model given that the other regressor variables are present in the model.
In this way, we can eliminate FA,VA,CA,RS,D,PH with this level of significance using the same t-tests. With the rest of the regressor variables we can create a new model, called model2 and perform the same kinds of tests.
```{r}
Z<-read.csv("C:/ProgramData/Microsoft/Windows/Start Menu/Programs/RStudio/Wine1.csv")
set.seed(200)
Z$Y <- ifelse(Z$Q>5, 1, 0)
Z$Y <- as.factor(Z$Y)
Z[,1:11]<-apply(Z[,1:11],2,scale)
df <- Z[,-c(1,3,4,6,8,9,12)]
train_index <- sample(1:1280,1280,replace=FALSE)
df_train <- df[train_index,]
df_test <-df[-train_index,]
model2 <- glm(Y ~.,family=binomial(link='logit'),data=df_train)
summary(model2)
```
This new model2, has no non-significant variables in it. With this model let's analyze the same parameters as we did for the model1
```{r}
pR2(model2)
fitted.results <- predict(model2,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != df$Y)
print(paste("Misclassification Error is",misClasificError))
p <- predict(model2,df_train)
optCutOff <- optimalCutoff(df_train$Y, p)
con <- Concordance(df_train$Y, p)
print(paste("concordance is ", con))
sens<-sensitivity(df_train$Y, p, threshold = optCutOff)
print(paste("Sensitivity is ",sens))
specs<-specificity(df_train$Y, p, threshold = optCutOff)
print(paste("Specificity is ",specs))
```
The accuracy can be computed from the confusion matrix:
```{r}
confusionMatrix(df_train$Y,threshold = optCutOff, p)
```
Accuracy : 75.23437\%
### Comparitive analysis
In this section, we can compare the two models : Model1 and Model2 on the basis of the below mentioned parameters.\
(i) McFadden $R^{2}$ - Model 1 has a Mc Fadden $R^{2}$ value of 0.264 than Model2 which has about 0.255. Higher Mc Fadden $R^{2}$ better is the model. Therefore here **Model1 is better than Model2**.\
(ii) MisClassification Error - Model 2 has a higher Misclassification Error (0.491) compared to Model 1 (0.482). Lower this MisClassification Error better is the model, therefore **Model1 is better than Model2**.\
(iii) Sensitivity and Specificity - Model1 has a sensitivity of 0.765 and specificity of 0.747 while Model2 has 0.785 and 0.712 respectively. Generally either of these will be higher and the other will be lower in any Models. \
(iv) Accuracy - Model1 has an accuracy of 73.906\% while Model2 has an accuracy of 75.234\%. In this way, we can tell **Model2 is better than Model1**. Model2 is able to predict large number of correct values over Model1.\
(v) Concordance - Both these models have almost the same concordance of about 82\%. Higher the concordance better is the model.
(vi) Chi - Square test - In the model2 the deviation is reduced significantly compared to 1.
```{r}
anova(model2,model1, test ="Chisq")
```
The Model2 has a higher accuracy over Model1. Therefore Model2 can be used for prediction to test if the model fits the data well.
Prediction
===================================================================
Column {data-width=600}
-----------------------------------------------------------------------
### Insight to Prediction
Prediction refers to the output of an algorithm after it is trained on a historical data set and applied to the new data [9]. This will generate probable values for an unknown variable that will help the builder to identify the most likely value.\
It extracts the significant information from the data set and uses it to predict the behavior patterns and trends [10]. The foundation of prediction understanding the relationships between the response and independant variable, and based on this the unknown outcome is predicted. It finds its useful in various domains and spheres of applications.
**How is Prediction done in this work?**
In this work, as mentioned before the data is spilt into training and test. In the previous section the entire analysis was done on the training data set while in this model on the developed training data set we apply the test data and check for the performance of this model and come up with appropriate conclusion.
column {.tabset data-width-400}
--------------------------------------------------------------------------
### ROC and AuC
The area under the curve plays a crucial role in the classification problems. It helps us to understand how much the model can distinguish between the classes, Higher AuC means better the model. This model has an AuC of about 78.57\% which is good and the predictability of this model is also good.
```{r,fig.width=4,fig.height=4,fig.cap='Fig.14 : Area under the Curve'}
Z<-read.csv("C:/ProgramData/Microsoft/Windows/Start Menu/Programs/RStudio/Wine1.csv")
set.seed(200)
Z$Y <- ifelse(Z$Q>5, 1, 0)
Z$Y <- as.factor(Z$Y)
Z[,1:11]<-apply(Z[,1:11],2,scale)
df <- Z[,-c(1,3,4,6,8,9,12)]
train_index <- sample(1:1280,1280,replace=FALSE)
df_train <- df[train_index,]
df_test <-df[-train_index,]
model2 <- glm(Y ~.,family=binomial(link='logit'),data=df_train)
fitted.results <- predict(model2,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != df$Y)
print(paste("Misclassification error is",misClasificError))
p <- predict(model2,df_test)
pr <- prediction(p,df_test$Y)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plotROC(df_test$Y, p)
```
### Other measures
In this sub-section, the other parameters such as accuracy, specificity, sensitivity and Misclassification Error are computed.
```{r}
Z<-read.csv("C:/ProgramData/Microsoft/Windows/Start Menu/Programs/RStudio/Wine1.csv")
set.seed(200)
Z$Y <- ifelse(Z$Q>5, 1, 0)
Z$Y <- as.factor(Z$Y)
Z[,1:11]<-apply(Z[,1:11],2,scale)
df <- Z[,-c(1,3,4,6,8,9,12)]
train_index <- sample(1:1280,1280,replace=FALSE)
df_train <- df[train_index,]
df_test <-df[-train_index,]
model2 <- glm(Y ~.,family=binomial(link='logit'),data=df_train)
fitted.results <- predict(model2,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != df$Y)
print(paste("Misclassification error is",misClasificError))
p <- predict(model2,df_test)
pr <- prediction(p,df_test$Y)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
p <- predict(model2,df_test)
pr <- prediction(p,df_test$Y)
prf <- performance(pr, measure = "tpr", x.measure ="fpr")
optCutOff <- optimalCutoff(df_test$Y, p)
sen<-sensitivity(df_test$Y, p, threshold = optCutOff)
print(paste("Sensitivity is ",sen))
sp<-specificity(df_test$Y, p, threshold = optCutOff)
print(paste("Specificity is ",sp))
Con<-Concordance(df_test$Y, p)
print(paste("Concordance is ", con))
```
Accuracy can be computed from the Confusion matrix:
```{r}
confusionMatrix(df_test$Y, p, threshold = optCutOff)
```
Accuracy is : (98+141)/319 = 74.92163009 \%
Discussion
==========================================================================
column
------------------------------------------------------------------------
### Prediction Result Analysis
In the prediction performed in the previous section, it was found that this model is performing very well on this test data set. In the Logistic regressiin section, we considered two models based on the training data test. Here with this new model developed after removing the non-significant variables and applied test data has an accuracy of about 74.921\% which is nearly closer to that of the Model 2 (developed and tested using the training data) which has an accuracy of about 75.23\%.
This shows that this model2 is doing well on any test data applied to it. This can be seen through the ROC curve, whose area under curve is about 78.57\% which is acceptable and better than the ones obtained in [2]. In [2] the overall accuracy is about 70\% while in this work, we have an overall accuracy of about 74.921\%. The low accuracy in [2] could be attributed to the non-removal of non-significant variables and not considering the collinearity between the regressor variables.
While in [1] the author has considered the response variable quality to be a continious variable and performed the corresponding regression analysis. This resulted in poor performance of the model,as the Adj.$R^{2}$ was very low of about 20\%. The reason for this was the incomplete consideration of the data set. Since the author partially considered the data set, it was difficult to perform analysis and resulted in poor performance.
Therefore this model so developed has taken into account the ones which [1] and [2] failed and this resulted in a model that has a good performance and predictability and this can be used to test any data sets.
But this model so developed may not be the ideal or the best. There are lots of scope for improvising it. In future bi-variate and multi-variate analysis on this data set could be done. That is considering a pair of features that contribute significantly to the Quality of wine. In terms of prediction, this Wine Quality data set could be applied on some other model developed with some other data set and review its performance. To this model developed using Wine Quality data set, Red Wine and White Wine data sets that are available in [3] can be applied and the performance of this model can be analysed.
column
-------------------------------------------------------------------------
### References
[1] https://rpubs.com/prasad_pagade/wine_quality_prediction
[2] http://rstudio-pubs-static.s3.amazonaws.com/438329_edfaab4011ce44a59fb9ae2d216d8dea.html
[3] https://www.kaggle.com/sagarnildass/red-wine-analysis-by-r
[4] https://archive.ics.uci.edu/ml/index.php
[5] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. "Modeling wine preferences by data mining from physicochemical properties"
[6] D. Smith and R. Margolskee "Making sense of taste. Scientific American,Special issue".
[7] Dr. Chen Notes Chapter 4 "Model Adequacy Checking".
[8] https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234
[9] https://www.datarobot.com/wiki/prediction-explanations/
[10] https://en.m.wikipedia.org/wiki/Predictive_analytics
[11] http://www.medicine.mcgill.ca/epidemiology/joseph/courses/epib-621/logfit.pdf