Overview of data to be analyzed: This tidy data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). This is a curated data set provided by Udacity using the following research article:
Cortez, P.; Cerdeira, A.; Almeida, F.; Matos, T.; Reis, J.
Modeling wine preferences by data mining from physicochemical properties.
Decision Support Systems. 2009, 47, 547-553.
Question guiding investigation: Which chemical properties influence the quality of white wines?
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
The various acid measures all have a normal distribution, with all but pH
having a few outliers (in a visual sense).
Citric acid:
## A few of the more extreme values... c(1.66, 1, 0.99, 1, 1, 1)
## 3rd quartile: 0.39
## 99th percentile: 0.74
The mean (0.3342) and median (0.3200) are similar in the citric.acid
distribution, meaning few wines exist in the tail and that most of the wines follow the normal distribution.
Fixed acidity (tartaric acid):
## A few of the more extreme values... c(10.2, 10.3, 10.3, 10.7, 10.7, 14.2)
## 3rd quartile: 7.3
## 99th percentile: 9.2
The mean (6.855) and median (6.800) of fixed.acidity
are close in the distribution, meaning few wines exist in the tail and that most of the wines follow the normal distribution.
Volatile acidity (acetic acid):
## A few of the more extreme values... c(0.905, 0.91, 1.005, 0.93, 0.965, 1.1)
## 3rd quartile: 0.32
## 99th percentile: 0.63
The mean (0.2782) and median (0.2600) of volatile.acidity
are close in the distribution, meaning few wines exist in the tail and that most of the wines follow the normal distribution.
The sulfur measures also have normal distribution, though an extreme outlier exists for the free.sulfur.dioxide
histogram.
Sulfates (added):
The distribution of sulfates
is roughly normal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
##
## 3rd quartile: 0.55
## 99th percentile: 0.83
Free sulfur dioxide:
The distribution of free.sulfur.dioxide
is roughly normal, with an extreme outlier (max).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
##
## 3rd quartile: 46
## 99th percentile: 81
Total sulfur dioxide:
The distribution of total.sulfur.dioxide
is roughly normal, with an extreme outlier (max).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
##
## 3rd quartile: 167
## 99th percentile: 241.03
Density and residual sugar:
In the distributions for both density
and residual.sugar
, we can see an extreme outlier (maybe more than one) in both. The distributions are otherwise relatively narrow. My suspicion is that they are the same wine (or few wines). Dissolved sugar increases the density of water (which is 1.00 g/cm^3).
## High density wine: 1.03898
## High residual sugar wine: 65.8
## residual.sugar density
## 2782 65.8 1.03898
We can see that they are indeed the same wine.
Sodium chloride:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
There is a little bit of a tail in the chlorides
distribution; most of the values are less than 0.05, yet there are a handful of values greater than 0.2 Let’s do a log base 10 transform on the x-axis.
We see that there really aren’t many data points beyond 0.1. This is consistent with the mean not being far from the median.
Alcohol (ABV - % by volume):
The distribution of alcohol
is roughly normal.
Quality Ratings:
quality
ratings are uniformly distributed around 6 (the mode).
Wine ratings were made on a scale of 0 to 10, with 0 being the worst. In this data, wines ranged from 3 to 9. I divided the ratings as follows:
> Worst: wines rated 3, 4
> Middle: wines rated 5, 6, 7
> Best: wines rated 8, 9
## Best rated wines: c(FALSE, TRUE) c(4718, 180)
## Middle rated wines: c(FALSE, TRUE) c(363, 4535)
## Worst rated wines: c(FALSE, TRUE) c(4715, 183)
I am going to focus on some of the features of the best wines.
It appears that in the best wines (rated 8 or greater), certain features show less variance whereas others vary greatly. For example, chlorides
does not seem to vary much (except for an extreme outlier); residual.sugar
has a lot of variation.
Most of the distribution show a normal or somewhat normal distribution. The chlorides
distribution is relatively normal. The residual.sugar
distribution seems almost uniform in its distribution.
This data has 4898 observations (wines) with 13 features (variables). These features are “X”, “fixed.acidity”, “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, “sulphates”, “alcohol”, and “quality”.
Most of the variables are numerical, except for “quality”. Although it consists of numbers, the numbers represent rank rather than being numbers in the strict mathematical sense. When I loaded the data, R interpreted it as a numerical vector, so I purposely changed it to a factor to properly reflect what it stands for.
The main feature of my data set is “quality” which, as mentioned in the previous answer, represets a ranking: 0 to 10 with 0 being the worst quality rating and 10 being the best quality rating.
The goal of this analysis is to determine which chemical properties affect the wine’s rating (quality).
Because we want to know which chemical properties affect quality, a feature’s importance will be determined relative to its effect on quality.
So far, a few features stand out: alchohol content, chloride content, free sulfur dioxide, and volatile acidity.
Yes. I created a “quality.category” (factor) variable using the “quality” variable. This factor has three levels: “Worst”, “Middle”, and “Best”. If a wine was rated poorly (0-3), its quality.category value is “Worst”. A rating of 5-7 corresponds to “Middle”. A rating of 8-10 represents “Best”.
The histogram of chloride content had a long tail with what appeared to be many data points along its length. I did a log transformation along its x-axis to see if there really were a good amount of data points in the tail. After the transformation, I found that the tail was less significant than I had originally thought.
For my preliminary bivariate plotting, I am using the ggpairs
function to help guide my later, targeted plots. The resulting plot matrix will give me an initial look at any possible relationships that may be worth exploring further.
To prevent overcrowded plot matrices, I will be subsetting the data into different groups based on my hunches about variables that may have something to do with one another. I will also create a subset that includes variables that are, at least to my understanding, seemingly unrelated.
Acidity subset
There only appears to be one moderate correlation amongst the numeric variables in this subset: between pH
and fixed.acidity
(r = -0.426).
There also only appears to be one correlation between quality
(categorical variable) and pH
. This will require further exploration below.
Sulfur subset
Amongst the numerical variables, a correlation exists between free.sulfur.dioxide
and total.sulfur.dioxide
(r = 0.616).
For quality
, the only relationship that appears to exist is between quality
and total.sulfur.dioxide
, but this requires more investigation below.
Other features subset
A strong positive correlation exists between residual.sugar
and density
(r = 0.839). There is also a strong negative correlation between alcohol
and density
(r = -0.78). Finally, there is a modest negative correlation between alcohol
and residual.sugar
(r = -0.4506).
As for relationships involving quality, there definitely appears to be a positive relationship between quality
and alcohol
. There also appears to be negative relationships between quality
and chlorides
as well as quality
and density
. A relationship between quality
and residual.sugar
seems unclear. All of these will be explored further below.
Random features subset
A modest positive correlation seems to exists between density
and total.sulfur.dioxide
(r = 0.53).
No other significant correlations seems to exist amongst the numerical variables.
No new relationships involving quality
are obvious.
Taking a close look at above findings between different pairs of chemical properties (based on correlation coefficient findings).
pH and Fixed Acidity
With 4 “far out” outliers eliminated
## r = -0.425858290991382
Total sulfur dioxide and free sulfur dioxide
## r = 0.615500965009836
Density vs. residual sugar
## r = 0.838966454904583
Alcohol vs. residual sugar
## r = -0.450631222031729
Density vs. alcohol
## r = -0.780137621425558
Density and Total Sulfur Dioxide
## r = 0.529881323878611
Volatile acidity:
## Median volatile acidity for 'Best' wines: 0.26
## Median volatile acidity for 'Middle' wines: 0.26
## Median volatile acidity for 'Worst' wines: 0.32
It seems that poorly rated wines tend to have higher volatile.acidity
.
pH:
## Median pH for 'Best' wines: 3.23
## Median pH for 'Middle' wines: 3.18
## Median pH for 'Worst' wines: 3.16
The best wines appear to have a (relatively) higher pH
; the middle rated wines have a lot of variability.
Free sulfur dioxide:
## Median Free Sulfur Dioxide for 'Best' wines: 34.5
## Median Free Sulfur Dioxide for 'Middle' wines: 34
## Median Free Sulfur Dioxide for 'Worst' wines: 18
## free.sulfur.dioxide quality quality.category
## 4746 289 3 Worst
The median free.sulfur.dioxide
is lower in the poorly rated wines when considered as a group (with respect to quality.category
); however, there is a lot variability amongst these wines. Overall, when considering each quality
rating separately, though, the pattern seems dubious. Note that the wine with the highest free.sulfur.dioxide
(289) is in the worst category. This may distort the “Worst” category as a group.
Total sulfur dioxide:
## Median Total Sulfur Dioxide for 'Best' wines: 34.5
## Median Total Sulfur Dioxide for 'Middle' wines: 34
## Median Total Sulfur Dioxide for 'Worst' wines: 18
There appears to be a slight positive trend between total.sulfur.dioxide
and quality
.
Alcohol (ABV - % by volume):
## Median Alcohol for 'Best' wines: 12
## Median Alcohol for 'Middle' wines: 10.3
## Median Alcohol for 'Worst' wines: 10.1
The better rated wines generally have higher alcohol content.
Density:
## Median Density for 'Best' wines: 0.99162
## Median Density for 'Middle' wines: 0.9938
## Median Density for 'Worst' wines: 0.9941
Better rated wines tend to have lower density
.
Chlorides:
## Median Chlorides for 'Best' wines: 0.0355
## Median Chlorides for 'Middle' wines: 0.043
## Median Chlorides for 'Worst' wines: 0.046
The worse rated wines tend to have higher chlorides
.
I observed several relationships involving the main feature (quality
). Poorly rated wines (“Worst” quality.category
) tended to have higher volatile.acidity
. In the text accompanying the data set, there is information on the variables contained in the data. Of note is this: “volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.” This would explain why at higher values these wines would end up in the worst category. Perhaps “Middle” and “Best” wines do not hit the threshold of “too high of levels”, thereby not experiencing any negative impact from volatile.acidity
.
Wines with a higher pH
value tended to have a better quality
rating. This holds true for all three quality.category
subsets (“Worst”, “Middle”, “Best”). I do not believe this is because wines with higher volatile.acidity
generally have worse ratings. Below (next question) I go into a discussion about how volatile.acidity
has a low impact on final pH
(how, in fact, it is fixed.acidity
that largely determines final pH
). Maybe it is a tasting preference to have slightly less acidic wine (they are all acidic nonetheless).
No relationship appears to exist between quality
and free.sulfur.dioxide
. Another note from the text accompanying the data set states: “total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO[2] is mostly undetectable in wine, but at free SO[2] concentrations over 50 ppm, SO[2] becomes evident in the nose and taste of wine”
Somewhat counterintuitively, there doesn’t seem to be much of a trend. A closer examination reveals that at the cited 50 ppm, that would mean free sulfur dioxide would have to be at least 49.9 mg/dm^3 to be detectable. Most wines (for all ratings) are less than this threshold. So for the most part, too much free SO2 likely isn’t a factor unless it is at unreasonably high amounts.
For quality
and total.sulfur.dioxide
, however, there does appear to be a modest trend. I did a little searching to make sense of this observation. Sulfur dioxide is an important preservative and stabilizer used world-wide in wine making. total.sulfur.dioxide
is defined as the “amount of free and bound forms of S02”.
In cases where the wine is highly oxidized and has ample amounts of acetaldehyde (not good for wine quality), any added SO2 is quickly bound by acetaldehyde and other chemical constituents. It becomes part of total SO2 but not part of free SO2. Bound SO2 does not have antioxidant or anti-microbial properties, but free SO2 does. Poorly made wines that have oxidized require more SO2 to be added – enough so that it spills over into free SO2 at a level high enough to have the desired preserving effect. Thereby explaining why we see a modest trend for total.sulfur.dioxide
but not for free.sulfur.dioxide
.
quality
ratings tend to be higher when alcohol
is higher. This appears to be a preference. According to an article in the Scientific American ( “Wine Becomes More Like Whisky as Alcohol Content Gets High”):
> “The recent surge in wine’s punch [alcohol content] is largely a result, scientists say, of a fashion for deeply colored wines with fewer “green” qualities and more bright, ripe, fruity flavors. As New World wines in this style have drawn more fans, even European winemakers accustomed to making lower-alcohol wines in less ripe styles are beginning to follow suit. But producing wines with those flavors means letting grapes hang longer on the vine, and with longer hang times comes bigger sugar. The more sugar the wine yeast S. cerevisiae has to work with, the more alcohol it will make.”
Along similiar lines quality
tends to be higher in wines with lower density
. This is consistent with the previous relationship that showed wines with higher alcohol content typically had better ratings. (Reminder: alcohol lowers the density of the wine, which is mostly water.)
Finally, a negative trend seems to exist between quality
and chlorides
. More chlorides seems to negatively impact quality. I found some information to corroborate this observation (see resources text file).
Yes, there were several.
pH
and fixed.acidity
had a modest negative correlation (r = -0.4259). The relationship suggests that the higher the fixed.acidity
, the lower the pH
. That is not an implausable correlation. The more acid in a solution, all else equal, the lower the pH of that solution will be.
With that in mind, there is more than one kind of acid in wine that contributes to the final pH of a wine. fixed.acidity
in this data set represents tartaric acid, one of the main acid constituents in wine, potentially explaining why we have a noticable trend. But because it is not the only acid present, a stronger relationship is not observed.
At first glance it may seem curious that there isn’t even a modest relationship between volatile.acidity
(acetic acid content) and pH
. It may also seem strange that no significant relationship appears to exist between citric.acid
and pH
.
If you look at the measurements of all three acids, however, the above patterns actually do make sense. All of the acids are measured in grams per decimeter cubed.
## Fixed acidity summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
##
## Volatile acidity summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
##
## Citric acid summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
fixed.acidity
quantities are much higher than either volatile.acidity
or citric.acid
contents, and they all have comprable pKa ranges (a measure of acidity), meaning that fixed.acidity
(tartaric acid) has greater influence on pH.
total.sulfur.dioxide
and free.sulfur.dioxide
have a positive correlation (r = 0.6155). This makes sense given my previous explanation of their relationship (see previous question).
A strong positive relationship exists between density
and residual.sugar
(r = 0.8390). This makes sense because dissolved solids increase the density of aqueous solvents (wine is mostly water, whose density is 1 g / cm^3).
A strong negative relationship exists between density
and alcohol
(r = -0.7801). This makes sense for the opposite reason. Ethanol has a density of 0.7893 g / cm^3, so its presence would decrease the density of a wine.
alcohol
and residual.sugar
have a moderate negative correlation (r = -0.4506). Because alcohol is generated by the fermentation of sugar, if fermentation is limited by the winemaker more sugar and less alcohol would be in the final wine, all else equal. I believe a stronger relationship is not seen because the amount of sugar at the beginning of fermentation is different amongst wines. This depends on the grapes’ qualities and if sugar is added to the starting product.
density
and total.sulfur.dioxide
share a moderate positive correlation (r = 0.5299). I am unsure why this correlation exists since the density of sulfur dioxide is lower than water (0.00293 g/cm^3).
The strongest relationship amongst numerical variables was between density
and residual.sugar
(r = 0.8390).
I do not have a quantification for the strongest relationship between quality
and another feature; however, visually the relationship between quality
and alcohol
seems to be the most dramatic.
The first multivariate plot is of alcohol
,density
, and quality.category
. As seen before, alcohol
and density
are strongly related to one another: regardless of a wine’s category, alcohol
and density
are negatively correlated. Also from earlier exploration, the relationship between quality.category
and alcohol
still holds true. The “Best” category wines tend to have higher alcohol
than their peers, “Middle” wines tend to be between, and “Worst” tends to be have the least alcohol
.
I also plotted volatile.acidity
, pH
and quality
as a scatterplot. I binned pH because there was worse overplotting in the un-binned form. The scatterplot makes it unmistakably clear that the middle rated wines are the most abundant and span across the widest pH scale. It is also possible to see that wines with higher volatile.acidity
are a mix of the worst rated wines and middle rated wines.
I did not see anything new that was not already seen in the bivariate analysis.
N/A
This scatterplot depicts the relationship between pH
and fixed.acidity
(r = -0.4259). Although it is only a moderate relationship, the inquiry that followed was interesting because I ended up learning about the acid composition of wine.
While it is not surprising that greater fixed.acidity
correlates to lower pH, the imperfect relationship suggested that something more was at play. As discussed in the Bivarate Analysis section, fixed.acidity
only represents one of the several acids present in wine (tartaric acid). After some brief investigation, it was clear that other acids, like malic acid (not measured in this data set), also contribute to the final pH of a wine. When I checked volatile.acidity
(acetic acid) and citric.acid
for correlation to pH, the relationships were not strong, meaning that there was much less of either acid present compared to tartaric acid.
This boxplot shows alcohol content of wines by quality category. I elected to group the wines into “Worst”, “Middle”, and “Best” for visualization purposes.
This plot was particularly interesting not simply because it showed a clear preference for higher alcohol wines, but because it revealed a market trend. As discussed in the Bivarate Analysis section, New World wines with bolder, “riper” flavors are in vogue, which consequently come with higher alcohol content.
Although this line plot is somewhat noisy, it actually contains a some of information of use. It shows a slight trend in quality and volatile acidity. It also shows that no significant relationship exists between pH and volatile acidity (which reinforces previous findings).
At a bigger picture level, this plot demonstrates that more is at work when it comes to quality rating: there are wines of different quality
rating present all about, meaning some wines certainly buck the trend for volatile.acidity
. It also shows the diversity in pH for wines. These observations reinforce my general impression that quality is influenced by multiple chemical properties of wine; that no one property is the ultimate dictator of quality.
This was a rewarding data set to work on from both a professional and personal perspective. Going through a full analysis of the data, starting at a univariate level, progressing through a bivariate level, then finishing at a multivariate level, gave me the opportunity to apply and polish what I have learned. What I especially enjoyed was customizing plots. I was able to learn much more about ggplot2 in the process: the different layers available, some of their respective intricacies, and the various ways I could portray my findings.
From a personal perspective, I loved learning more about wine! My academic background is in chemistry and I have a particular fondness for natural product chemistry and food chemistry.
Although I wouldn’t deem them “struggles” exactly, having to rummage the internet and other resources (I have a copy of “R in a Nutshell”) for code solutions took time and effort. For example, getting a good grasp of the theme()
layer took a good amount of investigation.
Another challenge was making sense of some of the observations resulting from my queries. This was the fun part though! I never knew that “New World” wines were a class of wines, nor that they were defined by having riper, fruitier flavors compared to “Old World” wines. It was also interesting to learn that these in fashion wines are fermented with riper grapes (more sugar rich) leading to more alcoholic wines – explaining my original observation that wines with higher quality
ratings tended to have higher alcohol
.
I was also suprised by my findings for quality
and chlorides
. I had no clue that wines could be “salty” to a detectable level, though it makes sense that these wines would have a lower quality
. I don’t think I would seek out or enjoy a noticeably salty wine!
In the future, once I have learned more about predictive modeling, I think building a model could be very interesting. It would be a great way to test my present findings and also to get a better understanding of what people look for in wines. Another thing that would interesting would be repeating the experiment and adding more variables – e.g. all the major acids like malic, lactic, succinic acid, etc, as well as an oxidation marker of some sort and acetaldehyde levels. Many of these have profound effects on the sensory experience of wine. So we may be missing important information if these are not included.