WHITE WINE QUALITY ANALYSIS by Kasey Cox

Overview of data to be analyzed: This tidy data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). This is a curated data set provided by Udacity using the following research article:

Cortez, P.; Cerdeira, A.; Almeida, F.; Matos, T.; Reis, J.
Modeling wine preferences by data mining from physicochemical properties.
Decision Support Systems. 2009, 47, 547-553.

Elsevier
Pre-press (pdf)
bib

Question guiding investigation: Which chemical properties influence the quality of white wines?


Univariate Plots Section

Preliminary information on the data set

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Organic acid contents and overall acidity (pH)

The various acid measures all have a normal distribution, with all but pH having a few outliers (in a visual sense).

Citric acid:

## A few of the more extreme values... c(1.66, 1, 0.99, 1, 1, 1) 
## 3rd quartile: 0.39 
## 99th percentile: 0.74

The mean (0.3342) and median (0.3200) are similar in the citric.acid distribution, meaning few wines exist in the tail and that most of the wines follow the normal distribution.

Fixed acidity (tartaric acid):

## A few of the more extreme values... c(10.2, 10.3, 10.3, 10.7, 10.7, 14.2) 
## 3rd quartile: 7.3 
## 99th percentile: 9.2

The mean (6.855) and median (6.800) of fixed.acidity are close in the distribution, meaning few wines exist in the tail and that most of the wines follow the normal distribution.

Volatile acidity (acetic acid):

## A few of the more extreme values... c(0.905, 0.91, 1.005, 0.93, 0.965, 1.1) 
## 3rd quartile: 0.32 
## 99th percentile: 0.63

The mean (0.2782) and median (0.2600) of volatile.acidity are close in the distribution, meaning few wines exist in the tail and that most of the wines follow the normal distribution.

Sulfur measures

The sulfur measures also have normal distribution, though an extreme outlier exists for the free.sulfur.dioxide histogram.

Sulfates (added):
The distribution of sulfates is roughly normal.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800
## 
## 3rd quartile: 0.55 
## 99th percentile: 0.83

Free sulfur dioxide:
The distribution of free.sulfur.dioxide is roughly normal, with an extreme outlier (max).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00
## 
## 3rd quartile: 46 
## 99th percentile: 81

Total sulfur dioxide:
The distribution of total.sulfur.dioxide is roughly normal, with an extreme outlier (max).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0
## 
## 3rd quartile: 167 
## 99th percentile: 241.03

Other wine attributes

Density and residual sugar:

In the distributions for both density and residual.sugar, we can see an extreme outlier (maybe more than one) in both. The distributions are otherwise relatively narrow. My suspicion is that they are the same wine (or few wines). Dissolved sugar increases the density of water (which is 1.00 g/cm^3).

## High density wine: 1.03898 
## High residual sugar wine: 65.8
##      residual.sugar density
## 2782           65.8 1.03898

We can see that they are indeed the same wine.

Sodium chloride:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

There is a little bit of a tail in the chlorides distribution; most of the values are less than 0.05, yet there are a handful of values greater than 0.2 Let’s do a log base 10 transform on the x-axis.

We see that there really aren’t many data points beyond 0.1. This is consistent with the mean not being far from the median.

Alcohol (ABV - % by volume):

The distribution of alcohol is roughly normal.

Quality Ratings:

quality ratings are uniformly distributed around 6 (the mode).

More about the best, middle, and worst wines

Wine ratings were made on a scale of 0 to 10, with 0 being the worst. In this data, wines ranged from 3 to 9. I divided the ratings as follows:
> Worst: wines rated 3, 4
> Middle: wines rated 5, 6, 7
> Best: wines rated 8, 9

## Best rated wines: c(FALSE, TRUE) c(4718, 180) 
## Middle rated wines: c(FALSE, TRUE) c(363, 4535) 
## Worst rated wines: c(FALSE, TRUE) c(4715, 183)

I am going to focus on some of the features of the best wines.

It appears that in the best wines (rated 8 or greater), certain features show less variance whereas others vary greatly. For example, chlorides does not seem to vary much (except for an extreme outlier); residual.sugar has a lot of variation.

Most of the distribution show a normal or somewhat normal distribution. The chlorides distribution is relatively normal. The residual.sugar distribution seems almost uniform in its distribution.

Univariate Analysis

What is the structure of your dataset?

This data has 4898 observations (wines) with 13 features (variables). These features are “X”, “fixed.acidity”, “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, “sulphates”, “alcohol”, and “quality”.

Most of the variables are numerical, except for “quality”. Although it consists of numbers, the numbers represent rank rather than being numbers in the strict mathematical sense. When I loaded the data, R interpreted it as a numerical vector, so I purposely changed it to a factor to properly reflect what it stands for.

What is/are the main feature(s) of interest in your dataset?

The main feature of my data set is “quality” which, as mentioned in the previous answer, represets a ranking: 0 to 10 with 0 being the worst quality rating and 10 being the best quality rating.

The goal of this analysis is to determine which chemical properties affect the wine’s rating (quality).

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Because we want to know which chemical properties affect quality, a feature’s importance will be determined relative to its effect on quality.

So far, a few features stand out: alchohol content, chloride content, free sulfur dioxide, and volatile acidity.

Did you create any new variables from existing variables in the dataset?

Yes. I created a “quality.category” (factor) variable using the “quality” variable. This factor has three levels: “Worst”, “Middle”, and “Best”. If a wine was rated poorly (0-3), its quality.category value is “Worst”. A rating of 5-7 corresponds to “Middle”. A rating of 8-10 represents “Best”.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

The histogram of chloride content had a long tail with what appeared to be many data points along its length. I did a log transformation along its x-axis to see if there really were a good amount of data points in the tail. After the transformation, I found that the tail was less significant than I had originally thought.


Bivariate Plots Section

Preliminary plotting

For my preliminary bivariate plotting, I am using the ggpairs function to help guide my later, targeted plots. The resulting plot matrix will give me an initial look at any possible relationships that may be worth exploring further.

To prevent overcrowded plot matrices, I will be subsetting the data into different groups based on my hunches about variables that may have something to do with one another. I will also create a subset that includes variables that are, at least to my understanding, seemingly unrelated.

Acidity subset

There only appears to be one moderate correlation amongst the numeric variables in this subset: between pH and fixed.acidity (r = -0.426).

There also only appears to be one correlation between quality (categorical variable) and pH. This will require further exploration below.

Sulfur subset

Amongst the numerical variables, a correlation exists between free.sulfur.dioxide and total.sulfur.dioxide (r = 0.616).

For quality, the only relationship that appears to exist is between quality and total.sulfur.dioxide, but this requires more investigation below.

Other features subset

A strong positive correlation exists between residual.sugar and density (r = 0.839). There is also a strong negative correlation between alcohol and density (r = -0.78). Finally, there is a modest negative correlation between alcohol and residual.sugar (r = -0.4506).

As for relationships involving quality, there definitely appears to be a positive relationship between quality and alcohol. There also appears to be negative relationships between quality and chlorides as well as quality and density. A relationship between quality and residual.sugar seems unclear. All of these will be explored further below.

Random features subset

A modest positive correlation seems to exists between density and total.sulfur.dioxide (r = 0.53).

No other significant correlations seems to exist amongst the numerical variables.

No new relationships involving quality are obvious.

Relationships amongst chemical properties

Taking a close look at above findings between different pairs of chemical properties (based on correlation coefficient findings).

pH and Fixed Acidity
With 4 “far out” outliers eliminated

## r = -0.425858290991382

Total sulfur dioxide and free sulfur dioxide

## r = 0.615500965009836

Density vs. residual sugar

## r = 0.838966454904583

Alcohol vs. residual sugar

## r = -0.450631222031729

Density vs. alcohol

## r = -0.780137621425558

Density and Total Sulfur Dioxide

## r = 0.529881323878611

Relationships between quality and other chemical properties

Volatile acidity:

## Median volatile acidity for 'Best' wines: 0.26 
## Median volatile acidity for 'Middle' wines: 0.26 
## Median volatile acidity for 'Worst' wines: 0.32

It seems that poorly rated wines tend to have higher volatile.acidity.

pH:

## Median pH for 'Best' wines: 3.23 
## Median pH for 'Middle' wines: 3.18 
## Median pH for 'Worst' wines: 3.16

The best wines appear to have a (relatively) higher pH; the middle rated wines have a lot of variability.

Free sulfur dioxide:

## Median Free Sulfur Dioxide for 'Best' wines: 34.5 
## Median Free Sulfur Dioxide for 'Middle' wines: 34 
## Median Free Sulfur Dioxide for 'Worst' wines: 18
##      free.sulfur.dioxide quality quality.category
## 4746                 289       3            Worst

The median free.sulfur.dioxide is lower in the poorly rated wines when considered as a group (with respect to quality.category); however, there is a lot variability amongst these wines. Overall, when considering each quality rating separately, though, the pattern seems dubious. Note that the wine with the highest free.sulfur.dioxide (289) is in the worst category. This may distort the “Worst” category as a group.

Total sulfur dioxide:

## Median Total Sulfur Dioxide for 'Best' wines: 34.5 
## Median Total Sulfur Dioxide for 'Middle' wines: 34 
## Median Total Sulfur Dioxide for 'Worst' wines: 18

There appears to be a slight positive trend between total.sulfur.dioxide and quality.

Alcohol (ABV - % by volume):

## Median Alcohol for 'Best' wines: 12 
## Median Alcohol for 'Middle' wines: 10.3 
## Median Alcohol for 'Worst' wines: 10.1

The better rated wines generally have higher alcohol content.

Density:

## Median Density for 'Best' wines: 0.99162 
## Median Density for 'Middle' wines: 0.9938 
## Median Density for 'Worst' wines: 0.9941

Better rated wines tend to have lower density.

Chlorides:

## Median Chlorides for 'Best' wines: 0.0355 
## Median Chlorides for 'Middle' wines: 0.043 
## Median Chlorides for 'Worst' wines: 0.046

The worse rated wines tend to have higher chlorides.


Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

I observed several relationships involving the main feature (quality). Poorly rated wines (“Worst” quality.category) tended to have higher volatile.acidity. In the text accompanying the data set, there is information on the variables contained in the data. Of note is this: “volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.” This would explain why at higher values these wines would end up in the worst category. Perhaps “Middle” and “Best” wines do not hit the threshold of “too high of levels”, thereby not experiencing any negative impact from volatile.acidity.

Wines with a higher pH value tended to have a better quality rating. This holds true for all three quality.category subsets (“Worst”, “Middle”, “Best”). I do not believe this is because wines with higher volatile.acidity generally have worse ratings. Below (next question) I go into a discussion about how volatile.acidity has a low impact on final pH (how, in fact, it is fixed.acidity that largely determines final pH). Maybe it is a tasting preference to have slightly less acidic wine (they are all acidic nonetheless).

No relationship appears to exist between quality and free.sulfur.dioxide. Another note from the text accompanying the data set states: “total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO[2] is mostly undetectable in wine, but at free SO[2] concentrations over 50 ppm, SO[2] becomes evident in the nose and taste of wine”

Somewhat counterintuitively, there doesn’t seem to be much of a trend. A closer examination reveals that at the cited 50 ppm, that would mean free sulfur dioxide would have to be at least 49.9 mg/dm^3 to be detectable. Most wines (for all ratings) are less than this threshold. So for the most part, too much free SO2 likely isn’t a factor unless it is at unreasonably high amounts.

For quality and total.sulfur.dioxide, however, there does appear to be a modest trend. I did a little searching to make sense of this observation. Sulfur dioxide is an important preservative and stabilizer used world-wide in wine making. total.sulfur.dioxide is defined as the “amount of free and bound forms of S02”.

In cases where the wine is highly oxidized and has ample amounts of acetaldehyde (not good for wine quality), any added SO2 is quickly bound by acetaldehyde and other chemical constituents. It becomes part of total SO2 but not part of free SO2. Bound SO2 does not have antioxidant or anti-microbial properties, but free SO2 does. Poorly made wines that have oxidized require more SO2 to be added – enough so that it spills over into free SO2 at a level high enough to have the desired preserving effect. Thereby explaining why we see a modest trend for total.sulfur.dioxide but not for free.sulfur.dioxide.

quality ratings tend to be higher when alcohol is higher. This appears to be a preference. According to an article in the Scientific American ( “Wine Becomes More Like Whisky as Alcohol Content Gets High”):
> “The recent surge in wine’s punch [alcohol content] is largely a result, scientists say, of a fashion for deeply colored wines with fewer “green” qualities and more bright, ripe, fruity flavors. As New World wines in this style have drawn more fans, even European winemakers accustomed to making lower-alcohol wines in less ripe styles are beginning to follow suit. But producing wines with those flavors means letting grapes hang longer on the vine, and with longer hang times comes bigger sugar. The more sugar the wine yeast S. cerevisiae has to work with, the more alcohol it will make.”

Along similiar lines quality tends to be higher in wines with lower density. This is consistent with the previous relationship that showed wines with higher alcohol content typically had better ratings. (Reminder: alcohol lowers the density of the wine, which is mostly water.)

Finally, a negative trend seems to exist between quality and chlorides. More chlorides seems to negatively impact quality. I found some information to corroborate this observation (see resources text file).

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Yes, there were several.

pH and fixed.acidity had a modest negative correlation (r = -0.4259). The relationship suggests that the higher the fixed.acidity, the lower the pH. That is not an implausable correlation. The more acid in a solution, all else equal, the lower the pH of that solution will be.

With that in mind, there is more than one kind of acid in wine that contributes to the final pH of a wine. fixed.acidity in this data set represents tartaric acid, one of the main acid constituents in wine, potentially explaining why we have a noticable trend. But because it is not the only acid present, a stronger relationship is not observed.

At first glance it may seem curious that there isn’t even a modest relationship between volatile.acidity (acetic acid content) and pH. It may also seem strange that no significant relationship appears to exist between citric.acid and pH.

If you look at the measurements of all three acids, however, the above patterns actually do make sense. All of the acids are measured in grams per decimeter cubed.

## Fixed acidity summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200
## 
##  Volatile acidity summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000
## 
##  Citric acid summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

fixed.acidity quantities are much higher than either volatile.acidity or citric.acid contents, and they all have comprable pKa ranges (a measure of acidity), meaning that fixed.acidity (tartaric acid) has greater influence on pH.

total.sulfur.dioxide and free.sulfur.dioxide have a positive correlation (r = 0.6155). This makes sense given my previous explanation of their relationship (see previous question).

A strong positive relationship exists between density and residual.sugar (r = 0.8390). This makes sense because dissolved solids increase the density of aqueous solvents (wine is mostly water, whose density is 1 g / cm^3).

A strong negative relationship exists between density and alcohol (r = -0.7801). This makes sense for the opposite reason. Ethanol has a density of 0.7893 g / cm^3, so its presence would decrease the density of a wine.

alcohol and residual.sugar have a moderate negative correlation (r = -0.4506). Because alcohol is generated by the fermentation of sugar, if fermentation is limited by the winemaker more sugar and less alcohol would be in the final wine, all else equal. I believe a stronger relationship is not seen because the amount of sugar at the beginning of fermentation is different amongst wines. This depends on the grapes’ qualities and if sugar is added to the starting product.

density and total.sulfur.dioxide share a moderate positive correlation (r = 0.5299). I am unsure why this correlation exists since the density of sulfur dioxide is lower than water (0.00293 g/cm^3).

What was the strongest relationship you found?

The strongest relationship amongst numerical variables was between density and residual.sugar (r = 0.8390).

I do not have a quantification for the strongest relationship between quality and another feature; however, visually the relationship between quality and alcohol seems to be the most dramatic.


Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The first multivariate plot is of alcohol,density, and quality.category. As seen before, alcohol and density are strongly related to one another: regardless of a wine’s category, alcohol and density are negatively correlated. Also from earlier exploration, the relationship between quality.category and alcohol still holds true. The “Best” category wines tend to have higher alcohol than their peers, “Middle” wines tend to be between, and “Worst” tends to be have the least alcohol.

I also plotted volatile.acidity, pH and quality as a scatterplot. I binned pH because there was worse overplotting in the un-binned form. The scatterplot makes it unmistakably clear that the middle rated wines are the most abundant and span across the widest pH scale. It is also possible to see that wines with higher volatile.acidity are a mix of the worst rated wines and middle rated wines.

Were there any interesting or surprising interactions between features?

I did not see anything new that was not already seen in the bivariate analysis.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

N/A


Final Plots and Summary

Plot One

Description One

This scatterplot depicts the relationship between pH and fixed.acidity (r = -0.4259). Although it is only a moderate relationship, the inquiry that followed was interesting because I ended up learning about the acid composition of wine.

While it is not surprising that greater fixed.acidity correlates to lower pH, the imperfect relationship suggested that something more was at play. As discussed in the Bivarate Analysis section, fixed.acidity only represents one of the several acids present in wine (tartaric acid). After some brief investigation, it was clear that other acids, like malic acid (not measured in this data set), also contribute to the final pH of a wine. When I checked volatile.acidity (acetic acid) and citric.acid for correlation to pH, the relationships were not strong, meaning that there was much less of either acid present compared to tartaric acid.

Plot Two

Description Two

This boxplot shows alcohol content of wines by quality category. I elected to group the wines into “Worst”, “Middle”, and “Best” for visualization purposes.

This plot was particularly interesting not simply because it showed a clear preference for higher alcohol wines, but because it revealed a market trend. As discussed in the Bivarate Analysis section, New World wines with bolder, “riper” flavors are in vogue, which consequently come with higher alcohol content.

Plot Three

Description Three

Although this line plot is somewhat noisy, it actually contains a some of information of use. It shows a slight trend in quality and volatile acidity. It also shows that no significant relationship exists between pH and volatile acidity (which reinforces previous findings).

At a bigger picture level, this plot demonstrates that more is at work when it comes to quality rating: there are wines of different quality rating present all about, meaning some wines certainly buck the trend for volatile.acidity. It also shows the diversity in pH for wines. These observations reinforce my general impression that quality is influenced by multiple chemical properties of wine; that no one property is the ultimate dictator of quality.


Reflection

This was a rewarding data set to work on from both a professional and personal perspective. Going through a full analysis of the data, starting at a univariate level, progressing through a bivariate level, then finishing at a multivariate level, gave me the opportunity to apply and polish what I have learned. What I especially enjoyed was customizing plots. I was able to learn much more about ggplot2 in the process: the different layers available, some of their respective intricacies, and the various ways I could portray my findings.

From a personal perspective, I loved learning more about wine! My academic background is in chemistry and I have a particular fondness for natural product chemistry and food chemistry.

Although I wouldn’t deem them “struggles” exactly, having to rummage the internet and other resources (I have a copy of “R in a Nutshell”) for code solutions took time and effort. For example, getting a good grasp of the theme() layer took a good amount of investigation.

Another challenge was making sense of some of the observations resulting from my queries. This was the fun part though! I never knew that “New World” wines were a class of wines, nor that they were defined by having riper, fruitier flavors compared to “Old World” wines. It was also interesting to learn that these in fashion wines are fermented with riper grapes (more sugar rich) leading to more alcoholic wines – explaining my original observation that wines with higher quality ratings tended to have higher alcohol.

I was also suprised by my findings for quality and chlorides. I had no clue that wines could be “salty” to a detectable level, though it makes sense that these wines would have a lower quality. I don’t think I would seek out or enjoy a noticeably salty wine!

In the future, once I have learned more about predictive modeling, I think building a model could be very interesting. It would be a great way to test my present findings and also to get a better understanding of what people look for in wines. Another thing that would interesting would be repeating the experiment and adding more variables – e.g. all the major acids like malic, lactic, succinic acid, etc, as well as an oxidation marker of some sort and acetaldehyde levels. Many of these have profound effects on the sensory experience of wine. So we may be missing important information if these are not included.