If I choose a passenger at random, what is the probability they rode in 1st class?
If I choose a passenger at random, what is the probability they are a woman who rode in first class?
If I choose a woman at random, what is the probability they rode in first class?
<Figure Size: (1280 x 960)>
We have analyzed a quantitative variable already. Where?
In the Colombia COVID data!
Departamento Edad ... Fecha de diagnóstico Fecha recuperado
0 Bogotá D.C. 19 ... 2020-03-06 2020-03-13
1 Valle del Cauca 34 ... 2020-03-09 2020-03-19
2 Antioquia 50 ... 2020-03-09 2020-03-15
3 Antioquia 55 ... 2020-03-11 2020-03-26
4 Antioquia 25 ... 2020-03-11 2020-03-23
... ... ... ... ... ...
25361 Buenaventura D.E. 48 ... 2020-05-28 NaN
25362 Valle del Cauca 55 ... 2020-05-28 NaN
25363 Buenaventura D.E. 39 ... 2020-05-28 NaN
25364 Valle del Cauca 13 ... 2020-05-28 NaN
25365 Córdoba 0 ... 2020-05-28 NaN
[25366 rows x 10 columns]
To visualize the age variable, we did the following:
Then, we could treat age as categorical and make a barplot:
A histogram uses equal sized bins to summarize a quantitative variable.
A histogram must use a quantitative variable to look right:
To tweak your histogram, you can change the number of bins:
<Figure Size: (1280 x 960)>
Recall the distribution of a categorical variable: What are the possible values and how common is each?
The distribution of a quantitative variable is similar: The total area in the histogram is 1.0 (or 100%).
<Figure Size: (1280 x 960)>
In this example, we have a limited set of possible values for age: 0, 1, 2, …., 100. We call this discrete.
What if had a quantitative variable with infinite values?
For example: Price of a ticket on Titanic.
We call this continuous.
In this case, it is not possible to list all possible values and how likely each one is.
Instead, we talk about ranges of values.
About what percent of people in this dataset are below 18?
<Figure Size: (1280 x 960)>
About what percent of people in this dataset are below 18?
<Figure Size: (1280 x 960)>
If you had to summarize this variable with one single number, how would you pick?
One summary of the center of a quantitative variable is the mean.
When you hear “The average age is…” or the “The average income is…”, this probably refers to the mean.
Suppose we have five people, ages: 4, 84, 12, 27, 7
The mean age is: \[(4 + 84 + 12 + 27 + 7)/5 = 134/5 = 26.8\]
To refer to our data without having to list all the numbers, we use \(x_1, x_2, ..., x_n\)
In the previous example, \(x_1 = 4, x_2 = 84, x_3 = 12, x_4 = 27, x_5 = 7\). So, \(n = 5\).
To add up all the numbers, we use the summation notation: \[ \sum_{i = 1}^n x_i = 134\]
Therefore, the mean is: \[\bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i\]
Long version: find the sum and the number of observations
Short version: use the built-in function!
The mean is only one option for summarizing the center of a quantitative variable. It isn’t perfec!
Let’s investigate this.
Plot the density of ticket prices on titanic
Calculate the mean price
See how many people paid more than mean price
Our fare data was skewed right: Most values were small, but a few values were very large.
These large values “pull” the mean up; just how the value 84 pulled the average age up in our previous example.
So, why do we like the mean?
Recall: Ages 4, 84, 12, 27, 7.
Imagine that we had to “guess” the age of the next person.
If we guess 26.8, then our “squared error” for these five people is:
array([ 519.8, 3271.8, 219. , 0. , 392. ])
<Figure Size: (1280 x 960)>
Another summary of center is the median, which is the “middle” of the sorted values.
To calculate the median of a quantitative variable with values \(x_1, x_2, x_3, ..., x_n\), we do the following steps:
Sort the values from smallest to largest: \[x_{(1)}, x_{(2)}, x_{(3)}, ..., x_{(n)}.\]
The “middle” value depends on whether we have an odd or an even number of observations.
If \(n\) is odd, then the middle value is \(x_{(\frac{n+1}{2})}\).
If \(n\) is even, then there are two middle values, \(x_{(\frac{n}{2})}\) and \(x_{(\frac{n}{2} + 1)}\). It is conventional to report the mean of the two values (but you can actually pick any value between them).
Ages: 4, 84, 12, 7, 27. What is the median?
Median age in the Columbia data:
One measure of spread is the variance.
The variance of a variable whose values are \(x_1, x_2, x_3, ..., x_n\) is calculated using the formula \[\textrm{var(X)} = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n - 1}\]
Does this look familiar? It’s the sum of squared error! (Divided by \(n-1\), the “degrees of freedom”)
We could do this manually:
…or using a built-in Python function.
348.0870469898451
Notice that the variance isn’t very intuitive: what do we mean by “The spread is 348”?
This is because it is the squared error!
So, to get it in more interpretable language, we take the square root:
Or, we use the built-in function!
Visualize quantitative variables with histograms or densities.
Summarize the center of a quantitative variable with mean or median.
Describe the shape of a quantitative variable with skew
Summarize the spread of a quantitative variable with the variance or the standard deviation.