Oct 15, 2018

Basic Probability

  • The set of all possible outcomes of a random experiment is called a sample space, \(\Omega\). An event \(E\) is a subset of \(\Omega\).

Three Axioms of Probability

  1. \(0 \leq P(E) \leq 1)\)
  2. \(P(\Omega)=1\)
  3. For any sequence of mutually exclusive events \(E_1, E_2, \ldots\) (that is, \(E_i \cap E_j = \emptyset\)) for all (\(i \neq j\)),

\[P\left(\bigcup_{i=1}^{\infty}E_i \right) = \sum_{i=1}^{\infty}P(E_i)\]

Birthday Problem

Suppose that a room contains \(m\) students. What is the probability that at least two of them have the same birthday? Start by assuming every day of the year is equally likely to be a birthday, and disregard leap years. That is, assume there are always \(n = 365\) days to a year.

Birthday Problem - Solution

Let the event \(E\) denote two or more students with the same birthday. In this problem, it is easier to find \(E^c\), as there are a number of ways that \(E\) can take place. There are a total of \(365^m\) possible outcomes in the sample space. \(E^c\) can occur in \(365\times 364 \times \cdots \times (365 - m + 1)\) ways. Consequently,

\[P(E^c) = \frac{365\times 364 \times \cdots \times (365 - m + 1)}{365^m}\] and

\[P(E) = 1 - \frac{365\times 364 \times \cdots \times (365 - m + 1)}{365^m}\]

Some R

> m <- 1:60             # vector of number of students
> p <- numeric(60)      # initialize vecotr to 0's
> for(i in m){          # index value for loop
+   q = prod((365:(365 - i + 1))/365)
+   p[i] = 1 - q
+ }
> plot(m, p, pch = 19,
+      ylab = "P(at least two students with the same birthday)",
+      xlab = "m = number of students in the room")
> abline(h = 0.5, lty = 2, col = "blue")
> abline(v = 23, lty = 2, col = "blue")

Resulting Graph

Conditional Probability

  • If \(E\) and \(F\) are any two events in a sample space \(\Omega\) and \(P(E) \neq 0\), the conditional probability of \(F\) given \(E\) is defined as

\[P(F|E) = \frac{P(F\cap E)}{P(E)}\]

Example

Suppose two fair dice are rolled where each of the 36 possible outcomes is equally likely to occur. Knowing that the first die shows a 3, what is the probability that the sum of the two dice equals 6?

  • Define "the sum of the dice equals 6" to be event \(A\) and "3 on the first toss" to be event \(B\).

\[P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{1/36}{6/36}=\frac{1}{6}\]

Law of Total Probability and Bayes' Rule

Let \(F_1, F_2, \ldots F_n\) be such that \(\bigcup_{i=1}^{n}F_i = \Omega\) and \(F_i \cap F_j = \emptyset\) for all \(i \neq j\), with \(P(F_i) > 0\) for all \(i\). Then, for any event \(E\),

\[P(E) = \sum_{i=1}^{n}P(E \cap F_i) = \sum_{i=1}^{n}P(E|F_i)P(F_i)\]

\[P(F_j|E) = \frac{P(E \cap F_j)}{P(E)} = \frac{P(E|F_i)P(F_i)}{\sum_{i=1}^{n}P(E|F_i)P(F_i)} \]

Choose a Door

The television show Let's Make a Deal hosted by Monty Hall, gave contestants the opportunity to choose one of three doors. Contestants hoped to choose the one that concealed the grand prize. Behind the other two doors were much less valuable prizes. After the contestant chose one of the doors, say Door 1, Monty opened one of the other two doors, say Door3, containing a much less valuable prize. The contestant was then asked whether he or she wished to stay with the original choice (Door 1) or switch to the other closed door (Door2). What should the contestant do? Is it better to stay with original choice of switch to the other closed door?

  • What is the probability of winning by switching doors when given the opportunity?
  • What is the probability of winning by staying with the initial selection?

R - Code

> set.seed(13)
> n <- 10000
> actual <- sample(1:3, size = n, replace = TRUE)
> aguess <- sample(1:3, size = n, replace = TRUE)
> equals <- (actual == aguess)
> not.eq <- (actual != aguess)
> PnoSwitch <- mean(equals)
> PwiSwitch <- mean(not.eq)
> probs <- c(PnoSwitch, PwiSwitch)
> names(probs) <- c("P(Win no Switch)", "P(Win Switch)")
> probs
P(Win no Switch)    P(Win Switch) 
           0.331            0.669 

Monte Hall (continued)

To solve with Bayes' Rule start by assuming the contestant initially guesses Door 1 and that Monty opens Door 3. The the event \(D_i=\) Door \(i\) conceals the prize and \(O_j\) = Monty opens Door \(j\) after the contestant selects Door 1. When a contestant initially selects a door, \(P(D_1)=P(D_2)=P(D_3)=1/3\). Once Monty show the grand prize is not behind Door 3, the probability of winning the grand prize is now one of \(P(D_1|0_3)\) or \(P(D_2|O_3)\). Note that \(P(D_1|O_3)\) corresponds to the strategy of sticking with the initial guess and \(P(D_2|O_3)\) corresponds to the strategy of switching doors.

Monte Hall (continued)

Based on how the show is designed, the following are know:

  • \(P(O_3|D_1)=1/2\) since Monty can open one of either Door 3 or Door 2 without revealing the grand prize.
  • \(P(O_3|D_2)=1\) since the only door Monty can open without revealing the grand prize is Door 3.
  • \(P(O_3|D_3)=0\) since Monty will not open Door 3 if it contains the grand prize.

Monte Hall (continued)

\[P(D_1|O_3) = \frac{P(O_3|D_1)P(D_1)}{P(O_3|D_1)P(D_1) + P(O_3|D_2)P(D_2) + P(O_3|D_3)P(D_3)}\] \[P(D_1|O_3) = \frac{1/2\cdot 1/3}{1/2 \cdot 1/3 + 1 \cdot 1/3 + 0 \cdot 1/3}=\frac{1}{3}\]

\[P(D_2|O_3) = \frac{P(O_3|D_2)P(D_2)}{P(O_3|D_1)P(D_1) + P(O_3|D_2)P(D_2) + P(O_3|D_3)P(D_3)}\]

\[P(D_2|O_3) = \frac{1\cdot 1/3}{1/2 \cdot 1/3 + 1 \cdot 1/3 + 0 \cdot 1/3}=\frac{2}{3}\]

Monte Hall Problem with n Doors

Consider the case with \(n\) Doors and just one grand prize. The probability of winning the grand prize on the first choice is \(\frac{1}{n}\). This is the same probability of winning if you were to use the "stay" strategy. You will win the grand prize if both of the following events occur:

  1. You choose a door that does not have the grand prize \(\left(P(G) = \frac{n-1}{n}\right)\)

  2. You choose the door with the grand prize by switching doors \(\left(P(C|G) = \frac{1}{n-2}\right)\)

Monte Hall Problem with n Doors (continued)

In other words, the probability of winning using the "switch" strategy is

\[P(G \cap C) = P(G) \cdot P(C|G) = \frac{n-1}{n} \cdot \frac{1}{n-2}\] For \(n=3\) doors, the probability of winning with the "switch" strategy is:

\[\frac{n-1}{n}\cdot \frac{1}{n-2} =\frac{3-1}{3}\cdot\frac{1}{3-2} = \frac{2}{3}\] Note: \(\frac{n-1}{n(n-2)} > \frac{1}{n} \rightarrow\) switching is always better!

Monty Hall Simulation Shiny-App

Random Variables

  • A discrete random variable \(X\) is a function from \(\Omega\) into the real numbers \(\bf{R}\) with a range that is finite or countably infinite. That is, \(X: \Omega \rightarrow {x_1, x_2, \ldots, x_m}\), or \(X:\Omega \rightarrow {x_1, x_2, \dots}\).

  • A function \(X\) from \(\Omega\) into the real numbers \(\bf{R}\) is a continuous random variable if there exists a nonnegative function \(f\) such that for every subset \(C\) of \(\bf{R}\), \(P(X \in C) = \int_C\;f(x)\,dx\). In particular, for \(a \leq b\), \(P(a < X \leq b) = \int_a^b\; f(x)\,dx.\)

Probability Mass Function (Discrete)

The function that assigns probability to the values of the random variable is called the probability density function (pdf). Many authors will refer to the pdf as the probability mass function pmf when working with discrete random variables. We will denote the pdf as \(p(x)=P(X=x)\) for each \(x\). All pdfs must satisfy the following two conditions:

  1. \(p(x)\geq0\) for all \(x\).

  2. \(\sum_{\forall x}p(x) = 1\).

Probability Density Function (Continuous)

The function \(f\) is called the probability density function of \(X\) and satisfies the following conditions:

  • \(f(x) > 0 , \quad x\in \Omega\)
  • \(\int_{\Omega}\;f(x)\,dx = 1\)
  • If \((a, b) \subseteq \Omega\), then the probability of the event \(\{a < X < b\}\) is \[P(a < X < b) = \int_a^b\;f(x)\,dx\]

Mean and Variance of X

  • \[\mu_X = E(X) = \sum_{x}x\cdot p(x)\]

  • \[\mu_X = E(X) = \int_{-\infty}^{\infty}\;x\,f(x)\,dx\]

  • \[\sigma^2 = \text{Var}(X) = E[(X - \mu_X)^2] = \sum_{x}(x- \mu_X)^2 \cdot p(x)\]

  • \[\sigma^2 = \text{Var}(X) = E[(X - \mu_X)^2] = \int_{-\infty}^{\infty}(x - \mu_X)^2\;f(x)\,dx\]

Example

  • Consider rolling two dice. Let \(X\) denote the sum of the two numbers that appear. In this case, \(X\) is a discrete random variable. That is, \(X:\Omega \rightarrow {2, 3, 4, \ldots, 11, 12}\).
> S <- expand.grid(roll1 = 1:6, roll2 = 1:6)
> S$Sum <- apply(S, 1, sum)
> head(S)
  roll1 roll2 Sum
1     1     1   2
2     2     1   3
3     3     1   4
4     4     1   5
5     5     1   6
6     6     1   7

Example (continued)

Find the expected value and variance of \(X\)

> table(S$Sum)
 2  3  4  5  6  7  8  9 10 11 12 
 1  2  3  4  5  6  5  4  3  2  1 
> T1 <- MASS::fractions(table(S$Sum)/sum(!is.na(S$Sum)))
> T1
   2    3    4    5    6    7    8    9   10   11   12 
1/36 1/18 1/12  1/9 5/36  1/6 5/36  1/9 1/12 1/18 1/36 

Example (continued)

> x <- as.numeric(names(T1))
> x
 [1]  2  3  4  5  6  7  8  9 10 11 12
> px <- T1
> EX <- sum(x*px)
> EX
[1] 7
> VX <- sum((x - EX)^2*px)
> VX
[1] 35/6

Example Mean and Variance

Let \(X\) be a continuous random variable with pdf \(f(x) = 2x, 0 < x < 1\). Find the mean and variance of \(X\).

\[\mu_x = E(X) = \int_{0}^{1} x \cdot 2x\, dx = \frac{2}{3}x^3 \mid_0^1 = \frac{2}{3}\]

\[\sigma^2 = \text{Var}(X) = E(X^2) - \mu_X^2\] \[\sigma^2 = \int_0^1 x^2(2x)\,dx - \left(\frac{2}{3}\right)^2 = \left[\left(\frac{1}{2}y^4 \right)\right]_0^1 - \frac{4}{9}= \frac{1}{18}\]

Example (continued 1)

> f <- function(x){2*x}
> integrate(f, 0, 1)
1 with absolute error < 1.1e-14
> ex <- function(x){x*f(x)}
> EX <- integrate(ex, 0, 1)
> EX
0.6666667 with absolute error < 7.4e-15
> EX <- MASS::fractions(EX$value)
> EX
[1] 2/3

Example (continued 2)

> g <- function(x){(x - 2/3)^2*f(x)}
> VX <- integrate(g, 0, 1)
> VX
0.05555556 with absolute error < 6.2e-16
> VX <- MASS::fractions(VX$value)
> VX
[1] 1/18

Example (continued 3)

What is the area under \(f\)?

> curve(f, 0, 1)

Odds and Ends

  • The standard deviation of \(X\) is \(\sigma_{X} = \sqrt{\text{Var}[X]}\)

  • If \(X\) is a random variable and \(a\) and \(b\) are constants, then

\[E[a \pm bX] = a \pm bE[X]\] \[\text{Var}[a \pm bX] = b^2\text{Var}[X] \] \[\text{Var}[aX \pm bY] = a^2 \text{Var}[X] + b^2\text{Var}[Y]\]

Sample of random variables

  • Let \(X_1, X_2, \ldots, X_n\) be independent and identically distributed random variables with mean \(\mu\) and variance \(\sigma^2\). Then

\[\bar{X} = \frac{1}{n}\sum_{i=1}^{n}X_i\] is called the (sample) mean.

\[E[\bar{X}] = \mu_{\bar{X}} = \mu_{X} = \mu\]

  • If \(X_1, X_2, \ldots, X_n\) are independent and identically distributed random variables with mean \(\mu\) and variance \(\sigma^2\), then

\[\text{Var}[\bar{X}] = \frac{\sigma^2_X}{n} = \frac{\sigma^2}{n}\]

Z- score

\[Z = \frac{\text{statistic} - \mu_{\text{statistic}}}{\sigma_{\text{statistic}}}\]

  • Suppose \(X \sim N(\mu = 10, \sigma = 2)\) and one takes a random sample of size 4 and computes the sample mean to be 14. What is the \(Z-\)score?

\[Z = \frac{\bar{X} - \mu_{\bar{X}}}{\sigma_{\bar{X}}} = \frac{14 - 10}{\frac{2}{\sqrt{4}}} = 4\]

T- score

\[T = \frac{\text{statistic} - \mu_{\text{statistic}}}{\hat{\sigma}_{\text{statistic}}}\]

  • Suppose \(X \sim N(\mu = 10, \sigma = ?)\). The results from a random sample of size 4 are: 12, 14, 2, and 20. What is the \(T-\)score?
> rs <- c(12, 14, 2, 20)
> xbar <- mean(rs)
> SD <- sd(rs)
> Tscore <- (xbar - 10)/(SD/sqrt(4))
> c(xbar, SD, Tscore)
[1] 12.0000000  7.4833148  0.5345225

\[T = \frac{\bar{X} - \mu_{\bar{X}}}{\hat{\sigma}_{\bar{X}}} = \frac{12 - 10}{\frac{7.483315}{\sqrt{}4}} = 0.5345\]

Independence

Let X and Y be random variables. Then X and Y are independent if for any sets A and B,

\[P(X \in A , Y \in B) = P(X \in A)P(Y \in B).\]

Two events A and B are independent if

\[P(A \cap B) = P(A)P(B)\]

Two events A and B are independent if (and only if)

\[P(B|A) = P(B)\]

Example

Suppose we take a random sample of \(n = 100\) online adults and find 71 use Facebook, 18 use Twitter, and 15 use both.

       Facebook
Twitter Yes No Sum
    Yes  15  3  18
    No   56 26  82
    Sum  71 29 100
  1. Are using Facebook and using Twitter mutually exclusive?
  2. Are Facebooking and tweeting independent?

Partial Solution

Let A = respondent uses Facebook and B = respondent uses Twitter.

We know that \(P(A) = \frac{71}{100}=0.71\), \(P(B) = \frac{18}{100}=0.18\), and \(P(A\cap B)= \frac{15}{100}=0.15\).

Since \(P(A \cap B)\neq 0\), the events A and B are not mutually exclusive.

Recall that A and B are independent if \(P(A \cap B)=P(A)P(B)\). Since \(P(A \cap B) = 0.15 \neq P(A)P(B)=0.71\times 0.18 = 0.1278,\) the events A and B are not independent.

Are Facebook usage and Twitter usage Independent?

What do we expect in each cell?

       Facebook
Twitter Yes No
    Yes  15  3
    No   56 26
       Facebook
Twitter   Yes    No
    Yes 12.78  5.22
    No  58.22 23.78

Test Statistic

\[TS = \sum_{\text{all cells}}\frac{(O - E)^2}{E}\]

  • Create a tidy data set.
> on <- c(15, 3, 56, 26)
> ONL <- matrix(data = on, nrow = 2, byrow = TRUE)
> dimnames(ONL) <- list(Twitter = c("Yes", "No"), Facebook = c("Yes", "No"))
> ONLT <- as.table(ONL)
> ONLTDF <- as.data.frame(ONLT)
> TDF <- vcdExtra::expand.dft(ONLTDF)
> dim(TDF)
[1] 100   2

Tidy Data

Data

> T1 <- xtabs(~Twitter + Facebook, data = TDF)
> T1
       Facebook
Twitter No Yes
    No  26  56
    Yes  3  15
> chisq.test(T1, correct = FALSE)
    Pearson's Chi-squared test

data:  T1
X-squared = 1.6217, df = 1, p-value = 0.2029

Graph Code

> library(tidyverse)
> TDF %>% 
+   ggplot(aes(x = Twitter, fill = Facebook)) + 
+   geom_bar(position = "fill") + 
+   labs(y = "Fraction")

Graph

Randomization Test

> obs.stat <- chisq.test(T1, correct = FALSE)$stat
> obs.stat
X-squared 
 1.621673 
> set.seed(123)
> sims <- 10^4 - 1
> ts <- numeric(sims)
> for(i in 1:sims){
+   ts[i] <- chisq.test(xtabs(~Twitter + sample(Facebook), data = TDF), 
+                       correct = FALSE)$stat
+ }
> pvalue <- (sum(ts >= obs.stat) + 1)/(sims + 1)
> pvalue
[1] 0.2546

Randomiztion Distribution Code

> library(ggplot2)
> ggplot(data = data.frame(x = ts), aes(x = x)) + 
+   geom_density(fill = "peachpuff", adjust = 3) + 
+   theme_bw() + 
+   stat_function(fun = dchisq, args = list(df = 1), color = "red") + 
+   geom_vline(xintercept = obs.stat, linetype = "dashed", 
+              color = "lightblue") + 
+   xlim(0, 8)

Randomization Distribution