ProbabilityLab2.knit

June 03, 2024

Please enter set.seed(10) at the R console before doing any of the R coding below. This will assure that your answers are the same as ours for random numbers.

We commonly represent the following general framework for linking models to data:

\[\big[y_i \mid g\big(\mathbf\theta,x_i\big),\sigma^2\big],\]

which represents the probability of obtaining the observation \(y_i\) given that our model predicts the mean of a distribution \(g(\mathbf\theta,x_i)\) with variance \(\sigma^2\). Assume we have count data. What distribution would be a logical choice to model these data? Write out a model for the data.

The Poisson is a logical choice. We predict the mean of the Poisson for each \(x_i\), i.e. \(\lambda_i = g\big(\mathbf\theta,x_i\big)\), which also controls the uncertainty because in the Poisson distribution the variance equals the mean.A model for the data is:

\[ y_i \sim \textrm{Poisson}\big(g\big(\mathbf\theta,x_i\big)\big).\]

Choose the appropriate distribution for the types of data shown below and justify your decision.

The mass of carbon in above ground biomass in square m plot.
The number of seals on a haul-out beach in the gulf of AK.
Presence or absence of an invasive species in forest patches.
The probability that a white male will vote republican in a presidential election.
The number of individuals in four mutually exclusive income categories.
The number of diseased individuals in a sample of 100.
You have counts of individuals in three political parties, the Greens, the Workers’ Party and the Liberals. You seek to understnad the proportion of the population in each category.

Random variable	Distribution	Justification
The mass of carbon in above ground biomass in square m plot.	gamma or lognormal	continuous and non-negative
The number of seals on a haul-out beach in the gulf of AK.	Poisson or negative binomial	counts
Presence or absence of an invasive species in forest patches.	Bernoulli	zero or one
The probability that a white male will vote republican in a presidential election.	beta	zero to one
The number of individuals in four mutually exclusive income categories.	multinomial	counts in more than two categories
The number of diseased individuals in a sample of 100.	binomial	counts in two categories, number of successes on a given number of trials.
The political party affiliation (democrat, republican, independent) of a voter.	multinomial	counts in more than two categories

Find the mean, variance, and 95% quantiles of 10000 random draws from a Poisson distribution with \(\lambda=33\).

lambda <- 33
n <- 10000
y <- rpois(n, lambda)
mean(y)

## [1] 33.0171

var(y)

## [1] 33.06711

quantile (y, c(0.025, 0.975))

##  2.5% 97.5% 
##    22    45

Simulate one observation of survey data with five categories on a Likert scale, i.e. strongly disagree to strongly agree. Assume a sample of 80 respondents and the following probabilities:

Strongly disagree = 0.07
Disagree = .13
Neither agree nor disagree = .15
Agree = .23
Strongly agree = .42

prob <- c(.07,.13,.15,.23,.42)
size <- 80 
n <- 1 
rmultinom(n, size, prob)

##      [,1]
## [1,]    7
## [2,]    7
## [3,]   12
## [4,]   20
## [5,]   34

The average above ground biomass in a grazing allotment of sagebrush grassland is 103 g/m², with a standard deviation of 23. You clip a 1 m² plot. Write out the model for the probability density of the data point. What is the probability density of an observation of 94 assuming the data are normally distributed? Is there a problem using normal distribution? What is the probability that your plot will contain between 90 and 110 gm of biomass?

The normal distribution isn’t an ideal choice because it extends below 0, which isn’t possible for measurements of above ground biomass. Nonetheless:

\[y_i \sim \textrm{normal}(103, 23^{2})\]

x <- 94 
mean <- 103
sd <- 23
dnorm(x, mean, sd)

## [1] 0.01606693

q <- c(110, 90) 
p.bound <- pnorm(q, mean = mean, sd = sd)
p.bound[1] - p.bound[2]

## [1] 0.3336056

The prevalence of a disease in a population is the proportion of the population that is infected with the disease. The prevalence of chronic wasting disease in male mule deer on winter range near Fort Collins, CO is 12 percent. A sample of 24 male deer included 4 infected individuals. Write out a model that represents how the data arise. What is the probability of obtaining these data conditional on the given prevalence (p=0.12)?

\[y_i \sim \textrm{binomial}(24, 0.12)\]

x <- 4 
size <- 24
p <- 0.12
dbinom(x, size, p)

## [1] 0.1709024

Researchers know that the true proportion of related age-sex classifications for elk in Rocky Mountain National Park are: Adult females (p = 0.56), Yearling males (p = 0.06), Bulls (p = 0.16), and Calves (p = 0.22). What is the probability of obtaining the classification data conditional on the known sex-age population proportions given the following counts?

Adult females (count = 65)
Yearling males (count = 4)
Bulls (count = 25)
Calves (count = 26)

p <- c(0.56, 0.06, 0.16, 0.22)
y <- c(65, 4, 25, 26)

dmultinom(x = y, prob = p)

## [1] 0.0003043713

Nitrogen fixation by free-living bacteria occurs at a rate of 1.9 g/N/ha/yr with a standard deviation (\(\sigma\)) of 1.4. What is the lowest fixation rate that exceeds 2.5% of the distribution? Use a normal distribution for this problem, but discuss why this might not be a good choice.

mu <- 1.9
sigma <- 1.4
p <- 0.025
qnorm(p, mu, sigma)

## [1] -0.8439496

The normal distribution isn’t an ideal choice because it extends below 0, which isn’t possible for measurements of nitrogen fixation.

Bayesian Models for Ecologists

Probability Lab 2: Probability Distributions

June 03, 2024