Category Archives: Probability

Independence

Definition. Let $(\Omega,\mathscr{U},P)$ be a probability space. Let $A,B\in\mathscr{U}$ be two events with $P(B)>0$. $P(A|B)$, the probability of $A$ given $B$ is defined by
$$P(A|B)=\frac{P(A\cap B)}{P(B)}\ \mbox{if}\ P(B)>0$$
If the events $A$ and $B$ are independent,
$$P(A)=P(A|B)=\frac{P(A\cap B)}{P(B)}$$
i.e.
$$P(A\cap B)=P(A)P(B)$$
This is true under the assumption that $P(B)>0$ but we take this for the definition even if $P(B)=0$.

Definition. Two events $A$ and $B$ are independent if
$$P(A\cap B)=P(A)P(B)$$

Definition. Let $X_i:\Omega\longrightarrow\mathbb{R}^n$ be random variables, $i=1,\cdots$. Then random variables $X_1,\cdots$ are said to be independent if $\forall$ integers $k\geq 2$ and $\forall$ choices of Borel sets $B_1,\cdots,B_k\subset\mathbb{R}^n$
\begin{align*}
P(X_1\in B_1,X_2\in B_2,&\cdots,X_k\in B_k)=\\
&P(X_1\in B_1)P(X_2\in B_2)\cdots P(X_k\in B_k)
\end{align*}

Theorem. The random variables $X_1,\cdots,X_,m:\Omega\longrightarrow\mathbb{R}^n$ are independent if and only if
\begin{equation}
\label{eq:indepdistrib}
F_{X_1,\cdots,X_m}(x_1,\cdots,x_m)=F_{X_1}(x_1)\cdots F_{X_m}(x_m)
\end{equation}
$\forall x_1\in\mathbb{R}^n$, $\forall i=1,\cdots,m$. If the random variables have densities, \eqref{eq:indepdistrib} is equivalent to
$$f_{X_1,\cdots,X_m}(x_1,\cdots,x_m)=f_{X_1}(x_1)\cdots f_{X_m}(x_m)$$
$\forall x_i\in\mathbb{R}^n$, $\forall i=1,\cdots,m$, where the function $f$ are the appropriate densities.

Proof. Suppose that $X_1,\cdots,X_m$ are independent. Then
\begin{align*}
F_{X_1,\cdots,X_m}(x_1,\cdots,x_m)&=P(X_1\leq x_1,\cdots, X_m\leq x_m)\\
&=P(X_1\leq x_1)\cdots,P(X_m\leq x_m)\\
&=F_{X_1}(x_1)\cdots F_{X_m}(x_m)
\end{align*}
Let $B_1,B_2,\cdots,B_m\subset\mathbb{R}^n$ be Borel sets. Then
\begin{align*}
P(X_1\in B_1,\cdots,X_m\in B_m)&=\int_{B_1\times\cdots\times B_m}f_{X_1,\cdots,X_m}(x_1,\cdots,x_m)dx_1\cdots x_m\\
&=\left(\int_{B_1}f_{X_1}(x_1)dx_1\right)\cdots\left(\int_{B_m}f_{X_m}(x_m)dx_m\right)\\
&=P(X_1\in B_1)P(X_2\in B_2)\cdots P(X_k\in B_k)
\end{align*}
So, $X_1,\cdots,X_m$ are independent.

Theorem. If $X_1,\cdots,X_m$ are independent real-valued random variables with $E(X_i)<\infty$ ($i=1,\cdots,m$) then $E(X_1\cdots X_m)<\infty$ and
$$E(X_1\cdots X_m)=E(X_1)\cdots E(X_m)$$

Proof.
\begin{align*}
E(X_1\cdots X_m)&=\int_{\mathbb{R}^n}x_1\cdots x_m f_{X_1,\cdots,X_m}(x_1,\cdots,x_m)dx_1\cdots x_m\\
&=\left(\int_{\mathbb{R}}x_1f_{X_1}(x_1)dx_1\right)\cdots\left(\int_{\mathbb{R}}x_mf_{X_m}(x_m)dx_m\right)\\
&=E(X_1)\cdots E(X_m)
\end{align*}

Theorem. If $X_1,\cdots,X_m$ are independent real-valued variables with $V(X_i)<\infty$, $i=1,\cdots,m$ then
$$V(X_1+\cdots+X_m)=V(X_1)+\cdots+V(X_m)$$

Proof. We prove for the case when $m=2$. For general $m$ case the proof follows by induction. Let $m_1=E(X_1)$ and $m_2=E(X_2)$. Then
\begin{align*}
E(X_1+X_2)&=\int_{\Omega}(X_1+X_2)dP\\
&=\int_{\Omega}X_1dP+\int_{\Omega}X_2dP\\
&=E(X_1)+E(X_2)\\
&=m_1+m_2
\end{align*}
\begin{align*}
V(X_1+X_2)&=\int_{\Omega}(X_1+X_2-(m_1+m_2))^2dP\\
&=\int_{\Omega}(X_1-m_1)^2dP+\int_{\Omega}(X_2-m_2)^2dP\\
+2\int_{\Omega}(X_1-m_1)(X_2-m_2)dP\\
&=V(X_1)+V(X_2)+2E[(X_1-m_1)(X_2-m_2)]
\end{align*}
For $X_1,X_2$ being independent, we have $E[(X_1-m_1)(X_2-m_2)]=0$. This completes the proof.

References:

Lawrence C. Evans, An Introduction to Stochastic Differential Equations, Lecture Notes

Distribution Functions

Let $(\Omega,\mathscr{U},P)$ be a probability space and $X:\Omega\longrightarrow\mathbb{R}^n$ a randome variable. We define an ordering between two vectors in $\mathbb{R}^n$ as follows: Let $x=(x_1,\cdots,x_n),y=(y_1,\cdots,y_n)\in\mathbb{R}^n$. Then $x\leq y$ means $x_i\leq y_i$ for $i=i,\cdots,n$.

Definition. The distribution function of $X$ is the function $F_X: \mathbb{R}^n\longrightarrow[0,1]$ defined by
$$F_X(x):=P(X\leq x)$$
for all $x\in\mathbb{R}^n$. If $X_1,\cdots,X_m:\Omega\longrightarrow\mathbb{R}^n$ are random variables, their joint distribution function $F_{X_1,\cdots,X_m}:(\mathbb{R}^n)^m\longrightarrow[0,1]$ is defined by
$$F_{X_1,\cdots,X_m}(x_1,\cdots,x_m):=P(X_1\leq x_1,\cdots,X_m\leq x_m)$$
for all $x_i\in\mathbb{R}^n$ and for all $i=1,\cdots,n$.

Definition. Let $X$ be a random variable, $F=F_X$ its distribution function. If there exists a nonnegative integrable function $f:\mathbb{R}^n\longrightarrow\mathbb{R}$ such that
$$F(x)=F(x_1,\cdots,x_n)=\int_{-\infty}^{x_1}\cdots\int_{-\infty}^{x_n}f(y_1,\cdots,y_n)dy_1\cdots dy_n$$
then $f$ is called the density function for $X$. More generally,
$$P(X\in B)=\int_B f(x)dx$$
for all $B\in\mathscr{B}$ where $\mathscr{B}$ is the Borel $\sigma$-algebra.

Example. If $X:\Omega\longrightarrow\mathbb{R}$ has the density function
$$f(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{|x-m|^2}{2\sigma^2}},\ x\in\mathbb{R}$$
then we say $X$ has a Gaussian or normal distribution with mean $m$ and variance $\sigma^2$. In this case, we write “$X$ is an $N(m,\sigma^2)$ random variable.”

Example. If $X: \Omega\longrightarrow\mathbb{R}^n$ has the density
$$f(x)=\frac{1}{\sqrt{(2\pi)^n\det C}}e^{-\frac{1}{2}(x-m)C^{-1}(x-m)^t},\ x\in\mathbb{R}^n$$
for some $m\in\mathbb{R}^n$ and some positive definite symmetric matrix $C$, we say that “$X$ has a Gaussian or normal distribution with mean $m$ and covariance matrix $C$.” We write $X$ is an $N(m,C)$ random variable. The covariance matrix is given by
\begin{equation}\label{eq:covmatrix}C=E[(X-E(X))^t(X-E(X))]\end{equation}
where $X=(X_1,\cdots,X_n)$, i.e. each $C$ is the matrix whose $(i,j)$ entry is the covariance
$$C_{ij}=\mathrm{cov}(X_i,X_j)=E[(X_i-E(X_i))(X_j-E(X_j))]=E(X_iX_j)-E(X_i)E(X_j)$$
Clearly $C$ is a symmetric matrix. Recall that for a real-valued random matrix $X$ the variance $\sigma^2$ is given by
$$\sigma^2=V(X)=E[(X-E(X))^2]=E[(X-E(X))\cdot (X-E(X))]$$
So one readily sees that \eqref{eq:covmatrix} is a generalization of variance to higher dimensions. It follows from \eqref{eq:covmatrix} that for a vector $b\in\mathbb{R}^n$,
$$V(Xb^t)=bV(X)b^t$$
Since the variance is nonnegative, we see that the covariance matrix is a positive definite matrix. Since $C$ is symmetric, $PCP^{-1}=D$ where $P$ is an orthogonal matrix and $D$ is a diagonal matrix whose main diagonal contains the eigenvalues of $C$. Recall that for two $n\times n$ matrices $A$ and $B$, $\det(AB)=\det(A)\det(B)$ so we see that $\det(C)=\det(D)$. Since all the eigenvalues of a positive definite matrix are positive, $\det(C)>0$.

Lemma. Let $X:\Omega\longrightarrow\mathbb{R}^n$ be a random variable and assume that its distribution function $F=F_X$ has the density $f$. Suppose $g:\mathbb{R}^n\longrightarrow\mathbb{R}$ and $Y=g(X)$ is integrable. Then
$$E(Y)=\int_{\mathbb{R}^n}g(x)f(x)dx$$

Proof. Suppose first that $g$ is a simple function on $\mathbb{R}^n$.
$$g=\sum_{i=1}^mb_iI_{B_i}\ (B_i\in\mathscr{B})$$
\begin{align*}E(g(X))&=\sum_{i=1}^mb_i\int_{\Omega}I_{B_i}(X)dP\\&=\sum_{i=1}^mb_iP(X\in B_i).\end{align*}
But
\begin{align*}\int_{\mathbb{R}^n}g(x)f(x)dx&=\sum_{i=1}^mb_i\int_{\mathbb{R}^n}I_{B_i}f(x)dx\\&=\sum_{i=1}^nb_i\int_{B_i}f(x)dx\\&=\sum_{i=1}^mb_iP(X\in B_i)\end{align*}
Hence proves the lemma for the case $g$ is a simple function. The rest of the argument extends to general $g$ straightforwardly.

Corollary. If $X:\Omega\longrightarrow\mathbb{R}^n$ is a random variable and its distribution function $F=F_X$ has the density $f$, then
$$V(X)=\int_{\mathbb{R}^n}|x-E(X)|^2f(x)dx$$

Proof. Recall that $V(X)=E(|X-E(X)|^2)$. Define $g:\mathbb{R}^n\longrightarrow\mathbb{R}$ by
$$g(x)=|x-E(X)|^2$$
for all $x\in\mathbb{R}^n$. Then by the Lemma we have
$$V(X)=\int_{\mathbb{R}^n}|x-E(X)|^2f(x)dx$$

Corollary. If $X:\Omega\longrightarrow\mathbb{R}$ is a random variable and its distribution function $F=F_X$ has the density $f$, then $E(X)=\int_{-\infty}^\infty xf(x)dx$ and $V(X)=\int_{-\infty}^\infty |x-E(X)|^2f(x)dx$.

Proof. Trivial from the Lemma by taking $g:\mathbb{R}\longrightarrow\mathbb{R}$ the identity map.

Corollary. If $X:\Omega\longrightarrow\mathbb{R}^n$ is a random variable and its distribution function $F=F_X$ has the density $f$, then
$$E(X_1\cdots X_n)=\int_{\mathbb{R}^n}x_1\cdots x_nf(x)dx$$

Proof. Define $g:\mathbb{R}^n\longrightarrow\mathbb{R}$ by
$$g(x)=x_1\cdots x_n\ \mbox{for all}\ x=(x_1,\cdots,x_n)\in\mathbb{R}^n$$
Then the rest follows by the Lemma.

Example. If $X$ is $N(m,\sigma^2)$ then
\begin{align*}
E(X)&=\frac{1}{\sqrt{2\pi\sigma^2}}\int_{-\infty}^\infty xe^{-\frac{(x-m)^2}{2\sigma^2}}dx\\
&=m\\
V(X)&=\frac{1}{\sqrt{2\pi\sigma^2}}\int_{-\infty}^\infty (x-m)^2e^{-\frac{(x-m)^2}{2\sigma^2}}dx\\
&=\sigma^2
\end{align*}
Therefore, $m$ is the mean and $\sigma^2$ is the variance.
References:

Lawrence C. Evans, An Introduction to Stochastic Differential Equations, Lecture Notes

Probability Measure

In this lecture notes, we study basic measure theory in terms of probability. If you want to learn more about general measure theory, I recommend [2].

Let $\Omega$ be a set whose elements will be called samples.

Definition. A $\sigma$-algebra is a collection $\mathscr{U}$ of subsets of $\Omega$ satisfying

  1. $\varnothing,\Omega\in\mathscr{U}$
  2. If $A\in\mathscr{U}$, then $A^c\in\mathscr{U}$
  3. If $A_1,A_2,\cdots\in\mathscr{U}$, then $\bigcup_{k=1}^\infty A_k,\bigcap_{k=1}^\infty A_k\in\mathscr{U}$

Note: In condition 3, it suffices to say if $A_1,A_2,\cdots\in\mathscr{U}$, then $\bigcup_{k=1}^\infty A_k\in\mathscr{U}$ or if $A_1,A_2,\cdots\in\mathscr{U}$, then $\bigcap_{k=1}^\infty A_k\in\mathscr{U}$. For example, lets assume that if $A_1,A_2,\cdots\in\mathscr{U}$, then $\bigcup_{k=1}^\infty A_k\in\mathscr{U}$. Let $A_1,A_2,\cdots\in\mathscr{U}$. Then by condition 2, $(A_1)^c,(A_2)^c,\cdots\in\mathscr{U}$ so we have $\bigcup_{k=1}^\infty (A_k)^c\in\mathscr{U}$. By condition 2 again with De Morgan’s laws, this means $\bigcap_{k=1}^\infty A_k=\left[\bigcup_{k=1}^\infty (A_k)^c\right]^c\in\mathscr{U}$.

Definition. Let $\mathscr{U}$ be a $\sigma$-algebra of subsets of $\Omega$. A map $P:\mathscr{U}\longrightarrow[0,1]$ a probability measure if $P$ satisfies

  1. $P(\varnothing)=0$, $P(\Omega)=1$
  2. If $A_1,A_2,\cdots\in\mathscr{U}$, then $$P\left(\bigcup_{k=1}^\infty A_k\right)\leq\sum_{k=1}^\infty P(A_k)$$
  3. If $A_1,A_2,\cdots\in\mathscr{U}$ are mutually disjoint, then $$P\left(\bigcup_{k=1}^\infty A_k\right)=\sum_{k=1}^\infty P(A_k)$$

Proposition. Let $A,B\in\mathscr{U}$. If $A\subset B$ then $P(A)\leq P(B)$.

Proof. Let $A,B\in\mathscr{U}$ with $A\subset B$. Then $B=(B-A)\dot\cup A$ where $\dot\cup$ denotes disjoint union. So by condition 3, $P(B)=P(B-A)+P(A)\geq P(A)$ since $P(B-A)\geq 0$.

Definition. A triple $(\Omega,\mathscr{U},P)$ is called a probability space. We say $A\in\mathscr{U}$ is an event and $P(A)$ is the probability of the event $A$. A property which is true except for an event of probability zero is said to hold almost surely (abbreviated “a.s.”).

Example. The smallest $\sigma$-algebra containing all the open subsets of $\mathbb{R}^n$ is called the Borel $\sigma$-algebra and is denoted by $\mathscr{B}$. Here we mean “open subsets” in terms of the usual Euclidean topology on $\mathbb{R}^n$. Since $\mathbb{R}^n$ with the Euclidean topology is second countable, the “open subsets” can be replaced by “basic open subsets”. Assume that a function $f$ is nonnegative, integrable (whatever that means, we will talk about it later) such that $\int_{\mathbb{R}^n}f(x)dx=1$. Define
$$P(B)=\int_Bf(x)dx$$ for each $B\in\mathscr{B}$. Then $(\mathbb{R}^n,\mathscr{B},P)$ is a probability space. The function $f$ is called the density of the probability measure $P$.

Definition. Let $(\Omega,\mathscr{U},P)$ be a probability space. A mapping $X:\Omega\longrightarrow\mathbb{R}^n$ is called an $n$-dimensional random variable if for each $B\in\mathscr{B}$, $X^{-1}(B)\in\mathscr{U}$. Equivalently we also say $X$ is $\mathscr{U}$-measurable. The probability space $(\Omega,\mathscr{U},P)$ is a mathematical construct that we cannot observe directly. But the values $X(\omega)$, $\omega\in\Omega$ of random variable $X$ are observables. Following customary notations in probability theory, we write $X(\omega)$ simply by $X$. Also $P(X^{-1}(B))$ is denoted by $P(X\in B)$.

Definition. Let $A\in\mathscr{U}$. Then the indicator $I_A: \Omega\longrightarrow\{0,1\}$ of $A$ is defined by $$I_A(\omega)=\left\{\begin{array}{ccc}1 & \mbox{if} & \omega\in A\\0 & \mbox{if} & \omega\not\in A\end{array}\right.$$
In measure theory the indicator of $A$ is also called the characteristic function of $A$ and is usually denoted by $\chi_A$. Here we reserve the term “characteristic function” for something else. Clearly the indicator is a random variable since both $\{0\},\{1\}$ are open. The Borel $\sigma$-algebra $\mathscr{B}$ coincides with the discrete topology on $\{0,1\}$. Or without mentioning subspace topology, let $B\in\mathscr{B}$, the Borel $\sigma$-algebra of $\mathbb{R}$. If $0\in B$ and $1\notin B$ then $I_A^{-1}(B)=A^c\in\mathscr{U}$. If $0\notin B$ and $1\in B$ then $I_A^{-1}(B)=A\in\mathscr{U}$. If $0,1\notin B$ then $I_A^{-1}(B)=\varnothing\in\mathscr{U}$. If $0,1\in B$ then $I_A^{-1}(B)=\Omega\in\mathscr{U}$.

If $A_1,A_2,\cdots,A_m\in\mathscr{U}$ with $\Omega=\bigcup_{i=1}^m A_i$ and $a_1,a_2,\cdots,a_m\in\mathbb{R}$, then
$$X=\sum_{i=1}^m a_iI_{A_i}$$ is a random variable called a simple function.

Simple function

Simple function

Lemma. Let $X: \Omega\longrightarrow\mathbb{R}^n$ be a random variable. Then
$$\mathscr{U}(X)=\{X^{-1}(B): B\in\mathscr{B}\}$$ is the smallest $\sigma$-algebra with respect to which $X$ is measurable. $\mathscr{U}(X)$ is called the $\sigma$-algebra generated by $X$.

Definition. A collection $\{X(t)|t\geq 0\}$ of random variables parametrized by time $t$ is called a stochastic process. For each $\omega\in\Omega$, the map $t\longmapsto X(t,\omega)$ is the corresponding sample path.

Let $(\Omega,\mathscr{U},P)$ be a probability space and $X=\sum_{i=1}^k a_iI_{A_i}$ a simple random variable. The probability that $X=a_i$ is $P(X=a_i)=P(X^{-1}(a_i))=P(A_i)$, so $\sum_{i=1}^k a_iP(A_i)$ is the expected value of $X$. We define the integral of $X$ by
\begin{equation}\label{eq:integral}\int_{\Omega}XdP=\sum_{i=1}^k a_iP(A_i)\end{equation}
if $X$ is a simple random variable. A random variable is not necessarily simple so we obviously want to extend the notion of integral to general random variables. First suppose that $X$ is a nonnegative random variable. Then we define
\begin{equation}\label{eq:integral2}\int_{\Omega}XdP=\sup_{Y\leq X,\ Y\ \mbox{simple}}\int_{\Omega}YdP\end{equation}
Let $X$ be a random variable. Let $X^+=\max\{X,0\}$ and $X^-=\max\{-X,0\}$. Then $X=X^+-X^-$. Define
\begin{equation}\label{eq:integral3}\int_{\Omega}XdP=\int_{\Omega}X^+dP-\int_{\Omega}X^-dP\end{equation}For a random variable $X$, we would still call the integral \eqref{eq:integral3} the expected value of $X$ and denote it by $E(X)$. This integral is called Lebesgue integral in real analysis (see [2]). When I first learned Lebesgue integral in my senior year in college, it wasn’t very clear to me as to what motivated one to define Lebesgue integral the way it is. In terms of probability the motivation is so much clear. I personally think that it would be better if we introduce Lebesgue integral to undergraduate students in the context of probability theory rather than  abstract real analysis. If $X:\Omega\longrightarrow\mathbb{R}^n$ is a vector-valued random variable and $X=(X_1,X_2,\cdots,X_n)$, we define $$\int_{\Omega}XdP=\left(\int_{\Omega}X_1dP,\int_{\Omega}X_2dP,\cdots,\int_{\Omega}X_ndP\right)$$As one would expect from an integral, the expected value $E(\cdot)$ is linear.

Definition. We call $$V(X)=\int_{\Omega}|X-E(X)|^2dP$$the variance of $X$.

It follows from the linearity of $E(\cdot)$ that $$V(X)=E(|X-E(X)|^2)=E(|X|^2)-|E(X)|^2$$

Lemma. If $X$ is a random variable and $1\leq p<\infty$, then \begin{equation}\label{eq:chebyshev}P(|X|\geq\lambda)\leq\frac{1}{\lambda^p}E(|X|^p)\end{equation}for all $\lambda>0$. The inequality \eqref{eq:chebyshev} is called Chebyshev’s inequality.

Proof. Since $1\leq p<\infty$, $|X|\geq\lambda\Rightarrow |X|^p\geq\lambda^p$. So, \begin{align*}E(|X|^p)&=\int_{\Omega}|X|^pdP\\&\geq\int_{|X|\geq\lambda}|X|^pdP\\
&\geq\lambda^p\int_{|X|\geq\lambda}dP\\&=\lambda^pP(|X|\geq\lambda).\end{align*}

Example. Let a random variable $X$ have the probability density function $$f(x)=\left\{\begin{array}{ccc}\frac{1}{2\sqrt{3}} & \mbox{if} & -\sqrt{3}<x<\sqrt{3}\\ 0 & \mbox{elsewhere}
\end{array}\right.$$For $p=1$ and $\lambda=\frac{3}{2}$, $\frac{1}{\lambda}E(|X|)=\frac{1}{\sqrt{3}}\approx 0.58$. Note that $E(|X|)=\int_{-\infty}^\infty |x|f(x)dx$. (We will discuss this later.) $P(|X|\geq\frac{3}{2})=1-\int_{-\frac{3}{2}}^{\frac{3}{2}}f(x)dx=1-\frac{\sqrt{3}}{2}=0.134$. Hence we confirm Chebyshev’s inequality.
References: Not in particular order

  1. Lawrence C. Evans, An Introduction to Stochastic Differential Equations, Lecture Notes
  2. H. L. Royden, Real Analysis, Second Edition, Macmillan
  3. Robert V. Hogg, Joseph W. McKean, Allen T. Craig, Introduction to Mathematical Statistics, Sixth Edition, Pearson