1. Introduction
I would like to start off by stating that I’m no expert when it comes down to Measure Theory, to say the least. I’m trying to make my way through Tom Lindstrøm’s “Introduction to Real Analysis”, but it takes time to wrap one’s head around real analysis.
So, what am I trying to do here? If you are familiar with probability you should remember the relation \(\mathbb{P}[x \le X \le x + \Delta{x}] \approx f_{X}(x) \Delta{x}\), where of course, \(X\) is a random variable and \(f_{X}(x)\) is a PDF. This is proven in some books, skimmed by in others. Eventhough you can find the proof yourself i couldn’t find one for joint probability of two random variables, for three r.v., and not even mentioning a random vector of an arbitrary size. Thus, i would like to derive those relations.
1.1. Assumption
To make matters clear, we are here using only differentiable and integrable functions, we don’t think if the function is Riemann integrable or Lebesgue integrable, it just is. We are dealing only with PDFs which satisfies those assumptions.
2. Univariate case
This is basic, and can be easily found online or in any decent (or even not so much) book on probability theory, but i will derive it for completeness. We let our random variable be defined on some arbitrary region in \(\mathbb{R}\), by stating \(X(\Omega) = [a, b]\), with \(a < b\). If we let \(\Delta{x} > 0\) we define the probability of \(X\) falling in a region \([x, x + \Delta{x}] \subseteq [a, b]\) as:
\[\begin{align} \mathbb{P}[x \le X \le x + \Delta{x}] = \int_{x}^{x + \Delta{x}}f_{X}(t)dt \end{align}\]This is true for both continuous distributions as well as descrete ones if we represent them using so-called “delta train”, \(f_{X}(x) = \sum_{k \in X(\Omega)} \mathbb{P}[X=k] \delta(x-k)\). Thus, given that \(f_{X}(x)\) is a PDF there must exist a corresponding CDF, where both are related by:
\[\begin{align} f_{X}(x) = \frac{d F_{X}(x)}{dx} \end{align}\]This shows that CDF is an antiderivative of PDF, and using The Fundamental Theorem of Calculus we show that,
\[\begin{align} \int_{x}^{x + \Delta{x}}f_{X}(t)dt = F_{X}(x + \Delta{x}) - F_{X}(x) \end{align}\]Now, by The Mean Value Theorem there exist some value \(x^{*} \in [x, x + \Delta{x}]\) such that:
\[\begin{align} F_{X}(x + \Delta{x}) - F_{X}(x) = \frac{d F_{X}(x^{*})}{dx} \Delta{x} = f_{X}(x^{*}) \Delta{x} \end{align}\]Putting this together we see that,
\[\begin{align} \mathbb{P}[x \le X \le x + \Delta{x}] = f_{X}(x^{*}) \Delta{x} \end{align}\]Because, we don’t know what \(x^{*}\) is, we approximiate this result by \(x\), obtaining the final relation,
\[\begin{align} \mathbb{P}[x \le X \le x + \Delta{x}] \approx f_{X}(x) \Delta{x} \end{align}\]3. Bivariate case
Here we are dealing with two random variables \(X\) and \(Y\), defined on some arbitraty region, which we denote as \(X(\Omega_{X}) \times Y(\Omega_{Y}) = [a_{1}, b_{1}] \times [a_{2}, b_{2}]\), with \((a_{1}, a_{2}) \prec (b_{1}, b_{2})\). Now, as in the univariate case, let \(\Delta{x} > 0\) and \(\Delta{y} > 0\), then we define the probability of \((X, Y)\) falling in a rectangle \([x, x + \Delta{x}] \times [y, y + \Delta{y}] \subseteq [a_{1}, b_{1}] \times [a_{2}, b_{2}]\) as,
\[\begin{align} \mathbb{P}[x \le X \le x + \Delta{x}, y \le Y \le y + \Delta{y}] = \int_{x}^{x + \Delta{x}} \int_{y}^{y + \Delta{y}} f_{X, Y}(t_{1}, t_{2}) dt_{2} dt_{1} \end{align}\]We want to show that this approximiates \(f_{X,Y}(x, y) \Delta{x} \Delta{y}\). Let’s consider the simplest case, that \(X\) and \(Y\) are independent (they don’t have to be \(i.i.d.\)), then
\[\begin{align} \int_{x}^{x + \Delta{x}} \int_{y}^{y + \Delta{y}} f_{X, Y}(t_{1}, t_{2}) dt_{2} dt_{1} = \int_{x}^{x + \Delta{x}} f_{X}(t_{1}) dt_{1} \int_{y}^{y + \Delta{y}} f_{Y}(t_{2}) dt_{2} \end{align}\] \[\begin{align} \approx f_{X}(x) f_{Y}(y) \Delta{x} \Delta{y} = f_{X,Y}(x, y) \Delta{x} \Delta{y} \end{align}\]We see that if random variables are indeed independent, then this approximation is true. The univariate case is proven using the relation between PDF and CDF, and so it might be a good idea to try it out here. We could use The Fundamental Theorem of Calculus two times in a row, but we will be using the simple relation between PDF and CDF. The relation in question is:
\[\begin{align} f_{X, Y}(x, y) = \frac{\partial^{2} F_{X, Y} (x, y)}{\partial{x} \partial{y}} \end{align}\]Let us substitute this relation into our probability definition:
\[\begin{align} \int_{x}^{x + \Delta{x}} \int_{y}^{y + \Delta{y}} f_{X, Y}(t_{1}, t_{2}) dt_{2} dt_{1} = \int_{x}^{x + \Delta{x}} \int_{y}^{y + \Delta{y}} \left. \frac{\partial^{2} F_{X, Y} (x, y)}{\partial{x} \partial{y}} \right|_{(x, y) = (t_{1}, t_{2})} dt_{2} dt_{1} \end{align}\]For more compact notation we rewrite the last step substituting \(t_{1}\) and \(t_{2}\) for \(x, y\) respecitvely.
\[\begin{align} \int_{x}^{x + \Delta{x}} \int_{y}^{y + \Delta{y}} \left. \frac{\partial^{2} F_{X, Y} (x, y)}{\partial{x} \partial{y}} \right|_{(x, y) = (t_{1}, t_{2})} dt_{2} dt_{1} = \int_{x}^{x + \Delta{x}} \int_{y}^{y + \Delta{y}} \frac{ \partial^{2} F_{X, Y} (t_{1}, t_{2}) }{ \partial{t_{1} } \partial{ t_{2} }} dt_{2} dt_{1} \end{align}\]We can now split the second derivative and proceede with the calculations as it was a CDF of a single random variable:
\[\begin{align} \int_{x}^{x + \Delta{x}} \int_{y}^{y + \Delta{y}} \frac{ \partial^{2} F_{X, Y} (t_{1}, t_{2}) }{ \partial{t_{1} } \partial{ t_{2} }} dt_{2} dt_{1} = \int_{x}^{x + \Delta{x}} \frac{\partial}{\partial{t_{1}}} \left[ \int_{y}^{y + \Delta{y}} \frac{\partial F_{X, Y} (t_{1}, t_{2}) }{\partial{t_{2}}} dt_{2} \right] dt_{1} \end{align}\]We can take what is inside the parentheses and focus only on that particular integral:
\[\begin{align} \int_{y}^{y + \Delta{y}} \frac{\partial F_{X, Y} (t_{1}, t_{2}) }{\partial{t_{2}}} dt_{2}= F_{X, Y} (t_1, y + \Delta{y}) - F_{X, Y} (t_1, y) \end{align}\]This is a simple relation, especialy if you think of it as a function of one variable. By The Mean Value Theorem there exist \(y^{*} \in [y, y + \Delta{y}]\) such that,
\[\begin{align} F_{X, Y} (t_1, y + \Delta{y}) - F_{X, Y} (t_1, y) = \frac{\partial F_{X, Y} (t_{1}, y^{*})}{\partial y} \Delta{y} \end{align}\]and so we can substitute this result into a parentheses in the probability integral, and we get
\[\begin{align} \int_{x}^{x + \Delta{x}} \frac{\partial}{\partial{t_{1}}} \left[ \frac{\partial F_{X, Y} (t_{1}, y^{*})}{\partial y} \Delta{y} \right] dt_{1} = \Delta{y} \frac{\partial}{\partial y} \int_{x}^{x + \Delta{x}} \frac{\partial F_{X, Y} (t_{1}, y^{*})}{\partial{t_{1}}} dt_{1} \end{align}\]The last step is due to the symmetry property of second derivatives, which comes from The Clairaut’s Theorem. The integral we get has similar form as the one we have already solved, and so,
\[\begin{align} \int_{x}^{x + \Delta{x}} \frac{\partial F_{X, Y} (t_{1}, y^{*})}{\partial{t_{1}}} dt_{1} = F_{X, Y} (x + \Delta{x}, y^{*}) - F_{X, Y} (x, y^{*}) = \frac{\partial F_{X, Y} (x^{*}, y^{*})}{\partial x} \Delta{x} \end{align}\]Substituting this into the last solution we get,
\[\begin{align} \Delta{y} \frac{\partial}{\partial y} \int_{x}^{x + \Delta{x}} \frac{\partial F_{X, Y} (t_{1}, y^{*})}{\partial{t_{1}}} dt_{1} = \frac{\partial^{2} F_{X, Y} (x^{*}, y^{*})}{\partial x \partial y} \Delta{x} \Delta{y} \end{align}\]And finally, by using the relation between PDF and CDF we can write,
\[\begin{align} \frac{\partial^{2} F_{X, Y} (x^{*}, y^{*})}{\partial x \partial y} \Delta{x} \Delta{y} = f_{X, Y} (x^{*}, y^{*}) \Delta{x} \Delta{y} \end{align}\]As in the univariate case, we don’t know \(x^{*}\) and \(y^{*}\), and so we can at best approximate the result. That gives us the result we wanted,
\[\begin{align} \mathbb{P}[x \le X \le x + \Delta{x}, y \le Y \le y + \Delta{y}] \approx f_{X, Y} (x, y) \Delta{x} \Delta{y} \end{align}\]4. Random vectors
We are now considering some random vector \(\mathbb{X} \in \mathbb{R}^{n}\). This random vector is defined on an arbitrary hyperrectangle \(\mathbb{X} (\Omega) = \prod_{i=1}^{n} [a_{i}, b_{i}]\) with \(\textbf{a} \prec \textbf{b}\), where \(\textbf{a} = [a_{i}]_{i=1}^{n}\), and \(\textbf{b} = [b_{i}]_{i=1}^{n}\). The simplest way of working with such cases is to use vectors. Let us introduce some notation to make further discussion and calculations clear.
We let \(\textbf{x} = [x_{i}]_{i=1}^{n}\) be a vector in \(\mathbb{R}^{n}\) denoting the particular realization of the random vector. Let \(\Delta{\textbf{x}} = [\Delta{x}_{i}]_{i=1}^{n}\), but \(\textbf{dt} = dt_{n} \dotsi dt_{1}\), as well as \(\partial \textbf{t} = \partial t_{1} \dotsi \partial t_{n}\). Don’t treat the symbol \(\frac{\partial^{n}}{\partial \textbf{t}}\) as the gradient \(\nabla_{\textbf{t}}\).
With this simple notation out of the way, the probability of \(\mathbb{X}\) falling into a region \(\prod_{i=1}^{n} [x_{i}, x_{i} + \Delta{x}_{i}] \subseteq \prod_{i=1}^{n} [a_{i}, b_{i}]\) for \(\Delta \textbf{x} \succ \textbf{0}\) is
\[\begin{align} \mathbb{P} [\textbf{x} \le \mathbb{X} \le \textbf{x} + \Delta \textbf{x}] = \int_{\textbf{x}}^{\textbf{x} + \Delta \textbf{x}} f_{\mathbb{X}}(\textbf{t}) \textbf{dt} \end{align}\]How can we deal with this problem? Well, we have shown that the approximation is true for the univariate case, and for bivariate case, thus by the mathematical induction if we assume that this is true for the case with \(k\) variables and we show that it holds for \(k + 1\), then it must hold for any arbitrary case. So, we assume that,
\[\begin{align} \mathbb{P} [\textbf{x} \le \mathbb{X} \le \textbf{x} + \Delta \textbf{x}] = f_{\mathbb{X}}(\textbf{x}^{*}) \prod_{i=1}^{k} \Delta{x}_{i} \approx f_{\mathbb{X}}(\textbf{x}) \prod_{i=1}^{k} \Delta{x}_{i} \end{align}\]And now we want to show that this is also true for the case of \(k + 1\) variables
\[\begin{align} \int_{\textbf{x}}^{\textbf{x} + \Delta \textbf{x}} \int_{x_{k+1}}^{x_{k+1} + \Delta{x}_{k+1}} f_{\mathbb{X}}(\textbf{t}, t_{k+1}) dt_{k+1} \textbf{dt} = \int_{\textbf{x}}^{\textbf{x} + \Delta \textbf{x}} \int_{x_{k+1}}^{x_{k+1} + \Delta{x}_{k+1}} \frac{\partial^{k}}{\partial \textbf{t}} \frac{\partial F_{\mathbb{X}} (\textbf{t}, t_{k+1})}{\partial t_{k+1}} dt_{k+1} \textbf{dt} \end{align}\]By The Clairaut’s and The Fubini’s Theorems the order of derivation and integration does not matter, thus we can flip it and use our assumption for case of \(k\) variables.
\[\begin{align} \int_{\textbf{x}}^{\textbf{x} + \Delta \textbf{x}} \int_{x_{k+1}}^{x_{k+1} + \Delta{x}_{k+1}} \frac{\partial^{k}}{\partial \textbf{t}} \frac{\partial F_{\mathbb{X}} (\textbf{t}, t_{k+1})}{\partial t_{k+1}} dt_{k+1} \textbf{dt} \end{align}\] \[\begin{align} = \int_{x_{k+1}}^{x_{k+1} + \Delta{x}_{k+1}} \frac{\partial}{\partial t_{k+1}} \left[ \int_{\textbf{x}}^{\textbf{x} + \Delta \textbf{x}} \frac{\partial^{k} F_{\mathbb{X}} (\textbf{t}, t_{k+1})}{\partial \textbf{t}} \textbf{dt} \right] dt_{k+1} \end{align}\]Using the result from bivariate case we can solve the inner integral, obtaining
\[\begin{align} \prod_{i=1}^{k} \left( \Delta{x}_{i} \right) \frac{\partial^{k}}{\partial \textbf{x}} \int_{x_{k+1}}^{x_{k+1} + \Delta{x}_{k+1}} \frac{\partial F_{\mathbb{X}} (\textbf{x}^{*}, t_{k+1})}{\partial t_{k+1}} dt_{k+1} \end{align}\]We are able to write it like that without taking multiple steps because when our assumption for \(k\) variables is true we can take a shortcut using the relation already obtained for bivariate case, which is,
\[\begin{align} \int_{x}^{x + \Delta{x}} \frac{\partial}{\partial{t_{1}}} \left[ \int_{y}^{y + \Delta{y}} \frac{\partial F_{X, Y} (t_{1}, t_{2}) }{\partial{t_{2}}} dt_{2} \right] dt_{1} \to \Delta{y} \frac{\partial}{\partial y} \int_{x}^{x + \Delta{x}} \frac{\partial F_{X, Y} (t_{1}, y^{*})}{\partial{t_{1}}} dt_{1} \end{align}\]This is exactly what we have done here, but with \(k\) variables instead of just one. We are one integration away from completing the proof, so let’s do it,
\[\begin{align} \prod_{i=1}^{k} \left( \Delta{x}_{i} \right) \frac{\partial^{k}}{\partial \textbf{x}} \int_{x_{k+1}}^{x_{k+1} + \Delta{x}_{k+1}} \frac{\partial F_{\mathbb{X}} (\textbf{x}^{*}, t_{k+1})}{\partial t_{k+1}} dt_{k+1} \end{align}\] \[\begin{align} = \prod_{i=1}^{k} \left( \Delta{x}_{i} \right) \frac{\partial^{k}}{\partial \textbf{x}} \left( F_{\mathbb{X}} (\textbf{x}^{*}, x_{k+1} + \Delta{x}_{k+1}) - F_{\mathbb{X}} (\textbf{x}^{*}, x_{k+1}) \right) \end{align}\] \[\begin{align} = \prod_{i=1}^{k} \left( \Delta{x}_{i} \right) \frac{\partial^{k}}{\partial \textbf{x}} \left( \frac{\partial F_{\mathbb{X}} (\textbf{x}^{*}, x_{k+1}^{*})}{\partial x_{k+1}} \Delta{x}_{k+1} \right) \end{align}\] \[\begin{align} = \frac{\partial^{k+1} F_{\mathbb{X}} (\textbf{x}^{*}, x_{k+1}^{*})}{\partial \textbf{x} \partial x_{k+1}} \prod_{i=1}^{k+1} \left( \Delta{x}_{i} \right) \end{align}\] \[\begin{align} = f_{\mathbb{X}} (\textbf{x}^{*}, x_{k+1}^{*}) \prod_{i=1}^{k+1} \left( \Delta{x}_{i} \right) \end{align}\] \[\begin{align} \approx f_{\mathbb{X}} (\textbf{x}, x_{k+1}) \prod_{i=1}^{k+1} \left( \Delta{x}_{i} \right) \end{align}\]This completes the proof.