The conditional expectation is a property associated with a random variable that tells you the likelihood of some event given some other event has already occurred. Before we can discuss this we need to cover some basic definitions first. As always we begin with a probability space (\Omega,\mathcal{F},\mathbb{P} consisting of the sample space of all events and outcomes \Omega , the sigma-algebra of possible/measurable events \mathcal{F} , and a probability measure that assigns a real number from the unit interval [0,1] to each event E from the sigma-algebra. This is elementary stuff. Next we introduce the conditional probability measure as being a functional of some certain event that has already occurred A (such that \mathbb{P}(A) = 1 ), and another event E whose probability we wish to measure. The first thing to understand is that the probability that some event E will occur can be very much different to the probability that the same event will occur given that some other event has already occurred. This is because the conditional probability measure \mathbb{P}_A (as opposed to the ordinary probability measure \mathbb{P} ) actually removes the event that has already occurred from the sample space and re-assigns all the probabilities to the remaining to-be-determined events. It is an entirely new measure.

DEFINITION: Conditional Probability Measure
Let (\Omega,\mathcal{F},\mathbb{P}) be a probability space and let A \subset \mathcal{F} be an event that has already occurred with probability one. Then the functional defined by

\mathbb{P}_A(E)=\mathbb{P}(E|A):=\frac{\mathbb{P}(E \cap A)}{\mathbb{P}(A)}

is called a conditional probability measure.

Partitions

The next ingredient we need is that of a partition, which is simply the act of splitting up a set \Omega in to smaller subsets \{B_1,\dots,B_n\} . These smaller subsets are called blocks and they satisfy the property that they never overlap, which means that their intersection is empty: B_i \cap B_j = \emptyset ; and if you stick all of them together you end up reconstructing the whole set that you began with, which means that their union is the whole set: B_1 \cup \cdots \cup B_n = \Omega . We say that one can partition a set in to a collection \mathcal{P} of smaller blocks \{B_1,\dots,B_n\} who are each collectively exhaustive and mutually exclusive with respect to the set being partitioned. There is no mathematical trickery going on here, we are simply breaking up the set, which can be done in a number of different ways, but more importantly, can always be done.

Now why are we interested in partitioning a set? Well, if the set to be partitioned is the sample space \Omega then we should be very interested because if you think about what a random variable does: it maps events (subsets of the sample space) to the real numbers, and now think about the inverse of a random variable: it maps a real number to the sample space (to the sigma-algebra really, but what goes in to the sigma-algebra always goes in to the sample space). Then, thinking about that some more, we can imagine that, depending on the random variable mapping, the same real number may get mapped to the same subsets, so you have various subsets of the sample space with the same number associated to them by the inverse random variable. If you cut up the sample space by these numbers you are partitioning it! Kind of like painting by numbers. Said another way: A random variable naturally partitions the sample space. Let’s see exactly why this is the case.

Refinements

Okay, so we are focusing on discrete probability theory here, and in particular discrete random variables. We have at our disposal the idea of partitioning a set, or the sample space, in to a collection \mathcal{P} of blocks. We’ve also hinted that in some way, a random variable partitions the sample space as well. Before we can prove that we need to introduce a very special kind of partition: a refinemnet.

DEFINITION: Refinement
Let \mathcal{P} be a partition of a set \Omega . Then a partition \mathcal{R}=\{B_1^{\prime},\dots,B_n^{\prime}\} is called a refinement of \mathcal{P} if each block B_i^{\prime} of \mathcal{R} is contained in some block B_i of \mathcal{P} , such that B_i = \bigcup B_i^{\prime} . We denote such a refinement this by \mathcal{P} \prec \mathcal{R} .

We will need this concept of a refinement when it comes time to prove the connection between random variables naturally inducing a partition on the sample space and it’s measurability.

Random Variables and Natural Partitions

Consider a (discrete) random variable X , recall that it maps events (or subsets of the sample space) from the sigma-algebra to the real numbers. Think of all the real numbers x_i that the random variable might map to. Collect these together and form a set \{x_1,\dots,x_n\} , this is called the image of the random variable – it is a collection of all the points where the random variable mapping sends events and subsets to. We denote this by \Im(X) = \{x_1,\dots,x_n\}. Now, the inverse random variable X^{-1} goes backward and maps a real number x_i to a subset E of the sample space \Omega . If we know the random variable then we know its image \Im(X) , then all we need to do is follow the real numbers back via the inverse random variable and see what subset of the sample space we end up in. Then, as you would with paint-by-numbers, we partition the sample space using those numbers. That is, we naturally partition \mathcal{P}_X the sample space using the inverse random variable, or by partitioning using the inverse image of the elements from \Im(X) . In symbols: \mathcal{P}_X:=\{X^{-1}(x)\,|\,x\in \Im(X)\} .

Measurable Random Variables

If you begin with a probability space (a sample space, a sigma-algebra of events, and a measure functional) and introduce a random variable as a mapping from the sigma-algebra to the real numbers, how do you say that the random variable is measurable? After all, the measure functional that comes equipped with the probability space measures probabilities of events, not random variables. But, with the help of partitions and refinements, we can measure a random variable. Here’s how we do it:

We take a probability space and any random variable X . We then form a partition \mathcal{P}_X of the sample space \Omega in to the one induced naturally by the random variable. Since the blocks of the partition have the exact same real number mapped to them by the inverse random variable, we know for sure that the random variable is constant on each block. In other words, the random variable is constant on each block precisely because we partitioned the blocks using the inverse random variable. If we used some other partition then it would no longer be constant. Now we define it:

DEFINITION: Random Variable Measurability
Let \mathcal{P} be the partition of the sample space \Omega . A random variable X is said to be \mathcal{P}-measurable if the random variable is constant on each block B_i of the partition \mathcal{P}.

Could the same random variable might be considered measurable given some other partition and not the natural one? Maybe. It depends on whether or not the partition is a refinement of the natural partition.

THEOREM
A random variable X is said to be \mathcal{R}-measurable if and only if \mathcal{R} is a refinement of the natural partition \mathcal{P}_X .

What About Infinite Sample Spaces?

We have seen so far that for finite sample spaces, partitions are intimately connected with random variables via their inverse images. As it happens, the notion of a partition does not generalise readily for infinite sample spaces. Instead we use an algebra.

Recall that an algebra \mathcal{A} is a collection of subsets of some larger set \Omega such that the empty set is in the collection, it’s closed under complements, and under unions. Now, just as a random variable can naturally define a partition of the sample space, it can also define a natural algebra \mathcal{A}_X consisting, again, of inverse images. Now let us re-define the measurability of a random variable to one with respect to an algebra instead of a partition. In fact, it comes as no surprise that for a random variable to be algebra-measurable (instead of partition-measurable) we want it to be constant on all elements of the algebra.

DEFINITION: Measurability of Random Variables
Let X be a random variable on a sample space \Omega . Let \mathcal{A} be some algebra of subsets on \Omega . Then the random variable X is \mathcal{A} -measurable if X is equal to some constant C on all elements of the algebra, or:

\{X=c\}\in\mathcal{A},\,\forall\,c\subseteq \Im(X)

Continuing in the same fashion as before, we can say that any random variable is measurable with respect to any algebra \mathcal{A} so long as the algebra is (not a refinement) but simply a subset of the algebra naturally induced by the random variable: \mathcal{A}

Finally, Conditional Expectation!

We are now ready to formulate the definition of conditional expectation: We just take the ordinary expectation of a random variable of an event E , but now we do it using the conditional probability measure \mathbb{P}_A , where A is some certain event that has already occurred. Thus,

DEFINITION: Conditional Expectation
Let (\Omega, \mathcal{F},\mathbb{P}) be a probability space and let A be an event which has or could potentially occur. Then the conditional expectation (using the ordinary probability measure \mathbb{P} that comes equipped with the probability space) of a random variable \mathbb{E}^{\mathbb{P}}[X|A] equals the expectation (using the new conditional probability measure \mathbb{P}_A ) of just the random variable with no conditioning. In symbols:

\mathbb{E}^{\mathbb{P}}[X|A] = \mathbb{E}^{\mathbb{P}_A}[X]

If we use the indicator function \mathbb{I}_A , which returns 1 when it eats a subset of A and a 0 when it eats something which is not, and we use it on the random variable X , we get the product X\mathbb{I}_A which essentially forces the random variable to only act on subsets of the specified subset A (the result otherwise is zero which, in probability terms, is precisely the same as an impossible event). Thus the product X\mathbb{I}_A can be viewed in terms of set operations as X \cap A – this is clear abuse of notation as X is not a set, hence why we need the indicator function to do the same thing as set inclusion. Using the ordinary probability measure \mathbb{P} , we can take the ordinary expectation of this \mathbb{E}^{\mathbb{P}}[X\mathbb{I}_A] and then divide this number (remember that the expectation operator \mathbb{E} is a functional which assigns real numbers to events) by the probability of the event A occurring and we get another representation of the conditional expectation that really lines up with the representation of the conditional probability measure introduced at the beginning of this article:

LEMMA: Conditional Expectation
Given a probability space, a random variable X and an event A that has or could occur, then the conditional expectation of X  with respect to the event A can be expressed as

\mathbb{E}^{\mathbb{P}}[X|A] = \frac{\mathbb{E}^{\mathbb{P}_A}[X\mathbb{I}_A]}{\mathbb{P}(A)}