Calculating the average, or most likely of something has always been an interesting question in mathematics because it is a nice and useful way to represent a wide range of objects as one single number that in many ways is the individual that best represents the whole. Of course, this is not the only interesting way to represent a wide range of things as one, but it’s the one I’m going to talk about in this article. Computing the average of bunch of things means first converting those things in to numbers (using a random variable is a good way to do this, but any mapping from your abstract set to the real numbers will do), adding them up (you may need to use an integral here if the regular old sum can’t cut it), counting the number of things, and then dividing the sum by the count. In essence you are doing this:

\displaystyle\textup{Average}=\frac{x_1+x_2+\cdots+x_n}{n}

where x is the number representing the object (or just a number in its own right), and n is the number of objects.
Let’s look at an example: Suppose you roll a fair 6-sided dice with the letters A through F. What is the average or most likely letter to get rolled?
Wait a second, you can’t add letters up!
Okay, so we need a map that takes letters to numbers. We can do this in many, many ways and you have to understand that. The easiest way, and the most uniform way (such that you don’t introduce bias), is to map A to 1, B to 2, C to 3, D to 4, E to 5 and F to 6 (note the equal distance between each of the numbers!). Let’s call this map X . This map actually satisfies all the awesome properties of being a random variable, so that’s cool to know. Now we add up all our numbers we used: 1+2+3+4+5+6=21 , and then divide this number by the number of them, which is six, to get 3.5 – and this is image of the average. To get the actual average we need to go back, using the inverse map, from numbers to letters to find the actual average. Unfortunately, due to our choice of map there is no inverse for 3.5 , after all there is no C-and-a-half! However, we can say that 3.5 is exactly halfway between three and four so we’ll say that our average letter is half the time C and half the time D. To illustrate the many ways one can map abstract objects to numbers and still find an average let’s choose a different map, how about: A to 2, B to 4, C to 6, D to 8, E to 10 and F to 12. The sum of this is forty-two, divide it by six and we get 7 as our average. Once again, there is no inverse image for seven but it’s exactly halfway between six and eight whom have inverse images. Thus we again arrive halfway between the letters C and D.

These examples are meant to illustrate a very common procedure that mathematicians use all the time when computing averages and expectations. We usually have some sort of experiment or abstract set, usually of things that aren’t necessarily numbers (they might be, for example, heads and tails, or letters, or people’s names, cities, etc…). We then have to find a way of mapping the objects to the real numbers. We can then add, subtract, divide, multiply those numbers, and then map that number back using the inverse map, to an abstract object.

From Averages to Expectations

So how do we move from taking the average, to the expectation of something. Well, unlike the average – which is a property of the collection of things, the expectation is a property of the mapping we used to turn the objects in to numbers. In our first example above, the average was C half of the time, and D the other half. Whereas the expectation was 3.5. That’s the difference. Before you can start calculating expectations you need a map, in particular a random variable.

Not only is the expectation a property of a random variable, it is also what’s known as an operator. If the random variable is being denoted by X , then it’s expectation is written like \mathbb{E}[X] , in good old-fashioned operator notation (note the square brackets, indicating that the expectation is an operator and not a function). Before we can introduce the definition of the expectation operator on a random variable, we need one more ingredient: the probability.

Recall that we rolled a fair six-sided dice with the letters A through F. Within the word fair is a rather large assumption: we assume that the probability of rolling any side of the dice is exactly equal to rolling all the others. For a six sided dice that’s one divided by six, or one-sixth chance of rolling any one side. A coin would have one divided by two, or one-half chance of flipping either side. The probability for each possible outcome of rolling the dice and flipping the coin is said to be assigned uniformly, we basically means we can neglect it in our calculation of the expectation. But from now one we won’t. We will re-write our formula for the expectation by pulling the count, n up in to the numerator:

\mathbb{E}[X] := \displaystyle\frac{x_1+x_2+\cdots+x_n}{n}= x_1\frac{1}{n}+x_2\frac{1}{n}+\cdots+x_n\frac{1}{n}

Now we claim that each fraction \frac{1}{n} is precisely the probability, and we write for this case the probability \mathbb{P}(x_i) for each side of the dice x_i (where i ranges from one to six) is equal to one divided by the count of sides, or \frac{1}{n} .

Unfortunately, this only works when the assignment of probabilities is uniform. In most cases we won’t have a nice fraction like \frac{1}{n} to assign to each outcome because our experiments won’t always be fair. Instead we need a general assignment function to assign the probabilities. This is where we make use of a probability measure, denoted by \mathbb{P} . Now our formula for the average becomes:

\displaystyle \mathbb{E}[X] := \displaystyle\frac{x_1+x_2+\cdots+x_n}{n}= x_1\mathbb{P}(x_1)+x_2\mathbb{P}(x_2)+\cdots+x_n\mathbb{P}(x_n)

And in the case of an infinite number of outcomes and probabilities we get

\boxed{\mathbb{E}[X] := \displaystyle\sum\limits_{i=1}^{\infty} x_i\mathbb{P}(x_i)}

Exploding Expectations

Expectations can explode! Suppose you take the absolute value of all the numbers being added up in the definition of the expectation:

\mathbb{E}[X] := \displaystyle\sum\limits_{i=1}^{\infty} |x_i|\mathbb{P}(x_i)

This makes all the summands positive, and will remove any scenarios where positives cancel with negatives. If this infinite series of positive numbers adds up to infinity we say that the expectation does not exist. Said another way, this is due to the fact that the series does not converge absolutely.

So in most cases we want:

\mathbb{E}[X] := \displaystyle\sum\limits_{i=1}^{\infty} |x_i|\mathbb{P}(x_i) < \infty

Continuous Expectations

Let’s go back to the random variable X. If we can find a way to represent it as a function of the object x that it is assigning probabilities to, then we can write it as f(x) . This means that f(x) is the probability density of the random variable X . Since this function will work for any object x in the continuum, then we can convert the infinite sum to a infinite Riemann integral with integrator \mbox{d}x being an infinitesimal partition of numbers to which it is mapping the outcomes, i.e. as:

\displaystyle \boxed{\mathbb{E}[X] := \displaystyle\int_{-\infty}^{+\infty} xf(x)\mbox{d}x}