In this blog we are going to start working on a brand new Python library; and the first thing which we will be doing is creating probability distribution objects.

Now, scipy.stats does an excellent job at providing such probability distributions as class objects, and we will be leveraging heavily on this python package; however, we will be designing our own (wrapper) classes over the top to satisfy our particular needs.

Distributions

The first Python module we are going to need is one which contains all of our Probability Distributions as Python Objects. So let’s go ahead and open up our PyCharm IDE and create a new Python file and call it distributions.py.

In this file we are going to need numpy and scipy, so we are going to install these Python packages and import them:

Fig 1 – The beginnings of a Probability Distribution Python Module.

They are greyed out because we haven’t written any code yet that uses them! Let us now create the skeleton of an abstract base class for probability distributions:

Fig 2. The beginnings of an abstract base class for probability distributions.

The first method this class will have is the ability for it to return to the user a (univariate) probability density function f(x) but we are unsure yet how to implement this method, so we will leave it for now:

Fig 3. The Beginnings of the class’s first method: the ability for it to return a probability density function.

What else can we do?

Well, we know that a Normal probability distribution should be defined here as well. All Normal distributions are specified completely by their first and second moments, i.e. by their mean \mu and standard deviation \sigma.

We can define a skeleton derived class here like so:

Fig 4. The Beginnings of a normal probability distribution function within the distributions.py file.

This class will be constructed by passing it a set of parameters \Theta that define the probability distribution. We may change that argument name later on… spoiler: we do!

Sampling

This prompts us to add a new method to the base class: one should have the ability to sample random variables from the distribution!

Let us call this class method rvs for “Random Variables“, and implement it in the base class:

Fig 5. A second class method for our distribution class: the ability for it to return a random variable sample from the distribution.

Let us defer the implementation of the random variable to the inheriting class by using the keyword: NotImplementedError:

Fig 6. Deferring the implementation of the Random Variable rvs class method.

This means the the Normal distribution class will need to implement its own rvs method:

Fig 7. The Normal Distribution implementing its own rvs method.

We will use the Numpy Random library to implement the Normal Random Variable method. However, it needs loc (i.e. the first parameter: the mean, \mu) and a scale (i.e. the second parameter: the variance \sigma^2) arguments as can be seen in the above figure.

This means, that the ‘moments‘ argument \Theta we had for the Normal class, needs to take in a loc and a scale argument/parameter of type float.

But let’s be a little bit more clever than that. Let us define a new class object (called TwoParameterDist) that holds any two, generic parameters called loc and scale:

Fig 8 – Implementing a small data class to hold the location and scale for any 2-parameter probability distriution.

Now, we can simply initialise the Normal class with the TwoParameterDist class, and refer to any loc or size as TwoParameterDist.loc or TwoParameterDist.scale.

When we instantiate a Normal ProbDist from a TwoParameterDist object, the Normal class inherits loc and size from TwoParameterDist. Thus, we can always access data held in the TwoParameterDist object after instantiating Normal by typing self.loc and self.scale.

Fig 9 – Extracting the loc and scale data from the Moments class (instead of passing them as data). However we still need the size parameter.

The rvs method of the Normal ProbDist class now returns a numpy random.normal random variable where:

  • loc <- mean, \mu, and
  • scale <- variance, \sigma > 0.

The Size of a Probability Distribution

Note that we allow an optional parameter called size to potential reshape the distribution if we need to. The size, in this sense, is the dimensionality of the distribution. For all univariate cases, the size is 1. In later blogs on extending this module we will investigate multi-variate probability distributions where the size is greater than 1.

This allows us to fully use the numpy random normal class. The size (or dimension) parameter is not a moment, so it can’t reside in the TwoParameterDist class. Hence, we place the size further up, in the parent base class ProbDist (i.e. the dimension of a probability distribution is a property of a ProbDist):

Fig 10 – Adding the Size of a Probability Distribution to the Base Class ProbDist.

Note that since size() is implemented as a method in the base class, to access the size parameter we must call the method with the parentheses operator. The other two parameters are accessed as public, inherited class data.

The Inverse CDF

Another method we can quickly implement here without additional infrastructure is the percent point function (or inverse CDF) function at a particular quantile q of the given random variable. To use this method we only need a loc and scale, hence:

Fig 11 – The Normal class gets an inverse CDF method!

Testing

Let us now perform a unit test of this implementation.

We create a new Python file called main.py and import our distributions class as dist:

Fig 12 – Setting up the Test Case.

Notice that I have also created a little print function, so I can print the results of calling rvs or ppf to the console.

Next, in the usual fashion, we create a test case:

Fig 13 – The Test Case.

Line 15 shows us instantiating a new Normal probability distribution class with the initial parameters \texttt{loc} = \mu = 0 and \texttt{scale} = \sigma^2 = 1. Together, these satisfy the signature of a TwoParameterDist class.

Then, line 17 shows how we then ask the instantiated distribution object for a single random variate x, by specifying \texttt{size} = n = 1.

The random variate we got in this case was -0.81744753.

Structured Distribution

A structured distribution (or \texttt{StructDist}) is a class that inherits from \texttt{ProbDist} and serves as a register for pairs of keys and laws.

By a \texttt{law} we mean a probability distribution function, such as \texttt{Normal}.

Since \texttt{StructDist} inherits from \texttt{ProbDist} it must implement (or override) the class methods. This is how it is initialised:

Fig 14 – The constructor for a StructDist class.

We implement this class to sit between user-defined dictionaries of random variables and their probability distribution functions. This allows the user to specify a random variable label, say Y, and a distribution, say \texttt{Normal}. Then these pairs are registered with this class.

Note that while the \texttt{StructDist} class inherits from the base class \texttt{ProbDist} it does not require one to be instantiated. Instead, this class only requires a \texttt{laws} dictionary to be passed to it:

Fig 15 – Passing a laws (dictionary) to the constructor of the StructDist class. The OrderedDict function then parses the dictionary in to self.laws.

After processing, the laws dictionary becomes part of the \texttt{StructDist} class data. Below, we can see that the law label Y is associated with a Normal probability distribution:

Fig 16 – A closer look at the StructDist class. It has successfully identified the random variable Y with a Normal Distribution.

This is a good way to associate a variable name Y with a distribution. We can extend \texttt{ProbDist} to handle even more probability distributions. But for now, we will leave just the Normal distribution implemented.

The StructDist class will also need to implement its own rvs method, except this time it is generic. I.e. any time we need a particular probability distribution, we refer to that as a law:

Fig 14b – the rvs method of the StructDist class object. This uses placeholders instead of concrete probability distribution names to achieve generality.

Likewise, for the corresponding implementation of the Inverse CDF method:

Fig 14c – the ppf method of the StructDist class object. This uses placeholders instead of concrete probability distribution names to achieve generality.

Plotting Samples Drawn from Distributions

We instruct Python to make histogram plots of discrete probability distributions via the use of a dictionary. Here, I have labelled my random variable as Y and supplied the dictionary with this label, the Normal distribution function, and the 2 parameters: \texttt{loc} = \mu = 0.5 and \texttt{scale} = \sigma^2 = 0.1:

We then instantiate a StructDist with the above ‘law’ dictionary:

Now, we run a test. Let us attempt to sample 5000 random variates Y and plot the results as a 30-bin histogram:

Fig 17 – The Unit Test for Histogram Plotting Random Variables Drawn from a Probability Distribution specified as a registered pair of labels ‘Y’ and a law ‘Normal’.

giving the following Seaborn plot:

Fig 18 – A Seaborn histogram plot of the drawn random samples.

This uses a custom plot function:

Fig 19 – The custom plot function to show the Seaborn histogram.

where we have imported the following:

Fig 20 – the Imports of this Unit Test.

Conclusion

In this blog we have successfully wrapped the scipy.stats normal distribution class in to a custom wrapper class. We also created a new structure of distributions to hold pairs of labels and laws. Then, we showed how easy it is to store such pairs and then sample from them.

In the next blog we will implement a state space model.

References

https://docs.scipy.org/doc/scipy/reference/stats.html

https://seaborn.pydata.org/