Creating a Gaussian Mixture Model using BNT --
a Short Tutorial, by Richard W. DeVaul
Version 1.0
IMPORTANT NOTE
The error in the BNT that cause EM to fail should now be fixed in the
latest version of the BNT. Please download the latest version from
Kevin Murphy's BNT website,
ABSTRACT
This document presents a simple tutorial on creating Gaussian mixture
models using Kevin Murphy's Bayes Net Toolbox for Matlab
, a collection of
freely available (GPLd) graphical model tools for the extremely useful
(but not free) Matlab numerical environment.
It assumes little clue on the part of the user beyond a basic
understanding of statistical machine learning and Matlab. Deep
expertise in machine learning or graphical models is not required.
INTRODUCTION
I am a consumer of machine learning techniques, not a machine learning
researcher. Although I have acquired some familiarity with the tools
and notation of modern statistical machine learning and graphical
models, I am not an expert. The literature of this field tends to
long on math and short on tutorials, which can be frustrating for the
non-specialist.
This document grew out of my experience (and frustration) implementing
a simple Gaussian mixture model using Matlab and BNT. I present this
in the hope that others might have an easier time getting started. I
may expand this tutorial to cover other kinds of models as time and my
own explorations permit.
PURPOSE OF THE TUTORIAL
The goal of this tutorial is to set out a complete example of
defining, training, and testing a simple Gaussian mixture model. This
tutorial includes real research data and step-by-step instructions.
It will not teach you machine learning or graphical models, nor is it
an introduction to Matlab or BNT. It is simply a step-by-step example
of how to use these tools in one very specific application.
PREREQUISITES
This tutorial assumes you have basic understanding of statistical
machine learning and Matlab. If you don't know what a Gaussian
mixture model is or why you might use one, this document probably
won't help you. Likewise you won't (easily) learn Matlab notation
from this tutorial.
You don't need to know much machine learning to do this tutorial, but
you will need to know something about the bigger-picture issues
(feature selection, model selection, model complexity, generalization
and over-fitting, regularization, the pros and cons of generative
vs. discriminative modeling, etc.) in order to usefully apply this
material to your own work.
In addition to having the requisite knowledge I assume you have access
to a recent version of Matlab with BNT properly installed.
INCLUDED IN THIS TUTORIAL
This tutorial requires the following files, which are available on the
web:
mixtureBNT.txt -- this file.
mixtureBNT.mat -- the Matlab data file
mixtureBNT.m -- a Matlab script generated from this instruction file.
If you didn't get the mixtureBNT.mat and mixtureBNT.m files along with
this text file, you can find them as part of a zip or tar archive on
the web at the following URLs:
http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.zip
http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.tar.gz
GETTING STARTED
Start Matlab if you haven't already; The file 'mixtureBNT.mat' should
be somewhere in your path - the current working directory is fine. In
this tutorial, Matlab commands I expect you to type are denoted by two
"greater than" symbols at the beginning of a line. You should type
all such lines (except for the leading ">>") exactly as shown and in
the order presented.
All of the Matlab code has been tested under Matlab 6.0.0.88 Release
12 and the "28 July 2002" release of BNT under Linux. If something
doesn't work, go back and make sure you have typed things properly and
haven't skipped any steps.
If you are still having problems, restart Matlab and run the included
'mixtureBNT.m' script -- it was automatically generated from this text
file and includes all commands verbatim. Further difficulties may be
the result of a Matlab path or BNT installation problem.
THE PROBLEM
For this tutorial we want to create an activity classifier that will
classify two activity states: walking and running. The data we are
working with is produced by a body-worn accelerometer, worn near the
subject's center of mass (the hips).
MODEL ARCHITECTURE
We will construct activity classifier using a Gaussian mixture model,
represented as a simple graphical model and constructed within the BNT
framework. We will train this model using real research data. For the
purposes of this tutorial it doesn't really matter what this data is
or where it comes from. However, this is not a toy problem -- we will
be working in a 31 dimensional feature space, and the results will be
a generative model suitable for use in a real-time activity
classification system.
THE DATA
A large amount of labeled data was gathered while the subject was
engaged in various activities; the data we are using here comes from
walking and running segments. We have already conducted our
preliminary analysis and chosen appropriate features, resulting in 31
element feature vectors** (see endnote).
IMPORTANT NOTE: The BNT implementation of EM requires that the
magnitude of our features be in some "reasonable" range, which AFAIK
is not documented. The data provided with the tutorial has already
been appropriately scaled by dividing by 1e+6. Your own data may
also require a prescaling operation.
The first step is to load the tutorial data:
>> load 'mixtureBNT.mat'
Assuming this works, the walking and running data sets are now loaded.
This defines the following Matlab variables:
walkingX -- walking features, 570 samples, 31 elements
runningX -- running features, 120 samples, 31 elements
You can verify this by typing the following:
>> size(walkingX)
>> size(runningX)
Matlab should produce the appropriate numbers.
THE MODEL
The model is a two-class, two component mixture model:
Class 1 -- "walking"
two 31 dimensional Gaussians (means and covariances) with associated
mixing parameters.
Class 2 -- "running"
... the same as 1, but with different parameter values.
This structure is captured in the following graphical model. Note
that the square nodes represent discrete values and the round (well,
hexagonal) node represents continuous values.
+-------+
Node 1 | Class |
| A/B |
+-------+
| \
| \
| \
| V
| +-----------+
| | component | Node 2
| | 1/2 |
| +-----------+
| /
| /
V V
.--------.
Node 3 / \
/ Gaussian \
\ mu, sigma /
\ /
\________/
The graph structure of this model can be represented by the following
adjacency matrix:
011
001
000
CREATING THE MODEL USING BNT
To construct this model in BNT, we issue the following commands:
>> dag = [ 0 1 1 ; 0 0 1 ; 0 0 0 ];
>> discrete_nodes = [1 2];
>> nodes = [1 : 3];
>> node_sizes=[ 2 2 31];
>> bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes);
>> bnet.CPD{1} = tabular_CPD(bnet,1);
>> bnet.CPD{2} = tabular_CPD(bnet,2);
>> bnet.CPD{3} = gaussian_CPD(bnet, 3);
>> %bnet.CPD{3} = gaussian_CPD(bnet, 3,'cov_type','diag');
ORGANIZING THE DATA
The next step is organizing our feature data into training and testing
sets that can be used with the BNT tools. Our walking and running data
sets are of different sizes (570 and 120 feature vectors respectively)
so we will use the first 100 feature vectors from each in our training
set, and the next 20 of each in our testing set.
>> trainingX = walkingX(1:100,:);
>> trainingX(101:200,:)=runningX(1:100,:);
We must also label the training features:
>> trainingC(1:100) = 1; %% Class 1 is walking
>> trainingC(101:200) = 2; %% Class 2 is running
Now, the testing set:
>> testX(1:20,:) = walkingX(101:120,:); %% The first 20 are walking
>> testX(21:40,:) = runningX(101:120,:); %% The next 20 are running
No labels here, since the labels are what we want the trained model to
produce.
Unfortunately, this data is in the form of arrays, and BNT requires
cell arrays. Here we convert the data into an appropriate form. Note
that what we are actually doing is specifying the observed nodes of
the Bayes net; What is observed in this case is the output (Node 3,
the 31 dimensional feature) and the class (the two-state label, Node
1). What is hidden is Node 2, which represents the mixing weights (or
priors) of the Gaussian components.
>> training= cell(3,length(trainingX));
>> training(3,:) = num2cell(trainingX',1);
>> training(1,:) = num2cell(trainingC,1);
Note the matrix transpose in the second step -- I like to organize my
data such that each row is a feature vector, and the BNT expects
column vectors.
TRAINING THE MODEL
Now we are ready to train the model using the EM algorithm. EM works
by starting with a randomly initialized model, and then iteratively
refines the model parameters to produce a locally optimal
maximum-likelihood fit. The EM algorithm is composed of two steps.
In the first, each data point undergoes a soft-assignment to each
mixture component. In the second, the parameters of the model are
adjusted to fit the data based on the soft assignment of the previous
step.
First, we create the (exact) inference engine that will allow EM to
estimate the model parameters.
>> engine = jtree_inf_engine(bnet);
Next, we fit the model:
>> maxiter=10; %% The number of iterations of EM (max)
>> epsilon=1e-100; %% A very small stopping criterion
>> [bnet2, ll, engine2] = learn_params_em(engine,training,maxiter,epsilon);
EVALUATING THE MODEL -- GENERATIVE MODELING
If all goes well, you should now have a new trained Bayes net and
inference engine. One thing we can do to verify that the model is
reasonable is to draw samples from it and visually compare them to the
training data.
We can draw samples iteratively using the following procedure:
>> class0= cell(3,1); %% Create an empty cell array for observations
>> class1 = class0;
>> class2 = class0;
>> class1{1} = 1; %% The class node is observed to be walking
>> class2{1} = 2; %% The class node is observed to be running
>> for i=1:100
>> sample1=sample_bnet(bnet2,'evidence',class1);
>> sample2=sample_bnet(bnet2,'evidence',class2);
>> modelX(i,:)=sample1{3}';
>> modelX(i+100,:)=sample2{3}';
>> end
Plot the original training data:
>> figure
>> subplot(2,1,1);
>> plot(trainingX);
Plot the synthetic data drawn from the model distribution:
>> subplot(2,1,2);
>> plot(modelX);
The two plots should look similar.
EVALUATING THE MODEL -- CLASSIFICATION
Generative modeling is useful to see what the model is doing, but our
stated goal is classification. In order to test the performance of
the model as a classifier, we will see how well it classifies our
held-out (testing) data.
The first step is to entering each of our held-out features as
evidence and calculating the marginal of the class (Node 1).
>> evidence=class0; %% Start out with nothing observed
>> for i=1:40
>> evidence{3}=testX(i,:)';
>> [engine3, ll] = enter_evidence(engine2,evidence);
>> marg = marginal_nodes(engine3,1);
>> p(i,:)=marg.T';
>> end
The array 'p' now contains the marginal probabilities (likelihood) of
the two class for each of our held-out features. We will plot this
against the data. As in our training set, the first half of the data
is walking, the second half is running.
>> figure;
>> subplot(2,1,1);
>> plot(testX);
>> hold
>> plot(p(:,1)); %% Plot the output of the walking classifier
>> subplot(2,1,2);
>> plot(testX);
>> hold
>> plot(p(:,2)); %% Plot the output of the running classifier
You should see that the marginals for each class are at or near 1.0 in
the appropriate regions of the data, and at or near zero elsewhere.
Congratulations, your Bayes net performs well as both a generative and
discriminative model.
----------------------------------------
BIBLIOGRAPHY
Excelent introduction to modern statistical methods for
classification:
R.O. Duda, P.E. Hart, and D.G. Stork.
"Pattern Classification." Wiley-Interscience,
2nd edition, 2000.
Classic EM algorithm reference:
A.P. Dempster, N. M. Laird, and D. B. Rubin.
"Maximum likelihood from incomplete data via the EM algorithm."
JRSSB, 39:1--38, 1977.
Advanced reading on graphical models:
M. Jordan, Ed.
"Learning in Graphical Models."
Kluwer Academic Publishers, 1998.
----------------------------------------
ENDNOTES
** The feature we have chosen is the power spectrum (minus the DC
component) of the magnitude of the three-axis acceleration vector. The
feature data was computed in real-time using a 64-sample FFT window
with 32 sample overlap. Each feature represents a 1.28 second "slice"
of time, with 50% overlap with the previous slice.