Creating a Gaussian Mixture Model using BNT -- a Short Tutorial, by Richard W. DeVaul Version 1.0 IMPORTANT NOTE The error in the BNT that cause EM to fail should now be fixed in the latest version of the BNT. Please download the latest version from Kevin Murphy's BNT website, ABSTRACT This document presents a simple tutorial on creating Gaussian mixture models using Kevin Murphy's Bayes Net Toolbox for Matlab , a collection of freely available (GPLd) graphical model tools for the extremely useful (but not free) Matlab numerical environment. It assumes little clue on the part of the user beyond a basic understanding of statistical machine learning and Matlab. Deep expertise in machine learning or graphical models is not required. INTRODUCTION I am a consumer of machine learning techniques, not a machine learning researcher. Although I have acquired some familiarity with the tools and notation of modern statistical machine learning and graphical models, I am not an expert. The literature of this field tends to long on math and short on tutorials, which can be frustrating for the non-specialist. This document grew out of my experience (and frustration) implementing a simple Gaussian mixture model using Matlab and BNT. I present this in the hope that others might have an easier time getting started. I may expand this tutorial to cover other kinds of models as time and my own explorations permit. PURPOSE OF THE TUTORIAL The goal of this tutorial is to set out a complete example of defining, training, and testing a simple Gaussian mixture model. This tutorial includes real research data and step-by-step instructions. It will not teach you machine learning or graphical models, nor is it an introduction to Matlab or BNT. It is simply a step-by-step example of how to use these tools in one very specific application. PREREQUISITES This tutorial assumes you have basic understanding of statistical machine learning and Matlab. If you don't know what a Gaussian mixture model is or why you might use one, this document probably won't help you. Likewise you won't (easily) learn Matlab notation from this tutorial. You don't need to know much machine learning to do this tutorial, but you will need to know something about the bigger-picture issues (feature selection, model selection, model complexity, generalization and over-fitting, regularization, the pros and cons of generative vs. discriminative modeling, etc.) in order to usefully apply this material to your own work. In addition to having the requisite knowledge I assume you have access to a recent version of Matlab with BNT properly installed. INCLUDED IN THIS TUTORIAL This tutorial requires the following files, which are available on the web: mixtureBNT.txt -- this file. mixtureBNT.mat -- the Matlab data file mixtureBNT.m -- a Matlab script generated from this instruction file. If you didn't get the mixtureBNT.mat and mixtureBNT.m files along with this text file, you can find them as part of a zip or tar archive on the web at the following URLs: http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.zip http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.tar.gz GETTING STARTED Start Matlab if you haven't already; The file 'mixtureBNT.mat' should be somewhere in your path - the current working directory is fine. In this tutorial, Matlab commands I expect you to type are denoted by two "greater than" symbols at the beginning of a line. You should type all such lines (except for the leading ">>") exactly as shown and in the order presented. All of the Matlab code has been tested under Matlab 6.0.0.88 Release 12 and the "28 July 2002" release of BNT under Linux. If something doesn't work, go back and make sure you have typed things properly and haven't skipped any steps. If you are still having problems, restart Matlab and run the included 'mixtureBNT.m' script -- it was automatically generated from this text file and includes all commands verbatim. Further difficulties may be the result of a Matlab path or BNT installation problem. THE PROBLEM For this tutorial we want to create an activity classifier that will classify two activity states: walking and running. The data we are working with is produced by a body-worn accelerometer, worn near the subject's center of mass (the hips). MODEL ARCHITECTURE We will construct activity classifier using a Gaussian mixture model, represented as a simple graphical model and constructed within the BNT framework. We will train this model using real research data. For the purposes of this tutorial it doesn't really matter what this data is or where it comes from. However, this is not a toy problem -- we will be working in a 31 dimensional feature space, and the results will be a generative model suitable for use in a real-time activity classification system. THE DATA A large amount of labeled data was gathered while the subject was engaged in various activities; the data we are using here comes from walking and running segments. We have already conducted our preliminary analysis and chosen appropriate features, resulting in 31 element feature vectors** (see endnote). IMPORTANT NOTE: The BNT implementation of EM requires that the magnitude of our features be in some "reasonable" range, which AFAIK is not documented. The data provided with the tutorial has already been appropriately scaled by dividing by 1e+6. Your own data may also require a prescaling operation. The first step is to load the tutorial data: >> load 'mixtureBNT.mat' Assuming this works, the walking and running data sets are now loaded. This defines the following Matlab variables: walkingX -- walking features, 570 samples, 31 elements runningX -- running features, 120 samples, 31 elements You can verify this by typing the following: >> size(walkingX) >> size(runningX) Matlab should produce the appropriate numbers. THE MODEL The model is a two-class, two component mixture model: Class 1 -- "walking" two 31 dimensional Gaussians (means and covariances) with associated mixing parameters. Class 2 -- "running" ... the same as 1, but with different parameter values. This structure is captured in the following graphical model. Note that the square nodes represent discrete values and the round (well, hexagonal) node represents continuous values. +-------+ Node 1 | Class | | A/B | +-------+ | \ | \ | \ | V | +-----------+ | | component | Node 2 | | 1/2 | | +-----------+ | / | / V V .--------. Node 3 / \ / Gaussian \ \ mu, sigma / \ / \________/ The graph structure of this model can be represented by the following adjacency matrix: 011 001 000 CREATING THE MODEL USING BNT To construct this model in BNT, we issue the following commands: >> dag = [ 0 1 1 ; 0 0 1 ; 0 0 0 ]; >> discrete_nodes = [1 2]; >> nodes = [1 : 3]; >> node_sizes=[ 2 2 31]; >> bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes); >> bnet.CPD{1} = tabular_CPD(bnet,1); >> bnet.CPD{2} = tabular_CPD(bnet,2); >> bnet.CPD{3} = gaussian_CPD(bnet, 3); >> %bnet.CPD{3} = gaussian_CPD(bnet, 3,'cov_type','diag'); ORGANIZING THE DATA The next step is organizing our feature data into training and testing sets that can be used with the BNT tools. Our walking and running data sets are of different sizes (570 and 120 feature vectors respectively) so we will use the first 100 feature vectors from each in our training set, and the next 20 of each in our testing set. >> trainingX = walkingX(1:100,:); >> trainingX(101:200,:)=runningX(1:100,:); We must also label the training features: >> trainingC(1:100) = 1; %% Class 1 is walking >> trainingC(101:200) = 2; %% Class 2 is running Now, the testing set: >> testX(1:20,:) = walkingX(101:120,:); %% The first 20 are walking >> testX(21:40,:) = runningX(101:120,:); %% The next 20 are running No labels here, since the labels are what we want the trained model to produce. Unfortunately, this data is in the form of arrays, and BNT requires cell arrays. Here we convert the data into an appropriate form. Note that what we are actually doing is specifying the observed nodes of the Bayes net; What is observed in this case is the output (Node 3, the 31 dimensional feature) and the class (the two-state label, Node 1). What is hidden is Node 2, which represents the mixing weights (or priors) of the Gaussian components. >> training= cell(3,length(trainingX)); >> training(3,:) = num2cell(trainingX',1); >> training(1,:) = num2cell(trainingC,1); Note the matrix transpose in the second step -- I like to organize my data such that each row is a feature vector, and the BNT expects column vectors. TRAINING THE MODEL Now we are ready to train the model using the EM algorithm. EM works by starting with a randomly initialized model, and then iteratively refines the model parameters to produce a locally optimal maximum-likelihood fit. The EM algorithm is composed of two steps. In the first, each data point undergoes a soft-assignment to each mixture component. In the second, the parameters of the model are adjusted to fit the data based on the soft assignment of the previous step. First, we create the (exact) inference engine that will allow EM to estimate the model parameters. >> engine = jtree_inf_engine(bnet); Next, we fit the model: >> maxiter=10; %% The number of iterations of EM (max) >> epsilon=1e-100; %% A very small stopping criterion >> [bnet2, ll, engine2] = learn_params_em(engine,training,maxiter,epsilon); EVALUATING THE MODEL -- GENERATIVE MODELING If all goes well, you should now have a new trained Bayes net and inference engine. One thing we can do to verify that the model is reasonable is to draw samples from it and visually compare them to the training data. We can draw samples iteratively using the following procedure: >> class0= cell(3,1); %% Create an empty cell array for observations >> class1 = class0; >> class2 = class0; >> class1{1} = 1; %% The class node is observed to be walking >> class2{1} = 2; %% The class node is observed to be running >> for i=1:100 >> sample1=sample_bnet(bnet2,'evidence',class1); >> sample2=sample_bnet(bnet2,'evidence',class2); >> modelX(i,:)=sample1{3}'; >> modelX(i+100,:)=sample2{3}'; >> end Plot the original training data: >> figure >> subplot(2,1,1); >> plot(trainingX); Plot the synthetic data drawn from the model distribution: >> subplot(2,1,2); >> plot(modelX); The two plots should look similar. EVALUATING THE MODEL -- CLASSIFICATION Generative modeling is useful to see what the model is doing, but our stated goal is classification. In order to test the performance of the model as a classifier, we will see how well it classifies our held-out (testing) data. The first step is to entering each of our held-out features as evidence and calculating the marginal of the class (Node 1). >> evidence=class0; %% Start out with nothing observed >> for i=1:40 >> evidence{3}=testX(i,:)'; >> [engine3, ll] = enter_evidence(engine2,evidence); >> marg = marginal_nodes(engine3,1); >> p(i,:)=marg.T'; >> end The array 'p' now contains the marginal probabilities (likelihood) of the two class for each of our held-out features. We will plot this against the data. As in our training set, the first half of the data is walking, the second half is running. >> figure; >> subplot(2,1,1); >> plot(testX); >> hold >> plot(p(:,1)); %% Plot the output of the walking classifier >> subplot(2,1,2); >> plot(testX); >> hold >> plot(p(:,2)); %% Plot the output of the running classifier You should see that the marginals for each class are at or near 1.0 in the appropriate regions of the data, and at or near zero elsewhere. Congratulations, your Bayes net performs well as both a generative and discriminative model. ---------------------------------------- BIBLIOGRAPHY Excelent introduction to modern statistical methods for classification: R.O. Duda, P.E. Hart, and D.G. Stork. "Pattern Classification." Wiley-Interscience, 2nd edition, 2000. Classic EM algorithm reference: A.P. Dempster, N. M. Laird, and D. B. Rubin. "Maximum likelihood from incomplete data via the EM algorithm." JRSSB, 39:1--38, 1977. Advanced reading on graphical models: M. Jordan, Ed. "Learning in Graphical Models." Kluwer Academic Publishers, 1998. ---------------------------------------- ENDNOTES ** The feature we have chosen is the power spectrum (minus the DC component) of the magnitude of the three-axis acceleration vector. The feature data was computed in real-time using a 64-sample FFT window with 32 sample overlap. Each feature represents a 1.28 second "slice" of time, with 50% overlap with the previous slice.