Babbage Session 6-7/1/98
back to home
Internal parametrization, Analog VLSI, and the art of nonlinear search
A most excellent Babbage session.
These notes are a bit ambiguous because at no time in the course of the meeting did everyone understand what was going on. We spent most of the time discussing a casual suggestion by Neil to effectively model some results best expressed with the terminology of functional analysis in terms of information theory. Goals for the next week include Finding The Point and Being Precise.
We began by discussing a pretty result from Carver Meade's analog vlsi group. Al Barr et. al. found a way to perform gradient descent using a noise source, two differentiators, an integrator, and a multiplier. The paper is almost self-explanatory, and many of us gave one-line summaries of The Point, although two issues of contention arose:
- why noise? It allows for the most perfect matched filter in order to synchronously detect the gradient. Any other signal would do, but with varying degrees of success because they might lose information (if they lacked a certain spectral component) or not dominate dy/dt (if they didn't have sharp enough slope at all points). Matched filtering with noise is difficult unless you use the noise right then and there, and perform analogous transforms on both components.
- what does the analog hardware do? The important thing to remember is that although the math works out perfectly, there is no such thing as a perfect integrator/differentiator/averager, and the analog vlsi components must simply be good approximations of these components. It is nice that the system works, but not much can be said beyond that.
We then began discussing the paper by Barron. We really need to read it in more depth, despite the good intentions of several people to disassociate the technical nature of the proof from insight into how it works. The Point was not clear, although several conclusions were easily Several significant points include:
- The conclusion. We knew something like this all along, from Bishop. If the first moment of the Fourier transform is bounded = Cf, then n functions with internally varying parameters can approximate a function to within Cf^2/n, whereas a linear superposition can only approach Cf/d (1/n)^1/d. This has profound implications for why neural networks are so good. Key points: this works for *all* functions in a convex set in function space, and is not a mere average-case scenario (despite the use of the WLLN in the proof), and has to do with the 'smearing-out' that the internal parameters can do. Keep in mind the nonlinear nature of diodes and other physical devices when considering such theorems: *how could I actually build something that does this?* The first example, the gradient descent using noise, and the Barron paper, are not entirely unlinked.
- We resolved that sinusoids, sigmoids, etc. are equally powerful in the Cf-L2 sense that Barron uses. Bernd wisely observed that this does not always hold for real-life problems. Sigmoids and RBFs are better than mere sinusoids in almost all NN problems. This is because the probability measure defined over the set of 'smooth functions' is not very well-defined; using the trivial distribution is probably not a very useful thing to do.
- We observed that the term 'exponential expressiveness,' which everyone is much fond of tossing around, is useless unless we do some real work explaining what it really is.
- We observed that there is an exponential constraint on the smoothness. Say that my deviation along any one dimension is k. Then Cf ~ k^d, e.g. my one-D deviation must go down exponentially as my dimension increases. We note, however, that under the same circumstances the linear case fares far worse. We also discussed the search process -- e.g., the linear case is essentially a series of projections which can be done in linear time (or, for the matrix case, n^3).
- We noted the definition of Hilbert space and asked the Dirac question.
- Quantum mechanics offers a square root speedup, although internal parametrization offers an exponential decrease in the number of functions needed. How does this affect the best computer on earth.
- We discussed Kirkpatricks 'hard problems'. We discusses how reparametrizing the variables and the measures could make hard problems seem easy. Discussed 'natural measures.'
- Bernd: 'in practice...'
- Quick discussion of EM algorithm, error surfaces, and other stuff. Mentioned Tad Hogg, structured search. Mentioned spin glasses.
Finally an effort was made to rephrase a simple question, "How can one do an information theoretic representation of the Barron result?" The remainder of the session was a flurry of attempts to define not much. I will summarize:
- How can one link entropy with the moment of the Fourier distribution?
- Answer: not obvious. We must decide on some probability measure that suggests a good way to do this.
- Hint: uniform continuity and generalizations thereof.
- Is the nonlinear basis more powerful than the linear basis in the infinite limit?
- Can we operate in a stochastic fashion?
- Should we discard dynamics and proceed after this fashion.
Ben had some interesting diagrams which I will not put up here, as well as an interesting analysis of the problem of analyzing a frequency spectrum.
Things to do: read Bennett's packet.
Things for me to do: Write to Bennett. Reinterpret Barron. Read Cover and Thomas 8, 9, 11, 12, 13, 16. Read Tad Hogg's structured search paper. Read Amari; review differential geometry and statistics. Review embedding and the topological description of dynamics. Scan measure theory and functional analysis. Spin glasses? Kirkpatrick's problem phase transition? (Check NMM reference)
Algebraic topology? Higher-dimensional algebra (category/8-dimension repeat (Bott periodicity)/manifold properties/antisym/etc.). Fields (integrability, etc.)
tr. by Ed Boyden
babbage@media.mit.edu· 20 Ames St· Cambridge, MA 02139 USA · 617.253.2383