• Login
  • Register

Work for a Member company and need a Member Portal account? Register here with your company email address.

Project

TypeContour: Interactive Installation Exploring Typography and Body

Lingdong Huang

A virtual world painted with letters and words.

Continuing the spirit of BodyType I made for member's week in 2023, I present TypeContour in this spring's member's week, this time with cooler neural networks and even cooler ways to use them with openFrameworks, a creative coding framework started by my advisor, Prof. Zach Lieberman.

The idea of the piece, initially described to me by Zach, was simple: A virtual world painted with letters and words, with each word being the name of the object it paints with its letters. 

For the semantic segmentation neural network, I tried out DETR and SAM2, and found the former to run faster and fit my need better.

For BodyType, VisionOSC and my earlier interactive installations involving machine vision and openFrameworks, I was mixing Objective C++ with the rest of my code base, in order to use Apple's shiny new Vision / CoreML technologies that run blazing fast. It was quite a pain. One would think C++ or Objective C alone is terrible enough, now multiply them together, throw in the ideosyncrasies of XCode and openFrameworks, you get the most unreadable, uncompilable, unlinkable pile of mumble-jumble.

Luckily, it so happens that I'm working on the FFI of my new programming language for my PhD thesis, and have been into compiling, dynamic linking libraries and all that stuff lately. Then it hit me, I could easily do the same for the Apple Vision API's.

The idea is that I pack up all the unsavory objective C stuff in to a dynamic library. Now it's no longer objective C -- it becomes machine code, so I can just call it with C/C++. Only thing I need to do is to find the symbols in the compiled binary, and tell C/C++: "trust me bro, this address here is a function pointer of such and such signature, just call it, it's gonna work". 

And it did. No more trouble sweeping after compiler/linker barfs, just one mysterious .so file (the result of a 30 mins struggle), that I dynamically load in sweet, pure C, with the promise of never having to touch Objective C again (for the length of this project, at least). 

The beauty of this approach: OpenFrameworks doesn't know about it, XCode doesn't know about it, so they have no way of complaining about it.

There are two major ways of drawing anything: one is to delineate its outline, the other is to fill its volume. In drawing you have line drawing versus hatching, in Processing it's called stroke() and fill(). This is no different in drawing with letters and words: you can draw a bunch of letters along the contour of something, or you can fill the shape with letters.

DETR gives a most likely label for each pixel. I use the findContour algorithm (the same used by OpenCV, which I reimplemented myself so I can be free from OpenCV) to trace the contour for each of the top classes. Then to draw contour with words, I simply lay the letters along the traced (and smoothed) contour repeatedly until it runs out. To fill a shape with words, I test the original segmentation bitmap to see if a position is still within bounds -- this is faster than computing point-in-polygon with vector-based algorithms.

When filling with words, I predict if the horizontal space is going to run out before a word is appended, and put it on the next line (or past the obstacle) if so -- just like in a typesetting scenario.

In addition to DETR, I use using Apple Vision's built-in facial landmark detection, as I felt a little extra attention to detail would be nice for the human visage, indeed, it's all we humans care about. I draw the facial features with the contour algorithm. I think I look more handsome with this letter filter.

Now it's a question of which things should be drawn with lines, which things with fills, and which things not drawn at all. DETR tends to be quite excited about useless background elements, and has classes such as "wall (brick)" and "wall (stone)", as well as "window (blind)". If all these are drawn with outlines, the composition looks terribly crowded. 

So my idea was that background elements should be drawn with fills, while foreground with contours. I dumped the ~200 labels to chatGPT, and told it to use its AI big brain to figure out what are interesting foreground objects and what are silly background noise (and format the output nicely too).

I'm pretty happy with the looks, only that the background fills often look kinda static. I decided to add some dynamism by driving the text with some vector field, make them feel like they're floating on the surface of an ocean or something.

When playing around with it, I came up with a better idea. What if words come out of your mouth and flies into the sky, and fill the background that way. Git rid of all the background classes, cause who cares.

For real time, offline speech recognition, I use whispercpp. it's made by this badass guy who wrote his own ML backend! 

I put the machine vision stuff and the speech to text stuff in separate threads, to get 120 FPS on the main graphics thread. The vision stuff is 20-30 FPS. The way I use whispercpp is that I keep feeding the network the last 10 seconds of audio I pick up from microphone, and do a Levenshtein's distance to find out what new words have popped up. The reason behind this is that the inference takes the same amount of time for inputs ranging from 1 to 20-ish seconds. It works alright, but there's room for improvement. The problem lies with the fact that network will keep correcting previously established words as new samples are given. For example, if you say "the weather is nice". In an ideal situation for the algorithm, the network would spit out "the weather", "weather is", "is nice", and will be easily spliced together as "the weather is nice". However, sometimes it might first spit out "the wet", then "whether is", "is nigh", etc. and now it gets resolved as "the wet whether is the weather is nigh weather is nice". Confusing!

If I'm writing a transcribing software, it's easy enough to just correct previous mistakes in the output. However, since animated letterforms fly out real time from tracked facial landmarks, it becomes quite a bit more involved to edit the ongoing animations. Another heuristic would be to detect "word boundaries" and chop the audio at these places -- however it would then be sensitive to the level of background noise. Given the time constraints I decided that the current approach is good enough for the time being.

Anyways I found making the whole thing pretty fun.

Next, I had to port the software from my M3 macbook pro, to the M1 mac mini to be used for the member's week installation. This proved to be a bit more trouble than I expected. First, the mini is on an older OS, and new CoreML models simply won't run. I tried to re-encode the models but there's no luck -- apparently "mlpackage" is totally different thing than "mlmodel". I had to upgrade the operating system. Now it runs, but terribly slowly -- I guess M3 is just so much better than M1. I was pretty set on just using my laptop for the duration of the open house, but before I gave up, I dumped my CoreML wrapper code to chatgpt just in case there's some weird Objective-C / Apple thing I didn't notice, that can magically make everything faster. I just said "make this faster!". And, among other not-so-useful advice, chatgpt spotted one line, where I told CoreML to use "MLComputeUnitsCPUAndGPU": turns out using "BOTH cpu AND gpu" (which I thought would be going at full power) actually means "DO NOT USE apple neural engine", which is apprently the actual fast thing. Boom! after I made that change, it did magically get much faster! Still not as fast as it is on my laptop, but fast enough for interactivity!

Many people came try out the piece at member's week. Two question I get asked most is "what's the inspiration" and "what's the application", to which I initially answered "Zach wants it" and "it's art". But later I modified my first answer to "concrete poetry" to make it sound better.