The Ocarina Test for AI Image Generators
Published on December 27, 2024 at 12:46pm by Matt the Tall
A while back in a discussion about LLMs and image models, I mentioned to a friend offhandedly that image models don't really understand what ocarinas are, and he said I should coin an “ocarina test” for them. That idea developed into the article you're reading now. I've done some playing with AI models here and there, mostly with setting up a few local models to run on my own hardware like StableDiffusion and occasionally seeing how bigger models like GPT-4 or Dall-E differ from local running models. Put simply, image generation models don't handle ocarinas well for a number of reasons.
Even in human circles, the ocarina is a pretty obscure instrument. In my unscientific observation of anecdotes, I'd say roughly 30-50% of those I meet know the ocarina as an instrument when I play it, but the majority of those who recognize it don't know it's an instrument with more history than the Legend of Zelda games. The transverse form factor that's familiar from Legend of Zelda was invented in 1853 Italy by Giuseppe Donati. The wider family of instruments ocarinas are a part of, the vessel flute, has been around for thousands of years.
Take a look at an example of a traditional Italian 10-hole ocarina from maker Menaglio in Italy:
For more eye candy, these are examples of Asian style transverse ocarinas:
Clockwise from top left:
- 12-hole Signature Alto C from OcarinaMusic
- 12-hole Soprano G from Aketa
- 11-hole Alto D from Pure Ocarinas
- 11-hole Alto C from Oberon Ocarinas
For the rest of the article, I'm going to use terms for ocarina anatomy assuming you know what they mean. If those terms aren't familiar to you, please look at this article on the Pure Ocarinas site with a ton of other great info. There will be easy reference links the first time terms show up.
If you're wondering what's different from and Asian vs an Italian design, it comes down to how the upper range is played. To play the highest two notes in the natural scale of the ocarina (not using sharps/flats), for an Italian ocarina you lift the left pinky followed by the right thumb climbing up the two highest notes in the range. With an Asian ocarina, the order is reversed: right thumb followed by left pinky. The size of the tone holes for those two fingers reflect the difference.
In addition to transverse ocarinas, there's another type to further muddy the waters for hapless AI models: the pendant ocarina. There are both English style pendants and those in a more transverse looking shape, not to mention more artistic pieces that function as playable sculptures.
We haven't even covered all the major types of ocarinas that could be taken into account, but let's take a look at how ChatGPT 4 currently handles a couple of simple image prompts.
With the prompt “generate an image of an ocarina” I got this:
Better than I would have expected, but very obviously not playable and WAY too many holes for what otherwise looks like a pendant form factor. The shiny blue glaze and raised designs do look like a plausible art style.
Let's see what happens if I try to be a little more specific with “generate an image of a 10 hole Italian transverse ocarina.”
It actually got worse! There's 18 holes rather than 10, and it's linear instead of transverse (a transverse flute is held with the body perpendicular to the mouth). I will give credit, it looks like an unglazed clay more common to Italian ocarinas, but that's about the only thing this one got right. Worth noting as well, the windway opening is entirely the wrong shape for sending the air to the voicing in a useful way. Except for rare exceptions, ocarina windways are rectangular.
Let's dig even deeper into this rabbit hole, shall we?
First, there are some limitations to the ocarina as an instrument. There's a maximum amount of range that can be added to an ocarina by opening up holes in the body. For sounding good across their range, this usually means a maximum of 12 holes in one chamber minus some uncommon exceptions of sopranos that can support 13. Personally, I prefer ocarinas with 10 or 11 holes because they feel nicer to play and sound better across their range. As a way around the limited range of single chamber ocarinas, the multichamber ocarina was invented. They are oft referred to as double, triple, and quadruple ocarinas to denote the number of chambers.
I suspect it's not difficult to see how the piles of statistics we call AI systems would have problems with figuring out how to generate a double or even a triple ocarina like these pictured, which also have multiple fingering systems their makers might use.
This is not exhaustive, as there are even more variants on both single chamber and multichamber ocarinas I've omitted for something resembling brevity.
Let's see how ChatGPT fares.
Generate a picture of a double ocarina:
Generate a picture of an alto C multichamber ocarina:
These have a lot of commonalities with previous AI generated examples in not being remotely practical as an instrument, and don't even resemble the form factor they should if the result was to be convincing.
My takeaway here is that with sufficient training, you can create a model that will be very effective in a specialized generation role. IDE autocomplete has been made massively more useful thanks to LLMs, and the ability of AI models to generate remotely interesting images is impressive all its own, much less what specialized image models are capable of. I'm sure image models have similar failings with less obscure musical instruments, but I suspect the ocarina will continue to be an issue for image models for several reasons:
- They aren't nearly as standardized as common orchestral instruments, so there are many designs, and image models would need to distinguish them.
- Nintendo can be a bit litigious about their trademarks, and asking for an image of the Ocarina of Time provided a response saying it wouldn't generate that particular ocarina before generating one in a shape that is definitely not transverse. The Ocarina of Time isn't a practical design, but it does have that transverse form factor.
- Models don't understand what makes a playable instrument, and that's required to make a convincing image. Tone holes are not all the same size, and other holes in the instrument serve specific purposes with their size/shape, such as subholes or split holes.
Additionally, for models which are trying to be more generalist, there are massive opportunities for unseen edge cases where they will fail. Those failures can be amusing like the examples here or they can cause major problems when AI falls prey to malicious use. Just look at the Morris-II worm, which attacks Generative AI email assistants. If any prompt engineers out there find methods to coax better results out of existing models, check my link tree page for ways to reach out.
Autumn Rust Boost Recap Feed is Live
Published on November 5, 2024 at 4:25pm by Matt the Tall
After the Satellite Skirmish: Autumn Rust concluded, boo-bury organized a session to record a reading of all the boosts that came in during the show and up to the recap. If you didn't catch the Skirmish, you can watch the recording here or find it in a new podcast app.
Thanks to the RSS feed for the show, people can send value even after the festivities are over, so there were some reads of boosts that came in between Autumn Rust and the recap. It was fun hanging out with boo-bury, Frankiepaint, SirSpencer, and Em of Survival Guide reading through the boosts to credit and thank everyone who sent them. Much like the Skirmish, you can listen to the recap in your choice of new podcast app or follow this link.
That Time I Discovered Unicode has a Formatting Stack While Fixing a Localization Bug
Published on October 4, 2024 at 12:55pm by Matt the Tall
This story happened back in 2015. It feels so long ago, back when Sublime Text was my primary text editor outside an IDE rather than VS Code (or VS Codium these days when I don't want the telemetry).
For those unaware, RTL languages (Arabic and Hebrew are what I worked with), can cause all kinds of annoyances and problems in getting the UI to look nice and be comprehensible to those who use those languages. Everything you're used to having on the left side of the UI switches to being on the right side and vice versa along with text reading from the right side of the line.
Back when this happened, there had been a number of RTL (Right to Left) issues where localization to RTL languages was causing a problem with the old Windows Open File dialog. When opening that window for a user to select a file, you pass it a filter string, which looks something like this:
Extension type (*.ext)|*.ext
When the direction was set to RTL, it looked more like this:
txe.*|(txe.*) epyt noisnetxE
The substring after the |
is what's used to filter which files show in the UI. If you have a Unicode “Right To Left override” character to make the RTL languages readable for native readers, it breaks the format and you don't get any filtering.
When trying to figure the issue out, one of the technical translators recommended something that led down an unexpected rabbit hole. She pointed me toward an editor specifically for dealing with Unicode that had a useful ability to insert and render invisible control characters, including a Right to Left Override character. (Back then, text editors weren't as good at dealing with Unicode as they are now) Seeing that override, it followed that there was a similar Left to Right Override, which was the case. I then spotted a name for a character whose acronym collides with a well known document format: PDF, or Pop Directional Formatting. I then went and read the official spec for that character to gain a deeper understanding.
Anyone who's dealt with computing and knows what a stack is probably just reacted like I did on discovering that: Unicode has a directional formatting stack where you can push and pop elements!? That character allowed me to properly fix the issue. Now, someone might think: does it really make a difference that the text direction is perfect when people who read Hebrew or Arabic probably have to get used to reading it backward for dealing with software that doesn't support the language as well? Yes and no. It matters less for Hebrew because each character is distinct like in printed English, but Arabic is a script you could compare to cursive, and since the characters are linked together, words would look very different when the characters are reversed. Also, I'm just a stickler for getting the UI right.
Because of how other text editors rendered the text at the time, the end result looked more like this when the text wasn't in the localization editor:
<LeftToRight override char> <RightToLeft override char> <Pop Directional Format char> RTL reading Arabic text (*.ext)|*.ext
For editors that displayed the directional formatting correctly, it looked more like this:
<LeftToRight override char> <RightToLeft override char> RTL Arabic text <Pop Directional Format char> (*.ext)|*.ext
With this as the end result:
شهادة الجهاز *.pfx |(*.pfx)
If you copy/paste that result text into an editor, put your cursor somewhere near the end of the string and tap the left arrow on your keyboard, seeing where the cursor goes.You'll get a better idea of how the text is read vs how Windows processes the text. I hope you enjoyed learning this obscure bit of information about text display and processing.
PS, If you were wondering whether I made the title sound like one of those overly long anime/manga titles on purpose, wonder no more. 🙂
After the Satellite Skirmish: Autumn Rust
Published on September 23, 2024 at 2:08pm by Matt the Tall
The Skirmish is over, and it was a great time. Something went wrong with the stream during the first song of the set, so there was a sound loop that threw things off. If it was the restreaming system, sounds like it should get fixed in the recording when that's all put together (edit: not an issue in the recording). Despite that causing a temporary halt to my set, I played through the rest of it afterward and managed to squeeze in finishing before the time was up with something like 15sec spare.
Lessons learned:
- Record locally while streaming, just in case
- Get a condenser mic and an arm for it (probably AT2020, but open to recommendations)
- Should pick up less of the room than the dynamic I have right now
- Easier to get close so my voice is audible
- One caveat will be to figure out a shortcuts/macros to switch between audio profiles for ocarina and voice
- Use a relaxant, perhaps? I was chock full of adrenaline during that performance, particularly after the stream issues with the first song, so a shot of liquor shortly before the set might have taken the edge off.
It was a ton of fun going through the live chat after the set to both respond to people and see the discussion about my sword when people noticed it in the background. 😆
Thank you to the production crew and everyone else who made the event happen! Would love to perform in future Skirmishes and more events.
Some major thank yous are in order for those who boosted during my set:
- ericpp
- boo-bury
- DuhLaurien (and the Bowlette boost, glad the kids loved both the sword in the background and hearing The Stables)
- Boolysteed
- cbrooklyn
- Salty Crayon
- Kolomona - Sir Libre
- petar
- frankiepaint
- cottongin
- HeyCitizen
- beamus
- Andy RNR Breakheart
- netned
- Em
- dude
- ChadF
- SirSpencer
- marykateultra
- Heather Larson
- natejohnivan
If you want me to include your note(s) in addition to your name (or change your name from how it came in the boost), reach out and let me know. I have a few ways you can get in contact here.
Performing in the Satellite Skirmish: Autumn Rust
Published on September 20, 2024 at 12:04pm by Matt the Tall
The show will be kicking off at 2pm Pacific Sunday. The Skirmish is an online Value 4 Value battle of the bands format where the artist who receives the most donations in satoshis (a small fraction of a bitcoin) wins the competition.
You can watch the show here, and if you want to send boosts (pieces of bitcoin with a message) live during the show to support me, I'd recommend downloading Podcast Guru and setting up a Lightning wallet that connects to the app.
I'll be unveiling three songs I'm working on that have yet to be recorded. Anyone who wants a say in which one gets published first can send boosts during the show saying which one, and the song that gets the most sats will be recorded before the others.
Hello World!
Published on June 4, 2024 at 10:06pm by Matt the Tall
For anyone who works remotely close to software, the title choice for this post probably comes as zero surprise. I'm a software engineer by trade who dabbles in a host of unrelated subjects. For a long time I've played with the idea of setting up a blog, and getting the domain name on the cheap for the first year helped me decide to actually pull the trigger on the domain.
This blog was set up using MyWebLog made by Daniel J. Summers. When I asked about recommendations for blogging systems, I went with his system both due to our common enjoyment of No Agenda, a comedic show full of news deconstruction and analysis, and it's always nice to be able to directly ask questions to the person who wrote software you're using.
An added reason for the choice is there's support for Podcasting 2.0 features. If that's a new term for you, you can read more about it at the Podcast Index. I don't know whether I'll go about starting my own podcast, but I do want to take advantage of those features for my music.
Yes, music. One of the cool things that's spun out of Podcasting 2.0 and its enabling of V4V (Value 4 Value) podcasts, and that tech is also being used for music in the same vein. Eventually I want to host my V4V music, though I haven't gotten to that yet. I play a somewhat esoteric instrument that most people either haven't heard of or think only exists in the Legend of Zelda games: the ocarina.