Posts Tagged “generative ai”

The Ocarina Test for AI Image Generators

Published on December 27, 2024 at 12:46pm by Matt the Tall

A while back in a discussion about LLMs and image models, I mentioned to a friend offhandedly that image models don't really understand what ocarinas are, and he said I should coin an “ocarina test” for them. That idea developed into the article you're reading now. I've done some playing with AI models here and there, mostly with setting up a few local models to run on my own hardware like StableDiffusion and occasionally seeing how bigger models like GPT-4 or Dall-E differ from local running models. Put simply, image generation models don't handle ocarinas well for a number of reasons.

Even in human circles, the ocarina is a pretty obscure instrument. In my unscientific observation of anecdotes, I'd say roughly 30-50% of those I meet know the ocarina as an instrument when I play it, but the majority of those who recognize it don't know it's an instrument with more history than the Legend of Zelda games. The transverse form factor that's familiar from Legend of Zelda was invented in 1853 Italy by Giuseppe Donati. The wider family of instruments ocarinas are a part of, the vessel flute, has been around for thousands of years.

Image of Donati

Take a look at an example of a traditional Italian 10-hole ocarina from maker Menaglio in Italy:

Budrio style Italian ocarina from Menaglio

For more eye candy, these are examples of Asian style transverse ocarinas:

Asian ocarinas

Clockwise from top left:

  • 12-hole Signature Alto C from OcarinaMusic
  • 12-hole Soprano G from Aketa
  • 11-hole Alto D from Pure Ocarinas
  • 11-hole Alto C from Oberon Ocarinas

For the rest of the article, I'm going to use terms for ocarina anatomy assuming you know what they mean. If those terms aren't familiar to you, please look at this article on the Pure Ocarinas site with a ton of other great info. There will be easy reference links the first time terms show up.

If you're wondering what's different from and Asian vs an Italian design, it comes down to how the upper range is played. To play the highest two notes in the natural scale of the ocarina (not using sharps/flats), for an Italian ocarina you lift the left pinky followed by the right thumb climbing up the two highest notes in the range. With an Asian ocarina, the order is reversed: right thumb followed by left pinky. The size of the tone holes for those two fingers reflect the difference.

In addition to transverse ocarinas, there's another type to further muddy the waters for hapless AI models: the pendant ocarina. There are both English style pendants and those in a more transverse looking shape, not to mention more artistic pieces that function as playable sculptures.

Pendant ocarinas

We haven't even covered all the major types of ocarinas that could be taken into account, but let's take a look at how ChatGPT 4 currently handles a couple of simple image prompts.

With the prompt “generate an image of an ocarina” I got this:

AI generated ocarina image

Better than I would have expected, but very obviously not playable and WAY too many holes for what otherwise looks like a pendant form factor. The shiny blue glaze and raised designs do look like a plausible art style.

Let's see what happens if I try to be a little more specific with “generate an image of a 10 hole Italian transverse ocarina.”

AI generated ocarina image

It actually got worse! There's 18 holes rather than 10, and it's linear instead of transverse (a transverse flute is held with the body perpendicular to the mouth). I will give credit, it looks like an unglazed clay more common to Italian ocarinas, but that's about the only thing this one got right. Worth noting as well, the windway opening is entirely the wrong shape for sending the air to the voicing in a useful way. Except for rare exceptions, ocarina windways are rectangular.

Ocarina with windway visible

But wait, there's more

Let's dig even deeper into this rabbit hole, shall we?

First, there are some limitations to the ocarina as an instrument. There's a maximum amount of range that can be added to an ocarina by opening up holes in the body. For sounding good across their range, this usually means a maximum of 12 holes in one chamber minus some uncommon exceptions of sopranos that can support 13. Personally, I prefer ocarinas with 10 or 11 holes because they feel nicer to play and sound better across their range. As a way around the limited range of single chamber ocarinas, the multichamber ocarina was invented. They are oft referred to as double, triple, and quadruple ocarinas to denote the number of chambers.

I suspect it's not difficult to see how the piles of statistics we call AI systems would have problems with figuring out how to generate a double or even a triple ocarina like these pictured, which also have multiple fingering systems their makers might use.

Multichamber ocarinas

This is not exhaustive, as there are even more variants on both single chamber and multichamber ocarinas I've omitted for something resembling brevity.

Let's see how ChatGPT fares.

Generate a picture of a double ocarina: AI generated double ocarina

Generate a picture of an alto C multichamber ocarina: AI generated AC multichamber

These have a lot of commonalities with previous AI generated examples in not being remotely practical as an instrument, and don't even resemble the form factor they should if the result was to be convincing.

My takeaway here is that with sufficient training, you can create a model that will be very effective in a specialized generation role. IDE autocomplete has been made massively more useful thanks to LLMs, and the ability of AI models to generate remotely interesting images is impressive all its own, much less what specialized image models are capable of. I'm sure image models have similar failings with less obscure musical instruments, but I suspect the ocarina will continue to be an issue for image models for several reasons:

  • They aren't nearly as standardized as common orchestral instruments, so there are many designs, and image models would need to distinguish them.
  • Nintendo can be a bit litigious about their trademarks, and asking for an image of the Ocarina of Time provided a response saying it wouldn't generate that particular ocarina before generating one in a shape that is definitely not transverse. The Ocarina of Time isn't a practical design, but it does have that transverse form factor.
  • Models don't understand what makes a playable instrument, and that's required to make a convincing image. Tone holes are not all the same size, and other holes in the instrument serve specific purposes with their size/shape, such as subholes or split holes.

Additionally, for models which are trying to be more generalist, there are massive opportunities for unseen edge cases where they will fail. Those failures can be amusing like the examples here or they can cause major problems when AI falls prey to malicious use. Just look at the Morris-II worm, which attacks Generative AI email assistants. If any prompt engineers out there find methods to coax better results out of existing models, check my link tree page for ways to reach out.

Categorized under: Music, Ocarina, Technology, AI/ML
Tagged: ai, generative ai, image models, llm, music, ocarina