Music: Category Archive
The Ocarina Test for AI Image Generators
Published on December 27, 2024 at 12:46pm by Matt the Tall
A while back in a discussion about LLMs and image models, I mentioned to a friend offhandedly that image models don't really understand what ocarinas are, and he said I should coin an “ocarina test” for them. That idea developed into the article you're reading now. I've done some playing with AI models here and there, mostly with setting up a few local models to run on my own hardware like StableDiffusion and occasionally seeing how bigger models like GPT-4 or Dall-E differ from local running models. Put simply, image generation models don't handle ocarinas well for a number of reasons.
Even in human circles, the ocarina is a pretty obscure instrument. In my unscientific observation of anecdotes, I'd say roughly 30-50% of those I meet know the ocarina as an instrument when I play it, but the majority of those who recognize it don't know it's an instrument with more history than the Legend of Zelda games. The transverse form factor that's familiar from Legend of Zelda was invented in 1853 Italy by Giuseppe Donati. The wider family of instruments ocarinas are a part of, the vessel flute, has been around for thousands of years.
Take a look at an example of a traditional Italian 10-hole ocarina from maker Menaglio in Italy:
For more eye candy, these are examples of Asian style transverse ocarinas:
Clockwise from top left:
- 12-hole Signature Alto C from OcarinaMusic
- 12-hole Soprano G from Aketa
- 11-hole Alto D from Pure Ocarinas
- 11-hole Alto C from Oberon Ocarinas
For the rest of the article, I'm going to use terms for ocarina anatomy assuming you know what they mean. If those terms aren't familiar to you, please look at this article on the Pure Ocarinas site with a ton of other great info. There will be easy reference links the first time terms show up.
If you're wondering what's different from and Asian vs an Italian design, it comes down to how the upper range is played. To play the highest two notes in the natural scale of the ocarina (not using sharps/flats), for an Italian ocarina you lift the left pinky followed by the right thumb climbing up the two highest notes in the range. With an Asian ocarina, the order is reversed: right thumb followed by left pinky. The size of the tone holes for those two fingers reflect the difference.
In addition to transverse ocarinas, there's another type to further muddy the waters for hapless AI models: the pendant ocarina. There are both English style pendants and those in a more transverse looking shape, not to mention more artistic pieces that function as playable sculptures.
We haven't even covered all the major types of ocarinas that could be taken into account, but let's take a look at how ChatGPT 4 currently handles a couple of simple image prompts.
With the prompt “generate an image of an ocarina” I got this:
Better than I would have expected, but very obviously not playable and WAY too many holes for what otherwise looks like a pendant form factor. The shiny blue glaze and raised designs do look like a plausible art style.
Let's see what happens if I try to be a little more specific with “generate an image of a 10 hole Italian transverse ocarina.”
It actually got worse! There's 18 holes rather than 10, and it's linear instead of transverse (a transverse flute is held with the body perpendicular to the mouth). I will give credit, it looks like an unglazed clay more common to Italian ocarinas, but that's about the only thing this one got right. Worth noting as well, the windway opening is entirely the wrong shape for sending the air to the voicing in a useful way. Except for rare exceptions, ocarina windways are rectangular.
Let's dig even deeper into this rabbit hole, shall we?
First, there are some limitations to the ocarina as an instrument. There's a maximum amount of range that can be added to an ocarina by opening up holes in the body. For sounding good across their range, this usually means a maximum of 12 holes in one chamber minus some uncommon exceptions of sopranos that can support 13. Personally, I prefer ocarinas with 10 or 11 holes because they feel nicer to play and sound better across their range. As a way around the limited range of single chamber ocarinas, the multichamber ocarina was invented. They are oft referred to as double, triple, and quadruple ocarinas to denote the number of chambers.
I suspect it's not difficult to see how the piles of statistics we call AI systems would have problems with figuring out how to generate a double or even a triple ocarina like these pictured, which also have multiple fingering systems their makers might use.
This is not exhaustive, as there are even more variants on both single chamber and multichamber ocarinas I've omitted for something resembling brevity.
Let's see how ChatGPT fares.
Generate a picture of a double ocarina:
Generate a picture of an alto C multichamber ocarina:
These have a lot of commonalities with previous AI generated examples in not being remotely practical as an instrument, and don't even resemble the form factor they should if the result was to be convincing.
My takeaway here is that with sufficient training, you can create a model that will be very effective in a specialized generation role. IDE autocomplete has been made massively more useful thanks to LLMs, and the ability of AI models to generate remotely interesting images is impressive all its own, much less what specialized image models are capable of. I'm sure image models have similar failings with less obscure musical instruments, but I suspect the ocarina will continue to be an issue for image models for several reasons:
- They aren't nearly as standardized as common orchestral instruments, so there are many designs, and image models would need to distinguish them.
- Nintendo can be a bit litigious about their trademarks, and asking for an image of the Ocarina of Time provided a response saying it wouldn't generate that particular ocarina before generating one in a shape that is definitely not transverse. The Ocarina of Time isn't a practical design, but it does have that transverse form factor.
- Models don't understand what makes a playable instrument, and that's required to make a convincing image. Tone holes are not all the same size, and other holes in the instrument serve specific purposes with their size/shape, such as subholes or split holes.
Additionally, for models which are trying to be more generalist, there are massive opportunities for unseen edge cases where they will fail. Those failures can be amusing like the examples here or they can cause major problems when AI falls prey to malicious use. Just look at the Morris-II worm, which attacks Generative AI email assistants. If any prompt engineers out there find methods to coax better results out of existing models, check my link tree page for ways to reach out.
Autumn Rust Boost Recap Feed is Live
Published on November 5, 2024 at 4:25pm by Matt the Tall
After the Satellite Skirmish: Autumn Rust concluded, boo-bury organized a session to record a reading of all the boosts that came in during the show and up to the recap. If you didn't catch the Skirmish, you can watch the recording here or find it in a new podcast app.
Thanks to the RSS feed for the show, people can send value even after the festivities are over, so there were some reads of boosts that came in between Autumn Rust and the recap. It was fun hanging out with boo-bury, Frankiepaint, SirSpencer, and Em of Survival Guide reading through the boosts to credit and thank everyone who sent them. Much like the Skirmish, you can listen to the recap in your choice of new podcast app or follow this link.
After the Satellite Skirmish: Autumn Rust
Published on September 23, 2024 at 2:08pm by Matt the Tall
The Skirmish is over, and it was a great time. Something went wrong with the stream during the first song of the set, so there was a sound loop that threw things off. If it was the restreaming system, sounds like it should get fixed in the recording when that's all put together (edit: not an issue in the recording). Despite that causing a temporary halt to my set, I played through the rest of it afterward and managed to squeeze in finishing before the time was up with something like 15sec spare.
Lessons learned:
- Record locally while streaming, just in case
- Get a condenser mic and an arm for it (probably AT2020, but open to recommendations)
- Should pick up less of the room than the dynamic I have right now
- Easier to get close so my voice is audible
- One caveat will be to figure out a shortcuts/macros to switch between audio profiles for ocarina and voice
- Use a relaxant, perhaps? I was chock full of adrenaline during that performance, particularly after the stream issues with the first song, so a shot of liquor shortly before the set might have taken the edge off.
It was a ton of fun going through the live chat after the set to both respond to people and see the discussion about my sword when people noticed it in the background. 😆
Thank you to the production crew and everyone else who made the event happen! Would love to perform in future Skirmishes and more events.
Some major thank yous are in order for those who boosted during my set:
- ericpp
- boo-bury
- DuhLaurien (and the Bowlette boost, glad the kids loved both the sword in the background and hearing The Stables)
- Boolysteed
- cbrooklyn
- Salty Crayon
- Kolomona - Sir Libre
- petar
- frankiepaint
- cottongin
- HeyCitizen
- beamus
- Andy RNR Breakheart
- netned
- Em
- dude
- ChadF
- SirSpencer
- marykateultra
- Heather Larson
- natejohnivan
If you want me to include your note(s) in addition to your name (or change your name from how it came in the boost), reach out and let me know. I have a few ways you can get in contact here.
Performing in the Satellite Skirmish: Autumn Rust
Published on September 20, 2024 at 12:04pm by Matt the Tall
The show will be kicking off at 2pm Pacific Sunday. The Skirmish is an online Value 4 Value battle of the bands format where the artist who receives the most donations in satoshis (a small fraction of a bitcoin) wins the competition.
You can watch the show here, and if you want to send boosts (pieces of bitcoin with a message) live during the show to support me, I'd recommend downloading Podcast Guru and setting up a Lightning wallet that connects to the app.
I'll be unveiling three songs I'm working on that have yet to be recorded. Anyone who wants a say in which one gets published first can send boosts during the show saying which one, and the song that gets the most sats will be recorded before the others.
Hello World!
Published on June 4, 2024 at 10:06pm by Matt the Tall
For anyone who works remotely close to software, the title choice for this post probably comes as zero surprise. I'm a software engineer by trade who dabbles in a host of unrelated subjects. For a long time I've played with the idea of setting up a blog, and getting the domain name on the cheap for the first year helped me decide to actually pull the trigger on the domain.
This blog was set up using MyWebLog made by Daniel J. Summers. When I asked about recommendations for blogging systems, I went with his system both due to our common enjoyment of No Agenda, a comedic show full of news deconstruction and analysis, and it's always nice to be able to directly ask questions to the person who wrote software you're using.
An added reason for the choice is there's support for Podcasting 2.0 features. If that's a new term for you, you can read more about it at the Podcast Index. I don't know whether I'll go about starting my own podcast, but I do want to take advantage of those features for my music.
Yes, music. One of the cool things that's spun out of Podcasting 2.0 and its enabling of V4V (Value 4 Value) podcasts, and that tech is also being used for music in the same vein. Eventually I want to host my V4V music, though I haven't gotten to that yet. I play a somewhat esoteric instrument that most people either haven't heard of or think only exists in the Legend of Zelda games: the ocarina.