I love when machine learning is used to create genuinely new things* instead of trying to imitate human processes. Both "alien" and "human-like" directions are related, but the human-like results aren't as mind-blowing because, well, we do it ourselves already. This reminds me of those controlnet images that represent different things on different levels, results that feel like magic.
* Yes, the Aphex face (HN favorite) is mentioned in the paper along with a bunch of other interesting references. I'd argue that there is a difference here. It's the way in which audio and spectrogram are both representational while being encoded by the same information. The difference might be a bit subjective.
wizzwizz4 149 days ago [-]
> It's the way in which audio and spectrogram are both representational while being encoded by the same information.
They're not, really. If you've spent much time reading spectrograms, you can look at the images and see which part is the sound. Occasionally the picture and sound are "chosen" to line up (notably the birdsong, disguised as distant blurry flower petals – notably not the large flowers, which don't really sound like birdsong at all), but usually the sound appears as clear, visually-distinct banding (e.g. the kittens meowing), and the image shows up as very audible distortion (e.g. the tigers).
Choosing sounds and images that line up properly might not be something a human has tried before, but I can already see how I'd do it, if I were better at drawing. It's like a multimedia ambigram: https://en.wikipedia.org/wiki/Ambigram. The diffusion model is missing a lot of tricks: I could get similar quality by making a collage of clip art.
Edit: just saw the third "painting of a farm" one. That one is actually quite good, the way it uses the solid space in the barns. The first "painting of a farm" has the right idea (if I may anthropomorphise the system), working the birdsong in as clouds and trees, but its execution is severely lacking.
mrkramer 150 days ago [-]
>I love when machine learning is used to create genuinely new things* instead of trying to imitate human processes.
Machine learning is based upon human neural system so it was directly inspired by humans and all use cases so far try to mimic human behaviour e.g. vision, speech, data and information synthesis, art generation etc.
gessha 149 days ago [-]
> Machine learning is based upon human neural system
Very, very loosely if at all. There was a moment in computer vision history where researchers began to focus more on practical algorithmic improvements and mathematical optimization techniques rather than direct biological realism. This was around LeCun’s time in the 1990s.
colechristensen 150 days ago [-]
The extent to which LLMs are comparable to how collections of neurons work is really quite limited.
mrkramer 148 days ago [-]
All I meant is loosely based on animal/human neural networks.
This process has an interesting history. In 1937, Russian engineer Evgeny Murzin created a photoelectronic synthesizer that used images on glass plates to play music. It was called the ANS synthesizer and was used for the soundtrack of the Tarkovsky film Solaris. Somebody made an app version of it, called Virtual ANS, that is generally available.
A commercial product called Metasynth released in 1998 was the program Aphex Twin used for his song referenced here and also in the Matrix movie for the bullet time sound effects. It's amazing, but very expensive.
naltroc 150 days ago [-]
super cool. I love to think of the spectrogram as a blank canvas, where you mix spectral masks to produce interesting synth sounds.
This approach takes the "canvas" part more literally producing recognizable objects where we more often see fuzzy or geometric data.
like the article mentions,
It is highly constrained by the overlap of likeliness of the subject to have a sound that can be reproduced with a spectrogram with features like the object's timbre.
It works on the dog very well because a dog's bark is already closer to "noise" than "instrument" or "voice", and you can fit that noise to any octave that is convenient to the image.
Notice on the horse how the "neigh" sound wiggles up and down, and thus there are features of the horse that also have wiggling.
The castle uses "bell" tones, a great choice because these types of tones prefer non-integer values for harmonics (which comes off as horizontal brick lines in the castle walls). Extra convenient because you can gen new harmonics to get a new wall.
The garden views largely use birdsounds. Remember what we said about dog sound being translatable noise, and horse having a wiggle? The bird sound applies both of those techniques, and its sound is represented by the petals or leafs we see.
Since we perceive sound at a logarithmic scale, it is possible to show a visual representation of the object and also not being able to perceive it in the final mix! Our ears are most sensitive to the most prominent frequencies, which it looks like here are the darkest pixels. So the lighter pixels can provide some kind of shaping (like edges or boundaries) while the darker pixels provide the sound we perceive.
cortland [dot] mahoney [at] gmail [dot] com
kazinator 149 days ago [-]
Oscillofun: stream of stereo samples that are music, and also a vector animation if interpreted as the X and Y deflection signals on an oscilloscope.
An aside, but the images produced from this are some of the coolest AI art ive seen, stylistically. Would love to have some of this displayed on my wall if it werent machine-generated.
wsintra2022 147 days ago [-]
Would love to have some of this displayed on my wall if it werent machine-generated.
- not sure I understand? the machine wouldn't mind you displaying it on the wall
Jerrrrry 148 days ago [-]
meta, but the ironic juxtaposition of this sentiment is incredibly far too unsubtle as to not warrant its authenticity.
I would give a comparable analogy, but I am unable.
If beauty is in the eye of the beholder, and truly subjective, than this is truly an illogical, human sentiment.
This is a cool demonstration of shared latent spaces. Whether you're generating audio and images, or audio and text, or robot controls and video, or... whatever. The concept of having a shared latent space for different modalities really has legs.
rcpt 149 days ago [-]
Also worth checking out is oscilloscope music which pops up on this forum every few years
Also thanks for the pointer to Venetian Snares! We had cited "Songs about my Cats" but it somehow got lost in the editing :/ We'll add the citation back in for the next version, and we've cited it on our website for now
Dwedit 149 days ago [-]
Not a joke, a spectrogram is where you split a sound signal into its frequencies and amplitude for each frequency using a Fourier transform. 2D images can be encoded that way, though they usually don't sound like anything significant.
Using the singular sound as a verb is unusual, it is a little bit of a joke, but it checks out.
esafak 150 days ago [-]
It's perfectly understandable once you read the article.
Y_Y 149 days ago [-]
I read the article and didn't understand it, hence my post. If you'd be kind enough to explain it then that would be appreciated.
Using "sound" on its own as a verb doesn't work in modern English. I don't think even a heavy goods poetic license permits that.
mejutoco 149 days ago [-]
That sounds wrong to me.
I am not an authority on English, but using sound as a verb seems common, and idiomatic. Not sure where you got this idea.
Clamchop 148 days ago [-]
Sound is usually found as a transitive verb "sound the alarm" or immediately followed by an adjective or adjective phrase "it sounds like it's coming from the attic".
It's almost never seen naked except when talking about "how something sounds" which follows from the latter usage above.
We're beyond grammar and in the weeds of idiomatic usage, but hopefully it's clear to English speakers that "it sounds" full stop is so eyebrow-raising that it should normally be avoided.
Unless you're being clever or meta, as the title here is, so it gets a pass in my opinion.
If it were written idiomatically, though, it'd be "images that make sounds".
mejutoco 147 days ago [-]
I agree with you. I guess "sound on its own" was not the best explanation either, compared to the transitive one you provided.
StevenWaterman 149 days ago [-]
Image is a verb here
> _makes a representation of the external form of_ that sound
And Sound is a verb here
> Images that _emit or cause to emit sound_
It's a punny title pointing out that it's both at once
thfuran 149 days ago [-]
Sound, as in "sound the alarm" or "the horn sounded", is a verb. The title literally means "images that make noise".
hrnnnnnn 149 days ago [-]
I'm a native english speaker and the title, using "sound" on its own as a verb, made immediate sense to me - images that [make] sound.
dylan604 150 days ago [-]
It's up there with phrases like "the color of sound"
yobananaboy 149 days ago [-]
My dog never responds to barking sounds in shows / movies, but when I played the spectrogram one he went wild. Anyone have an idea of what makes it different?
thfuran 149 days ago [-]
Probably a lot of extra high frequency content. How does your dog feel about dog (or regular) whistles?
Oh no, nightmare territory for a painting of a beach. Fascinating project though, I can think of a lot of use cases in artistic work.
tomduncalf 149 days ago [-]
This is really cool and creative! Honestly one of my favourite AI things I've seen recently. More like this please :)
lxe 149 days ago [-]
I love projects that have 0 practical applications and are 100% artistic and creative.
suhacker256 140 days ago [-]
amazing! reminds me a little of polyglot files
zaptrem 149 days ago [-]
They should use BigVGAN instead of HiFiGAN, afaik it’s better.
azinman2 150 days ago [-]
Whacky AF! Not sure about the utility but wow that’s cool!
underlipton 150 days ago [-]
According to financial media, market-moving memes (based on a Mr. Robot Easter egg)[1]. (I'm personally not convinced that Twitter shitposts are enough to momentarily blast a stock price 800% week-to-week, but who's to say?)
still, why has AFX not blasted the stock market with Windowlicker?
Gotta buy some AFX...
Jerrrrry 150 days ago [-]
>utility
f'.... utility is the bane of existence...
i can now embed an obfuscated <script> in a barely-discernible QR code that exists solely on the spectroscopic output of an arbitrary noise
the stegacoustic opportunities also have practical phylactologic applications, outside of inane google easteregg hunts to find tail-end dwellers
wizzwizz4 149 days ago [-]
No, we've been able to do that since 1995, when the <script> tag was first invented. (Shortly after 1994, when we got the QR code.) But a QR code would definitely be audible as a series of quiet beeps, unless you exploited auditory masking.
Jerrrrry 149 days ago [-]
>But a QR code would definitely be audible as a series of quiet beeps, unless you exploited auditory masking.
The article we are discussing, and just a cursory google search, may interest you.
Very cool. Also related, the Riffusion project, which is a finetuned text-to-audio model, based on the CLIP vision model! It also outputs spectrogram images (but moreso music-inclined).
* Yes, the Aphex face (HN favorite) is mentioned in the paper along with a bunch of other interesting references. I'd argue that there is a difference here. It's the way in which audio and spectrogram are both representational while being encoded by the same information. The difference might be a bit subjective.
They're not, really. If you've spent much time reading spectrograms, you can look at the images and see which part is the sound. Occasionally the picture and sound are "chosen" to line up (notably the birdsong, disguised as distant blurry flower petals – notably not the large flowers, which don't really sound like birdsong at all), but usually the sound appears as clear, visually-distinct banding (e.g. the kittens meowing), and the image shows up as very audible distortion (e.g. the tigers).
Choosing sounds and images that line up properly might not be something a human has tried before, but I can already see how I'd do it, if I were better at drawing. It's like a multimedia ambigram: https://en.wikipedia.org/wiki/Ambigram. The diffusion model is missing a lot of tricks: I could get similar quality by making a collage of clip art.
Edit: just saw the third "painting of a farm" one. That one is actually quite good, the way it uses the solid space in the barns. The first "painting of a farm" has the right idea (if I may anthropomorphise the system), working the birdsong in as clouds and trees, but its execution is severely lacking.
Machine learning is based upon human neural system so it was directly inspired by humans and all use cases so far try to mimic human behaviour e.g. vision, speech, data and information synthesis, art generation etc.
Very, very loosely if at all. There was a moment in computer vision history where researchers began to focus more on practical algorithmic improvements and mathematical optimization techniques rather than direct biological realism. This was around LeCun’s time in the 1990s.
https://en.wikipedia.org/wiki/ANS_synthesizer
A commercial product called Metasynth released in 1998 was the program Aphex Twin used for his song referenced here and also in the Matrix movie for the bullet time sound effects. It's amazing, but very expensive.
This approach takes the "canvas" part more literally producing recognizable objects where we more often see fuzzy or geometric data.
like the article mentions, It is highly constrained by the overlap of likeliness of the subject to have a sound that can be reproduced with a spectrogram with features like the object's timbre.
It works on the dog very well because a dog's bark is already closer to "noise" than "instrument" or "voice", and you can fit that noise to any octave that is convenient to the image.
Notice on the horse how the "neigh" sound wiggles up and down, and thus there are features of the horse that also have wiggling.
The castle uses "bell" tones, a great choice because these types of tones prefer non-integer values for harmonics (which comes off as horizontal brick lines in the castle walls). Extra convenient because you can gen new harmonics to get a new wall.
The garden views largely use birdsounds. Remember what we said about dog sound being translatable noise, and horse having a wiggle? The bird sound applies both of those techniques, and its sound is represented by the petals or leafs we see.
Since we perceive sound at a logarithmic scale, it is possible to show a visual representation of the object and also not being able to perceive it in the final mix! Our ears are most sensitive to the most prominent frequencies, which it looks like here are the darkest pixels. So the lighter pixels can provide some kind of shaping (like edges or boundaries) while the darker pixels provide the sound we perceive.
cortland [dot] mahoney [at] gmail [dot] com
https://www.youtube.com/watch?v=o4YyI6_y6kw
(History of numerous appearances on HN).
I would give a comparable analogy, but I am unable.
If beauty is in the eye of the beholder, and truly subjective, than this is truly an illogical, human sentiment.
This paper is the advanced version of a YouTube video by Japhy Riddle that I watched recently: https://youtu.be/qeUAHHPt-LY?si=b21Op1nmi_o-M6Od
https://youtu.be/ywdRQ3zU6Uc?si=EZc6Pz8j6y1i83DJ
Also I believe the missed out on citing the pioneering work of Snares et al. 2001 [0]
[0] https://youtube.com/watch?v=BHup81lEjqo
Also thanks for the pointer to Venetian Snares! We had cited "Songs about my Cats" but it somehow got lost in the editing :/ We'll add the citation back in for the next version, and we've cited it on our website for now
My favorite example of hiding an image in a song was in Doom 2016: https://www.youtube.com/watch?v=yzFit0nldf4
Using "sound" on its own as a verb doesn't work in modern English. I don't think even a heavy goods poetic license permits that.
I am not an authority on English, but using sound as a verb seems common, and idiomatic. Not sure where you got this idea.
It's almost never seen naked except when talking about "how something sounds" which follows from the latter usage above.
We're beyond grammar and in the weeds of idiomatic usage, but hopefully it's clear to English speakers that "it sounds" full stop is so eyebrow-raising that it should normally be avoided.
Unless you're being clever or meta, as the title here is, so it gets a pass in my opinion.
If it were written idiomatically, though, it'd be "images that make sounds".
> _makes a representation of the external form of_ that sound
And Sound is a verb here
> Images that _emit or cause to emit sound_
It's a punny title pointing out that it's both at once
1: https://twitter.com/GtradeCrypto/status/1791616356263072137
still, why has AFX not blasted the stock market with Windowlicker?
Gotta buy some AFX...
i can now embed an obfuscated <script> in a barely-discernible QR code that exists solely on the spectroscopic output of an arbitrary noise
the stegacoustic opportunities also have practical phylactologic applications, outside of inane google easteregg hunts to find tail-end dwellers
https://antfu.me/posts/ai-qrcode