@@ValerioVelardoTheSoundofAI hello, thanks to you from me too. im working on a parallel architecture that only controls a vst and compares the outcome to sounds from songs or samples that include sounddesign that mostly stems from synths. the only problem that i have to sort out right now is how to speed up the mid render process for the input attempt step, if you know if theres possibilities to parallelize VST->dsp->render outputs (as far as im concerned that always going over CPU), i would be very thankful.
You could have perhaps elaborated on the slide that suggested different architectures such as GAN, AE, VAE, and VQ VAE. I could look up the net to get an idea. That was important for me to follow what we were talking about. But it would have saved some distraction if you had spent one more minute on this. I guess a novice feels where the it pinches! I have done online courses on machine learning, and audio signal processing. While going through deep learning I found your channel. I appreciate the effort you have put in to share so much in a lucid manner.
Thank you so much for this series! You're helping so many people getting introduced to DL audio ... thank you. I was wondering for the data points size: why is the time frequency representation more "compact"? If we have an FFT with a 512 window with no overlap (best case scenario), there are 256 amplitude points and 256 phase points aka 512 data points every 512 samples (?) Isn't it the same number of data points per seconds?
Hello, can you. please tell me how to show many librosa spectrograms like matplotlib subplot structure, like if there are 7 different kind of sound and a spectrogram from each sound and all spectrogram show in a single plot like we see in matplotlib subplot
Great video as always, Valerio! Do you know where I could listen to an example piece of generated audio that demonstrates what a problematic phase reconstruction sounds like?
Not sure what your are looking for. However the most outstanding probably job was done by Aiva Technologies. There is a free tool to compose your own music and be aware of challenges you can meet www.aiva.ai/. The problem with "reconstruction" is highly related with Deep Neural Networks (mainly architecture, hyperparmeters, proper datasets, etc.). We need to remember about the complexity of signal spectrum => "complicated" spectrum decrease the generation quality (neural network can not be capable to approximate our signal function). Good luck
I have a doubt when you said that we would be using variational autoencoders then it means that the model would be predicting the spectrogram as the image. I don't know how would you convert it into a wav file again. If there are any articles related to it do share. ANd yes I love your videos so much, i get to learn from them a lot !!!
Thank you Harish! Once you generate a spectrogram via a VAE, you can convert it to a waveform using an inverse short-time Fourier transform. I'll cover this topic in coming videos in the series.
Can you please make a video on the generated audio? I worked on Deep Convolution GAN's for generating images by taking random noise but never trained it on audio signals
Speaking of autoencoders, do you think you could eventually talk about audio clustering using the embedding? That would be really cool! I've also seen some interesting loss functions used also on the embedding to enforce small distances between two of the same class (or similar classes) and large distances for very different classes.
Just to also add, there's a paper called "SCAN: Learning to Classify Images without Labels" that is SOTA for image clustering. They do something cool where they send an image + the same image with augmentations into the network and use that to make the embedding distances close. I wonder if something similar could be done with audio augmentation!
@@Erosis thank you for the suggestion! I'm definitely going to cover clustering / music similarity with embeddings. Thank you for the reference too :) I wasn't aware of the SCAN paper.
@@hijonk9510 Yes. Both the original image and the augmented image enter a network with the same weights and structure. Keep in mind, SCAN has more steps after that (mainly regarding the clustering).