Master's final project

The extraction of musical instruments from bachata songs
using semantic segmentation for audio source separation

CNN / Semantic Segmentation / Meta-learning / Signal Processing / STFT / PyTorch

Could we teach a machine to separate the individual instruments within a bachata (genre of latin music) song, given its typical instruments, polyrhythms and section structure? In this project this is what we attempted to do. To test our hypothesis and model idea, first we studied a much simpler problem using the MNIST dataset: extract a number (e.g. 5) given that it is present in a mixture of numbers (i.e. 5,1,3,9 drawn on top of one another). After successfully solving this problem with high accuracy, using U-Net architecture and semantic-segmentation, we moved this model structure to our original, more complex but highly similar problem, using audio domain. Our dataset was a large collection of 4/8 beat-long-loops of individual instruments played at particular sections within bachata songs (e.g. intro, chorus, verse, bridge, outro). The preprocessing and data preparation phase included selecting, sorting and organising the right loops, followed by resampling and normalising the clips by shrinking/expanding them using Fast Fourier Transform so that they all have universal length/tempo (BPM) and can be used for ML purposes. The model had 3 input clips: a mixture, the ground truth (present in the mixture), and the target (same kind as but different to the target). Their spectrograms were produced using STFT and a heatmap and fed into our U-Net like CNN. Essentially our goal was to find a target-like pattern in the mixture and compute the loss by comparing it to the ground truth pattern. Having the mixture, target and ground truth clips chosen randomly, we could build our input data on the fly. Once we achieved high enough performance, we moved on to testing it on real bachata songs. Careful preprocessing was needed on real songs in order to feed them to our model: beat tracking to split a song into clips of 4 beats, resampling, stretching/shrinking, STFT and heatmap to produce the input spectrograms. The predicted clips were then stretched/shrank back to their original tempo and concatenated to form the full-song-length extracted instrument.

Further improvements to be done: implement a GAN for more defined predictions, possible residual blocks in the U-Net, training the model on original audio quality, examine the louder starting beat in the audio clips, estimate the downbeat locations instead of standard beat locations to resolve issues on songs that have intro.

RESULTS: NUMBER MODEL
RESULTS: AUDIO MODEL
examples
mixture
bongo
guira
bass
guitar

ground truth

prediction

ground truth

prediction

ground truth

prediction

ground truth

prediction

example 1
example 2
example 3
example 4
example 5
RESULTS ON BACHATA SONGS
artist &
song name
original
extractions / predictions
relative
assessment

bongo

guira

bass

guitar

Prophex:
Vanidosa
good
Lucho Panic:
El Guardaespaldas
good
Prince Royce:
Te robare
fair
Kiko Rodriguez:
Vagabundo, borracho & Loco
poor