the end of the year is wrapping up and i realized i hadn't done much in the terms of AI. to come up with an article idea, started thinking about what media type i wanted to work with : images, audio, video, text, links, ... decided to work with audio because my last couple of AI articles had dealt with images or video. then started brainstorming about what problems AI could be used to solve regarding audio? a couple things bubbled to the top of the list :
from the /noReco and /aiSomPic articles, i already had an initial idea of how these problems might be solved. the basics being to create a feature array using the frequency of the song over time and then use that to train a SOM (Self Organizing Map) neural network to cluster similar frequencies. some searches, and i found an article that verified my initial guess : Self-Organizing Maps for Content-Based Music Clustering. this article will basically use similar techniques to try and solve different problems (listed above).
the first step was to figure out how to do the audio processing. from MS, i thought about DirectShow and Windows MediaPlayer. looked at the MediaPlayer APIs first. all i could find to get wave and frequency data is from the Visualizer API. that wasn't going to work, because i need this data in an external application and not from a MediaPlayer hosted plugin. DirectShow was next. the problem with DirectShow is that MS does not provide a managed wrapper. there are a number of 3rd party wrappers, but i haven't been all that happy with the couple i've used in the past. some more searches, and fmod.org turned up. it's a cross platform audio playback library. just so happens that it has a newer version called FMOD Ex that can be called from C#. downloaded it and it had a C# sample called 'spectrum'. this sample shows how an external application can use FMOD Ex to play an audio file and get the wave data to display an oscilloscope and the frequency data to display an equalizer. sweet! that is exactly what i need. the screen cap below shows the output from a modified version of that app displaying both the wave and frequency data.
looking around some more turned up the NOSOUND_NRT flag which keeps the track from being played back and allows for quick audio processing. turning that flag on lets my app grab the frequency values for a 5 minute song in about 30 seconds. the screen cap below shows an image representing the frequency bands (y) over time (x) for an mp3.
so FMOD Ex is awesome. its primary use is for playback, but its feature set is large enough to be used for audio processing. it also supports alot of different audio formats with mp3, wma, and wav being needed for this project. finally, its pretty quick. the only problem i had was with the Channel.setPosition() call. that would fail for some of the mp3 files that i used with it. amazingly, the lead developer responded in the forums, checked out my repro test code, and has made a fix for the next version ... did i say awesome! the current workaround is to call System.createStream with the FMOD.MODE.ACCURATETIME flag to force it to a valid mpeg boundary.
the next step was to use the extracted frequency data to train a SOM neural network to have it cluster the data. for an explanation about SOMs, you should read the /aiSomPic article. in short, a SOM is a technique for clustering similar data in a 2D space. e.g. the/aiSomPic used a SOM to cluster similar images, and this article will use it to cluster similar music samples.
for each specific problem, the frequency data had to be processed differently and the SOM had to be trained differently.
the first problem i have is how to find duplicate songs in my media library. by duplicate, i mean :
alot of these problems can be solved my using file system data (name and size) and ID3 tags. the problem is i don't trust tagged data. tags could be misspelled, have extra spaces (or no spaces), represented differently (special characters), underscores vs hyphens, be missing, or just be plain wrong. what i trust is the actual audio content itself.
for this test, i used a random sample of about 50 mp3 tracks from various artists. its my collection, so the genres are mostly techno and numetal. for each song, the frequency data was sampled once every second for the entire song. that 2 dimensional array was then broken up into quadrants. the pic below shows 4x4 quadrants, but i actually used 25x25 (with overlap).
the values of each quadrant were summed and normalized. the resulting value from each quadrant made up a feature vector of length 625. a 10x10 SOM was then trained with these 50 feature vectors. the next step was to see if it could find duplicates. got on eMule and downloaded a bunch of different versions of Korn's song 'It's On'. fed it an mp3, and the screen shot below shows that it matched the same Korn song which i had ripped from CD (yes, i really do own it). further tests showed that it did match variable bit rate vs fixed bit rate, mp3 vs wma, different fixed bit rates, and different codecs. off to a good start ...
the test above was to find a duplicate song, but the power of SOMs is that it can group similar items. since i listen to techno music, i have alot of remixes from different artists for a single song. so how can i group those remixes? didn't have a remix for the Korn song, but i did have a demo version. the demo song is practically a remix. its pretty much the same song, but just a little different. the intro is different, it sounds slightly different throughout the whole song, and the length of the song is not the same. when i fed this into the already trained SOM, it was just 1 quadrant off. while the original song was found in 2:9, the demo was placed in 2:8 (where no other song is). for kicks, i shrunk the SOM down to 7x7, since 10x10 had a bunch of empty quadrants. rerunning the same test, the demo was matched to the original song.
for the next step i tried a live recording of the song. the first problem is that the live recording had an MC intro. to get past this i just chopped off the first 15 seconds and last 15 seconds of each track before its processed. when running it through the SOM, the results weren't so good. it was matched 4 quadrants away (10x10 SOM) from where it should be placed. listening to the live version had some obvious differences. first, the quality was pretty low. don't think this is the problem because i'm already using FMOD to reduce the quality of the songs when being processed. also, low quality duplicates were matched above. second, the length of the song is different. its about a minute shorter than the original. this might be the problem, but the way quadrants are processed reduces the time difference. third, the audience is cheering throughout the whole song. my guess is that this is the main problem. reason being is that i am treating the higher sounds with more significance when training the SOM. the results from this test were that demos and mixes that weren't radically different from the original would get clustered nearby, but mixes that were radically different and live versions would not.
listening to techno, i have a bunch of 60+ minute mp3 mixes. or as my friends joke about ... one really long song that all sounds the same to them :) these are either mix albums or live DJ sets. the problem is jumping from one song to the next ... because its mixed. usually, i just skip ahead some X number of minutes and then adjust from there, but i'd rather have a system that could help me make a better skip. this is really bad in my car because my car mp3 players fast forward is really slow. if it could figure out the different tracks, then i could chop the single mp3 file up into multiple tracks and be able to skip around the mix in my car.
for this, i had to process the songs differently. instead of creating a single feature vector for each file, i took multiple 5 second samples throughout a track and created a feature vector for each sample. then the SOM was trained on all the samples from that one song. the end result being that similar samples were clustered together, giving me a delineation of the entire mix. selecting a different cell in the SOM would jump to a different position in the mix. then it did some more processing on the resulting SOM. all it did was go through each node of the SOM (and its neighbors) to determine the start position and end position of possible tracks. could have used some more work because some of the ranges overlapped, but by jumping around the starting points showed that it picked what seemed like reasonable positions to my untrained ear. it definitely worked better than just chopping the song up every 3 to 5 minutes.
the final problem is now that i have possible tracks from the mix, how can i use that to find what the songs that were used in the mix. so the next attempt was to try and find a song from a short sample, such as the service shazam.com. NOTE this is different than finding duplicates above, because i'm only using about 15 seconds of audio data to find the original track. the first step was to process multiple samples from multiple tracks. the samples taken were 15 seconds long and it took samples from half of every song. then the SOM was trained.
for input i just used the tracks downloaded from eMule. after selecting a file, it would randomly skip to some position in the file and grab a 15 second sample for processing. then that would be input into the SOM. the first attempt worked (pic below) and it found the song from the sample. when i fed it the same input song again, it randomly picked another sample point and tried again ... but returned the wrong song. tried again ... and it worked. overall, it gave about a 66% success rate for the different songs i tested with. not very good for only using 50 songs in the test library. this would need alot more work to actually make it usable. so i have much respect for shazam.com being able to accomplish this with a music library of 1.6 million songs from telephony quality recordings in noisy environments.
did not actually implement this, but the Content-Based Music Clustering article (linked above) used these techniques a little differently. it used music samples from each music track (like 'Find song from a sample' above), but then it went one step further. it built another SOM on top of those results. so the feature vector for the 2nd SOM were made up of where the samples for a song were located on the 1st SOM. when trained, the 2nd SOM could be used to find similar music tracks.
this article showed how a SOM trained with music frequency data could be used to find duplicate songs, cluster remixes, and split DJ mix sets into multiple tracks. this solves some of the problems that i have with my current media library. it showed some success for finding a song from a sample, but the current processing is returning too many false positives to make it usable.
this sort of content based searching needs to start making it into more applications. such as the current applications that find duplicate mp3s based off of ID3 tags. i would be much more trusting of an application that looked at both the ID3 tags and the content. would also like it if media players could fast forward to significant points within a song. how many times are you listening to a song and you really just want to fast forward to the chorus? this is a similar problem to what i have with 70 minute mixes. i don't want to skip to the next 70 minute mix, or fast forward, i just want to jump to the next significant point within the same track. something between fast forwarding and skipping to the next track.
also, i'd like for .NET proper to better support audio processing applications. FMOD Ex worked great, but since MS keeps hyping Vista with its Media capabilities, then they should really give us APIs to do more interesting media applications. still waiting for a System.ArtificialIntelligence namespace too!
not making the source available
if i were to update this i would change the SOM into a GHSOM (Growing Hierarchical SOM) so that it would scale better. would also break it out into individual applications for finding duplicates and splitting mixes
still need to look at WWF. later