BIRD AI

JP Merz
with Raymond Finzel, a machine learning researcher at the University of Minnesota
and Maya Livio, a creative researcher and writer at the University of Colorado Boulder

Inspired by an essay on bird data, surveillance, and algorithmic bias written by Maya Livio, this project uses SampleRNN, a recurrent neural network, to create AI-generated audio to examine biases in bird song data. The final outcomes of this research are multi-part and multi-modal: 1) the essay itself 2) a designed website which incorporates text from the essay with AI-generated sound and 3) a composition for flute and AI-generated electronics.

Livio’s essay looks at the histories of collecting bird data, which often involved shooting the more colorful (male) birds out of the sky. Natural history collections are filled with drawers and drawers of stuffed, colorful, male birds. In her essay, Livio thinks through the implications of this history: by only collecting male bird bodies, much more is learned about them, their bodies, and their behaviors; less colorful, often brown, female birds benefit from avoiding detection and interest, and therefore, avoiding being shot, but conservation efforts in the face of the climate crisis may be less effective for them as their behaviors and bodies are less studied and understood. While the shooting birds as data collection is no longer practiced, harmful practices around data collection, and how that data is implemented has clear implications to the present day and Livio’s essay further explores this topic through the work of Ruha Benjamin and Safiya Noble.

Bird AI picks up on another thread of Livio’s essay, that of algorithmic bias in bird song data collection. Traditionally, bird song has been viewed as primarily a male trait and female song has been thought of as rare and anomalous. Additionally the way “song” has been defined is based on classical music biases of what counts as a melody or as a pitch. This is reinforced by the use of bird song in classical music itself, which has largely drawn on the melodies of birds which most closely resemble human melodies. And so over hundreds of years of ornithological research, female bird vocalizations have largely been undocumented and unstudied, although several recent initiatives, such as http://femalebirdsong.org/ led by the Cornell Lab of Ornithology have begun to reverse this trend. Female bird vocalizations can be incredibly complex and are often driven more by timbre (or the sound itself) rather than by pitch or rhythm as with typical “song birds.”

The essay also addresses a recent mass die-off of birds in the American Southwest which took place Sept 9+10, 2020, in which thousands, likely tens of thousands, of birds suddenly died primarily of starvation. It’s speculated that historic wildfires drove these birds from their typical food sources and a sudden shift to cold temperatures worsened these conditions. As the climate crisis continues, events like this are likely to keep occurring. Conservation efforts increasingly rely on machine learning tools and so it is with a great sense of urgency that algorithmic bias must be investigated.

To respond to these provocations, this project uses SampleRNN, a recurrent neural network (RNN) used for audio generation. Specifically, we used an iteration of SampleRNN created by Dadabots, a well-known music duo working with AI-generated sound. The simplest and clearest explanation of SampleRNN I could find comes from Dadabot’s own Github repository:

Load a dataset of audio
Train a model on that audio to predict "given what just happened, what comes next?"
Generate new audio by iteratively choosing "what next comes" indefinitely

There are several types of neural network architectures, RNN’s are best suited for interpreting and generating sequential information (“what comes next?”) and are often applied to tasks like speech recognition, text generation (e.g. a Google search auto-complete), or handwritten signature analysis. The sequential nature also lends itself to audio generation.

SampleRNN’s main innovation in its design for audio-generation is in its use of a hierarchical structure. So for example, there is a neural network applied to different temporal layers, from each individual sample (slice in digital time), as well longer durations of time (seconds, then minutes, etc.), allowing for more variation over greater time spans.

Dadabot’s implementation also adds several key features to the original SampleRNN algorithm.

New scripts for different sample rates are available (16k, 32k). 32k audio sounds better, but the nets take longer to train, and they don't learn structure as well as 16k.
Any processed datasets can be loaded into the two-tier network via arguments. This significantly speeds up the workflow without having to change code.
Any number of RNN layers is now possible (until you run out of memory). This was significant to getting good results. The original limit was insufficient for music, we get good results with 5 layers.

(From the Dadabots github)

When running SampleRNN there are several “hyperparameters” or arguments which can be specified to alter the training and generation processes. Dadabot’s paper on their process goes into the technical details in much greater depth.

Our methods:

As a group, we decided to create three models based off of three data sets.

Whole Bird data set:

The first data set we called the Whole Bird data set. To obtain this dataset of recorded bird songs, I first used Wildlife Health Information Sharing Partnership’s online database to find the species names of the birds which died in the mass die-off in the American Southwest. I then cross listed those species names with available audio recordings in the Macaulay Library online database. I accessed all of the file names and their metadata through Macauly’s “Export” feature, creating a manable .csv file of 4363 items. As Macaulay’s online request system happened to be down when I was seeking these files, I had to reach out to Macaulay through email. I was pleasantly surprised that they were willing to accomodate my very large request, although because of the size, they were only able to send .mp3 files rather than .wav. Since SampleRNN is capable of reading .wav or .mp3 files and the final output is fairly low-quality audio, this concession was fine, although using .wav files certainly could have resulted in different outcomes. For training this set, we used the 32k, two-tier version.

Unknown data set:

The second model we call the Unknown dataset, and is a subset of the Whole Bird dataset, only the birds who’s sex is not labeled in the metadata. Our thought process for doing this was to bring an awareness to the binary essentials that databases and data collection often read in, and to complicate ideas around non-binaryness and intersex in nonhumans. Using the .csv I organized the items by sex and deleted all of the unsexed birds, leaving just the files I wanted to delete. Then with a simple Terminal script, I was able to delete all of the unwanted files, creating a new folder of just the unsexed bird calls. For training this set, we used the 32k, two-tier version.

Classical Music data set:

Using the “Birds In Music” Wikipedia entry and my own familiarity with classical music, I identified some of the most well known uses of bird song in classical music from Baroque music (e.g. Handel and Vivaldi) to Modernist 20th century works (e.g. Messiaen and Britten). This research confirmed my assumption that most bird song in classical music is from male passerine birds, a very large phylogenetic order that includes species like cuckoos, nightingales, warblers, among many others. I then looked up YouTube performances of each performance and selected ones I deemed to be of decent recording quality, but was not too discriminatory. Using a YouTube to mp3 web-service, I downloaded each selected file. For training this set, we used the 16k, two-tier version.

Results:

The training process outputs 30, 30-second long samples when it’s completed a training epoch. Here are a few selected samples from the different models, the examples chosen were based on my own sonic interest in them.

Whole Bird data set:

BIRDAI · Whole Bird dataset

Unknown data set:

BIRDAI · Unknown dataset

Classical Music data set:

BIRDAI · Classical Music dataset

Challenges:

Several challenges came up when working on this project.

Updating the Dadabots implementation of SampleRNN that is hosted on their Github repository was the main technical challenge of this project. Many of the dependencies are no longer supported and also only work with an older (and unsupported) version of Python.
Integrating SampleRNN with the USC CARC’s cluster also presented some difficulty. The main issues arose when compiling the Singularity container. Unusually, the container was working with CARC’s interaction session mode, but when the same container was submitted as a job using the slurm system, the script froze and timed out, suggesting that there were global settings in the DIscovery environment that might be affecting the compiling. The solution to this was adding some flags provided by the USC CARC support team.
It’s also worth mentioning that there were several bureaucratic challenges and setbacks along the way. Getting Raymond (a non-USC research) access to the CARC system involved coordination across several offices. Dealing with a university system has its disadvantages in these ways. But the benefit of having access to this processing power for free (I briefly entertained using Google Virtual Machine’s; to do a small job would have left me with a bill of ~$1000 USD, even with the researcher discounts) and with the support of the incredibly helpful team of graduate students assigned to assist CARC users was certainly worth these delays.
I also want to mention that with this project at least, once again, the latest technology is only available to those with the resources and connections to access it. I am a White-presenting Armenian-American man and my technical collaborator is White man. We both have strong ties to universities and access to the many resources that come along with that affiliation. However even within the university, I don’t think many students realize that they have access to the resources at CARC. I wasn’t aware of it until Raymond mentioned that his university has such a system and that USC might as well. If universities want to promote greater diversity and equity in tech across all disciplines, they could begin by letting students know the resources that are available to them. To CARC’s credit though, the documentation on their site is incredibly thorough and they offer a tremendous amount of support to researchers who may be less familiar with machine learning but wish to incorporate it into their projects.

Moving forward:

We are continuing to train the neural networks and generate new audio. As of March 3, 2021, the Whole Bird data set is still producing interesting and somewhat variable audio after 17 “epochs” of training. We are planning on training again from scratch with slightly adjusted hyperparameters to see if there are similarities or differences. The Unknown data set produced interesting results in epochs 3-7 but started to degrade at epoch 8. I'm planning on cleaning up that set of files, using an algorithm that automatically removes silences to see if the quality and variability can last longer. The classical music has produced interesting results after 10 epochs but seems to be getting stuck with a particular tonality and a particular violin timbre. We are planning on experimenting with the hyperparameters to get greater sonic variety.