Machine Learning for Sound Recognition

From Canadian Centre for Ethnomusicology
Jump to: navigation, search

short URL:

A project of the Canadian Centre for Ethnomusicology

Introduction to Deep Learning for Sound Recognition

How do we recognize sound? How do we identify sound's many components and attributes? How do animals do this? Can an algorithm carry out these tasks?

Given a recording (whether of musical, linguistic, or environmental sounds), how do we extract sonic features (acoustic or psychoacoustic: pitch, meter, emotion), classify types (genres, styles, dialects, species, cultures), segment units (phonemes, notes, songs), and identify particular sources (speakers, singers, instruments, composers)? The general problem is complex, due to the effects of polyphony, especially when visual and other sensory information is lacking.

Sometimes recordings capture a single sound source: one instrument, speaker, or bird; others may gather multiple but coordinated sources: a musical ensemble, or a conversation, but such recordings often represent artificial situations. More typically, real field recordings resulting from ethnomusicological, linguistic, or bioacoustic fieldwork mix together a range of uncoordinated sound sources. The result is a total soundscape combining music, speech, and environmental sound in complex ways: music from multiple groups performing simultaneously, many speakers talking at once, or multiple environmental sound sources. Recordings of such soundscapes layer "signals" (sounds of research interest) with “noise” (unwanted sound, including anthropogenic sounds of crowds, highways and factories; biogenic sounds of animals and plants; and environmental sounds of rain, wind or thunder, along with sounds introduced by the recording process itself.)

Unlike the analogous challenges of recognizing components of visual “recordings” (photographs), our ability to recognize features of complex sound environments on audio recordings remains a rather mysterious process. More complex still are the psychoacoustic and cognitive processes by which we infer emotions conveyed by sounds - particularly in speech and music.

In contrast to an earlier era of “small data” (largely the result of the limited capacity of expensive analog recorders), the advent of inexpensive, portable, digital recording devices of enormous capacity combined with a growing interest in sound across the humanities, social sciences, and sciences, now contribute vast collections of sound recordings, resulting in interest in sound within the realm of “big data.” To date, most of the sound collection data is not annotated and in all practicality, is therefore inaccessible for research.

Computational recognition of sound, its types, sources, attributes, and components--what may be called "machine audition" by analogy to the better-developed field of "machine vision"-- is therefore crucial for a wide array of fields, including ethnomusicology, music studies, sound studies, linguistics (especially phonetics), media studies, library and information science, and bioacoustics, in order to enable indexing, searching, retrieval, and regression of audio information. While expert human listeners may be able to recognize certain complex sound environments with ease, the process is slow: they listen in real time, and they must be trained to hear sonic events contrapuntally. Sound recognition algorithms are thus of great potential value as research tools.

To the extent that such algorithms resemble biological processes of sound recognition, in humans or other species, they may also cast light on the nature of hearing, contributing to perception, psychoacoustics, and auditory cognition.

This project explores the use of artificial intelligence to address these problems. Specifically, we apply machine learning - primarily, deep learning neural networks trained on large datasets -- to develop sound recognition algorithms. Such algorithms enable labelling of digital repositories, supporting interdisciplinary research on sound, and potentially contributing to our understanding of auditory perception and cognition. More theoretically, this research may help develop machine audition and machine learning.

DLSR sub-projects


DL for Instrument and Polyphony Recognition

DL for Recognition of Musical Provenance

DL for Cantometrics Code Identification

DL for Drone-based Music Therapies

DL for Track Segment Recognition


DL for Speech Segment Recognition

DL for English Speech Accent Recognition


DL for Bird Species Recognition

Team Members

Principal Investigator: Michael Frishkopf, Professor of Ethnomusicology, Department of Music
Antti Arppe, Assistant Professor of Quantitative Linguistics
Erin Bayne, Professor, Department of Biological Sciences
Vadim Bulitko, Associate Professor, Department of Computing Science
Astrid Ensslin, Professor of Media and Digital Communication
Abram Hindle, Assistant Professor, Department of Computing Science
Mary Ingraham, Professor of Musicology, Director, Sound Studies Initiative, Department of Music
Sean Luyk, Music Librarian and Service Manager of ERA Audio + Video, University of Alberta Libraries
Scott Smallwood, Associate Professor of Music Composition, Department of Music
Benjamin V. Tucker, Associate Professor of Phonetics, Department of Linguistics


Ichiro Fujinaga, Associate Professor in Music Technology, Schulich School of Music, McGill University
George Tzanetakis, Associate Professor, Department of Computer Science, University of Victoria (developer of Marsyas)
Anna Lomax Wood, President and Director of Research for the Association for Cultural Equity, Hunter College, NYC
Michael Cohen, Professor of Computer Science, University of Aizu, Aizu-Wakamatsu, Japan.
Diane Thram, Professor Emerita, Music Department, Rhodes University, South Africa
Philippe Collard, André Lapointe, Frédéric Osterrath, & Gilles Boulianne, Centre de recherche informatique de Montréal (CRIM)


Sergio Poo Hernandez, PhD student in Computing Science
Matthew Kelley, PhD student in Linguistics
Rameel Sethi, MA student in Computing Science
Noah Weninger, Undergraduate Research Assistant, Computing Science
Shelby Carleton, Undergraduate Research Assistant, MLCS
Yourui Guo, Undergraduate Research Assistant, Computing Science

Funding Support (U of A)

KIAS Team Grant 2016
KIAS Cluster Grant 2017
Canadian Centre for Ethnomusicology
Hindle/Bulitko Computing Science Labs
Bioacoustic Unit (Biological Sciences)
Alberta Phonetics Laboratory (Linguistics)
Alberta Language Technology Lab (Linguistics)
University of Alberta Research Experience (UARE)

Funding Support (Other)

NVIDIA Corporation
Spatial Media Laboratory, University of Aizu, Japan
Compute Canada
Centre de recherche informatique de Montréal

Publications and Presentations


Kelley, Matthew C. and Benjamin V. Tucker. A comparison of input types to a deep neural network-based forced aligner. Accepted for Interspeech 2018, Sep 2-6, Hyderabad, India.

Ensslin, Astrid, Tejasvi Goorimoorthee, Shelby Carleton, Vadim Bulitko, and Sergio Poo Hernandez (2017), “Deep Learning for Speech Accent Detection in Videogames,” ed. Mike Cook et al., Proceedings of AIIDE / EXAG (Experimental AI in Games) 4, Oct 5-9th 2017, University of Utah.

Music and Ethnomusicology:

Jayarathne, Isuru, Michael Cohen, Michael Frishkopf, and Gregory Mulyk. 2019. “Relaxation ‘Sweet Spot’ Exploration in Pantophonic Musical Soundscape Using Reinforcement Learning.” In Proceedings of the 24th International Conference on Intelligent User Interfaces: Companion, 55–56. IUI ’19. New York, NY, USA: ACM.

Michael Frishkopf, Yourui Guo, Noah Weninger, Matthew Kelley, Sergio Hernandez, Vadim Bulitko. Deep Learning for Sound Recognition. Peer reviewed roundtable accepted for the 2018 Annual Meeting of the Society for Ethnomusicology, Albuquerque.

Rameel Sethi, Noah Weninger, Abram Hindle, Vadim Bulitko, Michael Frishkopf (2018), "Training Deep Convolutional Networks with Unlimited Synthesis of Musical Examples for Multiple Instrument Recognition." Sound & Music Computing, July 2018.

Frishkopf, Michael, with research from Sergio Hernandez, supervised by Vadim Bulitko. Towards an Extensible Global Jukebox: Deep Learning for Cantometrics Coding; for panel, "The Global Jukebox: Science, Humanism and Cultural Equity Chair: Anna Wood, Association for Cultural Equity. Society for Ethnomusicology annual meeting, Denver, 2017. Presented again for Eighteenth International Symposium on Spatial Media, Aizu-Wakamatsu, Japan, March 3-4, 2018, with additional research results from Yourui Guo.

A Framework for Synthesis of Musical Training Examples for Polyphonic Instrument Recognition, M.Sc. thesis Rameel Sethi, Sep 2018. Supervised by A. Hindle and V. Bulitko.


Knight, EC, KC Hannah G. Foley, C. Scott, R. Mark Brigham, and E. Bayne. 2017. Recommendations for acoustic recognition. Avian Conservation and Ecology 12 (2): 14.

Shonfield, J., and E. M. Bayne. 2017. Autonomous recording units in avian ecological research: current use and future applications. Avian Conservation and Ecology 12(1):14.

Yip, D. A., L. Leston, E. M. Bayne, P. Sólymos, and A. Grover. 2017. Experimentally derived detection distances from audio recordings and human observers enable integrated analysis of point count data. Avian Conservation and Ecology 12(1):11.

Use of an Acoustic Location System to Understand Songbird Response to Vegetation Regeneration on Reclaimed Wellsites in the Boreal Forest of Alberta M.Sc. thesis S. Wilson, Sep 2017. Supervised by E. Bayne.