Multi-Instrument Detection in Polyphonic Music with Cultural Instruments

Abstract

The study adapts several machine-learning and deep-learning architectures to recognize 63 traditional instruments in weakly labelled, polyphonic audio synthesized from the proprietary Sound Infusion collection. Ten thousand 5s clips were algorithmically generated, features such as Mel-spectrograms, MFCCs, and VGGish embeddings were extracted, and six models were evaluated. The re-implemented Han et al. Convolutional Neural Network (CNN) attained the best result (micro F1 = 0.55; macro F1 = 0.50), approaching published performance on mainstream Instrument Recognition in Musical Audio Signals (IRMAS) data. Results highlight data scarcity and class imbalance as key obstacles for culturally diverse MIR.

Share and Cite:

Meas, S. and Moieni, R. (2025) Multi-Instrument Detection in Polyphonic Music with Cultural Instruments. Open Journal of Social Sciences, 13, 278-310. doi: 10.4236/jss.2025.139017.

1. Introduction

In recent years, music has become more accessible than ever for the 67.9% of the human population with internet access. With advances in technology, music can be digitalised and accessed on the internet without the need to visit the local store to purchase physical copies such as CDs or DVDs. However, this convenient access to music is not universal since roughly 32% of the world still does not have access to the internet (Destatis, 2025). But accessibility to music leads to ease of access to different types of music created from both well-known and cultural instruments that are unique to each culture and their respective history. Despite the growing popularity and excitement for cultural instruments, such as the Chinese hulusi, these instruments do not experience the same level of representation as more commonly used instruments in popular music. Even when the opportunity arises to hear an under-represented instrument, it is often not as clearly distinguishable to listeners as common European instruments, such as the piano or violin, are. The ability to distinguish the sounds of many traditional instruments is diminishing as the number of practitioners dwindles (Vaiedelich & Fritz, 2017).

The human ear is capable of recognizing and identifying musical instruments in music, but the more instruments that form part of a composition, the more challenging this becomes. There is an abundance of research dedicated to tackling the problem of distinguishing musical instruments in musical compositions in the field of Musical Information Retrieval (MIR). However, most of the success is on single note or isolated note instrument recordings. In recent years, there has been an increase in the number of papers addressing the problem of multi-instrument detection, which all relied on datasets of better-known instruments, such as the piano or the guitar.

There are only a few available algorithms that can detect and identify musical instruments using a large dataset. Kailewang and Moieni (2022) proposed a method that can detect a single instrument with high precision when not combined with other instruments or vocals. Their objective was to introduce this innovative technology to the market, aiming to address an increasing demand for learning about such instruments while promoting and protecting cultural diversity.

Section 2 will discuss previous research achievements for single and ‘predominant’ instrument detection and the lack of multi-instrument detection for cultural and modern instruments (piano, violin, etc.). Model architecture will be described in Section 3, followed by data collection methods, data transformations, and data augmentation in Section 4. Section 5 will give an overview and discussion of each model’s performance. The results will be analysed in Section 6, followed by areas of improvement in Section 7. This paper will conclude in Section 8.

2. History of Study

The growth of technology has been remarkable, with calculators that help solve difficult mathematical problems in a matter of seconds, computer vision software capable of recognizing patterns at a level that the human eye cannot see, and programs that can identify the title of a piece of music from a few seconds of audio. Yet, these innovations in audiovisual recognition software have overlooked instruments from diverse cultural backgrounds. Cultural instruments often mean a lack of data. Sometimes the practice of these instruments is lost in time and only scant recordings remain. This emphasizes the important role of instrument detection software that can recognize these instruments for the protection and promotion of cultural diversity.

Musical instruments all have many similarities, especially when we compare the guitar with the ukulele, or the cello with the violin. In the field of computer vision, there are countless research papers dedicated to developing object detection software that can identify and recognise musical instruments from images. Dewi, Chen, and Christanto (2023) used the YOLO (You Only Look Once) models, such as YOLOv5 and YOLOv7, to identify musical instruments and achieved an average accuracy of around 85%. These are advanced object detection models that have been popularised in recent years due to their speed and accuracy. The YOLO models are Convolutional Neural Network (CNN) based models that process an entire image in a single pass, which makes them computationally efficient in object detection tasks (Kundu, 2023). Human eyes can identify and recognise similar objects just from a glance; however, this technology can be used to help aspiring musicians identify the differences between the instruments and learn, and it also benefits individuals with visual impairment.

Shazam is one of the most popular applications for music recognition. Its main function is to identify music in the user’s environment while minimizing false positives. Shazam’s database contains approximately 1.8 million songs and tracks, which it uses for music recognition (Wang, 2003). The performance of Shazam is dependent on the signal-to-noise ratio and the length of the audio sample, but it is able to achieve close to perfection when the two variables are optimal. The software is currently widely used by iPhone users; the Shazam app is integrated with the phone and can be activated at any time for music identification.

In the case of instrument detection, there are no real-world applications comparable to Shazam that can determine what instruments are present in an audio sample. Even so, MIR is a growing field of research. MIR involves processing audio content to understand or categorise music (Kaminskas & Ricci, 2012). It involves extracting features like individual notes from the instruments in music, which can then be used to identify music genres or analyse the styles of a particular artist. Numerous researchers in the past were mainly focused on musical instrument detection on isolated or solo recordings of a particular instrument.

Single instrument detection has achieved significant success, as evidenced by Li, Qian, and Wang (2015), where they began to incorporate deep learning models such as CNNs in place of traditional machine learning methods. CNNs are deep learning models that are especially good at processing and analyzing spatial patterns, such as images or audio spectrograms, like those that previous researchers have used in MIR. Success in identifying musical instruments using CNNs over traditional methods such as random forest or logistic regression has shown improvement across the board, except for precision. CNNs process an image by using filters to detect patterns in the image, such as an instrument’s spectrogram, enabling the model to learn about abstract features that cannot be discerned by the human eye. Li, Qian, and Wang’s (2015) CNN trained on audio was able to achieve a macro average F1 score of 0.6433 and an accuracy of 82.74%, in comparison to their traditional machine learning method, random forest, which was only able to achieve an F1 score of 0.4471 and 82.13% accuracy.

Traditional machine learning methods have performed exceptionally well, mainly the XGBoost model proposed by Liu et al. (2022), which addresses the single instrument detection problem and achieved a result of 97.65%, outperforming every other classification model they tested, including support vector machine (SVM), logistic regression, etc. They achieved this by incorporating various features such as the root mean square energy, zero crossing rate, spectral bandwidth, harmonic signals, spectral rolloff, spectral centroid, and Mel-frequency cepstral coefficients (MFCC).

Recently, higher quality data has become more publicly available. Datasets that are curated for ‘predominant’ instrument detection’ aim to detect the main instrument in a music recording, while ‘multiple instruments detection’ aims to detect all instruments in a particular recording. These popular datasets are used in conjunction with the ever more powerful deep learning models such as CNN and Recurrent Neural Network (RNN) for improved performance. A paper by Hing and Settle (2021) utilises the IRMAS dataset, where a CNN model architecture is constructed using the transfer learning technique, and they were able to achieve a high F1 score when classifying voice audio, 0.86, whereas the performance for classifying instruments was only between 0.47 and 0.69.

Another paper by Han et al. (2017) also addresses the problem of ‘predominant’ instrument detection; however, they constructed their CNN model to detect multiple ‘predominant’ instruments. This model proposed by Han et al. (2017) was regarded as the state-of-the-art model, able to achieve a result of 0.50 in macro and 0.65 in micro F1 score. This model was then used by multiple papers as a baseline for further improvements.

Reghunath and Rajan (2021) made use of the IRMAS dataset and found promising results with the use of CNN, RNN, and modified two CNN models with LSTM and GRU at the end of the architecture (C-LSTM and C-GRU). They achieved promising results with CRNN as it outperforms the standalone CNN and RNN architecture in ‘predominant’ instrument recognition. Their results show a significant improvement over that of the state-of-the-art model by Han et al. (2017), where they were able to achieve up to 9.09% in macro F1 score and a 7.81% improvement on the micro F1 score.

A recent paper by Zhong et al. (2023) tackled the ‘predominant’ instrument detection problem by approaching it from a different perspective. According to Zhong et al. (2023), their instrument recognition model contains a learnable front-end layer and a CNN-based feature extraction layer followed by a learnable pooling layer. Instead of the traditional method of providing the audio recording for feature extraction, Zhong et al. (2023) augmented their monophonic data and used it to pre-train their model, then finetuned it for ‘predominant’ instrument recognition with multi-instrument recordings. Zhong et al. (2023) achieved up to 0.674 in micro F1 score and 0.584 in macro F1 score. However, the focus of the authors’ research was on datasets containing common instruments such as piano, violin, etc.

It is evident that single instrument detection and ‘predominant’ instrument detection problems have been addressed with models proposed by previous research with good performances. However, the difficulty remains for multi-instrument detection. A paper by Lei (2022) addressed this problem by proposing a two-level classification model based on CNN, but instead of extracting features such as Mel spectrograms and MFCCs similar to previous papers, they use the “Constant Q Transform” (CQT) matrix as input. The first-level classification will use this feature as input, and its result will be a rough classification which will then be passed on to the second-level classification. The final classification level contains residual networks of the same architecture trained to identify instruments. Lei (2022) used a combination of well-annotated datasets which consists of Medley DB, MIXING SECRETS, and Bach10 to create variety in the data for training. As a result, they were able to achieve an average accuracy of 85% across all instruments.

Chen et al. (2024) addressed the multi-instrument detection problem by constructing a binary classifier for each instrument with a CNN configuration. Their binary classifier is trained to identify whether or not the given sample contains the specified instrument. Chen et al. (2024) utilised a One-vs-All (OvA) model by using CNN to extract features from the input spectrograms. Chen et al. (2024) model’s performance was measured using accuracy, and their model was capable of achieving an accuracy of up to 53% with a 6-instrument recording dataset. For duo instrument accuracy, they achieved up to 77%, followed by 71% for trio instrument recordings, 62% for 4-instrument recordings, and 58% for 5-instrument recordings.

The success and attention that MIR has received in recent years in music instrument detection rest on its ability to detect well-known musical instruments like the piano, guitar, and violin. Cultural instruments are missing due to limited access around the world and insufficient data or lack of strong-labelled data; for instance, Lei (2022) utilizes available datasets to combine them into one dataset for optimal training and testing, which is evident based on the strong result, and these available datasets are not fully inclusive. While previous research certainly contains developed algorithms and applications such as Lei (2022) with their exceptional model with 85% average accuracy, and Chen et al. (2024) with their multi-instrument recognition model with capabilities to detect up to 6 instruments with above 50% accuracy, these algorithms and models were based on the datasets containing common instruments. This paper focused on musical instrument recognition for cultural instruments.

A paper by Kailewang and Moieni (2022) tackleded this problem by shifting the focus from common musical instruments to cultural musical instruments to support cultural diversity. They constructed a bidirectional RNN, a fully connected layer network with an attention layer, and a CNN with an attention layer and found very little success due to model overfitting.

According to Kailewang and Moieni (2022), when users encounter a cultural instrument that sparks their interest, they can record the sounds of the instrument and upload it to their website for identification. However, the authors’ model was not able to perform when given the test set. The SVM model, which is the best model out of the two, resulted in an F1 score of 0.55. F1 is a measurement that balances both precision and recall to determine the model’s performance. A precision of 0.71 means 71% of the predicted instruments were correctly labelled, and a recall of 0.47 means the model missed 53% of the actual instruments that are present. Kailewang and Moieni (2022) deep learning models were overfitted due to the limited dataset as well as low variety in the audio samples of the instruments.

This paper will make use of precision, recall, and F1 score as the means to measure and evaluate the models. Given that there is practically no focus on cultural instruments, this paper aims to tackle this problem with a dataset similar to the one Kailewang and Moieni (2022) used in their paper.

3. Data

3.1. Data Collection

There is a clear lack of publicly available data for indigenous musical instruments around the world. The proposed method was trained on the Sound Infusion dataset, which consists of hundreds of different types of cultural instruments with samples of varying lengths. However, in this paper, we decided to only extract the same 63 unique instruments (a subset of the entire Sound Infusion dataset), excluding the vocals, that were used in the paper by Kailewang and Moieni (2022) to show a comparison to the authors’ approach compared to ours. The songs that were compiled using the available instruments will also follow the same structure as that of Kailewang and Moieni (2022), where there were 2 to 5 instruments. We did not include vocals in the training dataset.

Instead of manually creating the music by playing the instrument together and recording the result with an external program, we generatedd the music more efficiently by using Python and its librosa library. To achieve this, we separatedd audio files for each of the instruments. Sound Infusion provided these, along with multiple variations of each instrument. Most instruments have between 5 and 15 unique recordings, though a few have only one, while others have more than 25. There were factors that could affect the strength of the model in detecting certain instruments, like those with few variations. Low variation createdd an issue with overfitting, where training data and testing data are the same. These instruments are Cameroon-DrumsetBikutsi, Cameroon-ShakerBikutsi, China-GongsTunedSoftMallet, China-GongsTunedWoodMallet, and China-SmallErhuPlectrum. Hence, the performance of the model on these instruments is invalid (see Table 1).

3.2. Data Generation

Kailewang and Moieni (2022) manually generated datasets by recording each song manually using Sound Infusion’s individual instrument dataset, which represents approximately 4300 sample songs. This is an inefficient and time-consuming process, so instead of manually recording the songs, we created a simple algorithm using Python to automate the song creation and data labelling process. Our audio

Table 1. Instrument recording train and test split.

Training

Testing

Armenia-Duduk

3

1

Bali-Gamelan Ensemble

7

2

Bolivia-Charango

7

3

Bolivia-Moseno

15

5

Bolivia-Roncoro Chords

4

1

Bolivia-Roncoro Solo

4

1

Brazil-Afuche Cabasa

4

1

Brazil-Agogo

7

2

Brazil-Bass Guitar Bossa

4

2

Brazil-Berimbau

4

1

Brazil-Bongos

4

1

Brazil-BongosCowbell

4

1

Brazil-Cabasa

3

1

Brazil-Claves

6

2

Brazil-Cuica

4

2

Brazil-EggShaker

5

2

Brazil-Guitar

4

1

Brazil-Pandeiro

4

1

Brazil-PercussionSet

6

2

Brazil-RainStick

4

1

Brazil-Surdo

7

3

BurkinaFaso-BaraDrum

8

2

BurkinaFaso-BassLine

22

8

Cameroon-Congas

1

1

Cameroon-Djembe

1

1

Cameroon-DrumsetBikutsi

1

same as train

Cameroon-PercussionSetBikutsi

1

1

Cameroon-ShakerBikutsi

1

same as train

China-Bawu

4

2

China-BeijingOperaGongs

4

1

China-BianzhongBells

8

2

China-BigErhuPlectrum

6

2

China-CeylonGuitar

6

2

China-ChauGongs

3

1

China-Dizi

4

1

China-Dongxiao

4

1

China-Erhu

7

2

China-FengGong

5

2

China-Gaohu

8

2

China-GongsTunedMetalMallet

1

1

China-GongsTunedSoftMallet

1

same as train

China-GongsTunedWoodmallet

1

same as train

China-Hulusi

4

2

China-JinghuOperaViolin

4

1

China-KouXian

4

1

China-Pipa

4

2

China-ShanghaiBabyPiano

8

2

China-Sheng

4

2

China-SmallErhuPlectrum

1

same as train

China-WuhanTamTam

3

1

China-Xiao

6

2

China-YangQin

4

1

Congo-Bongos

8

2

Congo-Sanzas

8

2

Cuba-Guiro

3

1

Cuba-Triangle

2

1

Egypt-Fiddle

5

2

Germany-CrumhornAlto

4

2

Germany-CrumhornBass

4

2

Germany-CrumhornConsortium

4

2

Germany-CrumhornSoprano

4

2

Germany-CrumhornTenor

4

2

Germany-Gemshorn

5

2

files were recorded in mp3 format with a Constant Bit Rate and a sampling rate of 48,000 Hz.

Figure 1. Pseudocode to create folders.

Figure 2. Pseudocode to generate a CSV file with random song combinations.

We separated the musical instruments into 4 categories: Membranophone, Chordophone, Aerophone, and Idiophone, as Kailewang and Moieni (2022) did. We created folders for each of the instruments using the code in Figure 1 (above) for the training instruments and testing instruments, to solve the overfitting problem Kailewang and Moieni (2022) experienced with their CNN and RNN models. We then populated the folders with all the sample recordings of their respective instruments from Sound Infusion’s dataset. For the folders with multiple sample recordings, we reserved approximately 80% of the samples for training, while using the remainder as input for testing. For instance, instruments with 5 samples are split such that 4 samples are for training and 1 for testing. Table 1 shows a clear representation of our data split strategy; we prioritized including more training data than testing for the purpose of scaling this technology in the future. As mentioned previously, we clearly see that a few of the instrument’s recordings (Cameroon-DrumsetBikutsi, Cameroon-ShakerBikutsi, China-GongsTunedSoftMallet, China-GongsTunedWoodMallet, and China-SmallErhuPlectrum) were used in both the training and testing data.

Figure 2 shows the unique generator function, which is responsible for generating instrument combinations and exporting these combinations into a Comma Separated Values (CSV) file. This function uses the random library in Python to decide the number of instruments present and which instrument name to include (based on the Membranophone, Chordophone, Aerophone, and Idiophone). The function generates a certain number of songs based on user input. For this paper, we specified the function to create a total of 10,000 unique instrument combinations from both training and testing samples and exported them into a CSV file. The data generation process involved two steps: 1) to generate 8,000 random audio samples for training data; and 2) to generate 2,000 samples for testing data.

The function starts by randomly picking a number between 2 and 5. This number decides how many instruments will be included in the song. If the number of instruments (let’s call this n) is 3 or fewer, the function will randomly select n instruments to include in the song. However, when the function detects that n equals 4, it will pick one from each of the 4 categories to add the name of the instrument to the song. If n is 5, it will pick one instrument from each of the 4 categories, then add one random instrument. The CSV file has 5 columns to indicate the 5 instruments present in each song (each row) and for songs with fewer than 5 instruments, the missing instruments are replaced with “NA.”

Figure 3. Pseudocode to create the dataset.

Once the song combination CSV file and instrument folders are created, the CSV file is passed to the data generator function in Figure 3, which reads each row to determine the instruments used in each song. The function then locates the corresponding song folder and overlays the selected instruments using the PyDub library to generate the final track. It exports the completed composition as an mp3 file to the specified path. Each exported file is named to reflect the instruments used in the song, with instrument names separated by underscores. For example, a song featuring only the Armenian Duduk and Chinese Bawu will be named “Armenia-Duduk_China-Bawu_NA_NA_NA.mp3.”

Figure 4. Pseudocode for data labeling function.

This naming convention is essential as it simplifies the labeling process for the dataset. The data labeling function in Figure 4 processes the folder containing the exported dataset by reading the filenames, extracting the instrument names (excluding the “NA” entries), and saving them into a text file. This labeling process is similar to that used in the OpenMIC-2018 dataset (Humphrey et al., 2018); however, our dataset is weakly labeled, meaning it does not include timestamps for the instrument appearances.

3.3. Data Preprocessing

We extracted various audio features from the files to train our models, including the Mel Spectrogram, Mel MFCCs, VGGish embeddings, Spectral Bandwidth, Spectral Rolloff, Spectral Centroid, Harmonic signals, Root Mean Square (RMS) energy, and Zero Crossing Rate. We used Python’s librosa library to extract the musical features from all 10,000 audio files. Our recordings for all the songs were of varying lengths, from 5 seconds to 15 seconds, so we trimmed all the audio in both the training and testing data to limit the feature extraction to just 5 seconds of audio when all instruments are playing. We only trimmed the audio to 5 seconds when we extracted the Mel spectrogram, VGGish embeddings, and MFCCs. For the machine learning methods, we extracted audio features and used the mean and variance of each as input to the model. This approach allowed the model to incorporate a wider range of features while minimising data complexity. However, it introduced limitations to the model, as summarising features in this way leads to a loss of temporal information: the model can only rely on mean and variance to differentiate between the instruments. The resulting input for machine learning methods is a 1 × 52 dimensional feature vector.

3.3.1. Mel Spectrogram

The Mel spectrogram was a widely used feature in previous studies, including Reghunath and Rajeev (2021), Mukhedkar (2020), and Han et al. (2017). For our dataset, we’ve set the FFT window size (n_fft) to 1024, the hop length to 512, and the number of Mel bands (n_mel) to 128, using a Hann window function. Additionally, to better replicate human auditory perception—which is logarithmic—we apply a logarithmic scale to the Mel spectrogram, following the approach of Kailewang and Moieni (2022). The resulting output dimension for each audio sample is 128 × 216.

3.3.2. Mel Frequency Cepstral Coefficients (MFCCs)

In addition to the Mel spectrogram, we also extracted MFCCs from the data. We followed the same feature extraction parameters used by Kailewang and Moieni (2022), setting the frame length to 2048, hop length to 160, and using a Hann Window. To compute the frequency spectrum, we’ve set n_fft to 512 and use 40 Mel filters. For each audio sample, we extracted 24 MFCCs. However, unlike Kailewang and Moieni (2022), who used only VGGish embeddings, our deep learning models took both MFCCs and Mel spectrograms as input. The MFCC feature has a shape of 24 × 2756 for each audio.

In addition, we extracted a second form of MFCC that follows the same feature extraction process by Liu et al. (2022), where the MFCCs are compressed by calculating the mean and variance of each coefficient. For this extraction, we set the number of MFCC (n_mfcc) to 20, while keeping all other parameters at their default values in the librosa library. We then used these compressed features as input for the traditional machine learning methods.

3.3.3. VGGish

We also adopted an approach similar to that used in the OpenMIC-2018 dataset (Humphrey et al., 2018) by leveraging resources from the developers of AudioSet and VGGish. First, we trimmed the generated audio files to 5 seconds and transformed this raw audio into a 128-dimensional vector every 0.96 seconds using a fixed window size. To account for variations in audio length, we standardized each audio sample to 5 seconds, resulting in a 5 × 128 matrix of VGGish embeddings. This method was applied in previous research on polyphonic music instrument detection by Mukhedkar (2020), which reported promising results on datasets of well-represented instruments such as MUSDB18.

3.3.4. Spectral Bandwidth

We also extracted spectral bandwidth, defined by Kailewang and Moieni (2022) as the weighted average frequency of the signal at each frame. This feature was extracted using Python, and its mean and variance were used as input for the traditional machine learning methods.

3.3.5. Spectral Rolloff

The spectral rolloff was set to 85%, following the approach of Kailewang and Moieni (2022) and Liu et al. (2022). We extracted this feature and use its mean and variance as input. Additionally, we adopted Liu et al.’s (2022) method of separating audio signals into harmonic and percussive components. In this paper, we focused only on the harmonic signals, such as those produced by the Armenian Duduk, since our dataset contains a limited variety of percussive instruments. This separation was performed using the Harmonic-Percussive Source Separation (HPSS) algorithm, the preprocessing technique used by Liu et al. (2022). We then computed the mean and variance of the harmonic component for use in our machine learning methods.

3.3.6. Spectral Centroid

The spectral centroid was extracted from the raw audio files. Put simply, this feature reflects the brightness of the timbre. Timbre, as described by Liu et al. (2022), refers to the weighted average frequency based on the energy within a given frequency range. For instance, the timbre varies depending on whether the audio contains more high or low frequencies.

3.3.7. Zero Crossing Rate

The zero crossing rate refers to the number of times a particular waveform of an audio dips below or above zero.

3.3.8. Root Mean Square (RMS) Energy

Lastly, we extracted the Root Mean Square (RMS) energy, a commonly used measure for calculating the amplitude envelope of an audio signal. We included this feature as part of our input data, as it was also used by Liu et al. (2022) in their XGBoost model.

4. Model Architecture

4.1. Machine Learning Methods

4.1.1. Random Forest

This is our baseline random forest model implemented using traditional machine learning methods. Because random forest alone cannot handle multi-label data, we wrap it with scikit-learn’s MultiOutputClassifier, enabling multi-label classification. This model served as a benchmark for comparison with the other machine learning method, XGBoost. We’ve set the number of estimators to 100 and the maximum tree depth to 6.

4.1.2. XGBoost

As noted by Liu et al. (2022), this model performs exceptionally well for detecting single instruments in audio. In our work, we adapted it to handle multi-label data for multi-instrument detection. However, like the random forest model, XGBoost cannot natively handle multi-label classification. Therefore, we wrapped the XGBoost model with scikit-learn’s MultiOutputClassifier to enable multi-label support, similar to our approach with the random forest model. We used the same training parameters as Liu et al. (2022) with 100 estimators, a learning rate of 0.05, a maximum depth of 6, a subsample ratio of 0.8, and a minimum child weight of 1. Additionally, we used ‘logloss’ as the evaluation metric.

4.2. Deep Learning Methods

Unlike traditional machine learning methods, modern deep learning methods require additional considerations such as batch size, optimiser, learning rate, loss function, and the number of epochs. For our training, we set the batch size to 64. We used the Adam optimiser, with a learning rate of 0.001, following Han et al. (2017). However, unlike Han et al., who used the categorical cross-entropy loss function, we used binary cross-entropy loss, since this is a multi-label problem. Training loss will be used to gauge when to stop the training. We set the training to halt after 5 epochs when training loss does not improve.

4.2.1. bi-LSTM

The bi-LSTM is our baseline model among deep learning methods. This model is based on the architecture from Mukhedkar (2020) (shown in Figure 5). However, the architecture was reconfigured to handle multi-label input with 63 unique classes by using a sigmoid activation function for the output layer. Our recorded songs

Figure 5. bi-LSTM model architecture (Mukhedkar, 2020).

are considered as time series data, which can also be approached as a standard sequence learning problem, making RNN models a natural choice (Mukhedkar, 2020). Accordingly, we used the “Long Short-Term Memory” (LSTM) architecture as applied by Mukhedkar (2020). The model takes extracted VGGish embeddings as input. However, the main drawback of the RNN model is its inability to capture frequency domain invariances. The training parameters are shown in Table 2.

Table 2. Deep learning models’ hyper-parameters.

Models

Epochs

Batch size

Early-stop patience (training loss)

Learning rate

Optimiser

Loss function

CRNN

100

64

3

0.005

Adam

Binary Cross Entropy

Han’s CNN

100

64

3

0.005

Adam

Binary Cross Entropy

bi-LSTM

100

64

3

0.005

Adam

Binary Cross Entropy

4.2.2. C-RNN

Reghunath and Rajan (2021) designed a model architecture that combines the benefits of CNNs and RNNs for ‘predominant’ instrument detection. The proposed CRNN architecture is shown in Figure 6. We adapted this model for our multi-label dataset, and we replaced the softmax activation function at the output layer with a sigmoid function. According to Reghunath and Rajan (2021), CNNs are unable to retain temporal context information, whereas RNNs can. Therefore, the model architecture includes convolutional layers followed by two bidirectional LSTM units. Training parameters are shown in Table 2.

Figure 6. C-RNN model architecture by Reghunath and Rajan (2021).

4.2.3. Han’s CNN

We also usedd the state-of-the-art CNN model developed by Han et al. (2017). Their architecture was specifically designed to handle multi-class and multi-label data; however, their input dataset contained 11 unique instruments, whereas ours has 63. Therefore, we only modified the number of output classes for the sigmoid activation function. Additionally, we used basic ReLU as opposed to Leaky ReLU since our dataset differs slightly from that of the IRMAS. A previous study by Xu et al. (2015) has shown instances where ReLU outperforms Leaky ReLU depending on the dataset. Although Han et al.’s (2017) model was primarily developed for ‘predominant’ instrument detection, since it is capable of identifying multiple well-known musical instruments in an audio file, we can evaluate its performance in detecting all instruments in the audio file. The original model architecture is shown in Figure 7. The training parameters for this model are shown in Table 2.

Figure 7. Han et al. (2017) Model architecture.

5. Results and Discussion

For all our deep learning models, the sigmoid output threshold for a particular instrument to be present in a song is when it is greater than or equal to 0.5. For instance, if the Armenian Duduk’s sigmoid output is 0.5, then the instrument is considered to be present in the song, and vice versa if it is less than 0.5. For our results, we focused on the micro and macro average of the F1 scores, recall, and precision as in previous research papers such as Han et al. (2017). These metrics were calculated using Scikit Learn’s library “classification_report, and we used the Pandas library to format and export the results into CSV files.

The random forest model performance, measured by micro and macro F1 scores, was only 0.225 and 0.19, respectively. Although the model achieved a high micro-average precision of 0.946, further observation revealed that the model assumed many of the instruments were present in almost every audio file, as the micro average of recall was only 0.128. The F1 score, precision, and recall at a class level (see Table 3), clearly reveal that the random forest model assumes most of the instruments are present in the audio files, since almost all of them have a precision score of 1, while many of the instruments’ recall is close to 0. For example, the Brazil rainstick has a precision score of 1 but a recall of only 0.015, which shows that the model assumed that the rainstick was present in many audio samples when it was not. Some instruments were completely undetected; for instance, German-Gemshorn’s precision, recall, and F1 score were all 0. Many other instruments suffered similar issues, as shown in Table 3, for instance the China-Xiao.

Table 3. Random forest performance.

precision

recall

f1-score

support

Armenia-Duduk

0.75

0.069

0.126

131

Bali-Gamelan Ensemble

0

0

0

115

Bolivia-Charango

1

0.118

0.211

119

Bolivia-Moseno

1

0.223

0.365

130

Bolivia-Roncoro Chords

1

0.016

0.032

122

Bolivia-Roncoro Solo

0

0

0

132

Brazil-Afuche Cabasa

0.625

0.094

0.164

106

Brazil-Agogo

0.667

0.017

0.034

115

Brazil-Bass Guitar Bossa

1

0.259

0.412

108

Brazil-Berimbau

0.333

0.008

0.016

119

Brazil-Bongos

0

0

0

104

Brazil-BongosCowbell

0

0

0

86

Brazil-Cabasa

1

0.129

0.228

101

Brazil-Claves

1

0.036

0.07

111

Brazil-Cuica

0

0

0

94

Brazil-EggShaker

1

0.039

0.075

103

Brazil-Guitar

1

0.351

0.52

94

Brazil-Pandeiro

1

0.01

0.02

101

Brazil-PercussionSet

0

0

0

91

Brazil-RainStick

1

0.015

0.03

130

Brazil-Surdo

0.989

0.701

0.82

127

BurkinaFaso-BaraDrum

0

0

0

89

BurkinaFaso-BassLine

1

0.061

0.115

115

Cameroon-Congas

0

0

0

96

Cameroon-Djembe

0

0

0

93

Cameroon-DrumsetBikutsi

0.936

0.427

0.587

103

Cameroon-PercussionSetBikutsi

0

0

0

106

Cameroon-ShakerBikutsi

1

0.021

0.041

95

China-Bawu

0.889

0.058

0.11

137

China-BeijingOperaGongs

1

0.521

0.685

121

China-BianzhongBells

1

0.286

0.444

105

China-BigErhuPlectrum

1

0.505

0.671

103

China-CeylonGuitar

0

0

0

116

China-ChauGongs

0.348

0.075

0.124

106

China-Dizi

0.75

0.18

0.291

133

China-Dongxiao

0

0

0

117

China-Erhu

0.868

0.4

0.548

115

China-FengGong

1

0.02

0.039

100

China-Gaohu

0.981

0.525

0.684

101

China-GongsTunedMetalMallet

1

0.265

0.42

113

China-GongsTunedSoftMallet

1

0.409

0.58

115

China-GongsTunedWoodmallet

1

0.458

0.629

96

China-Hulusi

1

0.015

0.029

137

China-JinghuOperaViolin

0

0

0

121

China-KouXian

1

0.12

0.214

125

China-Pipa

1

0.008

0.017

119

China-ShanghaiBabyPiano

1

0.008

0.015

130

China-Sheng

0

0

0

133

China-SmallErhuPlectrum

1

0.531

0.694

113

China-WuhanTamTam

1

0.01

0.019

105

China-Xiao

0

0

0

132

China-YangQin

1

0.042

0.081

143

Congo-Bongos

1

0.163

0.28

92

Congo-Sanzas

0

0

0

123

Cuba-Guiro

1

0.017

0.033

121

Cuba-Triangle

1

0.026

0.051

114

Egypt-Fiddle

0.968

0.536

0.69

112

Germany-CrumhornAlto

1

0.062

0.116

130

Germany-CrumhornBass

1

0.113

0.203

115

Germany-CrumhornConsortium

1

0.177

0.301

113

Germany-CrumhornSoprano

1

0.051

0.098

117

Germany-CrumhornTenor

0.667

0.015

0.03

132

Germany-Gemshorn

0

0

0

124

micro avg

0.946

0.128

0.225

7165

macro avg

0.679

0.13

0.19

7165

weighted avg

0.686

0.128

0.187

7165

samples avg

0.427

0.151

0.214

7165

The final machine learning method we used was the XGBoost model, which performed exceptionally well in the single instrument recognition tasks according to Liu et al. (2022). We adapted this model for multi-label classification in the same way as the random forest model and found some interesting results. The micro and macro F1 scores were 0.426 and 0.363, respectively, showing a significant improvement over the random forest model. However, XGBoost exhibited a similar issue with a high precision but very low recall (see Table 4). The XGBoost model achieved a micro precision of 0.829 with a micro recall of only 0.286 and a macro precision of 0.751 with a macro recall of only 0.287. The results were much better than the random forest model; however, it still assumed that most of the audio samples include instruments that were not actually present. This issue is particularly evident for individual instruments such as Brazil-Bongos, Brazil-Pandeiro, China-Bawu, China-FengGong, Cuba-Triangle, and Congo-Sanzas.

Table 4. XGBoost performance.

precision

recall

f1-score

support

Armenia-Duduk

0.583

0.107

0.181

131

Bali-Gamelan Ensemble

0.967

0.252

0.4

115

Bolivia-Charango

0.933

0.471

0.626

119

Bolivia-Moseno

0.837

0.554

0.667

130

Bolivia-Roncoro Chords

0.528

0.385

0.445

122

Bolivia-Roncoro Solo

0.6

0.023

0.044

132

Brazil-Afuche Cabasa

0.421

0.075

0.128

106

Brazil-Agogo

0.667

0.017

0.034

115

Brazil-Bass Guitar Bossa

0.8

0.481

0.601

108

Brazil-Berimbau

0.25

0.008

0.016

119

Brazil-Bongos

1

0.029

0.056

104

Brazil-BongosCowbell

0

0

0

86

Brazil-Cabasa

0.692

0.178

0.283

101

Brazil-Claves

1

0.144

0.252

111

Brazil-Cuica

0.875

0.074

0.137

94

Brazil-EggShaker

0.5

0.097

0.163

103

Brazil-Guitar

0.923

0.638

0.755

94

Brazil-Pandeiro

0.9

0.089

0.162

101

Brazil-PercussionSet

0

0

0

91

Brazil-RainStick

0.75

0.023

0.045

130

Brazil-Surdo

0.953

0.961

0.957

127

BurkinaFaso-BaraDrum

1

0.067

0.126

89

BurkinaFaso-BassLine

0.857

0.104

0.186

115

Cameroon-Congas

0.25

0.01

0.02

96

Cameroon-Djembe

1

0.097

0.176

93

Cameroon-DrumsetBikutsi

0.784

0.738

0.76

103

Cameroon-PercussionSetBikutsi

1

0.028

0.055

106

Cameroon-ShakerBikutsi

1

0.095

0.173

95

China-Bawu

0.929

0.19

0.315

137

China-BeijingOperaGongs

0.965

0.917

0.941

121

China-BianzhongBells

0.971

0.324

0.486

105

China-BigErhuPlectrum

0.988

0.786

0.876

103

China-CeylonGuitar

0.24

0.103

0.145

116

China-ChauGongs

0.074

0.019

0.03

106

China-Dizi

0.582

0.429

0.494

133

China-Dongxiao

0.961

0.624

0.756

117

China-Erhu

0.775

0.809

0.791

115

China-FengGong

0.939

0.31

0.466

100

China-Gaohu

0.927

0.752

0.831

101

China-GongsTunedMetalMallet

1

0.558

0.716

113

China-GongsTunedSoftMallet

1

0.626

0.77

115

China-GongsTunedWoodmallet

1

0.667

0.8

96

China-Hulusi

0.667

0.102

0.177

137

China-JinghuOperaViolin

0.944

0.14

0.245

121

China-KouXian

0.96

0.192

0.32

125

China-Pipa

0.714

0.042

0.079

119

China-ShanghaiBabyPiano

0.375

0.046

0.082

130

China-Sheng

0.696

0.12

0.205

133

China-SmallErhuPlectrum

0.99

0.867

0.925

113

China-WuhanTamTam

0.455

0.048

0.086

105

China-Xiao

0

0

0

132

China-YangQin

0.813

0.182

0.297

143

Congo-Bongos

0.926

0.543

0.685

92

Congo-Sanzas

1

0.016

0.032

123

Cuba-Guiro

0.75

0.025

0.048

121

Cuba-Triangle

1

0.061

0.116

114

Egypt-Fiddle

0.908

0.705

0.794

112

Germany-CrumhornAlto

0.971

0.523

0.68

130

Germany-CrumhornBass

0.758

0.409

0.531

115

Germany-CrumhornConsortium

0.976

0.363

0.529

113

Germany-CrumhornSoprano

0.964

0.462

0.624

117

Germany-CrumhornTenor

0.957

0.333

0.494

132

Germany-Gemshorn

0.081

0.024

0.037

124

micro avg

0.829

0.286

0.426

7165

macro avg

0.751

0.287

0.363

7165

weighted avg

0.751

0.286

0.363

7165

samples avg

0.743

0.319

0.422

7165

Our deep learning baseline model, the bi-LSTM model, was not able to capture distinct features for instrument recognition as it scored 0 across all metrics—F1 score, recall, and precision (see Table 5). However, the bi-LSTM model was expected to be the worst-performing model. It is a common issue with RNN architectures when given high-dimensional inputs such as Mel spectrograms or MFCCs since the long temporal sequences cannot capture the invariance in the frequency domains (Reghunath & Rajan, 2021). However, in our study, we have reduced the input dimension by using the VGGish embeddings, but our dataset was not sufficiently diverse for the model to learn any meaningful information for predictions despite the reduced dimension of 5 × 128; hence, the model was not able to perform. Despite its previous success in related tasks, the bi-LSTM was the worst-performing model in our study.

Table 5. Bi-LSTM performance.

precision

recall

f1-score

support

Armenia-Duduk

0

0

0

131

Bali-Gamelan Ensemble

0

0

0

115

Bolivia-Charango

0

0

0

119

Bolivia-Moseno

0

0

0

130

Bolivia-Roncoro Chords

0

0

0

122

Bolivia-Roncoro Solo

0

0

0

132

Brazil-Afuche Cabasa

0

0

0

106

Brazil-Agogo

0

0

0

115

Brazil-Bass Guitar Bossa

0

0

0

108

Brazil-Berimbau

0

0

0

119

Brazil-Bongos

0

0

0

104

Brazil-BongosCowbell

0

0

0

86

Brazil-Cabasa

0

0

0

101

Brazil-Claves

0

0

0

111

Brazil-Cuica

0

0

0

94

Brazil-EggShaker

0

0

0

103

Brazil-Guitar

0

0

0

94

Brazil-Pandeiro

0

0

0

101

Brazil-PercussionSet

0

0

0

91

Brazil-RainStick

0

0

0

130

Brazil-Surdo

0

0

0

127

BurkinaFaso-BaraDrum

0

0

0

89

BurkinaFaso-BassLine

0

0

0

115

Cameroon-Congas

0

0

0

96

Cameroon-Djembe

0

0

0

93

Cameroon-DrumsetBikutsi

0

0

0

103

Cameroon-PercussionSetBikutsi

0

0

0

106

Cameroon-ShakerBikutsi

0

0

0

95

China-Bawu

0

0

0

137

China-BeijingOperaGongs

0

0

0

121

China-BianzhongBells

0

0

0

105

China-BigErhuPlectrum

0

0

0

103

China-CeylonGuitar

0

0

0

116

China-ChauGongs

0

0

0

106

China-Dizi

0

0

0

133

China-Dongxiao

0

0

0

117

China-Erhu

0

0

0

115

China-FengGong

0

0

0

100

China-Gaohu

0

0

0

101

China-GongsTunedMetalMallet

0

0

0

113

China-GongsTunedSoftMallet

0

0

0

115

China-GongsTunedWoodmallet

0

0

0

96

China-Hulusi

0

0

0

137

China-JinghuOperaViolin

0

0

0

121

China-KouXian

0

0

0

125

China-Pipa

0

0

0

119

China-ShanghaiBabyPiano

0

0

0

130

China-Sheng

0

0

0

133

China-SmallErhuPlectrum

0

0

0

113

China-WuhanTamTam

0

0

0

105

China-Xiao

0

0

0

132

China-YangQin

0

0

0

143

Congo-Bongos

0

0

0

92

Congo-Sanzas

0

0

0

123

Cuba-Guiro

0

0

0

121

Cuba-Triangle

0

0

0

114

Egypt-Fiddle

0

0

0

112

Germany-CrumhornAlto

0

0

0

130

Germany-CrumhornBass

0

0

0

115

Germany-CrumhornConsortium

0

0

0

113

Germany-CrumhornSoprano

0

0

0

117

Germany-CrumhornTenor

0

0

0

132

Germany-Gemshorn

0

0

0

124

micro avg

0

0

0

7165

macro avg

0

0

0

7165

weighted avg

0

0

0

7165

samples avg

0

0

0

7165

The C-RNN (or C-LSTM) model trained on MFCCs also performed poorly, achieving 0.237 for the micro F1 score and 0.22 for the macro F1 score (see Table 6). Considering that the MFCCs provided relevant information about the input audio, we had expected this model to perform better for instrument detection. However, it also failed to learn to detect some instruments from the training, namely, China-Wuhan TamTam and Brazil-Afuche Cabasa. Even for instruments that the model was able to detect, the precision and recall scores were underwhelming, with most scores for both precision and recall ranging between 0.3 and 0.4. Some instruments, such as Bolivia-Roncoro Chords, had near-zero scores, with just 0.048 for precision and 0.008 for recall.

Table 6. C-RNN (MFCC-trained) performance.

precision

recall

f1-score

support

Armenia-Duduk

0.224

0.115

0.152

131

Bali-Gamelan Ensemble

0.177

0.122

0.144

115

Bolivia-Charango

0.478

0.37

0.417

119

Bolivia-Moseno

0.232

0.177

0.201

130

Bolivia-Roncoro Chords

0.048

0.008

0.014

122

Bolivia-Roncoro Solo

0.137

0.053

0.077

132

Brazil-Afuche Cabasa

0

0

0

106

Brazil-Agogo

0.067

0.087

0.075

115

Brazil-Bass Guitar Bossa

0.412

0.259

0.318

108

Brazil-Berimbau

0.333

0.017

0.032

119

Brazil-Bongos

0.151

0.135

0.142

104

Brazil-BongosCowbell

0.084

0.081

0.083

86

Brazil-Cabasa

0.495

0.485

0.49

101

Brazil-Claves

0.024

0.027

0.026

111

Brazil-Cuica

0.102

0.138

0.117

94

Brazil-EggShaker

0.053

0.058

0.056

103

Brazil-Guitar

0.602

0.628

0.615

94

Brazil-Pandeiro

0.562

0.406

0.471

101

Brazil-PercussionSet

0.056

0.077

0.065

91

Brazil-RainStick

0.427

0.515

0.467

130

Brazil-Surdo

0.404

0.449

0.425

127

BurkinaFaso-BaraDrum

0.032

0.022

0.026

89

BurkinaFaso-BassLine

0.359

0.443

0.397

115

Cameroon-Congas

0.114

0.042

0.061

96

Cameroon-Djembe

0.053

0.011

0.018

93

Cameroon-DrumsetBikutsi

0.952

0.961

0.957

103

Cameroon-PercussionSetBikutsi

0.125

0.009

0.018

106

Cameroon-ShakerBikutsi

0.911

0.968

0.939

95

China-Bawu

0.235

0.088

0.128

137

China-BeijingOperaGongs

0.141

0.083

0.104

121

China-BianzhongBells

0.045

0.048

0.046

105

China-BigErhuPlectrum

0.05

0.01

0.016

103

China-CeylonGuitar

0.182

0.034

0.058

116

China-ChauGongs

0.307

0.292

0.3

106

China-Dizi

0.364

0.12

0.181

133

China-Dongxiao

0.102

0.043

0.06

117

China-Erhu

0.467

0.304

0.368

115

China-FengGong

0.039

0.03

0.034

100

China-Gaohu

0.348

0.307

0.326

101

China-GongsTunedMetalMallet

0.588

0.265

0.366

113

China-GongsTunedSoftMallet

0.907

0.852

0.879

115

China-GongsTunedWoodmallet

0.916

0.906

0.911

96

China-Hulusi

0.452

0.139

0.212

137

China-JinghuOperaViolin

0.033

0.008

0.013

121

China-KouXian

0.096

0.056

0.071

125

China-Pipa

0.081

0.025

0.038

119

China-ShanghaiBabyPiano

0.111

0.038

0.057

130

China-Sheng

0.053

0.008

0.013

133

China-SmallErhuPlectrum

0.965

0.982

0.974

113

China-WuhanTamTam

0

0

0

105

China-Xiao

0.103

0.053

0.07

132

China-YangQin

0.235

0.133

0.17

143

Congo-Bongos

0.259

0.315

0.284

92

Congo-Sanzas

0.221

0.236

0.228

123

Cuba-Guiro

0.036

0.033

0.034

121

Cuba-Triangle

0.064

0.088

0.074

114

Egypt-Fiddle

0.548

0.152

0.238

112

Germany-CrumhornAlto

0.295

0.177

0.221

130

Germany-CrumhornBass

0.302

0.139

0.19

115

Germany-CrumhornConsortium

0.15

0.053

0.078

113

Germany-CrumhornSoprano

0.333

0.068

0.113

117

Germany-CrumhornTenor

0.333

0.144

0.201

132

Germany-Gemshorn

0.034

0.016

0.022

124

micro avg

0.294

0.198

0.237

7165

macro avg

0.27

0.205

0.22

7165

weighted avg

0.268

0.198

0.215

7165

samples avg

0.302

0.209

0.226

7165

Unlike the C-RNN trained with MFCCs, the CRNN trained with Mel spectrograms performed significantly better. This model achieved a micro F1 score of 0.39 and a macro F1 score of 0.361 (see Table 7). Unlike machine learning methods like XGBoost and random forest, its precision and recall values do not suggest overprediction of instrument presence. Despite this advantage, the model still underperformed relative to the results achieved by Reghunath and Rajan (2021), who achieved a micro F1 score of up to 0.65 and a macro F1 score of 0.56 for their C-LSTM (C-RNN) model (trained on Mel spectrograms). The model encountered a similar issue to its counterpart; the CRNN trained with Mel spectrograms was also unable to detect an instrument, namely the China-CeylonGuitar.

Table 7. C-RNN (Mel spectrogram-trained) performance.

precision

recall

f1-score

support

Armenia-Duduk

0.355

0.168

0.228

131

Bali-Gamelan Ensemble

0.226

0.226

0.226

115

Bolivia-Charango

0.638

0.37

0.468

119

Bolivia-Moseno

0.346

0.354

0.35

130

Bolivia-Roncoro Chords

0.709

0.459

0.557

122

Bolivia-Roncoro Solo

0.089

0.038

0.053

132

Brazil-Afuche Cabasa

0.133

0.094

0.11

106

Brazil-Agogo

0.107

0.078

0.09

115

Brazil-Bass Guitar Bossa

0.692

0.667

0.679

108

Brazil-Berimbau

0.529

0.076

0.132

119

Brazil-Bongos

0.264

0.269

0.267

104

Brazil-BongosCowbell

0.063

0.07

0.066

86

Brazil-Cabasa

0.43

0.366

0.396

101

Brazil-Claves

0.073

0.036

0.048

111

Brazil-Cuica

0.581

0.457

0.512

94

Brazil-EggShaker

0.289

0.233

0.258

103

Brazil-Guitar

0.715

0.989

0.83

94

Brazil-Pandeiro

0.556

0.347

0.427

101

Brazil-PercussionSet

0.098

0.132

0.113

91

Brazil-RainStick

0.14

0.138

0.139

130

Brazil-Surdo

0.846

0.906

0.875

127

BurkinaFaso-BaraDrum

0.313

0.461

0.373

89

BurkinaFaso-BassLine

0.536

0.774

0.633

115

Cameroon-Congas

0.208

0.052

0.083

96

Cameroon-Djembe

0.167

0.022

0.038

93

Cameroon-DrumsetBikutsi

0.963

1

0.981

103

Cameroon-PercussionSetBikutsi

0.105

0.019

0.032

106

Cameroon-ShakerBikutsi

0.938

0.958

0.948

95

China-Bawu

0.337

0.204

0.255

137

China-BeijingOperaGongs

0.305

0.24

0.269

121

China-BianzhongBells

0.215

0.162

0.185

105

China-BigErhuPlectrum

0.892

0.32

0.471

103

China-CeylonGuitar

0

0

0

116

China-ChauGongs

0.314

0.151

0.204

106

China-Dizi

0.419

0.135

0.205

133

China-Dongxiao

0.452

0.239

0.313

117

China-Erhu

0.815

0.843

0.829

115

China-FengGong

0.593

0.54

0.565

100

China-Gaohu

0.452

0.416

0.433

101

China-GongsTunedMetalMallet

0.4

0.018

0.034

113

China-GongsTunedSoftMallet

0.867

0.965

0.914

115

China-GongsTunedWoodmallet

0.931

0.99

0.96

96

China-Hulusi

0.294

0.036

0.065

137

China-JinghuOperaViolin

0.55

0.091

0.156

121

China-KouXian

0.469

0.184

0.264

125

China-Pipa

0.346

0.151

0.211

119

China-ShanghaiBabyPiano

0.382

0.1

0.159

130

China-Sheng

0.4

0.135

0.202

133

China-SmallErhuPlectrum

0.926

0.991

0.957

113

China-WuhanTamTam

0.53

0.419

0.468

105

China-Xiao

0.088

0.023

0.036

132

China-YangQin

0.776

0.678

0.724

143

Congo-Bongos

0.473

0.946

0.63

92

Congo-Sanzas

0.411

0.415

0.413

123

Cuba-Guiro

0.124

0.099

0.11

121

Cuba-Triangle

0.07

0.088

0.078

114

Egypt-Fiddle

0.797

0.839

0.817

112

Germany-CrumhornAlto

0.507

0.269

0.352

130

Germany-CrumhornBass

0.717

0.287

0.41

115

Germany-CrumhornConsortium

0.455

0.221

0.298

113

Germany-CrumhornSoprano

0.635

0.342

0.444

117

Germany-CrumhornTenor

0.604

0.22

0.322

132

Germany-Gemshorn

0.073

0.048

0.058

124

micro avg

0.471

0.333

0.39

7165

macro avg

0.44

0.342

0.361

7165

weighted avg

0.438

0.333

0.354

7165

samples avg

0.49

0.348

0.38

7165

The state-of-the-art model by Han et al. (2017) was the best performing of all the models we evaluated, achieving a micro F1 score of 0.55 and a macro F1 score of 0.504. One of its most notable achievements was its ability to detect instruments like the Germany-CrumhornAlto, Germany-CrumhornBass, Germany-CrumhornConsortium, Germany-CrumhornSoprano, and Germany-CrumhornTenor with high precision and recall where other models struggled (see Table 8). The model failed to detect some instruments, such as the China-Hulusi and China-JinghuOperaViolin. However, its overall performance was close to results reported by Han et al. (2017), whose model trained on the IRMAS dataset achieved a micro F1 score of 0.602 and a macro F1 score of 0.503. Considering that their model was built to perform well for instrument recognition in mainstream polyphonic music, our 10% lower performance on the micro F1 score and similar macro F1 score for instrument recognition of cultural instruments was a strong result.

Table 8. Han’s CNN performance.

precision

recall

f1-score

support

Armenia-Duduk

0.667

0.321

0.433

131

Bali-Gamelan Ensemble

0.376

0.635

0.472

115

Bolivia-Charango

0.457

0.622

0.527

119

Bolivia-Moseno

0.316

0.746

0.444

130

Bolivia-Roncoro Chords

0.5

0.074

0.129

122

Bolivia-Roncoro Solo

0.496

0.841

0.624

132

Brazil-Afuche Cabasa

0.119

0.151

0.133

106

Brazil-Agogo

0.1

0.026

0.041

115

Brazil-Bass Guitar Bossa

0.963

0.713

0.819

108

Brazil-Berimbau

0.986

0.597

0.743

119

Brazil-Bongos

0.206

0.067

0.101

104

Brazil-BongosCowbell

0.169

0.151

0.16

86

Brazil-Cabasa

0.72

0.842

0.776

101

Brazil-Claves

0.893

0.982

0.936

111

Brazil-Cuica

0.44

0.543

0.486

94

Brazil-EggShaker

0.789

0.437

0.563

103

Brazil-Guitar

0.895

0.904

0.899

94

Brazil-Pandeiro

0.736

0.772

0.754

101

Brazil-PercussionSet

0.172

0.121

0.142

91

Brazil-RainStick

0.391

0.069

0.118

130

Brazil-Surdo

0.622

0.984

0.762

127

BurkinaFaso-BaraDrum

0.728

0.753

0.74

89

BurkinaFaso-BassLine

0.758

0.791

0.774

115

Cameroon-Congas

0.375

0.031

0.058

96

Cameroon-Djembe

0.038

0.011

0.017

93

Cameroon-DrumsetBikutsi

0.864

0.99

0.923

103

Cameroon-PercussionSetBikutsi

0.6

0.028

0.054

106

Cameroon-ShakerBikutsi

1

0.958

0.978

95

China-Bawu

0.522

0.606

0.561

137

China-BeijingOperaGongs

0.162

0.157

0.16

121

China-BianzhongBells

0.258

0.371

0.305

105

China-BigErhuPlectrum

0.978

0.437

0.604

103

China-CeylonGuitar

0.385

0.362

0.373

116

China-ChauGongs

0.02

0.009

0.013

106

China-Dizi

0.796

0.677

0.732

133

China-Dongxiao

0.524

0.949

0.675

117

China-Erhu

0.703

0.904

0.791

115

China-FengGong

0.326

0.43

0.371

100

China-Gaohu

0.7

0.832

0.76

101

China-GongsTunedMetalMallet

0.35

0.062

0.105

113

China-GongsTunedSoftMallet

0.972

0.904

0.937

115

China-GongsTunedWoodmallet

0.856

0.865

0.86

96

China-Hulusi

1

0.022

0.043

137

China-JinghuOperaViolin

0.824

0.116

0.203

121

China-KouXian

0.76

0.304

0.434

125

China-Pipa

0.373

0.21

0.269

119

China-ShanghaiBabyPiano

0.614

0.662

0.637

130

China-Sheng

1

0.444

0.615

133

China-SmallErhuPlectrum

0.982

0.973

0.978

113

China-WuhanTamTam

0.138

0.171

0.153

105

China-Xiao

0.033

0.008

0.012

132

China-YangQin

0.916

0.762

0.832

143

Congo-Bongos

0.727

0.783

0.754

92

Congo-Sanzas

0.389

0.415

0.402

123

Cuba-Guiro

0.2

0.091

0.125

121

Cuba-Triangle

0.4

0.07

0.119

114

Egypt-Fiddle

0.62

0.902

0.735

112

Germany-CrumhornAlto

0.842

0.985

0.908

130

Germany-CrumhornBass

0.911

0.887

0.899

115

Germany-CrumhornConsortium

0.84

0.929

0.882

113

Germany-CrumhornSoprano

0.869

0.966

0.915

117

Germany-CrumhornTenor

0.894

0.833

0.863

132

Germany-Gemshorn

0.4

0.081

0.134

124

micro avg

0.593

0.513

0.55

7165

macro avg

0.582

0.513

0.504

7165

weighted avg

0.587

0.513

0.504

7165

samples avg

0.63

0.531

0.558

7165

6. Conclusion

Despite advancements in technology such as applications like Shazam that can identify music within seconds, the trend mainly applies to well-known instruments. Cultural instruments are virtually unrepresented. This paper aims to build on the work of Kailewang and Moieni (2022) by extending multi-instrument detection to cultural instruments, contributing to cultural awareness and the protection and promotion of cultural diversity, which is an obligation under the UNESCO 2005 Convention on the Protection and Promotion of the Diversity of Cultural Expressions, an international treaty ratified by 151 signatory states. This project aims to build on the work of Kailewang and Moieni (2022) by extending multi-instrument detection and applying modern deep learning techniques to preprocess the dataset provided by Cultural Infusion.

We extracted features including Mel spectrograms, MFCCs, Spectral Bandwidth, Spectral Centroid, Spectral Rolloff, Zero Crossing Rate, Root Mean Square energy, and VGGish embeddings for model training. We used random forest as our baseline machine learning model, which performed poorly with our dataset. The XGBoost model, which performed exceptionally well in single instrument detection in a previous paper by Liu et al. (2022), achieved subpar results with our dataset. The bi-LSTM model, our modern deep learning baseline model, was not able to detect any instruments from the test data. The two CRNN models (trained on MFCCs and Mel spectrograms) performed better; however, they struggled to achieve a 0.5 F1 score in either the micro or macro average. The best performing model was the state-of-the-art CNN by Han et al. (2017), which we adapted to detect our cultural instruments. Despite challenges in the dataset, this model matched the macro F1 score of 0.50 achieved by Han et al. (2017) with their conventional dataset, IRMAS, and reached a micro average F1 score of 0.55, only about 10% lower (see Table 9 and Table 10).

Table 9. Macro average result.

Macro avg

Precision

Recall

F1-score

Random Forest

0.679

0.13

0.19

XGBoost

0.751

0.287

0.363

Han

0.582

0.513

0.504

CRNN mfcc

0.27

0.205

0.22

CRNN mel spec

0.44

0.342

0.361

Table 10. Micro average result.

Micro avg

Precision

Recall

F1-score

Random Forest

0.946

0.128

0.225

XGBoost

0.829

0.286

0.426

Han

0.593

0.513

0.55

CRNN mfcc

0.294

0.198

0.237

CRNN mel spec

0.471

0.333

0.39

7. Areas of Improvement

The major challenge we faced for this study was the dataset quality; we compiled the music by combining instruments randomly into one audio file for both the training and testing sets. Hence, the quality of the compositions was not entirely realistic when compared to public datasets that are used for mainstream music recognition or single-instrument recognition, such as IRMAS, OpenMIC-2018, and MedleyDB. In addition, the dataset lacked sufficient variety for training, particularly in terms of the number of samples available for the instruments. Cultural instrument data are scarce, with no publicly available datasets currently offering a broad and diverse set of cultural instruments.

Although we compiled 8000 training samples and 2000 testing samples, the lack of variety per instrument posed challenges. In cases where an instrument had only two recordings, one was used for training while the other was used for testing, causing an issue where the model was not able to detect the unseen data. Another limitation, also noted by Kailewang and Moieni (2022), was that all instruments were present at the same time throughout the audio sample, which does not reflect how instruments typically behave in real music, where different instruments enter and exit at various times.

In the future, the issue of limited instrument variety could be resolved by utilising Generative Adversarial Networks (GANs) such as WaveGAN to generate additional audio samples for each instrument. This experiment was conducted on a subset of the Sound Infusion cultural instrument dataset, which contains approximately 100 unique instruments with various audio samples for each. Our paper explored approximately 63 of these instruments, which leaves room for improvement and scalability.

To strengthen model performance and reduce the risk of overfitting, the data collection process could also be refined by filtering for instruments with at least five available audio samples prior to training.

Acknowledgments

We want to thank Sound Infusion for providing support with their music studio platform containing the instrument samples used to complete the project. We extend our thanks to the CEO of Cultural Infusion, Peter Mousaferiadis, for his support of the project. We would also like to thank Om Kadem, Anjaly Sajeevkumar, Mary Legrand, Aida Hakemi, Mohsen Sadegh Zadeh, and Nicole Lee for their additional support in the process to deploy this algorithm for use on http://www.soundinfusion.io/.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1] Chen, R., Akbar, G., & Ajit, N. (2024). Musical Instrument Recognition in Poly-Phonic Audio Through Convolutional Neural Networks and Spectrograms. DigitalNZ.
https://digitalnz.org/records/56071138/musical-instrument-recognition-in-polyphonic-audio-through-convolutional-ne
[2] Destatis (2025). 32% of the World’s Population Do Not Use the Internet.
https://www.destatis.de/EN/Themes/Countries-Regions/International-Statistics/Data-Topic/Science-Research-Digital/InternetUse.html
[3] Dewi, C., Chen, A. P. S., & Christanto, H. J. (2023). Recognizing Similar Musical Instruments with YOLO Models. Big Data and Cognitive Computing, 7, Article 94.
https://doi.org/10.3390/bdcc7020094
[4] Han, Y., Kim, J., & Lee, K. (2017). Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25, 208-221.
https://doi.org/10.1109/taslp.2016.2632307
[5] Hing, D., & Settle, C. (2021). Detecting and Classifying Musical Instruments with Convolutional Neural Networks. Stanford University.
http://cs230.stanford.edu/projects_winter_2021/reports/70770755.pdf
[6] Humphrey, E. J., Durand, S., & McFee, B. (2018). OpenMIC-2018: An Open Dataset for Multiple Instrument Recognition. International Society for Music Information Retrieval Conference (pp. 438-444). ISMIR.
https://doi.org/10.5281/zenodo.1492445
[7] Kailewang & Moieni, R. (2022) Multi-Instrument Detection in Culture Music Using Machine Learning Models. International Journal of Management and Applied Science (IJMAS), 8, 48-57.
[8] Kaminskas, M., & Ricci, F. (2012). Contextual Music Information Retrieval and Recommendation: State of the Art and Challenges. Computer Science Review, 6, 89-119.
https://doi.org/10.1016/j.cosrev.2012.04.002
[9] Kundu, R. (2023). YOLO Algorithm for Object Detection Explained [+Examples].
https://www.v7labs.com/blog/yolo-object-detection
[10] Lei, L. (2022). Multiple Musical Instrument Signal Recognition Based on Convolutional Neural Network. Scientific Programming, 2022, Article ID: 5117546.
https://doi.org/10.1155/2022/5117546
[11] Li, P., Qian, J., & Wang, T. (2015). Automatic Instrument Recognition in Poly-Phonic Music Using Convolutional Neural Networks. arXiv.
https://arxiv.org/abs/1511.05520v1
[12] Liu, Y., Yin, Y., Zhu, Q., & Cui, W. (2022). Musical Instrument Recognition by XGBoost Combining Feature Fusion. arXiv, 2206.00901.
https://arxiv.org/abs/2206.00901v1
[13] Mukhedkar, D. (2020). Polyphonic Music Instrument Detection on Weakly La-Belled Data Using Sequence Learning Models.
https://www.diva-portal.org/smash/get/diva2:1458608/FULLTEXT02
[14] Reghunath, L. C., & Rajan, R. (2021). Predominant Instrument Recognition in Polyphonic Music Using Convolutional Recurrent Neural Networks. In M. Aramaki, K. Hirata, T. Kitahara, R. Kronland-Martinet, & Ystad, S. (Eds.), Music in AI Era (pp. 214-227). Springer.
[15] Vaiedelich, S., & Fritz, C. (2017). Perception of Old Musical Instruments. Journal of Cultural Heritage, 27, S2-S7.
https://doi.org/10.1016/j.culher.2017.02.014
[16] Wang, A. (2003). An Industrial Strength Audio Search Algorithm. ResearchGate.
https://www.researchgate.net/publication/220723446_An_Industrial_Strength_Audio_Search_Algorithm
[17] Xu, B., Wang, N., Chen, T., & Li, M. (2015). Empirical Evaluation of Rectified Activations in Convolutional Network. arXiv, 1505.00853.
https://arxiv.org/abs/1505.00853v2
[18] Zhong, L., Cooper, E., Yamagishi, J., & Minematsu, N. (2023). Exploring Isolated Musical Notes as Pre-Training Data for Predominant Instrument Recognition in Polyphonic Music. 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 2312-2319). IEEE.
https://doi.org/10.1109/apsipaasc58517.2023.10317292

Copyright © 2025 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.