The Ultimate Guide to Using Your Smartwatch for Learning a Musical Instrument by Ear
Can You Answer Calls on Smartwatch? (The ULTIMATE Guide)
Are you curious about the capabilities of a smartwatch? Are you wondering if you can answer calls on a smartwatch? If you are, you have come to the right place! In this ULTIMATE guide, we will discuss all the features of a smartwatch, how you can answer calls on one, what else you can do with a smartwatch, the benefits and drawbacks of answering calls on one, and the best smartwatches for answering calls.
So, lets get started!.
Short Answer
Yes, you can answer calls on a smartwatch.
Most smartwatches are designed to be compatible with your phone, meaning that you can receive calls on your watch and answer them using the device.
Many smartwatches also include Bluetooth capabilities, allowing you to make and receive calls without having to take out your phone.
Some smartwatches also include a built-in microphone and speaker, allowing you to talk directly into the watch.
What is a Smartwatch?
A smartwatch is a wearable device that combines the functionality of a smartphone with that of a traditional watch.
Smartwatches are essentially miniature computers that are connected to your phone via Bluetooth and can monitor your daily activity, track your fitness goals, and even make and receive calls.
Smartwatches allow you to stay connected and answer calls quickly and easily without having to pull out your phone.
They are becoming increasingly popular and more functional, with more and more features being added all the time.
Smartwatches are able to monitor heart rate, track steps taken, and even receive notifications from your phone.
In addition, they can also answer calls, access your contact list, and make calls directly from the watch.
The microphone and speaker on the watch make it easy to have a conversation directly from your wrist.
Smartwatches are perfect for people who want to stay connected and always be available to answer calls, but dont want to be weighed down by carrying a bulky smartphone.
What Features Do Smartwatches Offer?
Smartwatches are becoming a more popular and fashionable accessory that offer a range of features.
In addition to being able to answer calls, the features of a smartwatch include the ability to track fitness activity, store music and media, take pictures and videos, and access notifications.
Not only do smartwatches have the ability to answer calls, they also come with a microphone and a speaker, so you can have a full conversation with the person on the other end.
Smartwatches also come with a contact list so you can easily identify who is calling and decide whether or not you want to answer.
You can also make calls from your watch, so you dont have to reach for your phone or fumble for it in your pocket.
In addition to the ability to take and make calls, smartwatches are also designed to be a personal assistant.
They offer a range of features that can help you stay organized and on top of your tasks.
You can set reminders, calendar events, alarms, and other notifications to help you stay on track and make sure you dont miss any important events.
Smartwatches also come with health and fitness tracking features.
You can track steps, calories burned, heart rate, sleep patterns, and more.
These features can help you stay active and healthy and make sure youre getting the right amount of exercise.
Overall, smartwatches offer a range of features that make them an incredibly useful tool.
With the ability to answer calls, take pictures and videos, access notifications, and track fitness activity, smartwatches are a must-have accessory.
How Can You Answer Calls On a Smartwatch?
Answering calls on a smartwatch is easier than ever before.
With the latest advancements in technology, it is now possible to answer calls directly from your wrist.
Smartwatches are equipped with a microphone and speaker, allowing you to have a conversation directly from the device.
You can also access your contact list and make calls from the watch.
The process of answering calls on a smartwatch can vary from device to device, but generally, you will receive an alert when someone is trying to call you.
From there, you can either accept or reject the call.
If you accept the call, it will be routed to the speaker and microphone on the device.
You will then be able to hold a conversation with the caller directly from your smartwatch.
For those who dont want to answer calls directly from their device, some smartwatches also offer the ability to send the call to your paired smartphone.
This allows you to answer the call from your smartphone instead.
Some smartwatches also offer the ability to reject a call and send a pre-set message to the caller.
Answering calls on a smartwatch is a convenient way to stay connected while on the go.
You can quickly and easily answer calls directly from your wrist, and you dont have to worry about missing important calls.
Additionally, smartwatches also offer a variety of other features such as fitness tracking, notifications, and more.
What Else Can You Do With a Smartwatch?
Smartwatches are not just for taking calls, they offer a variety of features that can make your life easier.
You can use a smartwatch to access your music, listen to podcasts, and control your smart home devices.
You can also get notifications and alerts right on your wrist, so you dont have to constantly be checking your phone.
You can even use a smartwatch to track your workouts and monitor your health, such as your heart rate, steps taken, and calories burned.
Smartwatches also come with a variety of apps that can help you with daily tasks, such as managing your calendar, finding directions, and even ordering food.
With a smartwatch, you can stay connected and productive without having to constantly be reaching for your phone.
What Are the Benefits of Answering Calls on a Smartwatch?
Answering calls on a smartwatch offers several unique advantages for users.
First, it allows for hands-free communication, meaning that users can multitask without having to constantly pick up and put down their phones.
This is especially useful when driving, cooking, or performing any other task that requires the use of both hands.
Additionally, answering calls on a smartwatch is much more discreet than using a phone.
This is especially true for those who find themselves in situations where answering a call on a phone would be inappropriate or rude.
Moreover, answering calls on a smartwatch is much faster than using a phone.
This is because the user does not have to take the time to unlock their phone and open the phone app.
Instead, they can simply press a button on their watch to answer the call.
This makes it easier to respond to calls in a timely manner and to avoid missing important calls.
Finally, answering calls on a smartwatch is more secure than using a phone.
This is because the users watch is connected to their phone, meaning that the users data is encrypted and more secure.
Additionally, the users watch is harder to steal than a phone, making it more difficult for an attacker to access the users data.
In short, answering calls on a smartwatch offers several distinct advantages for users, including hands-free communication, increased discretion, faster response time, and increased security.
What Are the Drawbacks of Answering Calls on a Smartwatch?
Answering calls on a smartwatch can be a great convenience, but there are also some drawbacks.
For starters, not all smartwatches are equipped with the same features and capabilities.
Some models may not have a microphone and speaker, or may not be able to access your contact list.
Additionally, many smartwatches dont have the same battery life as a smartphone, so you may find yourself needing to recharge your watch more often if youre frequently answering calls.
Furthermore, it can be difficult to hear the caller on a smartwatch when youre in a noisy environment.
This is because the speakers on a smartwatch are usually not as powerful as the speakers on a smartphone.
And, depending on the model, you may also have to hold your arm up close to your ear in order to hear the caller clearly.
This can be uncomfortable and awkward.
Finally, answering calls on a smartwatch can be a bit of a security risk.
Smartwatches are much smaller than smartphones, so its easier for someone to overhear your conversations.
Its also easier for someone to access your contacts and other personal information stored on the watch.
So if youre concerned about your privacy and security, its best to stick with your smartphone for answering calls.
What Are Some of the Best Smartwatches for Answering Calls?
When it comes to answering calls on a smartwatch, there are many options to choose from.
It is important to consider the features of each watch to determine which one is best for you.
Many smartwatch manufacturers offer devices that are designed to make answering calls easy and convenient.
The Apple Watch Series 6 is a great option for those who want to be able to answer calls on their wrist.
It offers a large selection of features, including the ability to make and receive calls directly from your wrist.
The watch also features a microphone and a speaker, so you can have a conversation directly from your wrist.
The watch also has a customizable contact list, allowing you to quickly and easily access your contacts and make calls.
The Samsung Galaxy Watch 3 is another great option for answering calls on your wrist.
The watch has a built-in microphone and speaker, so you can have a conversation directly from your wrist.
It also has a large selection of features, including the ability to make and receive calls directly from your wrist.
The watch also offers a customizable contact list, allowing you to quickly and easily access your contacts and make calls.
The Fossil Gen 5 is also a great option for answering calls on your wrist.
It has a built-in microphone and speaker, so you can have a conversation directly from your wrist.
The watch also has a large selection of features, including the ability to make and receive calls directly from your wrist.
The watch also offers a customizable contact list, allowing you to quickly and easily access your contacts and make calls.
The Fitbit Versa 3 is a great option for those who want to be able to answer calls on their wrist.
The watch has a built-in microphone and speaker, so you can have a conversation directly from your wrist.
The watch also has a large selection of features, including the ability to make and receive calls directly from your wrist.
The watch also offers a customizable contact list, allowing you to quickly and easily access your contacts and make calls.
When it comes to answering calls on a smartwatch, there are many options to choose from.
It is important to consider the features of each watch to determine which one is best for you.
With the right smartwatch, you can stay connected and answer calls quickly and easily.
Final Thoughts
Answering calls on a smartwatch is a convenient and efficient way to stay connected.
With a smartwatch, you can answer calls directly from your wrist and access your contact list.
You can also make calls and have a conversation directly from your watch.
With the many features offered by smartwatches, you can make the most of your device and stay connected on the go.
Now that you know all about answering calls on a smartwatch, why not give it a try and see how it can make your life easier?.
Musical Instrument Identification Using Deep Learning Approach
Sensors (Basel). 2022 Apr; 22(8): 3033.
Musical Instrument Identification Using Deep Learning Approach
1 and 2,*
Maciej Blaszke
1Multimedia Systems Department, Faculty of Electronics, Telecommunications and Informatics, Gdask University of Technology, Narutowicza 11/12, 80-233 Gdask, Poland; gro.demitlum@ekzsalbm
Boena Kostek
2Audio Acoustics Laboratory, Faculty of Electronics, Telecommunications and Informatics, Gdask University of Technology, Narutowicza 11/12, 80-233 Gdask, Poland
Marco Leo, Academic Editor
1Multimedia Systems Department, Faculty of Electronics, Telecommunications and Informatics, Gdask University of Technology, Narutowicza 11/12, 80-233 Gdask, Poland;
gro.demitlum@ekzsalbm2Audio Acoustics Laboratory, Faculty of Electronics, Telecommunications and Informatics, Gdask University of Technology, Narutowicza 11/12, 80-233 Gdask, Poland
Received 2022 Mar 22; Accepted 2022 Apr 13.
Abstract
The work aims to propose a novel approach for automatically identifying all instruments present in an audio excerpt using sets of individual convolutional neural networks (CNNs) per tested instrument. The paper starts with a review of tasks related to musical instrument identification. It focuses on tasks performed, input type, algorithms employed, and metrics used. The paper starts with the background presentation, i.e., metadata description and a review of related works. This is followed by showing the dataset prepared for the experiment and its division into subsets: training, validation, and evaluation. Then, the analyzed architecture of the neural network model is presented. Based on the described model, training is performed, and several quality metrics are determined for the training and validation sets. The results of the evaluation of the trained network on a separate set are shown. Detailed values for precision, recall, and the number of true and false positive and negative detections are presented. The model efficiency is high, with the metric values ranging from 0.86 for the guitar to 0.99 for drums. Finally, a discussion and a summary of the results obtained follows.
Keywords: deep learning, musical instrument identification, musical information retrieval
1. Introduction
The identification of complex audio, including music, has proven to be complicated. This is due to the high entropy of the information contained in audio signals, wide range of sources, mixing processes, and the difficulty of analytical description, hence the variety of algorithms for the separation and identification of sounds from musical material. They mainly use spectral and cepstral analyses, enabling them to detect the fundamental frequency and their harmonics and assign the retrieved patterns to a particular instrument. However, this comes with some limitations, at the expense of increasing temporal resolution, frequency resolution decreases, and vice versa. In addition, it should be noted that these algorithms do not always allow the extraction of percussive tones and other non-harmonic effects, which may therefore constitute a source of interference for the algorithm, which may hinder its operation and reduce the accuracy and reliability of the result.
Moreover, articulation such as glissando or tremolo causes frequency shifts in the spectrum; transients may generate additional components in the signal spectrum. Another important factor should be kept in mind: music in Western culture is basedto some extenton consonances, which, although pleasing to the ear, are based on frequency ratios to fundamental tones. Thus, an obvious consequence is the overlap of harmonic tones in the spectrum, which creates a problem for most algorithms.
It should be remembered that recording musical instruments requires sensors. It is of enormous importance how a particular instrument is recorded. Indeed, the acoustic properties of musical instruments, researched theoretically for many epochs, as well as sound engineering practice, prescribe how to register an instrument in a given environment and conditions almost perfectly. These were the days of music recording in studios with acoustics designed for that purpose or registering music during a live concert with a lot of expertise on what microphones to use. On that basis, identifying a musical instrument sound within a recording is reasonably affordable both in terms of a human ear and automatic recognition. However, music instrument recording and its processing have changed over the last few decades. Nowadays, music is recorded everywhere and with whatever sensors are available, including smartphones. As a consequence, the task of the automated identification process became both much more intensive and necessary. This is because identifying musical instruments is of importance in many areas no longer closely related to music, i.e., automatically creating sound for games, organizing music social services, separating music mixes into tracks, amateur recordings, etc. Moreover, instruments may become sensors addressing an interesting concept: could the sound of a musical instrument be used to infer information about the instruments physical properties [1]? This is based on the notion that any vibrating instrument body part may be used for measuring its physical properties. Building new interfaces for musical expression (NIME) is another paradigm related to new sonic creation and a new way of musical instrument sound expression and performance [2]. Last but not least, smart musical instruments, a class of IoT (Internet of Things) devices, should be mentioned in the context of music creation [3]. Turchet et al., devised a sound engine incorporating digital signal processing, sensor fusion, and embedded machine learning techniques to classify the position, dynamics, and timbre of each hit of a smart cajn [4].
Overall, both classical and sensor-based instruments need to be subject to sound identification and further applications, e.g., computational auditory scene analysis (CASA), humancomputer interaction (HCI), music post-production, music information retrieval, automatic music mixing, music recommendation systems, etc. The identification of various instruments in the music mix, as well as the retrieval of melodic lines, belongs to the task of automatic music transcription (AMT) systems [5]. This also concerns blind source separation (BSS) [6,7]. Moreover, some other methods should be cited as they constitute the basis of BSS, e.g., independent component analysis (ICA) [8] or empirical mode decomposition (EMD) [9].
However, the problem in some of the analyzed cases is the classification: assigning the analyzed sample to a specific class, that is, in this case, the musical instrument. The work aims to propose an algorithm for automatic identification of all instruments present in an audio excerpt using sets of individual convolutional neural networks (CNN) per tested instrument. The motivation for this work was the need for a flexible model where any instrument could be added to the previously trained neural network. The novelty of the proposed solution lies in splitting the model into separate processing paths, one per instrument to be identified. Such a solution allows using models with various architecture complexity for different instruments, adding new submodels to the previously trained model, or replacing one instrument for another.
The paper starts with a review of tasks related to musical instrument identification. It focuses on the tasks performed, input type, algorithms employed, and metrics used. The main part of the study shows the dataset prepared for the experiment and its division into subsets: training, validation, and evaluation. The following section presents the analyzed architecture of the neural network model and its flexibility to expand. Based on the described model, training is performed, and several identification quality metrics are determined for training and validation sets. Then, the results of the evaluation of the trained network on a separate set are shown. Finally, a discussion and a summary of the results obtained follows.
2. Study Background
2.1. Metadata
The definition of metadata refers to data that provides information about other data. Metadata is also one of the basic sources of information about songs and audio samples. The ID3v2 informal standard [10] evolved from the ID3 tagging system, and it is a container of additional data embedded in the audio stream. Besides the typical parameters of the signal based on the MPEG-7 standard [11], information such as the performer, music genre, the instruments used, etc., usually appears in the metadata [12,13].
While in the case of newly created songs, individual sound examples, and music datasets, this information is already inserted in the audio file, older databases may not have such metadata tags. This is of particular importance when the task considered is to name all musical instruments present in a song by retrieving an individual stem from an audio file [14,15,16,17]. To this end, two approaches are still seen in this research area. The first consists of extracting a feature vector (FV) containing audio descriptors and using the baseline machine learning algorithms [12,15,16,17,18,19,20,21,22,23,24,25,26,27]. The second is based on the 2D audio representation and a deep learning model [28,29,30,31,32,33,34,35,36,37,38,39,40,41], or a more automated version when a variational or deep softmax autoencoder is used for the audio representation retrieval [32,42]. Therefore, by employing machine learning, it is possible to implement a classifier for particular genres or instrument recognition.
An example of a precisely specified feature vector in the audio domain is the MEPG-7 standard, described in ISO/IEC 15938 [11]. It contains descriptors divided into six main groups:
Basic: based on the value of the audio signal samples;
BasicSpectral: simple timefrequency signal analysis;
SpectralBasis: one-dimensional spectral projection of a signal prepared primarily to facilitate signal classification;
SignalParameters: information about the periodicity of the signal;
TimbralTemporal: time and musical timbre features;
TimbralSpectral: description of the linearfrequency relationships in the signal.
Reviewing the literature that describes the classification of musical instruments, it can be seen that this has been in development for almost three decades [17,18,25,28,36,41]. These works use various sets of signals and statistical parameters for the analyzed samples, standard MPEG-7 descriptors, spectrograms, mel-frequency cepstral coefficients (MFCC), or constant-Q transform (CQT)the basis for their operation. Similar to the input data, the baseline algorithms employed for classification also differ. They are as follows: HMM (hidden Markov model), k-NN (k-nearest neighbors) classifier, SOM (self-organizing map), SVM (support vector machine), decision trees, etc. Depending on the FVs and algorithms applied, they achieve an efficiency of even 99% for musical instrument recognition. However, as already said, some issues remain, such as instruments with differentiated articulation. The newer studies refer to deep models; however, the outcome of these works varies between works.
2.2. Related Work
Musical instrument identification also has a vital role in various classification tasks in audio fields. One such example is genre classification. In this context, many algorithms were used but obtained similar results. It should be noted that a music genre is conditioned by the instruments present in a musical piece. For example, the cello and saxophone are often encountered in jazz music, whereas the banjo is almost exclusively associated with country music. In music genre classification, several well-known techniques have been used, such as SVM (support vector machine) [14,19,25,26,33], ANN (artificial neural networks) [24,40], etc., as well as CNN (convolutional neural networks) [28,30,34,35,36,37,38,39,40], RNN (recurrent neural networks) [28,41], and CRNN (convolutional recurrent neural network) [31,34].
shows an overview of various algorithms and tasks described above along with the obtained results [14,15,16,18,19,20,21,22,23,24,25,26,27,28,29,30,31,33,34,35,36,37,38,39,40,41].
Table 1
Authors | Year | Task | Input Type | Algorithm | Metrics |
---|---|---|---|---|---|
Avramidis K., Kratimenos A., Garoufis C., Zlatintsi A., Maragos P. [28] | 2021 | Predominant instrument recognition | Raw audio | RNN (recurrent neural networks), CNN (convolutional neural networks), and CRNN (convolutional recurrent neural network) | LRAP (label ranking average precision)0.747F1 micro0.608F1 macro0.543 |
Kratimenos A., Avramidis K., Garoufis C., Zlatintsi, A., Maragos P. [36] | 2021 | Instrument identification | CQT (constant-Q transform) | CNN | LRAP0.805F1 micro0.647F1 macro0.546 |
Zhang F. [41] | 2021 | Genre detection | MIDI music | RNN | Accuracy89.91%F1 macro0.9 |
Shreevathsa P. K., Harshith M., A. R. M. and Ashwini [40] | 2020 | Single instrument classification | MFCC (mel-frequency cepstral coefficient) | ANN (artificial neural networks) and CNN | ANN accuracy72.08%CNN accuracy92.24% |
Blaszke M., Koszewski D., Zaporowski S. [30] | 2019 | Single instrument classification | MFCC | CNN | Precision0.99Recall1.0F1 score0.99 |
Das O. [33] | 2019 | Single instrument classification | MFCC and WLPC (warped linear predictive coding) | Logistic regression and SVM (support vector machine) | Accuracy100% |
Gururani S., Summers C., Lerch A. [34] | 2018 | Instrument identification | MFCC | CNN and CRNN | AUC ROC0.81 |
Rosner A., Kostek B. [26] | 2018 | Genre detection | FV (feature vector) | SVM | Accuracy72% |
Choi K., Fazekas G., Sandler M., Cho K. [31] | 2017 | Audio tagging | MFCC | CRNN (convolutional recurrent neural network) | ROC AUC (receiver operator characteristic)0.65-0.98 |
Han Y., Kim J., Lee K. [35] | 2017 | Predominant instrument recognition | MFCC | CNN | F1 score macro0.503F1 score micro0.602 |
Pons J., Slizovskaia O., Gong R., Gmez E., Serra X. [39] | 2017 | Predominant instrument recognition | MFCC | CNN | F1 score micro0.503F1 score macro0.432 |
Bhojane S.B., Labhshetwar O.G., Anand K., Gulhane S.R. [29] | 2017 | Single instrument classification | FV(MIR Toolbox) | k-NN (k-nearest neighbors) | A system that can listen to the musical instrument tone and recognize it (no metrics shown) |
Lee J., Kim T., Park J., Nam J. [37] | 2017 | Instrument identification | Raw audio | CNN | AUC ROC0.91Accuracy86%F1 score0.45% |
Li P., Qian J., Wang T. [38] | 2015 | Instrument identification | Raw audio, MFCC, and CQT (constant-Q transform) | CNN | Accuracy82.74% |
Giannoulis D., Benetos E., Klapuri A., Plumbley M. D. [20] | 2014 | Instrument identification | CQT (constant-Q transform of a time domain signal) | Missing feature approach with AMT (automatic music transcription) | F10.52 |
Giannoulis D., Klapuri A., [21] | 2013 | Instrument recognition in polyphonic audio | A variety of acoustic features | Local spectral features and missing-feature techniques, mask probability estimation | Accuracy67.54% |
Bosch J. J., Janer J., Fuhrmann F., Herrera P. [14] | 2012 | Predominant instrument recognition | Raw audio | SVM | F1 score micro0.503F1 score macro0.432 |
Heittola T., Klapuri A., Virtanen T. [16] | 2009 | Instrument recognition in polyphonic audio | MFCC | NMF (non-negative matrix factorization) and GMM | F1 score0.62 |
Essid S., Richard G., David B. [19] | 2006 | Single instrument classification | MFCC and FV | GMM (Gaussian mixture model)and SVM | Accuracy93% |
Kostek B. [23] | 2004 | Single instrument classification (12 instruments) | Combined MPEG-7 and Wavelet-Based FVs | ANN | Accuracy72.24% |
Eronen A. [15] | 2003 | Single instrument classification | MFCC | ICA (independent component analysis) ML and HMM (hidden Markov model) | Accuracy between: 6285% |
Kitahara T., Goto M., Okuno H. [22] | 2003 | Single instrument classification | FV | Discriminant functionbased on the Bayes decision rule | Recognition rate79.73% |
Tzanetakis G., Cook P. [27] | 2002 | Genre detection | FV and MFCC | SPR (subtree pruningregrafting) | Accuracy61% |
Kostek B., Czyewski A. [24] | 2001 | Single instrument classification | FV | ANN | Accuracy94.5% |
Eronen A., Klapuri A. [18] | 2000 | Single instrument classification | FV | k-NN | Accuracy80% |
Marques J., Moreno P. J. [25] | 1999 | Single instrument classification | MFCC | GMM and SVM | Error rate17% |
As already mentioned, the aim of this study is to build an algorithm for automatic identification of instruments present in an audio excerpt using sets of individual convolutional neural networks (CNN) per tested instrument. Therefore, a flexible model where any instrument could be added to the previously trained neural network should be created.
3. Dataset
In our study, the Slakh dataset was used, which contains 2100 audio tracks with aligned MIDI files, and separate instrument stems along with tagging [43]. From all of the available instruments, four were selected for the experiment: bass, drums, guitar, and piano. After selection, each song was split into 4-second excerpts. If the level of instrument signal in the extracted part was lower than 60 dB, then this instrument was excluded from the example. This made it possible to decrease computing costs and increase the instrument count in the mix variability. Additionally, each part has a randomly selected gain for all instruments separately. An example of spectrograms of selected instruments and the prepared mix are presented in .
Example of spectrograms of selected instruments and the prepared mix.
The examples were then stored using the NumPy format on files that contain mixed signals, instrument references, and vectors of labels to indicate which instruments were used in the mix [44].
To achieve repeatability of the training results, the whole dataset was a priori divided into three parts, but with the condition that a single audio track cannot be split into each part:
Training set116,413 examples;
Validation set5970 examples;
Evaluation set6983 examples.
The number of individual instrument appearances in the mix is not similar, to not favor any of them. A class weighting vector is passed to the training algorithm to balance the results between instruments. Calculated weights are as follows:
4.
Bass0.65
5.
Guitar1.0
6.
Piano0.78
7.
Drums0.56
Furthermore, the number of instruments in a given sample also varies. Due to the structure of music pieces, the largest part (about 1/3 of all of the examples in the dataset) contains three instruments. Four, two, and then one instrument populate the remaining parts. In addition, music samples that do not have any instrument are introduced to the algorithm input to train the system to understand that such a case can also occur. Histograms of the instrument classes in the mixes and the number of instruments in a mix are presented in and .
Histogram of the instrument classes in the mixes.
Histogram of instruments in the mixes.
4. Model
The proposed neural network was implemented using the Keras framework and functional API [45]. The model initially produces MFCC (mel frequency cepstral coefficients) from the raw audio signal using built-in Keras methods [46]. The parameters for those operations are as follows:
In contrast to other methods where a single model performs identification or classification of all instruments, the used model employs sets of individual identically defined submodelsone per instrument. The proposed architecture contains 2-dimensional convolution layers in the beginning. The number of filters was, respectively, 128, 64, and 32 with (3, 3) kernels and the ReLU activation function [47]. In addition, 2-dimensional max pulling and batch normalization are incorporated into the model after each convolution [48,49,50]. To obtain the decision, four dense layers were used with 64, 32, 16, and 1 unit, respectively [50]. The model contains 706,182 trainable parameters. The topology of the network is presented in .
The simplified code for model preparation is presented below. Each instrument has its own model preparation function, where a new model could be created, or a pre-trained model could be loaded. In the last operation, outputs from all models are concatenated and set as a whole model output.
def prepareModel(input_shape):
dense_outputs = []
input = Input(shape = input_shape)
mfcc = prepareMfccModel(input)
dense_outputs.append(prepareBassModel(mfcc))
dense_outputs.append(prepareGuitarModel(mfcc))
dense_outputs.append(preparePianoModel(mfcc))
dense_outputs.append(prepareDrumsModel(mfcc))
concat = Concatenate()(dense_outputs)
model = Model(inputs = input, outputs = concat)
return model
4.1. Training
The training was performed using the Tensorflow framework for the Python language. The model was trained for 100 epochs with the mean squared error (MSE) as the loss function. Additionally, during training, the precision, recall, and AUC ROC (area under the receiver operating characteristic curve) were calculated. The best model was selected based on the AUC ROC metric [51]. Precision is a ratio of true positive examples to all examples identified as an examined class. The definition of this metric is presented in Equation (1) [52].
(1)
The recall ratio of true positive examples to all examples in the examined class is defined by Equation (2):
(2)
Additionally, for evaluation purposes, the F1 score was used [53]. This metric represents a harmonic mean of precision and recall. The exact definition is presented in Equation (3) [52].
(3)
The receiver operating characteristic (ROC) shows a trade-off between true and false positive results in the function of various decision thresholds. The ROC and AUC ROC are illustrated in . We included this illustration to visualize the importance of true and false positives in the identification process.
Example of the receiver operating characteristic and area under the curve.
Precision, recall, and AUC ROC calculated during training are presented in and . On the training set, recall starts from 0.67 and increases to 0.93. Precision starts from a higher value, 0.78, and increases throughout the whole training process to 0.93. AUC ROC builds up from the lowest value, 0.63, but increases to the highest value, 0.96. The values of the metrics for the validation sets look similar to those of the training set. During training, both metrics increase to 0.95 and 0.97 for the training set and, respectively, 0.95 and 0.94 for the validation set. The recall starts from 0.84 and increases to 0.93, precision from 0.82 to 0.92, and AUC ROC from 0.74 to 0.95.
Metrics achieved by the algorithm on the training set.
Metrics achieved by the algorithm on the validation set.
The training was performed using a single RTX2070 graphics card with an AMD Ryzen 5 3600 processor and 32 GB of RAM. The duration of a single epoch is about 8 min using multiprocessing data loading and with a batch size of 200.
4.2. Evaluation Results and Discussion
The evaluation was carried out using a set of 6983 examples prepared from audio tracks not presented in the training and validation sets. The processing time for a single example was about 0.44 s, so the algorithm works approx. 10 times faster than real-time. The averaged results for individual metrics are as follows:
Precision0.92;
Recall0.93;
AUC ROC0.96;
F1 score0.93.
The individual components of precision and recall are as follows:
True positive17,759;
True negative6610;
False positive1512;
False negative1319.
Based on the results obtained, more detailed analyses were also carried out, discerning individual instruments. The ROC curves are presented in . They indicate that the most easily identifiable class is percussion, which can obtain a true positive rate of 0.95 for a relatively low false positive rate of about 0.01. The algorithm is slightly worse at identifying bass because to achieve similar effectiveness, the false positive rate for the bass would have to be 0.2. When it comes to guitar and piano, to achieve effectiveness of about 0.9, one has to accept a false positive rate of 0.27 and 0.19, respectively.
ROC curves for each instrument tested.
Detailed values for precision, recall, and the number of true and false positive and negative detections are presented in . By comparing these results with the ROC plot in , one can see confirmation that the model is more capable of recognizing drums and also bass. Looking at the metric values for guitar, one can see that the model has a similar trend when resulting in the samples received as false negatives and false positives. For the piano, the opposite happens, i.e., more samples are marked as false positives. presents the confusion matrix.
Table 2
Metric | Bass | Drums | Guitar | Piano |
---|---|---|---|---|
Precision | 0.94 | 0.99 | 0.82 | 0.87 |
Recall | 0.94 | 0.99 | 0.82 | 0.91 |
F1 score | 0.95 | 0.99 | 0.82 | 0.89 |
True positive | 5139 | 6126 | 2683 | 3811 |
True negative | 1072 | 578 | 2921 | 2039 |
False positive | 288 | 38 | 597 | 589 |
False negative | 301 | 58 | 599 | 361 |
Table 3
Confusion matrix (in percentage points).
Ground Truth Instrument [%] | |||||
---|---|---|---|---|---|
Bass | Guitar | Piano | Drums | ||
Predicted instrument | Bass | 81 | 8 | 7 | 0 |
Guitar | 4 | 69 | 13 | 0 | |
Piano | 5 | 12 | 77 | 0 | |
Drums | 3 | 7 | 6 | 82 |
4.3. Redefining the Models
Using the ability of the models infrastructure to easily swap entire blocks for individual instruments, an additional experiment was conducted. Submodels for drums and guitar were changed to smaller and bigger ones. A detailed comparison of the block structure before and after the changes introduced is shown in .
Table 4
Comparison between the first submodel and models after modification.
Block Number | Unified Submodel | Guitar Submodel | Drums Submodel |
---|---|---|---|
1 | 2D convolution:2D Max pooling:Batch Normalization | 2D convolution:2D Max pooling:Batch Normalization | 2D convolution:2D Max pooling:Batch Normalization |
2 | 2D convolution:2D Max pooling:Batch Normalization | 2D convolution:2D Max pooling:Batch Normalization | 2D convolution:2D Max pooling:Batch Normalization |
3 | 2D convolution:2D Max pooling:Batch Normalization | 2D convolution:2D Max pooling:Batch Normalization | 2D convolution:2D Max pooling:Batch Normalization |
4 | Dense Layer: | Dense Layer:Units64 | Dense Layer:Units64 |
5 | Dense Layer:Units32 | Dense Layer:Units32 | Dense Layer:Units32 |
6 | Dense Layer:Units16 | Dense Layer:Units16 | Dense Layer:Units16 |
7 | Dense Layer:Units1 | Dense Layer:Units1 | Dense Layer:Units1 |
AUC ROC curves calculated during the training and validation stages are presented in and . The training and validation curves look similar, but the modified model achieves better results by about 0.01.
AUC ROC achieved by the algorithms on the training set.
AUC ROC achieved by the algorithms on the validation set.
4.4. Evaluation Result Comparison
The evaluation of the new model was prepared based on the same conditions as in Section 4.2. A comparison of results for the first submodel and models after modifications is presented in , whereas shows results per modified instrument. Because of rounding metric values to two decimal places, the differences are not strongly visible when looking at the entire evaluation set. However, comparing true positive, true negative, and false positive, all those measures are higher on the modified model than on the unified model, namely, of about 200 examples. The only value of the false negative examples is worse in 61 examples. Looking at the results of changed instruments, one can see that the smaller model for drums performs similarly compared to the unified model. A larger model obtained for guitar presents better results on precision and F1 score.
Table 5
Comparison between the first submodel and models after modifications.
Metric | Unified Model | Modified Model |
---|---|---|
Precision | 0.92 | 0.93 |
Recall | 0.93 | 0.93 |
AUC ROC | 0.96 | 0.96 |
F1 score | 0.93 | 0.93 |
True positive | 17,759 | 17,989 |
True negative | 6610 | 6851 |
False positive | 1512 | 1380 |
False negative | 1319 | 1380 |
Table 6
Results per modified instrument models.
Drums | Guitar | |||
---|---|---|---|---|
Metric | Unified Model | ModifiedModel | Unified Model | ModifiedModel |
Precision | 0.99 | 0.99 | 0.82 | 0.86 |
Recall | 0.99 | 0.99 | 0.82 | 0.8 |
F1 score | 0.99 | 0.99 | 0.82 | 0.83 |
True positive | 6126 | 6232 | 2683 | 2647 |
True negative | 578 | 570 | 2921 | 3150 |
False positive | 38 | 47 | 597 | 444 |
False negative | 58 | 51 | 599 | 659 |
The ROC curves for the unified and modified models are presented in . Focusing on the modified models for drums and guitar, it could be noticed that the smaller model for drums has an almost identical shape to the ROC curve. In contrast, the guitar model shows better results using a bigger model, e.g., the true positive rate increases from 0.72 to 0.77, whereas false positive rate equals 0.1.
ROC curves for each instrument tested on the unified and modified models.
presents reduced heatmaps for the last convolutional layers per identified instrument for one of the examples from the evaluation dataset. Comparing heatmaps between each other, one can see that the bass model focuses mainly on lower frequencies for the whole signal, guitar on low and mid frequencies, piano on mid and high frequencies but also the whole signal, and finally, drums for all of the frequencies and short-time signals.
Reduced heatmaps for the last convolutional layers per identified instrument.
5. Discussion
The presented results show that it is possible to determine the instruments present in a given excerpt of a musical recording with a precision of 93% and an F1 score of 0.93 using a simple convolutional network based on the MFCC.
The experiment also shows that the effectiveness of identification depends on the instrument tested. The drums are more easily identifiable, while the guitar and piano produced worse results.
The current state of the art in audio recognition fields focuses on single or predominant instrument recognition and genre classification. With regard to the results of those tasks, an accuracy of about 100% can be found, but when looking at musical instrument recognition results, the metric values are lower, e.g., AUC ROC of approximately 0.91 [37] or F1 score of about 0.64 [36]. The proposed solution can achieve an AUC ROC of about 0.96 and an F1 score of about 0.93, outperforming the other methods.
An additional difference compared to state-of-the-art methods is the flexibility of the model. The presented results show that an operation of a submodel switch allows, for example, reducing the size of the model in the case when the instrument is readily identifiable without affecting the architecture of the other identifiers. Thus, it is possible to save computational power compared to a model with a large, unified architecture. On the other hand, the submodel can be increased to improve the results for an instrument presenting poorer quality without affecting the other instruments under the study.
6. Conclusions
The novelty of the proposed solution lies in the model architecture, where every instrument has an individual and independent identification path. It produces outputs focused on specific patterns in the MFCC signal depending on the examined instrument, opposite to state-of-the-art methods, where a single convolutional part obtains one pattern per all instruments.
The proposed framework is very flexible, so it could use instrument models with various complexitymore advanced for those with weaker results and more straightforward for those with better results. Another advantage of this flexibility is the opportunity to extend the model with more instruments by adding new submodels in the architecture proposed. This thread will be pursued further, especially as a new dataset is being prepared that will contain musical instruments that are underrepresented in music repositories, i.e., the harp, Rav vast, and Persian cymbal (santoor). Recordings of these instruments are created with both dynamic and condenser microphones at various distances and angles of microphone positioning, and they will be employed for creating new submodels in the identification system.
Additionally, the created model will be worked on toward on-the-fly musical instrument identification as this will enable its broader applicability in real-time systems.
Moreover, we may use other neural network structures as known in the literature [54,55], e.g., using sample-level filters instead of frame-level input representations [56], and trying other approaches to music feature extraction, e.g., including derivation of rhythm, melody, and harmony and determining their weights by employing the exponential analytic hierarchy process (AHP) [57]. Lastly, the model proposed may be tested with audio signals other than music, such as classification of urban sounds [58].
Author Contributions
Conceptualization, M.B. and B.K.; methodology, M.B.; software, M.B.; validation, M.B. and B.K.; investigation, M.B. and B.K.; data curation, M.B.; writingoriginal draft preparation, M.B. and B.K.; writingreview and editing, B.K.; supervision, B.K. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Conflicts of Interest
The authors declare no conflict of interest.
Footnotes
Publishers Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
2.
Tanaka A. Sensor-based musical instruments and interactive music. In: Dean T.T., editor. The Oxford Handbook of Computer Music. Oxford University Press; Oxford, UK: 2012. [CrossRef] [Google Scholar]3.
Turchet L., McPherson A., Fischione C. Smart instruments: Towards an ecosystem of interoperable devices connecting performers and audiences; Proceedings of the Sound and Music Computing Conference; Hamburg, Germany. 31 August3 September 2016; pp. 498505. [Google Scholar]4.
Turchet L., McPherson A., Barthet M. Real-Time Hit Classification in Smart Cajn. Front. ICT. 2018;5:16. doi:10.3389/fict.2018.00016. [CrossRef] [Google Scholar]5.
Benetos E., Dixon S., Giannoulis D., Kirchhoff H., Klapuri A. Automatic music transcription: Challenges and future directions. J. Intell. Inf. Syst. 2013;41:407434. doi:10.1007/s10844-013-0258-3. [CrossRef] [Google Scholar]6.
Brown J.C. Computer Identification of Musical Instruments using Pattern Recognition with Cepstral Coefficients as Features. J. Acoust. Soc. Am. 1999;105:19331941. doi:10.1121/1.426728. [PubMed] [CrossRef] [Google Scholar]7.
Dziubiski M., Dalka P., Kostek B. Estimation of Musical Sound Separation Algorithm Effectiveness Employing Neural Networks. J. Intell. Inf. Syst. 2005;24:133157. doi:10.1007/s10844-005-0320-x. [CrossRef] [Google Scholar]8.
Hyvrinen A., Oja E. Independent component analysis: Algorithms and applications. Neural Netw. 2000;13:411430. doi:10.1016/S0893-6080(00)00026-5. [PubMed] [CrossRef] [Google Scholar]9.
Flandrin P., Rilling G., Goncalves P. Empirical mode decomposition as a filter bank. IEEE Signal Processing Lett. 2004;11:112114. doi:10.1109/LSP.2003.821662. [CrossRef] [Google Scholar]12.
Burgoyne J.A., Fujinaga I., Downie J.S. A New Companion to Digital Humanities. John Wiley & Sons. Ltd.; Chichester, UK: 2015. Music Information Retrieval; pp. 213228. [Google Scholar]14.
Bosch J.J., Janer J., Fuhrmann F., Herrera P.A. Comparison of Sound Segregation Techniques for Predominant Instrument Recognition in Musical Audio Signals; Proceedings of the 13th International Society for Music Information Retrieval Conference (ISMIR 2012); Porto, Portugal. 812 October 2012; pp. 559564. [Google Scholar]15.
Eronen A. Musical instrument recognition using ICA-based transform of features and discriminatively trained HMMs; Proceedings of the International Symposium on Signal Processing and Its Applications (ISSPA); Paris, France. 14 July 2003; pp. 133136. [CrossRef] [Google Scholar]16.
Heittola T., Klapuri A., Virtanen T. Musical Instrument Recognition in Polyphonic Audio Using Source-Filter Model for Sound Separation; Proceedings of the 10th International Society for Music Information Retrieval Conference; Utrecht, The Netherlands. 913 August 2009; pp. 327332. [Google Scholar]17.
Martin K.D. Toward Automatic Sound Source Recognition: Identifying Musical Instruments; Proceedings of the NATO Computational Hearing Advanced Study Institute; Il Ciocco, Italy. 112 July 1998. [Google Scholar]18.
Eronen A., Klapuri A. Musical Instrument Recognition Using Cepstral Coefficients and Temporal Features; Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP); Istanbul, Turkey. 59 June 2000; pp. 753756. [CrossRef] [Google Scholar]19.
Essid S., Richard G., David B. Musical Instrument Recognition by pairwise classification strategies. IEEE Trans. Audio Speech Lang. Processing. 2006;14:14011412. doi:10.1109/TSA.2005.860842. [CrossRef] [Google Scholar]20.
Giannoulis D., Benetos E., Klapuri A., Plumbley M.D. Improving Instrument recognition in polyphonic music through system integration; Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP); Florence, Italy. 49 May 2014; [CrossRef] [Google Scholar]21.
Giannoulis D., Klapuri A. Musical Instrument Recognition in Polyphonic Audio Using Missing Feature Approach. IEEE Trans. Audio Speech Lang. Processing. 2013;21:18051817. doi:10.1109/TASL.2013.2248720. [CrossRef] [Google Scholar]22.
Kitahara T., Goto M., Okuno H. Musical Instrument Identification Based on F0 Dependent Multivariate Normal Distribution; Proceedings of the 2003 IEEE Intl Conference on Acoustics, Speech and Signal Processing (ICASSP 03); Honk Kong, China. 610 April 2003; pp. 421424. [CrossRef] [Google Scholar]23.
Kostek B. Musical Instrument Classification and Duet Analysis Employing Music Information Retrieval Techniques. Proc. IEEE. 2004;92:712729. doi:10.1109/JPROC.2004.825903. [CrossRef] [Google Scholar]24.
Kostek B., Czyewski A. Representing Musical Instrument Sounds for Their Automatic Classification. J. Audio Eng. Soc. 2001;49:768785. [Google Scholar]25.
Marques J., Moreno P.J. A Study of Musical Instrument Classification Using Gaussian Mixture Models and Support Vector Machines. Camb. Res. Lab. Tech. Rep. Ser. CRL. 1999;4:143. [Google Scholar]26.
Rosner A., Kostek B. Automatic music genre classification based on musical instrument track separation. J. Intell. Inf. Syst. 2018;50:363384. doi:10.1007/s10844-017-0464-5. [CrossRef] [Google Scholar]27.
Tzanetakis G., Cook P. Musical genre classification of audio signals. IEEE Trans. Speech Audio Processing. 2002;10:293302. doi:10.1109/TSA.2002.800560. [CrossRef] [Google Scholar]28.
Avramidis K., Kratimenos A., Garoufis C., Zlatintsi A., Maragos P. Deep Convolutional and Recurrent Networks for Polyphonic Instrument Classification from Monophonic Raw Audio Waveforms; Proceedings of the 46th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021); Toronto, Canada. 611 June 2021; pp. 30103014. [CrossRef] [Google Scholar]29.
Bhojane S.S., Labhshetwar O.G., Anand K., Gulhane S.R. Musical Instrument Recognition Using Machine Learning Technique. Int. Res. J. Eng. Technol. 2017;4:22652267. [Google Scholar]30.
Blaszke M., Koszewski D., Zaporowski S. Real and Virtual Instruments in Machine LearningTraining and Comparison of Classification Results; Proceedings of the (SPA) IEEE 2019 Signal Processing: Algorithms, Architectures, Arrangements, and Applications; Poznan, Poland. 1820 September 2019; [CrossRef] [Google Scholar]31.
Choi K., Fazekas G., Sandler M., Cho K. Convolutional recurrent neural networks for music classification; Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); New Orleans, LA, USA. 59 March 2017; pp. 23922396. [Google Scholar]32.
Sawhney A., Vasavada V., Wang W. Latent Feature Extraction for Musical Genres from Raw Audio; Proceedings of the 32nd Conference on Neural Information Processing Systems (NIPS 2018); Montral, QC, Canada. 28 December 2021. [Google Scholar]33.
Das O. Musical Instrument Identification with Supervised Learning. Comput. Sci. 2019:14. [Google Scholar]34.
Gururani S., Summers C., Lerch A. Instrument Activity Detection in Polyphonic Music using Deep Neural Networks; Proceedings of the ISMIR; Paris, France. 2327 September 2018; pp. 569576. [Google Scholar]35.
Han Y., Kim J., Lee K. Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music. IEEE/ACM Trans. Audio Speech Lang. Process. 2017;25:208221. doi:10.1109/TASLP.2016.2632307. [CrossRef] [Google Scholar]36.
Kratimenos A., Avramidis K., Garoufis C., Zlatintsi A., Maragos P. Augmentation methods on monophonic audio for instrument classification in polyphonic music; Proceedings of the European Signal Processing Conference; Dublin, Ireland. 2327 August 2021; pp. 156160. [CrossRef] [Google Scholar]37.
Lee J., Kim T., Park J., Nam J. Raw waveform based audio classification using sample level CNN architectures; Proceedings of the Machine Learning for Audio Signal Processing Workshop (ML4Audio); Long Beach, CA, USA. 48 December 2017; [CrossRef] [Google Scholar]38.
Li P., Qian J., Wang T. Automatic Instrument Recognition in Polyphonic Music Using Convolutional Neural Networks. arXiv Prepr. 20151511.05520 [Google Scholar]39.
Pons J., Slizovskaia O., Gong R., Gmez E., Serra X. Timbre analysis of music audio signals with convolutional neural networks; Proceedings of the 25th European Signal Processing Conference (EUSIPCO); Kos, Greece. 28 August2 September 2017; pp. 27442748. [CrossRef] [Google Scholar]40.
Shreevathsa P.K., Harshith M., Rao A. Music Instrument Recognition using Machine Learning Algorithms; Proceedings of the 2020 International Conference on Computation, Automation and Knowledge Management (ICCAKM); Dubai, United Arab Emirates. 911 January 2020; pp. 161166. [CrossRef] [Google Scholar]41.
Zhang F. Research on Music Classification Technology Based on Deep Learning, Security and Communication Networks. Secur. Commun. Netw. 2021;2021:7182143. doi:10.1155/2021/7182143. [CrossRef] [Google Scholar]42.
Dorochowicz A., Kurowski A., Kostek B. Employing Subjective Tests and Deep Learning for Discovering the Relationship between Personality Types and Preferred Music Genres. Electronics. 2020;9:2016. doi:10.3390/electronics9122016. [CrossRef] [Google Scholar]43.
Slakh Demo Site for the Synthesized Lakh Dataset (Slakh) [(accessed on 1 April 2022)]. Available online: http://www.slakh.com/54.
Samui P., Roy S.S., Balas V.E., editors. Handbook of Neural Computation. Academic Press; Cambridge, MA, USA: 2017. [Google Scholar]55.
Balas V.E., Roy S.S., Sharma D., Samui P., editors. Handbook of Deep Learning Applications. Springer; New York, NY, USA: 2019. [Google Scholar]56.
Lee J., Park J., Kim K.L., Nam J. Sample CNN: End-to-end deep convolutional neural networks using very small filters for music classification. Appl. Sci. 2018;8:150. doi:10.3390/app8010150. [CrossRef] [Google Scholar]57.
Chen Y.T., Chen C.H., Wu S., Lo C.C. A two-step approach for classifying music genre on the strength of AHP weighted musical features. Mathematics. 2018;7:19. doi:10.3390/math7010019. [CrossRef] [Google Scholar]58.
Roy S.S., Mihalache S.F., Pricop E., Rodrigues N. Deep convolutional neural network for environmental sound classification via dilation. J. Intell. Fuzzy Syst. 2022:17. doi:10.3233/JIFS-219283. [CrossRef] [Google Scholar]