|Home | About | Journals | Submit | Contact Us | Français|
This paper describes a multiresolution system based on MPEG-7 audio signature descriptors for music identification. Such an identification system may be used to detect illegally copied music circulated over the Internet. In the proposed system, low-resolution descriptors are used to search likely candidates, and then full-resolution descriptors are used to identify the unknown (query) audio. With this arrangement, the proposed system achieves both high speed and high accuracy. To deal with the problem that a piece of query audio may not be inside the system's database, we suggest two different methods to find the decision threshold. Simulation results show that the proposed method II can achieve an accuracy of 99.4% for query inputs both inside and outside the database. Overall, it is highly possible to use the proposed system for copyright control.
With the SOPA (stop online piracy act) bill  proposed in 2011, the protection of copyrighted intellectual property, such as digital content, once again brought to public attention. Despite the controversial issues of the SOPA bill, it is commonly agreed that copyrighted digital content should be protected. However, the first step toward the protection of copyrighted content is to identify whether a piece of digital content is copyrighted, and if so, who owns it. In this regard, it is important to identify (detect) whether a digital work is copyrighted or not.
Among digital content, soundtracks (usually in the form of audio files) are one type of content that is easily to be illegally reproduced. Owing to the advanced techniques in audio compression, music soundtracks are usually distributed over the Internet in compressed form rather than in uncompressed form. Therefore, any approach for copyright detection must be able to deal with both compressed and uncompressed audio files.
A typical method to attach the copyright information to a piece of music is by embedding watermarks . Though effective, this method has some limitations, such as the watermarks must be embedded into the source soundtracks before release. Therefore, it is not possible to identify the rights owner of a piece of music without watermarks. Another concern is that the embedding process usually introduces distortion. Thus, the quality of the embedded audio may be degraded.
In addition to the watermarking technique, it is also possible to identify rights owner by comparison. For example, if an unknown soundtrack is very similar to a soundtrack owned by a company, then the unknown soundtrack is highly likely copyrighted in that company's name. This type of approach is especially suitable for audio files because, in practice, tremendous amount of records currently available do not embed watermarks or any kind of copyright information.
When comparing a piece of music with a music database, the comparison may be accomplished based on the melody (i.e., musical notes) of the music . For this type of comparison, however, if two persons sing the same song, these two works will be recognized as the same one. Since different artists may perform the same song, known as the cover version, a comparison based on melody cannot solve this problem.
Another type of comparison is based on the waveform of the music. This technique is also known as music identification. In this case, the same song performed by different artists generally does not have the same waveforms, and therefore they can be correctly identified. Though conceptually simple, it is not plausible to directly compare PCM samples of two pieces of music because it would take too much time for the comparison. For example, a typical compact disc (CD) has about 600M bytes of PCM samples to store about ten songs. If a database contains 10,000 different songs, then the PCM samples occupy about 600G bytes of space. A piece of unknown music with duration of ten seconds has about 880k bytes of data. It is obvious that it requires a huge amount of computation to sequentially compare the 880k bytes of data with the 600G bytes of data in the database. Therefore, dimension-reduced representations of the PCM waveforms, known as fingerprints, are used in comparison. Among the fingerprints, most of them are defined by individual companies or groups. Some of them are briefly explained in the following.
Researchers in Google develop a fingerprinting scheme for audio called Waveprint  based on wavelets. With the aid of wavelets, the fingerprint is invariant to timescale change. In other words, whether the audio piece is played faster or slower than the normal speed, the fingerprint is unchanged. The fingerprint of a piece of 4-minute music is around 64k bytes, equivalently 2133 bits per second.
Shazam  is a company (and service) dedicated for music identification. Its database contains around eleven million soundtracks. As described in , the fingerprints used are sets of triplets based on spectrogram peaks. For example, if (t1, f1) and (t2, f2) are two peaks at time t1 and t2 and frequency f1 and f2, then the triplet ((t2 − t1), f1, (f2 − f1)) is a feature. Based on the realization of , the fingerprint in this scheme uses 400 bits per second.
Researchers in Philips also propose a fingerprinting scheme . The computation of the fingerprints includes: framing, windowing (von Hann window), FFT (fast fourier transform), band decision, energy computation, and then quantization (into binary). In the typical setting, one second of audio has around 2,730 bits of fingerprint.
Microsoft's Robust Audio Recognition Engine (RARE)  divides the incoming audio into overlapping frames. Each frame is converted to spectral domain by MCLT (modulated complex lapped transform). The spectral values are applied to two layers of OPCA (oriented principle component analysis) to reduce the dimensionality of the spectral data. For this method, 344 features (11,008 bits if one feature is stored in 4 bytes in a floating point) are obtained per second.
In addition to the above methods, there are actually many more different types of audio fingerprinting schemes available, such as Music Brainz , Audible Magic , and Gracenote's MusicID . According to , there are more than ten different audio fingerprinting schemes available.
Since there are vast amount of different fingerprinting schemes available, some researchers then conducted experiments to compare the relative performance among some of them. The results show that, if the schemes use the same number of bits to represent fingerprints, they have comparable performance . Therefore, the selection of the fingerprinting schemes should also consider other factors (such as interoperability to be addressed below) rather than merely the minor performance difference.
With the ever-increasing amount of multimedia content over the Internet and in the multimedia databases, it is an important task to exchange multimedia content. To respond the public demands, ISO's (International Standardization Organization) working group developed MPEG-7 standard [15, 16]. In the audio part of the standard , a high-level tool is developed for audio identification called audio signature description scheme. The fingerprints used in the scheme are called audio signature descriptors, and they have good identification accuracy [18, 19]. In the following, we interchangeably use descriptors and fingerprints without distinction.
Although proprietary audio fingerprints have excellent identification performance, the MPEG-7 audio descriptors offer some advantages. First, being an international standard ensures the open and fair use (subject to license fee) of the technology. Second, such an international standard makes the interoperability possible. For example, if a mobile phone installs an application program to convert a piece of recorded audio to MPEG-7 descriptors, the descriptors can be sent to any website accepting the descriptors. On the other hand, it is not possible to send proprietary fingerprints used in one company to database systems owned by competitors. Third, different companies may share or exchange their audio descriptors (fingerprints) in their databases without any difficulties. In the current situation, each company has to compute fingerprints for newly released albums. With the use of MPEG-7 descriptors, the redundant efforts of computing fingerprints can be minimized.
Although a music identification system based on audio fingerprints has several applications , we concentrate on the issue of detecting if a piece of circulated music is highly similar to a copyrighted work or not. In a typical case, the similarity is measured by a distance metric. If the distance is shorter than a threshold, the two pieces of music are considered as similar. Although it is not trivial to determine a suitable threshold , this problem is not fully studied. For example,  does not indicate any approach to determine the threshold. In addition, the audio files to be compared may be very large; therefore it is very important to reduce the comparison time while maintaining high identification accuracy. Since not many papers address these two issues based on MPEG-7 descriptors, we report in this paper our approaches and experimental results.
This paper is organized as follows. Section 2 is an overview of the MPEG-7 audio signature descriptors. Section 3 is the system model for music identification. Section 4 describes the dimensionality reduction method used in the paper. Section 5 is the proposed strategy to determine the threshold. Section 6 covers the experiments and results. Section 7 is the conclusion.
Part 4  of the MPEG-7 standard includes low-level and high-level descriptors for various applications. Low-level descriptors are derived from the temporal and spectral characteristics of the waveform, whereas the high-level descriptors are constructed based on low-level descriptors. In this section, we will briefly describe the descriptors related to music identification.
There are 17 low-level audio descriptors defined in the standard. All of these low-level descriptors are derived from the waveform of the music. They can be divided into six different categories: basic, basic spectral, signal parameter, temporal timbral, spectral timbral, and spectral basis representation. These low-level features may be directly used or may serve as the basics for constructing high-level descriptors.
Based on low-level descriptors, MPEG-7 audio standard also defines high-level description schemes for various applications. These include audio signature description, instrument timbre description, general sound recognition and indexing description, and spoken content description. The audio signature descriptors, one type of fingerprints, are used for music identification.
Although low-level descriptors in MPEG-7 audio may also be used to identify music, it is shown that the audio signature descriptors provide better identification performance [18, 19]. Therefore, the audio signature descriptors are adopted in the proposed system.
The MPEG-7 audio signature descriptors are computed as follows.
With the computational steps of the audio signature descriptor, we may compute the number of descriptors in a piece of 15-second music. In the time domain, about (15−0.09)/0.03 = 497 windows are used to cover the 15-second signal because the hop size is 30ms. Since the scaling ratio is set to 16, there are 497/16 ≈ 31 values per subband. In the spectral domain, since there are four subbands per octave and we use six octaves (250Hz ~ 16kHz), there are totally 24 subbands in the spectral domain. Thus, the 15-second signal produces 24 × 31 = 744 mean values and other 744 variance values. The descriptors are arranged in a matrix form, as shown in Figure 1.
Since the mean values are sufficient for the identification purpose , we will not consider the variance descriptors in the following. The obtained descriptors are referred to as high-resolution (or full-dimensional) descriptors. According to , if a piece of query music is to be compared with a reference piece, it should be done with a sliding comparison, as shown in Figure 2. In the figure, one segment of line represents descriptors in a piece of music arranged in one dimensional structure. Using the representation in Figure 1, the query segment is arranged as
Since only descriptors from the same subband are to be compared, the hop size between two query segments in Figure 2 is 24 descriptors, or 0.03 · 16 = 0.48 second. In other words, one descriptor (in a subband) represents 0.48 second of audio samples. The smallest Euclidean distance obtained during the sliding operation is recorded as the distance between the query input and the reference music. For example, if the query Q is to be compared with reference A, then we may obtain the Euclidean distance EA,B(k) for the k-th sliding comparison as
The recorded distance between these two pieces of music is
Although using audio signature descriptors greatly reduces the computational burden for comparison, the required computation using (3) is still very large. According to , suppose that a database contains 1,000 audio files, with each one having duration of 30 seconds, and the music to be identified has a duration of 15 seconds, then 47, 616, 000 arithmetic operations are required. Therefore, it is beneficial to further reduce the number of comparison, which can be accomplished by employing a multiresolution strategy.
As discussed previously, the computational cost is still very high even if we use the MPEG-7 descriptors. Therefore, we will use the multiresolution strategy to reduce the computational complexity. To do so, in addition to the above-mentioned high-resolution descriptors, we also need to generate low-resolution descriptors (see also Section 4). Therefore, the music database contains high-resolution and low-resolution descriptors for each soundtrack to be identified. This step can be accomplished during the setup of the database. In addition, a training process is conducted to find a distance threshold to determine whether the query input is in the database or not. Section 5 has a detailed description about the training process.
Once the database is constructed, as shown in Figure 3, the music identification procedure consists of the following steps:
During the low-resolution comparison, it is also possible to use an existing algorithm to reduce the comparison time. For example, we may arrange the descriptors in k-d tree (k-dimensional binary search tree) [22, 23] structure to reduce the search time. However, using the k-d tree structure implies that every segment of music in the database has the same number of descriptors, or same duration. Note that the unit to be compared is one segment, and one soundtrack can be divided into many overlapped segments, as shown in Figure 2. Therefore, if the duration of the segment is set to 15 seconds, then the query music must also have a duration of 15 seconds to use the system. In a practical situation, if the query input is longer than 15 seconds, then only 15 seconds of the music is used for computing the descriptors and comparison.
This section briefly explains the dimensionality reduction technique used in the experiments. Although a block-average method is proposed in  for this purpose, we use a different technique in this paper.
Conceptually we may increase the scaling ratio (given in Section 2) to reduce the number of descriptors during computing them. For example, by increasing the scaling ratio from 16 to 256, the number of descriptors is reduced by 16 times. Unfortunately, this approach does not yield satisfactory results because of insufficient time resolution. Recall that the descriptors are derived based on the windowed waveform. Therefore, if two segments of the soundtrack are not highly similar, their corresponding descriptors usually have large differences. With a scaling ratio of 256, each descriptor represents around 7.7 seconds of audio samples. Therefore, unless the query input also starts at a point very close to the segment boundary of a soundtrack, a comparison based on these descriptors may fail. As illustrated in Figure 4, descriptors of an,k (or an+1,k) and qn,k are quite different, and therefore the descriptor-based identification scheme cannot identify the query input. Therefore, a suitable time resolution, for example, 0.48 second, should be maintained.
In contrast to reduce the number of descriptors by increasing the scaling ratio, we may reduce them in each temporal-spectral block (i.e., the matrix in Figure 1) representing a segment of (15-second) music. However, to maintain a high time resolution, successive temporal-spectral block should have a small time difference, as shown in Figure 5. With this arrangement, we achieve both high identification rate and low comparison complexity.
For each temporal-spectral block, we use PCA (principal component analysis) [21, 24] to reduce the number of descriptors. Since it is difficult to directly use PCA to obtain good low-resolution descriptors , we use an alternative approach. Its basic idea is to partition the entire block into four subblocks, as shown in Figure 6, and then use PCA to reduce the number of descriptors of each subblock into two values (descriptors), called low-resolution descriptors. Also, high-frequency descriptors in the original temporal-spectral block are all discarded because they are susceptible to noise . With this arrangement, totally eight descriptors are used to represent a segment of 15-second music. Note that the actual duration of a segment used in the experiments is 14.04 seconds (though we still say 15 seconds) because some audio samples are lost after MP-3 compression and decompression in the experiments.
We now describe how to calculate low-resolution descriptors. Since using PCA for dimensionality reduction is a well-known approach, we only describe how to construct the covariance matrix for PCA computation and omit the computation of finding principal components. Suppose that there are N segments of music pieces in the database, with their descriptor matrices denoted as A(1) to A(N). To simplify the argument, we consider the subblock matrix A1,1(k) (referring to Figure 6) in the following. Other subblocks can be computed by the same procedure. First, collect all subblocks from A1,1(k) to form a big matrix as follows:
Then, we may consider column vectors of matrix B as bi vectors. Having the data vectors bi, the covariance matrix and, subsequently, the principal components can be found. By keeping only two principal components, we can obtain ci = [ci,1ci,2] from bi. Since there are eight column vectors in a subblock, descriptors in A1,1(k) are reduced to eight ci(8k − 7 ≤ i ≤ 8k) vectors. Next, we can rearrange the obtained ci vectors as
Again, by treating each row of matrix C as the data to be processed by PCA, we can compute the covariance matrix, and finally reduce each column vector to one value. Since there are two columns originally from one subblock, there are totally two (low-resolution) descriptors per subblock. Equivalently, a segment of 15-second music is represented by eight descriptors. During the computation, the principal components obtained in the first and the second steps should be stored in the database. When a query input (in the form of high-resolution descriptors) is supplied, these components are used to obtain the low-resolution descriptors for the input.
Since, in practice, the database cannot collect all music soundtracks in the world, we have to have a strategy to determine if the query input is actually in the database or not. In our case, the decision is accomplished with the aid of a threshold. If the shortest distances between the input and the candidates are greater than a threshold, then the query input is not in the database; otherwise, it is in the database. Though the concept is simple, it is not trivial to determine the threshold .
Recall that when a query input is applied to the proposed system, the system computes the Euclidean distances between the low-resolution descriptors of the query input and those in the database. Accordingly, m (say, m = 20) segments of soundtracks from distinct titles are selected based on the computed distances. Next, the Euclidean distances between the query and the selected segments are computed again using full-resolution descriptors. By sorting the distances from small to large, the system creates a candidate list for the query input.
To compute the previously mentioned threshold, we examine two different methods. In the first method (method I), we define the “first” distance d1 as the (full-resolution) distance associated with the first candidate in the list. Similarly, the “second” distance d2 is the distance associated with the second candidate, as shown in Figure 7. The method to compute the first and second distances is used both in training and in identification. For the training phase, assume that there are NT soundtracks used. For a segment from a soundtrack indexed n, let the first distance be d1(n) and the second distance be d2(n). The threshold TA is then computed as
During the identification phase, the first distance for the query input is computed. If the first distance is smaller than the threshold TA, the query input is determined to be highly similar to the top soundtrack title in the candidate list. Otherwise, the query input is not inside the database.
In addition to method I, we also propose an alternative method (method II) to compute the first and second distances and the threshold. Referring to Figure 8, the first distance d1′(n) of method II for a training segment in soundtrack n is
where dm(n) is the distance associated with the mth candidate in the list, and M is a constant. Based on our experiment, M = 10 is sufficient. By the same arrangement, the second distance d2′(n) is calculated as
Similar to (7), the threshold TA′ is determined as
During identification phase, if the computed distance d1′ is greater thanTA′, the query input is determined as not in the database. Otherwise, the query input is highly similar to the top soundtrack title in the candidate list.
Before conducting the experiments, we collect 750 soundtracks from many CD titles for constructing the database and for identification. In the experiments, 30 seconds of music is excerpted from the soundtracks as the reference items to be identified. Then, low- and full-resolution descriptors of the reference items are calculated and stored in the database. We also randomly excerpt 15-second query items from the reference items. Note that a query item may start from any sample on the first half of the reference. To test the identification accuracy for compressed audio, the 15-second query inputs are also encoded and then decoded with an MP-3 coder with bitrates of 192k and 96k, respectively. However, the database only contains descriptors from the uncompressed items. In addition, the principal components are also obtained using uncompressed items.
Several experiments are conducted to examine the performance of the proposed system. The first experiment checks the identification accuracy using high-resolution descriptors. The second one compares the relative identification accuracy between the proposed method (in Section 4.2) and the method given in . The third experiment compares the search speed by using the linear search and k-d tree search. The next experiment intends to determine a suitable value of M used in (8) and (9). Having the value of M, we report the identification accuracy using a multiresolution strategy in experiment five. In this experiment, both method I and method II are examined. To have a complete comparison between method I and II, we also report the results in terms of the ROC (receiver operating characteristics)  curve and the DET-like (detection error tradeoff)  curve but without logarithm.
The first experiment is to examine the identification accuracy using the high-resolution MPEG-7 audio signature descriptors (without reduction). To evaluate the influences of high-frequency descriptors, we also examine the accuracies with and without using high-frequency (greater than 2.5kHz) descriptors. The results are given in Table 2. In the table, “match in first one” is the rate that the query input is correctly identified with the shortest distance among all reference items. Similarly, “match in first 15” is the rate that the query is correctly identified within a list of 15 reference items sorted by distance. The results indicate that high-resolution descriptors have very good identification accuracy. Also, for uncompressed or 192k MP-3 items, whether using high-frequency descriptors does not affect the identification accuracy. However, for 96k MP-3 items, discarding high-frequency descriptors greatly improves the accuracy.
The second experiment compares the relative identification accuracies of the proposed dimensionality reduction technique and the averaging method proposed in . In this experiment, eight descriptors are used to represent one segment of music (15 seconds). The results are shown in Table 3. For this experiment, the accuracy rate in “match in first 15” is more important than the figure in “match in first one” because the low-resolution descriptors are used to produce a candidate list. Therefore, it is acceptable as long as the correct item is within the list, as the high-resolution descriptors are to be used in the second stage. In this regard, the proposed approach is slightly better than the averaging method proposed in  for 96k MP-3 query inputs.
The third experiment compares the time to generate a list of 20 candidates by using k-d tree search and linear (exhausted) search for low-resolution descriptors. The computing time is obtained by averaging the required time in ten trials. For each trial, 500 query items are randomly selected among the 750 available items. The experimental platform is a personal computer with a 1.6GHz Pentium CPU and 1.5GB memory. The program is written in C with a Borland C++ compiler. The results are shown in Table 4. The results show that we can increase the search time by 10-fold if the low-resolution descriptors are embedded in the k-d tree structure.
The fourth experiment is to determine the number M in method II. Recall that the thresholds in both methods are computed based on high-resolution descriptors. To determine the optimal value of M, we vary this value from one to eighteen in (8) and (9) to compute the threshold TA′ and then to perform identification. Note that method II is degenerated to method I if M = 1. In this experiment, 96k MP-3-coded items are used as the query. In addition, we deliberately keep the high-frequency descriptors in computing the distances because keeping them makes it easier to examine the influence of M versus identification rate (defined in experiment five). The experimental results are given in Figure 9. The results show that when M is equal or greater than 10, the identification rate remains almost constant. Therefore, M = 10 is sufficient for method II. This value is then used in experiment five.
The fifth experiment compares the identification rates between method I and method II. The thresholds are calculated using (7) and (10), respectively. In this experiment, 375 of the 750 15-second items are randomly chosen for training (i.e., to obtain the thresholds). The rest of 375 items are for query. Note that all of the items are excerpted from the reference items in the database. In addition, we test other 500 query items (also with duration of 15sec) not excerpted from database references.
Since some query inputs are not in the database, we also need to consider the situations of falsely identifying an outside database item as one inside the database, and vice versa. Suppose that Td query items are actually inside the database and To query items are not inside the database. Assume that the identification system correctly identifies Nd inside-database query item, erroneously rejects Nr inside-database query items, and erroneously accepts Na outside database query items. Then, the IDR (identification rate), FAR (false-accept rate), and FRR (false-rejection rate) are computed as follows:
In addition, the accuracy of the system is computed as
where No is the number of query items correctly identified as outside-database items and can be computed as No = To − Na.
In this experiment, the low-resolution descriptors are used to find 20 candidates. Next, high-resolution descriptors are used to find the best-matched one in the list. If the distance associated with the best-match is greater than the threshold, the query input is determined as not in the database. Otherwise, the query input is identified as the best-matched one. In this experiment, high-frequency descriptors are kept in the comparison. With the use of (11) to (14), the average values after four trials are given in Table 5. From the results we know that the system does not perform well for identifying 96k MP-3 query items. This is mainly due to high distortion in high-frequency descriptors. A similar phenomenon can also be observed in Table 2. Later on, we will repeat this experiment but discard high-frequency descriptors.
To further examine the performance differences between method I and method II, we again use 96k MP-3 items as the query inputs. But this time we vary the threshold values (for both methods) and then record the IDR, FAR, and FRR values. The results are presented as ROC curves  and DET-like curves  (but without logarithm), as shown in Figures Figures1010 and and11,11, respectively. The ROC curve plots FAR versus TPR (true positive rate), which is defined as
Conceptually, a better classifier should have a higher TPR for a fixed FAR. Therefore, a better classifier should have an ROC curve closer to the left-upper corner. With this interpretation, from Figure 10 we know that method II is a better classifier.
In this experiment, we also show DET-like curves. A “true” DET curve plots FAR versus FRR using logarithm scales. In our case, however, we do not use logarithm scales to exhibit the similarity between ROC and DET curves. Again, conceptually we know that a better classifier should have a lower FRR over a fixed FAR. Thus, a better classifier should have a curve closer to the left-bottom corner. By examining Figure 11, we again confirm that method II is a better method.
Since the high-frequency descriptors actually affect the identification rate at low bitrates, we again repeat this experiment without using high-frequency descriptors. The results are given in Table 6. By comparing the results in Tables Tables55 and and6,6, we know that removing high-frequency descriptors slightly reduces the accuracy of uncompressed items. However, it greatly improves the accuracy for 96k MP-3 items. Since method II has an accuracy of at least 99.4% in all cases, it is highly plausible to identify copyrighted audio materials using this method.
In this paper, we propose a system using a multiresolution strategy to identify whether a piece of unknown music is identical to one of the pieces in the database. In the system, high-resolution descriptors are MPEG-7 audio signature descriptors, and the low-resolution descriptors are obtained from high-resolution descriptors with the aid of PCA to reduce their dimensionality. Experimental results show that low-resolution descriptors still have high identification accuracies. To reduce the time to generate the candidate list, we use the k-d tree structure to store low-resolution descriptors. Experimental results show that using k-d tree structure increases the search time by ten-folds. Since not every piece of query input is within the database, we also proposed two methods to determine the distance thresholds. Experimental results show that the proposed method II provides an accuracy of 99.4%. Therefore, the proposed system can be used in real applications, such as identifying copyrighted audio files circulated over the Internet. As it can be easily extended to operate with multiple computers, the proposed system is a plausible starting point to construct a large, operable database.
This work was supported in part by National Science Council of Taiwan through Grants NSC 94-2213-E-027-042 and 99-2221-E-027-097.