With the SOPA (stop online piracy act) bill [1
] proposed in 2011, the protection of copyrighted intellectual property, such as digital content, once again brought to public attention. Despite the controversial issues of the SOPA bill, it is commonly agreed that copyrighted digital content should be protected. However, the first step toward the protection of copyrighted content is to identify whether a piece of digital content is copyrighted, and if so, who owns it. In this regard, it is important to identify (detect) whether a digital work is copyrighted or not.
Among digital content, soundtracks (usually in the form of audio files) are one type of content that is easily to be illegally reproduced. Owing to the advanced techniques in audio compression, music soundtracks are usually distributed over the Internet in compressed form rather than in uncompressed form. Therefore, any approach for copyright detection must be able to deal with both compressed and uncompressed audio files.
A typical method to attach the copyright information to a piece of music is by embedding watermarks [2
]. Though effective, this method has some limitations, such as the watermarks must be embedded into the source soundtracks before release. Therefore, it is not possible to identify the rights owner of a piece of music without watermarks. Another concern is that the embedding process usually introduces distortion. Thus, the quality of the embedded audio may be degraded.
In addition to the watermarking technique, it is also possible to identify rights owner by comparison. For example, if an unknown soundtrack is very similar to a soundtrack owned by a company, then the unknown soundtrack is highly likely copyrighted in that company's name. This type of approach is especially suitable for audio files because, in practice, tremendous amount of records currently available do not embed watermarks or any kind of copyright information.
When comparing a piece of music with a music database, the comparison may be accomplished based on the melody (i.e., musical notes) of the music [3
]. For this type of comparison, however, if two persons sing the same song, these two works will be recognized as the same one. Since different artists may perform the same song, known as the cover version, a comparison based on melody cannot solve this problem.
Another type of comparison is based on the waveform of the music. This technique is also known as music identification. In this case, the same song performed by different artists generally does not have the same waveforms, and therefore they can be correctly identified. Though conceptually simple, it is not plausible to directly compare PCM samples of two pieces of music because it would take too much time for the comparison. For example, a typical compact disc (CD) has about 600
M bytes of PCM samples to store about ten songs. If a database contains 10,000 different songs, then the PCM samples occupy about 600
G bytes of space. A piece of unknown music with duration of ten seconds has about 880
k bytes of data. It is obvious that it requires a huge amount of computation to sequentially compare the 880
k bytes of data with the 600
G bytes of data in the database. Therefore, dimension-reduced representations of the PCM waveforms, known as fingerprints, are used in comparison. Among the fingerprints, most of them are defined by individual companies or groups. Some of them are briefly explained in the following.
Researchers in Google develop a fingerprinting scheme for audio called Waveprint [4
] based on wavelets. With the aid of wavelets, the fingerprint is invariant to timescale change. In other words, whether the audio piece is played faster or slower than the normal speed, the fingerprint is unchanged. The fingerprint of a piece of 4-minute music is around 64
k bytes, equivalently 2133 bits per second.
] is a company (and service) dedicated for music identification. Its database contains around eleven million soundtracks. As described in [6
], the fingerprints used are sets of triplets based on spectrogram peaks. For example, if (t1
) and (t2
) are two peaks at time t1
and frequency f1
, then the triplet ((t2
)) is a feature. Based on the realization of [7
], the fingerprint in this scheme uses 400 bits per second.
Researchers in Philips also propose a fingerprinting scheme [8
]. The computation of the fingerprints includes: framing, windowing (von Hann window), FFT (fast fourier transform), band decision, energy computation, and then quantization (into binary). In the typical setting, one second of audio has around 2,730 bits of fingerprint.
Microsoft's Robust Audio Recognition Engine (RARE) [9
] divides the incoming audio into overlapping frames. Each frame is converted to spectral domain by MCLT (modulated complex lapped transform). The spectral values are applied to two layers of OPCA (oriented principle component analysis) to reduce the dimensionality of the spectral data. For this method, 344 features (11,008 bits if one feature is stored in 4 bytes in a floating point) are obtained per second.
In addition to the above methods, there are actually many more different types of audio fingerprinting schemes available, such as Music Brainz [10
], Audible Magic [11
], and Gracenote's MusicID [12
]. According to [13
], there are more than ten different audio fingerprinting schemes available.
Since there are vast amount of different fingerprinting schemes available, some researchers then conducted experiments to compare the relative performance among some of them. The results show that, if the schemes use the same number of bits to represent fingerprints, they have comparable performance [14
]. Therefore, the selection of the fingerprinting schemes should also consider other factors (such as interoperability to be addressed below) rather than merely the minor performance difference.
With the ever-increasing amount of multimedia content over the Internet and in the multimedia databases, it is an important task to exchange multimedia content. To respond the public demands, ISO's (International Standardization Organization) working group developed MPEG-7 standard [15
]. In the audio part of the standard [17
], a high-level tool is developed for audio identification called audio signature description scheme. The fingerprints used in the scheme are called audio signature descriptors, and they have good identification accuracy [18
]. In the following, we interchangeably use descriptors and fingerprints without distinction.
Although proprietary audio fingerprints have excellent identification performance, the MPEG-7 audio descriptors offer some advantages. First, being an international standard ensures the open and fair use (subject to license fee) of the technology. Second, such an international standard makes the interoperability possible. For example, if a mobile phone installs an application program to convert a piece of recorded audio to MPEG-7 descriptors, the descriptors can be sent to any website accepting the descriptors. On the other hand, it is not possible to send proprietary fingerprints used in one company to database systems owned by competitors. Third, different companies may share or exchange their audio descriptors (fingerprints) in their databases without any difficulties. In the current situation, each company has to compute fingerprints for newly released albums. With the use of MPEG-7 descriptors, the redundant efforts of computing fingerprints can be minimized.
Although a music identification system based on audio fingerprints has several applications [8
], we concentrate on the issue of detecting if a piece of circulated music is highly similar to a copyrighted work or not. In a typical case, the similarity is measured by a distance metric. If the distance is shorter than a threshold, the two pieces of music are considered as similar. Although it is not trivial to determine a suitable threshold [20
], this problem is not fully studied. For example, [20
] does not indicate any approach to determine the threshold. In addition, the audio files to be compared may be very large; therefore it is very important to reduce the comparison time while maintaining high identification accuracy. Since not many papers address these two issues based on MPEG-7 descriptors, we report in this paper our approaches and experimental results.
This paper is organized as follows. Section 2
is an overview of the MPEG-7 audio signature descriptors. Section 3
is the system model for music identification. Section 4
describes the dimensionality reduction method used in the paper. Section 5
is the proposed strategy to determine the threshold. Section 6
covers the experiments and results. Section 7
is the conclusion.