Traditional approaches mainly involve two steps: comparison of record pairs to generate comparison vectors, and classification of record pairs into three sets of true match, possible match, and true non-match based on the comparison vectors [9
]. A probabilistic model was introduced by Fellegi and Sunter in 1969 [11
], in which comparison only considers match/non-match values. A lot of studies have been done following the probabilistic model [12
]. Expectation maximization (EM) algorithm has been used to get better decision rules when maximum likelihood is reached in [15
]. While a relational probability model was used for citation matching [16
], conditional models have been proposed to capture the dependency features under certain background, for instance, using conditional random fields in context [17
]. Based on such conditional models, deduplication has been studied in relational databases of different data types, e.g. information of an entire record is kept in different data tables [19
]. This probabilistic model can also support categorical comparison values [14
]. Several continuous-value comparisons also appear to deal with typographic errors [21
], such as Hamming Distance, Edit Distance [22
], Soundex Code and Metaphone (http://www.sound-ex.com/alternative_zu_soundex
), Jaro-Winkler Comparison [23
], Q-grams (or N-grams) [24
], Longest Common Substring [25
], and so on.
Several software and packages have also been developed to solve this record linkage problem. FEBRL is one of the excellent ones, which employs many existing techniques and can handle datasets with several hundred thousand records but with no clear accuracy and time analysis [26
], together with TAILOR [29
], IntelliClean [30
], Merge/Purge [31
] and so on. Most of these algorithms pre-process the linkage with a blocking phase, hashing by a blocking key [29
], for instance. Sorted-Neighborhood Method (SNM) [31
] limits the comparison within a window after sorting the data, which is used as blocking by FEBRL [26
], TAILOR [29
], and IntelliClean [30
]. Canopy Clustering [33
] is also used as a blocking method in FEBRL [26
]. However, not much work has been done to integrate multiple datasets. We propose a novel technique based on hierarchical clustering. Clustering partitions a set of observations into subsets (called clusters) such that observations in the same cluster are similar. Clustering has been applied in many fields such as data mining [34
], machine learning [36
], psychology [37
], and even genetics [38
]. Hierarchical clustering creates clusters in a tree structure (called a dendrogram) (see e.g., [42
]). Various parallel algorithms for it have also appeared (see e.g., [46
]). Given a set of n points to be clustered, the agglomerative hierarchical clustering starts with n clusters where each cluster has a single input point with a distance of 0
. From there on, the algorithm proceeds in stages. In each stage the two closest clusters are merged into one and the distance of this new cluster is the distance between the merged clusters. The stages of merging happen until we are left with one cluster (containing all the input points). We can cut the tree at any desired threshold level. Note that each cluster is associated with a distance. All the root clusters at the threshold level will be output by the algorithm. Any hierarchical clustering algorithm can be classified based on the distance measure used. In this paper we use the single linkage clustering. It has been shown that results are similar using different linkage in Hierarchical Clustering [49
], and single linkages has the advantage in time complexity [47
Our approach for data integration is to treat each record in each data set as a point in a multi-dimensional space. The dimension of this space is decided by the number of attributes in the records. We then define an appropriate distance measure for these points. Followed by this, we cluster these points and the clusters lead us to the identification of common records.
Record distance calculation
In this paper, we consider attributes such as the first name, last name, gender, zip code, etc. However, the techniques described are generic and should be applicable for any set of attributes. We define the distance between two records in a number of ways by appropriately combining attribute distances. We use RD(R1,R2) to denote the distance between two records R1 and R2. Given R1 and R2 from any data sets, the common attributes of the two records are used to calculate RD(R1,R2).
We consider two kinds of input errors, based on typing errors and sound similarity. We use the edit distance as the basic distance measure for the typing errors, while Metaphone is used to deal with errors based on sound similarity. The Edit Distance (also known as the Levenshtein Distance
) between two strings is the minimum number of operations, such as deletion, insertion, and exchange that are required to transform one string to the other [22
]. One lower bound on the edit distance between two strings is the difference in the lengths of the two strings. One could use dynamic programming to compute the edit distance between two strings of length n
, respectively, in O(mn)
time (see e.g., [50
]). With the Four-Russians speedup, which partitions the dynamic programming table into blocks and looks up the distance of each block from a pre-computed block offset function, the edit distance can be computed in O(l2/log(l))
time when an appropriate block size is chosen (where l
is the length of the two strings) [51
]. In this paper, we assume a penalty of 1
for each of the operations insertion, deletion, and exchange. Given two attributes A1
, the edit distance of (A1,A2)
is the attribute distance based on edit distance. We incorporate into edit distance several methods to deal with common typing errors, such as 1) Reversal of the first name and the last name (reversal distance
): In this case we use the smaller of the distance between the names and the distance between one name and the reversal of the other; 2) The use of nicknames (nickname distance
): In this case we look up a nickname table (http://www.cc.kyoto-su.ac.jp/~trobb/nicklist.html
) and use the smallest edit distance; and 3) Truncation of attributes (truncation distance
): only consider the first few letters. Given two names, the name distance
between them is defined as the smallest distance obtained from the names’ edit distance, reversal distance, nickname distance, and truncation distance. Errors based on pronunciation or sound similarity is another main issue when data is input. Metaphone is a phonetic algorithm to encode strings based on phonetic similarity, which works more effectively than the Soundex Coding. Phonetic distance
between two attributes is defined as: 1) zero, if the Metaphone codes of them are the same; 2) edit distance of the Metaphone codes, otherwise.
The basic algorithm
We employ hierarchical clustering to deal with the problem of integrating multiple (more than two) datasets. Our basic algorithm (Algorithm 1) is to take every record in each data set as a point (in an appropriate space). Single linkage hierarchical clustering is applied on these points. The clustering yields a dendrogram. A relevant threshold is employed to cut this dendrogram to obtain clusters of interest. In Step 2, a threshold is needed to collect the required clusters. We provide two types of thresholds - constant and proportional thresholds. The constant threshold allows a certain number of errors occurring in record comparisons, while the proportional threshold requires the number of errors to be limited to a percentage of the total record length.
Algorithm 1: Basic integration algorithm (BIA)
Step 1. Construct the dendrogram using hierarchical clustering.
Step 2. Cut the dendrogram at the level of a threshold. Collect the root clusters at the threshold level and output.
Ideally, each cluster will only have points corresponding to the same person, i.e., all the versions of the same record will be in one cluster and this cluster will not have any other records. However in practice this may not always be the case. Thus we characterize the performance of this technique with an accuracy parameter (See RESULTS Section).
If l is the maximum length of any record, the distance between two records can be computed in O(l2) time. Therefore, when the total number of records is n, Steps 1 and 2 need O(n2l2) time and O(n2) space.
Algorithm 1 takes too much time and space even on reasonably small data sizes. Note that the time and space requirements of Algorithm 1 are quadratic in the total number of records. Data sets of practical interest have millions of records. Table displays the performance of Algorithm 1 on small datasets. Extrapolating from the numbers shown in this table, we can expect that Algorithm 1 will take around 20 hours when the number of records is 100,000. To overcome the shortcomings of Algorithm 1, we have employed a series of techniques to improve its performance. In this section we describe these techniques.
Experimental results on simulated data sets (constant threshold)
Partial construction of the dendrogram
Instead of building the entire dendrogram and cutting at the threshold level, the idea here is to construct only portions of the tree that are below the threshold level. The resultant algorithm is shown as Algorithm 2.
Algorithm 2: PCD
Step 1. Collect all the records in all the data sets. Each such record is a cluster by itself at level 0 initially.
Step 2. Calculate the record distances between each pair of clusters. Keep a nearest neighbor for each cluster.
Step 3. Find the pair of clusters with the smallest distance.
Step 4. If this distance is smaller than or equal to the threshold, then merge the two clusters into a new cluster with the distance as the new cluster’s level. Otherwise, go to Step 7.
Step 5. Update the pairwise distances and the new nearest neighbors’ information.
Step 6. Repeat Steps 3, 4, and 5 until we end up with a single cluster.
Step 7. Collect and output the root clusters.
Although Algorithm 2 also takes O(n2l2) time, its expected run time is better than that of Algorithm 1. This is especially true if the error rate is low. If the errors in the records occur with a low probability, then the threshold value will be low and we only have to construct a small portion of the dendrogram.
Ignoring the dendrogram structure
Since only certain clusters are collected as the output, the structure of the dendrogram is not necessary. Algorithm 3 shows the details of this technique.
Algorithm 3: IDS
Step 1. Collect all the records in all the data sets. Let this collection be D.
Step 2. While D is not empty do
Pick one of the records in D
arbitrarily. Let R
be this record. We will form a cluster CR
corresponding to R
as follows. To begin with CR
has only R
in it. We identify all the records in D
that are at a distance of
. Let the collection of these records be C’
. Add all the records of C’
. Followed by this identify all the records of D
that are at a distance of
from any of the records of C’
. Let the collection of these records be C”
. Add all the records of C’
’ to CR
. Continue this process of identifying neighbors and adding them to CR
until no new neighbors can be found. At this point output CR
. This is one of the clusters of interest. Set D
:= D - CR
Since Algorithm 3 does not generate a distance matrix, its memory usage is only O(n) and the worst case run time is still O(n2l2).
Faster computation of the edit distance
The general algorithm for edit distance computation takes quadratic time (see e.g., [22
]). Since we have to calculate edit distances for a total of n2
times, a speedup in edit distance calculation will improve the run time on the entire data integration task. The four-Russians speedup algorithm on edit distance runs in O(l2/log(l))
time, when a block size is chosen to be (log3σ(l))/2
is the alphabet size and l
is the length of two strings). However, unfortunately, it is inapplicable due to the short length of the strings involved since in this problem, σ
and usually l
Observation 1: The edit distance between two strings is always larger than or equal to the difference in the lengths of the two strings.
It is easy to see that even if one string S1 is a substring of the other S2, it still needs |S2|-|S1| number of inserts and/or deletes to transfer one to the other. This lower bound can help to see whether the edit distance between two strings is larger than the threshold even before the calculation of the edit distance. In this case we can omit the calculation of the edit distance.
In summary, let t
be a given a threshold. If the distance between two strings S1
is less than or equal to t
, then we can compute this distance in
Dynamic programming is typically used to compute the distance between two strings. This algorithm employs a table of size
to compute partial distances (see e.g., [34
]). If the edit distance between two strings is smaller than t
, then the trace-back path of the dynamic programming table should be within a strip of size (2t+1)
centered on the diagonal. The diagonal is where we align the ith
character of one string to the same ith
character of the other string. Departure from the diagonal means insertion or deletion, so when at most t
inserts or deletions are allowed, the trace-back path is of departure at most t
from the diagonal in a row. If the edit distance is smaller than t
, what we calculate is the accurate edit distance. Otherwise, what we calculate is not accurate, but since we know that the edit distance is more than t
, it does not matter that the edit distance computed is not accurate. We refer to this technique of computing edit distances as FCED (fast computation of edit distances). This idea is also described in Gusfield's book ([52
] 261–262). In particular, in any row, column, or diagonal of the dynamic programming table (for edit distance computation), two adjacent cells can have values that differ by at most one ([52
] 305–306). Also, in any diagonal of the dynamic programming matrix D
(for edit distance computation), either
. (Matrix D
is defined as follows. If we are interested in computing the distance between two strings
using dynamic programming, then a matrix of size n
will be employed. In particular, D[i,j]
will be the distance between
(for all values of i and j)). We can see that
is calculated from
. Without loss of generality, assume that
, which is a contradiction.
This fact implies that the values along the diagonal of the dynamic programming table are non-decreasing.
Using Observation 1, we first check if the edit distance is larger than the threshold. If not, we employ the O(tl) time algorithm as described above. We keep checking the diagonal to see whether the allowed threshold t is broken through. If the edit distance is seen to be larger than the threshold, the algorithm terminates immediately. Otherwise, the accurate edit distance is calculated and returned. In this algorithm, the threshold is treated as the current distance allowed. While dynamically calculating the edit distance, the diagonal value is compared with the threshold. If the value is already larger than the threshold, no further calculation is necessary and an infinite distance is given for such comparisons. If there is another pair of attributes to compare, a new distance allowed is computed as the threshold minus the distance already used. This method is implemented in Algorithm 3, and Algorithm 4 (in the next section).
A two-phase algorithm
Now we describe a two-phase algorithm. In the first phase (called a blocking phase), records sharing something in common are indexed into the same block. Records in each block will be integrated later. Blocking phase uses the last name attribute, which is considered more accurate than the first name.
In each record, the last name is parsed into l-mers
), and this record is added into blocks which are indexed by these l-mers
. For instance, in the case of l
, ``Rueckl" is indexed into 4
blocks: ``rue", ``uec", ``eck", and ``ckl". The total number of possible blocks is 26l
. Since one record may belong to multiple blocks, after integration on each block, a post-processing is done to merge clusters with duplicate records. Details are shown as Algorithm 4.
Algorithm 4: TPA
Step 1. Collect all the records in all the datasets. Put them into blocks based on l-mers of the last names.
Step 2. Integrate records in each block using the algorithm IDS. Step 3. Merge the clusters with the same records.
Assuming that there are b
blocks, integrating each block takes O((n/b)2l2)
time on an average. Over all the b
blocks the expected run time is O(n2l2/b)
. Note that b
is typically a large number. For 3-mer
and for 4-mer
In summary, we have proposed four algorithms for multiple-source integration, together with six distance calculations: edit distance to handle common typos, reversal distance to handle the last and first names reversal errors, nickname distance to handle the distances with nicknames, truncation distance to handle the errors with abbreviations, phonetic distance to handle the similarity of sound, and name distance to capture all features of edit, reversal, nickname, and truncation distances on names. Without loss of generality, we validated our algorithms on simulated data by 1) reversal distance for the first or last names and edit distance for the other attributes (RDED); 2) name distance in the first and last names and edit distance in the other attributes (NDED) where the truncation length is 5 for both the first and the last names; and 3) phonetic distance in the first and last names and edit distance in the other attributes (PDED). For real data, because of the limited common attributes (only first name, last name, and date of birth), EDname calculates the edit distances based on the first and the last names, EDall considers all the three attributes, PDname calculates the phonetic distances based on the first and the last names, and PDED calculates the phonetic distances on the first and the last names and edit distance on date of birth.