|Home | About | Journals | Submit | Contact Us | Français|
While trying to integrate multiple data sets collected by different researchers, we noticed that the sample names were frequently entered inconsistently. Most of the variations appeared to involve punctuation, white space, or their absence, at the juncture between alphabetic and numeric portions of the cell line name.
Reasoning that the variant names could be described in terms of mutations or deletions of character strings, we implemented a simple version of the Needleman-Wunsch global sequence alignment algorithm and applied it to the cell line names. All correct matches were found by this procedure. Incorrect matches only occured when a cell line was present in one data set but not in the other. The raw match scores tended to be substantially worse for the incorrect matches.
A simple application of the Needleman-Wunsch global sequence alignment algorithm provides a useful first pass at matching sample names from different data sets.
While trying to integrate multiple data sets collected by different researchers, we have noticed that the sample names were frequently entered inconsistently. These inconsistencies make it difficult to automate the process of matching data correctly, since matching procedures tend to be based on exact matches of character strings. These inconsistencies can cause problems in a variety of contexts. For example, searching on cell line names at the web site for the American Type Culture Collection (ATCC) can fail if the name copied from a publication is not exactly the same as the name stored in their database.
While the problem of inconsistent names appears to be especially prevalent when using cell lines, our experience also indicates that similar problems can arise when using other kinds of samples as well. For instance, laboratory technicians often add extra information (in the form of a prefix or suffix) to the character string naming the sample in order to annotate something special that happened during sample preparation. In this case, the sample identifiers no longer exactly match the corresponding identifiers in a database containing clinical information about the samples. In an ideal world, of course, all of the extra information would be stored in a database that carefully regulated the forms of identifiers that could be used. In practice, data is often transferred to statisticians or bioinformaticians in spreadsheets or other files that do not adhere to strict standards or naming conventions.
Most of the variations in sample identifiers appear to involve punctuation, white space, or their absence at the juncture between alphabetic and numeric portions of the cell line name. The second most common variations appear to involve suffixes or prefixes added to the\standard” version of the sample identifier. Because the variant names can be described in terms of mutations, deletions, or insertions of character strings, we reasoned that algorithms that had already been developed for alignment of DNA or protein sequences could be applied to the problem of matching cell line names. The most commonly used sequence alignment algorithm at present is the Basic Local Alignment Search Tool (BLAST).1 However, our task at present is to align fairly short sequences as completely as possible, and thus a global alignment algorithm seems more appropriate. As a result, we chose to implement a simple version of the original Needle-man-Wunsch global sequence alignment algorithm.2 Scripts to implement and apply the algorithm in the R statistical software environment3 are available from the authors upon request.
We tried to integrate three kinds of data. The first data set contained radiation response data (obtained by estimating SF2, the surviving fraction after treatment with 2 Gray) on 33 head and neck squamous cell carcinoma (HNSCC) cell lines, 33 lung cancer cell lines, and 63 cell lines related to the NCI60 (also known as the NCI-60). The second data set contained reverse phase protein lysate array (RPPA) data on 224 HNSCC or lung cancer cell lines. The third data set contained Illumina gene expression data on 105 HNSCC or lung cancer cell lines. Complete lists of the cell line names in the three data sets are contained in the supplementary Excel file (namesMatched.xlsx).
We applied the Needleman-Wunsch algorithm to the cell line names, using a mismatch penalty of 2, a gap penalty of 1, and a match score of 2. Since the primary goal of the intended study was to relate gene or protein expression to outcome in the form of radiation response, we applied the algorithm as follows. First, for each cell line name used in the SF2 data set, we computed the Needleman-Wunsch score for all of the cell line names in both the RPPA data set and the Illumina data set. We then recorded all cell lines with maximum score as potential matches. Representative results are shown in inTableTable 1; a complete list of the results is contained in the supplementary Excel file (namesMatched.xlsx). Because the NCI60 cell lines (other than lung cancer lines) were present only in the SF2 data set, they provide useful information about the behavior of the algorithm when no correct matches exist.
We summarize the results in Table 2. For each raw score, we show the number of correct matches and the number of cell line names that could not be matched because there was no valid counterpart in the other data set. The raw match scores tended to be substantially worse for the incorrect (because impossible) matches than for the correct matches. There were no correct matches with a score less than 2. With a score less than or equal to 5, there were only 6 (5.7%) correct matches compared to 134 (87.0%) names that were impossible to match.
All correct matches were found by this procedure, with only one name providing an ambiguous match. The cell line name “NCI-H23” in the SF2 data set was correctly matched with the name “H23” in the RPPA data set, but incorrectly matched with the name “PCI-22B” in the same data set. Both matches yielded a raw score of 2. Not surprisingly, this suggests that shorter cell line names are more difficult to match correctly and unambiguously. With that single exception, all other incorrect matches only occured when a cell line was present in one data set but not in the other.
The results also suggest that it is impossible to set a cutoff on the score that will ensure that putative matches with at least that score will be correct. For example, the SF2 data set contains cell lines names “OSC19 LN1” and “OSC19 LN2”. Only the second of these cell lines is contained in the Illumina data set.
Thus, the best match to the “OSC19 LN1” cell line in the SF2 data set is the cell line “OSC19 LN2”, which has a raw match score of 14. Because many cell lines have names that differ only in a single digit, we expect that highly similar but incorrect matches will be common. Note that our current implementation does not correct for differences of case (eg, “CALU1” vs. “Calu-1”). Difference in case can be handled either by forcing everything to upper case or by adding a more elaborate mismatch penalty matrix that imposes smaller penalties for changes in case.
Consequently, we view the use of the Needleman-Wunsch sequence alignment algorithm as a first step in the process of correctly matching sample identifiers, especially across data sets containing hundreds of samples. With this tool, it is possible to quickly assemble a spreadsheet that shows the best putative matches (such as the one provided as a supplementary file). Having this file makes it easy for a researcher to scan through and indicate which matches are correct and which are incorrect. If that information is entered as another column in the same spreadsheet, then the resulting documentation can be used directly by statistical software packages to automate the next step in the process of merging the data.
This work was supported in part by the Department of Defense grant W81XWH 07 1 0306 02, and by National Insitutes of Health/National Cancer Institute grants P50 CA070907, P50 CA097007, and P01 CA006294.
This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers of this paper report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.