U87MG cells were ordered from ATCC (HTB-14) and cultured in a standard way. Genomic DNA was isolated from cultured U87MG cells using Qiagen Gentra Puregene reagents. DNA was stored at −20C until library generation.
Long-Mate-Paired Library Construction: The U87MG genomic DNA 2× 50bp long mate-paired library construction was carried out using the reagents and protocol provided by Applied Biosystems (SOLiD 3 System Library Preparation Guide). A similar protocol was reported previously 
. Briefly, 45ug of genomic DNA was fragmented by HydroShear (Digilab Genomic Solutions Inc) to 1.0–2.5kb. The fragmented DNA was repaired by the End-It DNA End-Repair Kit (Epicentre). Subsequently, the LMP CAP adaptor was ligated to the ends. DNA Fragments between 1.2–1.7kb were selected by 1.0% agarose gel to avoid concatamers and circularized with a biotinylated internal adaptor. Non-circularized DNA fragments were eliminated by Plasmid-Safe ATP-Dependent DNase (Epicentre) and 3ug of circularized DNA was recovered after purification. Original DNA nicks at the LMP CAP oligo/genomic insert border were translated into the target genomic DNA about 100bp by nick translation using E. coli DNA polymerase I. Fragments containing the target genomic DNA and adaptors were cleaved from the circularized DNA by single-strand specific S1 nuclease. P1 and P2 adaptors were ligated to the fragments and the ligated mixture was used to create two separate libraries with 10 cycles of PCR amplification. Finally, 250–300bp fragments were selected to generate mate paired sequencing libraries with average target genomic DNA on each end around 90bp by excision from PAGE gel and use as emulsion PCR template. Templated Beads Preparation: The templated beads preparation was performed using the reagents and protocol from the manufacturer (Applied Biosystems SOLiD 3 Templated Beads Preparation Guide). SOLiD 3 Sequencing: The 2×50b mate-paired sequencing was performed exactly according to the Applied Biosystems SOLiD 3 System Instrument Operation Guide and using the reagents from Applied Biosystems.
Exon Pull-Down Capture Sequencing with Illumina GAII
We used an array pull-down capture strategy established in our lab 
. An Agilent custom array for capturing 5,253 “cancer-related” genes was designed through Agilent e-array system (www.agilent.com
). Only the amino acid encoding regions were targeted with 60mer oligos spaced center-to-center 20–30bp. The probes were randomly distributed across two separate 244K arrays. The library for cancer gene capture sequencing was generated following the standard Illumina paired-end library preparation protocol. 5ug of genomic DNA was used for the starting material and 250–300bp fragments were size-selected during the gel-extraction step. In the last step, 18 cycles of PCR were performed in multiple tubes to yield 4ug of product and mixed with 50ug of Human Cot-1 DNA (Invitrogen), 52ul of Agilent 10× Blocking Agent, 260ul of Agilent 2× Hybridization Buffer and 10× molar concentration of unpurified Illumina paired-end primer pairs custom made according to the sequences provided by Illumina (Oligonucleotide sequences, 2008, Illumina, Inc: available on request from Illumina). The mix was then diluted with elution buffer for the final volume of 520ul and then incubated at 95°C for 3 min and 37°C for 30min. 490ul of the hybridization mix was added to the array and hybridized in the Agilent hybridization oven (Robins Scientific) for 65 hrs at 65°C, 20rpm. After hybridization, the array was washed according to the Agilent wash procedure A protocol. The second wash was extended to 5 minutes to increase the wash stringency. After washing, the array was stripped by incubating it in the Agilent hybridization oven at 95°C for 10min, 20rpm with 1.09× Titanium Taq PCR Buffer (Clonetech). After the incubation and collection of the solution, 4 tubes of PCR were performed with each tube containing 96ul of the collected solution, 1ul of dNTPs (10mM each), 1ul of Titanium Taq (Clonetech) and Solexa primers, 1ul each. 15 cycles of PCR was performed at the following condition: 30sec at 95°C, (10 sec at 95°C, 30 sec at 65°C, 30 sec at 72°C)×18 cycles, 5 min at 72°C and hold at 4°C. The amplified product was purified using QIAquick PCR Purification Kit and eluted in 30ul of EB. After confirming the size of the amplicon on 2% agarose gel and measuring the concentration, the amplicon was diluted to 10nM, the working concentration for cluster generation. The Illumina flowcell was prepared according to the manufacturer's protocol and the Genome Analyzer was run using standard manufacturer's recommended protocols. The image data produced were converted to intensity files and were processed through the Firecrest and Bustard algorithms (1.3.2) provided by Illumina to call the individual sequence reads.
ABI SOLiD Sequence Alignment and Consensus Base Calling
We used Blat-like Fast Accurate Search Tool version 0.5.3 (BFAST http://bfast.sourceforge.net
to perform sequence alignment of the two-base encoded reads off the ABI SOLiD to the NCBI human reference genome (build 36.1). Utilizing the local alignment algorithm included in BFAST 
, we were able to simultaneously decode the short reads, while searching for color errors (encoding errors), base changes, insertions, and deletions.
We found candidate alignment locations (CALs) for each end independently. We utilized ten indexes to be robust to up to six color errors, equating to a 12% per-read error rate:
We also set parameters to use only informative keys when looking up reads in each index (BFAST parameter -K 8), and to ignore reads with too many CALs aggregated across all indexes (BFAST parameter -M 384). If reads mapped to greater than 384 locations, then they were categorized as ‘unmapped’. We then performed local alignment for each of the returned CALs, simultaneously decoding the read from color space searching for color errors (encoding errors), base changes, insertions, and deletions 
. We choose the “best scoring” alignment, accepting an alignment only if it was at least the equivalent edit distance of two color errors away from the next best alignment. This is approximately similar to a ‘mapping quality’ of 20 or better from the MAQ program output, for reference. We removed duplicate reads using the alignment filtering utility found in DNAA (http://dnaa.sourceforge.net
). For single-end and mate-paired reads where only one end mapped, we removed duplicates based on reads having identical stat positions. For mate-paired reads, we removed duplicates where both ends had the same start position.
Illumina Genome Analyzer Sequence Alignment
Illumina generated sequence was aligned to the NCBI human reference genome (build 36.1) using BFAST with the following parameters applied. Each end of the fragment library was mapped independently to identify CALs, utilizing ten indexes to be robust to errors and variants in the short (typically 36bp) reads:
We also set parameters to use only informative keys when looking up reads in each index (BFAST parameter -K 8), and to ignore reads with too many CALs aggregated across all indexes (BFAST parameter -M 1280). We then performed a standard local alignment for each CAL. Reads were declared mapped if a single unique best scoring alignment was identified within the genome. Duplicate reads were filtered out in the same manner as for the ABI SOLiD data.
Single Nucleotide Variant and Small Insertion and Deletion Detection
To find SNVs including SNPs and small indels, we assumed the MAQ consensus-calling model 
utilizing the implementation in SAMtools 
. We used a value of 0.0000007 for the prior of a difference between two haplotypes (-r parameter). This was chosen based on ROC analysis of a test dataset (data not shown).
Structural Variation Detection
Structural variations were detected using custom algorithms designed to comprehensively search for groups of mate-pair reads with aberrant paired-end insert size distributions that are consistently identifying a unique structural variant in the genome. We utilized the “dtranslocations” utility in the DNAA package (http://dnaa.sourceforge.net
) for the primary structural variation candidate search. The utility first selected all pairs for which each end is uniquely mapped to a single location in the human genome and for which the mate-pair reads are not positioned in the expected size range relative to the consensus genome. Then we filter out false positives that are not consistent with a chromosomal difference on an allele. Briefly, the genome was divided into 500-base bins sequentially stepped 100-bases apart from their start positions. Each bin was then paired with other bins on the basis of containing similar ‘mismapped’ mate-pair reads. The aberrant mate-paired reads were defined as reads that were mapping less than 1000 or greater than 2000 bases apart within the reference genome sequence, which is selected based on the insert size distribution calculated from the aggregate dataset (Figure S2
). These were then rank-ordered based on the number of mate-pairs meeting criteria, and the destination bin with the most reads within it was paired with a given source bin to create a ‘binset’. Binsets containing less than 4 reads were filtered out, removing 98.3% of the candidates based on having too little evidence supporting them. The resulting list of filtered binsets was then scanned for clusters of binsets. Binset clusters are groups of binsets where the source bins occur within 2000 bases of each other and the destination bins occur within 2000 bases of each other. Redundant binsets were combined and those binset clusters that contain too few (less than 9 binsets spanning at least 1000 bases) or too many binsets (greater than 29 binsets spanning at most 3000 bases—higher is impossible given our insert size distribution) were removed as artifacts. The resulting binset clusters represent the reads immediately flanking structural breakpoint events. This detection process is currently being automated as Breakway (http://breakway.sourceforge.net
), but was done using custom scripts at the time of analysis.
The structural variations were then separated into interchromosomal and intrachromosomal events. Intrachromosomal events of less than 1Mb are assessed for deletion status by averaging base coverage within the bounds of the event and comparing it to base coverage 200kb outside the event on both sides. Those that have average interior base coverage less than 25% of the average exterior base coverage are classified as “complete” deletions. Those with average interior base coverage between 25% and 75% that of average exterior base coverage are classified as “heterozygous deletions” (deletions of at least one copy of the region, but with at least one copy remaining).
Genes Affected by Mutations in Coding Sequence
Variant calls from the SAMtools pileup tool were first loaded into a SeqWare QueryEngine database and subsequently filtered to produce BED files. This filtering criteria required that a variant be seen at least 4 times and at most 60 times with an observation occurring on each strand at least once. For SNVs we further enforced the criteria that SNVs should only be called in reads lacking indels and the last 5 bases of the reads were also ignored. This reduced the likelihood that spurious mismappings were used to predict SNVs and eliminated the lowest quality bases from consideration. For small indels (<21bp) we enforced a slightly different filter by requiring that any reads supporting an indel were only allowed to contain one contiguous indel and these reads were not considered if the indel occurred on either the beginning or end of the read. These criteria, like the SNV criteria, were used to reduce the likelihood of using mismapped reads or locally misaligned reads in the variant calling algorithm. The elimination of reads with indels at the beginning or end of the read was intended to remove potential alignment artifacts caused by ambiguous gap introduction due to lack of information at the ends to guide proper alignment. Together, these filtering criteria reduced the likelihood that sequencing errors were identified as SNV or indel variants. We used scripts available in the BFAST toolset and SeqWare Pipeline to filter and annotate the variant calls. Variants passing these filters were further annotated by their overlap with dbSNP version 129. Variants were required to share the same genomic position as a dbSNP entry along with matching the allele present in the database to be considered overlapping. Mapping to dbSNP allowed us to filter out known SNPs from de novo variants.
Filtered SNV and indel variants were then analyzed for their affect within the genome that is annotated with gene models. This analysis used scripts from the SeqWare Pipeline project and gene models downloaded from the UCSC hg18 human genome annotation database. Six different gene model sets from hg18 were considered: UCSC genes (knownGene), RefSeq genes (refGene, http://www.ncbi.nlm.nih.gov/RefSeq
), Consensus Coding Sequence genes (ccdsGene, http://www.ncbi.nlm.nih.gov/CCDS
), Mammalian Gene Collection genes (mgcGenes, http://mgc.nci.nih.gov
), Vertebrate Genome Annotation genes (vegaGene, http://vega.sanger.ac.uk
), and Ensembl genes (ensGene, http://www.ensembl.org
). Each variant was evaluated for overlap with genes from each of the 6 gene models. If overlap was detected the variant was examined and tagged with one or more of the following terms depending on the nature of the event: “utr-mutation”, “coding-nonsynonymous”, “coding-synonymous”, “abnormal-ref-gene-model-lacking-stop-codon”, “abnormal-ref-gene-model-lacking-start-codon”, “frameshift”, “early-termination”, “inframe-indel”, “intron-splice-site-mutation”, “stop-codon-loss”, and/or “start-codon-loss”. The variant was also tagged with the gene symbol and other accessions to facilitate lookups. This information was loaded into a SeqWare QueryEngine database to allow for querying and filtering of the variants as needed.
Genes affected by structural variations were assessed in two ways depending on the structural variation type. For interchromosomal translocation events, a gene was considered “affected” when either end of an interchromosomal translocation event fell in a genic region (including the entire coding region plus 1kb up- or down-stream of the gene's coding region). The same criteria were used for all intrachromosomal translocation events. For events that were classified as complete or heterozygous deletions, a gene was considered affected if all or part of a coding exon was deleted.
Annotation of Relevant Mutated Genes
Homozygous SNVs, small indels, large deletions, and translocation events for variants that included predicted coding sequence changes were tallied. This became a reference list of variants with serious homozygous mutations that likely completely disrupted, or “knocked out”, the normal function or synthesis of the target protein.
For the SNVs and small indels, a “knockout” variant was defined as a homozygous call by the SAMtools variant caller where the variant was predicted by the SeqWare Pipeline scripts to change coding sequence with one or more of the following annotations: “early-termination”, “frameshift”, “intron-splice-site-mutation”, “start-codon-loss”, and/or “stop-codon-loss”. The “early-termination” event represented a stop codon introduced upstream of the annotated stop codon. The “frameshift” represented an indel that resulted in a shifting of the reading frame of the gene resulting in, typically, early termination and non-sense coding sequence. The “intron-splice-site-mutation” referred to a mutation in the two consensus splice site intronic bases flanking exons (GT at the 5′ splice site and AG at the 3′ splice site). Finally, “stop-codon-loss” and “start-codon-loss” simply refer to variants that interrupt the stop or start codons. We chose to not include “coding-nonsynonymous” and “inframe-indel” annotations in this list of knocked out variants because, while potentially serious as these mutations are, they are not guaranteed to result in an unexpressed or non-functional protein. However, homozygous frameshift, early termination, splice site, and stop/start codon loss mutations are very likely to interrupt a gene's expression and translation to functional protein.
As described above, large microdeletions that removed all or part of an exon and interchromosomal translocation events that fell within 1kb of a gene's coding region were also classified as mutated genes.
Once suspect knockout variants were identified, a mapping process was used to translate one or more variants to the gene symbol. This mapping allowed us to condense multiple variants affecting multiple gene models to a more abbreviated list of gene symbols likely to be affected by these knockout mutations. The mapping from variants to gene symbols used variants identified with gene models from the refGene and the knownGene tables in the UCSC hg18 database and mapped these variants to gene symbols using queries against the name field of the knownGene table and the alias field of the kgAlias table. The UCSC table browser was used to accomplish these queries and map the knownGene identifiers to gene symbols via the kgXref table. A similar approach was used for homozygous large-scale microdeletions and translocation events.
The list of knockout genes was uploaded to the Database for Annotation, Visualization, and Integrated Discovery (DAVID, version 2008) to identify enriched Gene Ontology (GO) terms 
. Overlap with GO terms from the biological process, cellular component, and molecular function ontologies were considered. The default parameters were used and a p-value cutoff of <
0.01 was considered significant.
Cancer Gene Census
The overlap between the Cancer Gene Census genes and those identified as knockouts in U87MG were compared. The Cancer Gene Census project is an ongoing effort to catalog genes with mutations that have been implicated in cancer 
. It is a highly curated list that includes annotations for each gene including tumor types, class of mutations, and other genetic properties. We used the gene symbol list from the September 30th
, 2009 complete working list, which includes 412 gene symbols.
The overlap between mutations in the Cancer Genome Atlas (TCGA) and those identified as knockouts in U87MG was analyzed. TCGA is an ongoing effort to understand the molecular basis of cancer through large-scale copy number analysis, expression profiling, genome sequencing, and methylation studies among other techniques 
. It provides information on mutations found by Sanger sequencing on many patient samples. For glioblastoma this includes sequence data aberrations detected in 158 patient samples and 1,177 genes.
Genomic Identification of Significant Targets in Cancer
The Genomic Identification of Significant Targets in Cancer (GISTIC) method was used to find significant areas of deletion in 293 samples from the TCGA 
. The GISTIC technique was designed to identify and analyze chromosomal aberrations across a set of cancer samples, based on the amplitude of the aberrations as well as the frequency across samples. This approach produced a series of commonly deleted regions across the set of TCGA GBMs. To calculate the areas of deletion, we used 293 Affymetrix SNP 6.0 samples segmented using the GLAD SNP analysis module 
. Default parameters of GISTIC were used. GISTIC produces peak limits, wide peak limits, and in addition broader region limits. These commonly deleted broader regions were then scanned for predicted knockout genes in U87MG.
Indel Size Distribution and Nucleotide Substitution Frequencies
The distribution of small indel sizes was examined for both deletions and insertions. Indels classified as affecting coding-sequence by the SeqWare Pipeline (see above) were compared to those outside coding regions. Raw counts were collected, recalculated as percents of total, and compared directly.
Similarly, nucleotide substitution frequency was examined for SNVs from U87MG both genome-wide and only in coding regions. Once binned appropriately, the SNV nucleotide substitutions were counted, tallied in a table, and graphed as percents of total.
Individual Genome Comparison
Variants from the Watson and Yan Huang genome were downloaded from each respective project from the following URLs: ftp://jimwatsonsequence.cshl.edu/jimwatsonsequence/watson-454-snp-v01.txt.gz
. These files contained variant calls for each genome along with annotations describing the variant as novel or occurring in dbSNP. The Watson genome only contained SNV calls so our comparison was limited to just SNVs. The Yan Huang genome also contained calls indicating heterozygous or homozygous. However, a variant was considered to match between genomes regardless of zygosity state. We compared the overlap of the U87MG genome, dbSNP and each of these genomes in turn. SNVs from U87MG that were considered for comparison had to meet our criteria; variants had to be observed at least 4 times, at most 60 times, at least once per strand, and with a minimum phred score of 10. SNVs in the three-way comparison were said to match if the position and allele matched between the genomes. If both variants matched between U87MG and the other genome and one was annotated in dbSNP, then the other was considered in dbSNP as well. If neither contained annotations from dbSNP the variant was considered novel. A similar process was carried out for variants distinct to each genome. The results were recorded as Venn diagrams showing the overlap between dbSNP, U87MG, and the Watson or Yan Huang genome.
Illumina SNP Chip
Genomic DNA from U87MG was submitted to the Southern California Genotyping Consortium to be run on the Illumina Human 1M-Duo BeadChip, which consists of 1,199,187 probes scattered across the human genome. The Illumina Beadstudio program was used to analyze the resulting intensity data. Loss of heterozygosity was determined by analyzing B-allele frequency as determined by the Beadstudio program. Normal two-copy regions of the genome are represented by long stretches of probes with B-allele frequencies of 0, 0.5 or 1. Regions of LOH, on the other hand, deviate from this pattern significantly. Copy number was determined by looking at probe intensity.
Sanger Sequencing Validation
Primers for validation were designed by targeting regions immediately flanking the event predicted by our whole genome sequence analysis using the Primer3 tool (http://frodo.wi.mit.ed/primer3/
). Polymerase chain reaction was performed following standard protocols using Finnzymes Phusion Hot-Start High Fidelity polymerase. Products were run on 2% agarose gel electrophoresis and product purity and size was assessed by staining with ethidium bromide. Sanger sequencing was performed at the UCLA Genotyping and Sequencing core facility using an ABI 3730 Capillary DNA Analyzer. Sequence trace files were analyzed using Geospiza FinchTV. Validation status and PCR primers are listed in Table S1
Data Deposition and Availability
Intensities, quality scores, and color space sequence for the genomic sequence of U87 SOLiD were uploaded to the Sequence Read Archive under the accession SRA009912.1/Sequence of U87 Glioblastoma Cell-line. Intensities, quality scores, and nucleotide space sequence for the exon capture U87 Illumina sequence were also uploaded to the Short Read Archive under the same accession. For both datasets, alignment files have been uploaded to the Short Read Archive as additional analysis results.
Variant calls for both datasets are available via a SeqWare QueryEngine web service at http://genome.ucla.edu/U87
. This tool allows for querying the variants using a variety of search criteria including coverage, mutational consequence, gene symbol, and others. SeqWare QueryEngine produces results in both BED and WIG format making it compatible with the majority of genome browsers such as the UCSC genome and table browsers. Variant data will be uploaded to SRA as metadata along with the raw sequences. For the whole genome SOLiD alignment, small indels (<21bp), SNVs, large deletions, and translocation events can be queried. For the exon capture Illumina alignment, small indels and SNVs can be queried.
Most software used for this project is open-source and freely available. We created two software projects that were instrumental in the analysis of the U87MG data: BFAST and SeqWare. The color- and nucleotide-space alignment tool BFAST can be downloaded from http://bfast.sourceforge.net
and many of our alignment filtering as well as the primary step in structural variation detection can be found in the DNAA package at http://dnaa.sourceforge.net
. The SeqWare software project was used throughout the analysis of variant calls. We used the SeqWare LIMS tool for sample tracking, the SeqWare Pipeline analysis programs for annotating variants with dbSNP status and mutational consequence predictions, and SeqWare QueryEngine was used to database and query variant calls and annotations. This software and documentation can be downloaded from http://seqware.sourceforge.net