Three sets of human disease genes were used in our study. We obtained a list of 1003 human genes (1006 Swiss-Prot entries) with disease non-synonymous mutations from the Swiss-Prot database 
(July 2005; http://expasy.org/cgi-bin/listshumsavar.txt
). The list of 881 human disease genes (923 OMIM entries) with annotated phenotypes was taken from the study by Jimenez-Sanchez et al. 
. We also considered another disease set consisting of genes annotated as “disease”, but neither as “susceptibility” nor as “non-disease” in the OMIM Morbid Map 
. This set included 1609 genes (2239 MIM entries). Two sets of all human genes were used based on the Ensembl 
and Swiss-Prot databases. The longest protein isoform of every human gene was obtained from the Ensembl human genome build 35. We only retained genes annotated as “pep:known” or “pep:CCDS” (representing genes mapped to human-specific entries of Swiss-Prot, RefSeq, SPTrEMBL or CCDS). In total 20,262 genes were included. The other all- human gene set consisted of 12,211 protein sequences from the Swiss-Prot database. All-against-all BLASTP searches were performed using standard parameters 
. Sequence homologs were identified as non-self hits with E-value <
0.001 that could be aligned over more than 80% of both the query length and the length of identified sequence. Throughout the manuscript the term “singleton human genes” is used to describe the genes without any sequence homologs which can be identified the BLASTP searches.
We obtained H. sapiens
to D. rerio
, H. sapiens
to G. gallus
, and H. sapiens
to M. musculus
orthology information as well as paralogous relationships within D. rerio
, G. gallus
, and M. musculus
from the Ensembl database 
. Ka and Ka/Ks values of all 1
1 human-mouse orthologous pairs were calculated using the PAML package and obtained directly from the Ensembl database 
The sets of synonymous and non-synonymous human SNPs were obtained from the dbSNP database 
. These included 87920 SNPs corresponding to 14825 human genes. For each bin of homolog sequence identity, the Ka/Ks ratio was calculated. The proportion of non-synonymous sites (0.717) was calculated from simulation; for each nucleotide in the protein coding region a random transition or transversion mutation was performed at the ratio of 0.6/0.4, according to the published estimates in mammals 
We used manually curated phenotypes from the study by Jimenez-Sanchez et al. 
to calculated Spearman's rank correlation between reduction in life expectancy (ordinal data: none, mild, moderate, and severe) and sequence identity to the closest homolog.
The functional categories of human genes used in our study were based on the annotation by GOA 
; 53 of GO slims for GOA (http://www.geneontology.org/GO_slims/goslim_goa.obo
) were considered and Benjamin-Hochberg's algorithm was applied for multiple hypothesis correction.
The gene expression profiles in 79 human tissues were obtained from the study by Su et al. 
. We eliminated probe sets with cross hybridization effects (as identified by Su et al.
). In total, we considered expression profiles for 15097 human genes. The expression value of gene G at tissue T was set to 1 if at least one of gene G's transcripts was detected as “Present call” in tissue T based on the Affymetrix detection algorithm (provided by Su et al. 
). Similarity of Tissue Expression (STE) of a gene pair was defined as the Jaccard's coefficient of the binary expression profiles of the two genes, that is, the ratio of the number of tissues where the two genes are both expressed to the number of tissues where at least one of the genes is expressed. We performed the likelihood ratio test to investigate whether the similarity in tissue expression influences the probability of being a disease gene independently of the sequence identity to the closest homolog. The logistic regression was used to model the probability of being a disease gene using the expression and sequence similarities. In the null hypothesis the disease gene probability is determined only by sequence identity of the closest homolog; in the alternative hypothesis the probability is determined by sequence identity and tissue expression similarity of the closest homolog.
The probabilities shown in represent conditional probabilities. Specifically, the conditional probability P(disease|seq_id_homolog) that a gene is associated with a genetic disease given that it has a closest homolog with a certain sequence identity, was calculated according to the equation:
where P(seq_id_homolog | disease) is the probability that the closest homolog of a disease gene has a certain sequence identity, P(seq_id_homolog) is the probability that a randomly selected human gene (disease or non-disease) has a closest homolog with a certain sequence identity, and P(disease) is the probability that a random human gene is associated with a genetic disease. Importantly, because P(disease) is currently unknown (as we know only a fraction of all disease genes), we estimate P(disease | seq_id_homolog) up to a constant by assuming certain P(disease) value. For display purposes, we assumed P(disease)
0.2 in .