Our brains rapidly map incoming language onto what we hold to be true. Yet there are claims that such integration and verification processes are delayed in sentences containing negation words like ‘not’. However, research studies have often confounded whether a statement is true and whether it is natural thing to say during normal communication. In an event-related potential (ERP) experiment, we aimed to disentangle effects of truth-value and pragmatic licensing on the comprehension of affirmative and negated real-world statements. As in affirmative sentences, false words elicited a larger N400 ERP than true words in pragmatically licensed negated sentences (e.g., “In moderation, drinking red wine isn’t bad/good…”), whereas true and false words elicited similar responses in unlicensed negated sentences (e.g., “A baby bunny’s fur isn’t very hard/soft…”). These results suggest that negation poses no principled obstacle for readers to immediately relate incoming words to what they hold to be true.
The construct of calling has recently been applied to the vocation of medicine. We explored whether medical students endorse the presence of a calling or a search for a calling and how calling related to initial speciality interest. 574 first-year medical students (84 % response rate) were administered the Brief Calling Survey and indicated their speciality interest. For presence of a calling, the median response was mostly true for: ‘I have a calling to a particular kind of work’ and moderately true for: ‘I have a good understanding of my calling as it applies to my career’. For search for a calling, median response was mildly true: ‘I am trying to figure out my calling in my career’ and ‘I am searching for my calling as it applies to my career’. Mann–Whitney U (p < 0.05) results indicate that students interested in primary care (n = 185) versus non-primary care (n = 389) are more likely to endorse the presence of a calling. Students were more likely to endorse the presence of a calling rather than a search for a calling, with those interested in primary care expressing stronger presence of a calling to medicine.
Speciality; Calling; Medical students; Career
Comparison of a group of multiple observer segmentations is known to be a challenging problem. A good segmentation evaluation method would allow different segmentations not only to be compared, but to be combined to generate a “true” segmentation with higher consensus. Numerous multi-observer segmentation evaluation approaches have been proposed in the literature, and STAPLE in particular probabilistically estimates the true segmentation by optimal combination of observed segmentations and a prior model of the truth. An Expectation–Maximization (EM) algorithm, STAPLE’S convergence to the desired local minima depends on good initializations for the truth prior and the observer-performance prior. However, accurate modeling of the initial truth prior is nontrivial. Moreover, among the two priors, the truth prior always dominates so that in certain scenarios when meaningful observer-performance priors are available, STAPLE can not take advantage of that information. In this paper, we propose a Bayesian decision formulation of the problem that permits the two types of prior knowledge to be integrated in a complementary manner in four cases with differing application purposes: (1) with known truth prior; (2) with observer prior; (3) with neither truth prior nor observer prior; and (4) with both truth prior and observer prior. The third and fourth cases are not discussed (or effectively ignored) by STAPLE, and in our research we propose a new method to combine multiple-observer segmentations based on the maximum a posterior (MAP) principle, which respects the observer prior regardless of the availability of the truth prior. Based on the four scenarios, we have developed a web-based software application that implements the flexible segmentation evaluation framework for digitized uterine cervix images. Experiment results show that our framework has flexibility in effectively integrating different priors for multi-observer segmentation evaluation and it also generates results comparing favorably to those by the STAPLE algorithm and the Majority Vote Rule.
Ground truth; Bayesian decision; Precision; Segmentation; Multi-observer; Sensitivity; Specificity; STAPLE; Validation
The accurate assessment of the calibration of classification models is severely limited by the fact that there is no easily available gold standard against which to compare a model’s outputs. The usual procedures group expected and observed probabilities, and then perform a χ2 goodness-of-fit test. We propose an entirely new approach to calibration testing that can be derived directly from the first principles of statistical hypothesis testing. The null hypothesis is that the model outputs are correct, i.e., that they are good estimates of the true unknown class membership probabilities. Our test calculates a p-value by checking how (im)probable the observed class labels are under the null hypothesis. We demonstrate by experiments that our proposed test performs comparable to, and sometimes even better than, the Hosmer-Lemeshow goodness-of-fit test, the de facto standard in calibration assessment.
OBJECTIVES: To examine the accuracy of a commercial polymerase chain reaction (PCR) test (Amplicor CTR, Roche Diagnostic Systems, Branchburg NJ) for identification of endocervical chlamydial infections through both laboratory evaluation and among a diverse teaching hospital patient population. METHODS: Testing of reliable threshold inocula and reproducibility were carried out using laboratory stock organisms. Paired endocervical samples from patients with a wide range of indications were tested by PCR and an established culture procedure, and discrepant pairs were further analyzed to determine true results. RESULTS: Laboratory evaluation suggested that one copy of target DNA from a viable organism consistently yielded a positive result, and test reproducibility was very good, with an overall coefficient of variation of 15%. Compared to true results in 1,588 paired clinical samples from 1,489 women with a 10% prevalence of infection, the PCR test and culture yielded respective sensitivities of 87.4% and 78.0%, and negative predictive values of 98.6% and 97.6%. Specificity and positive predictive value for both tests were 100%. Cost per specimen was nearly identical at $18.84 and $18.88 respectively. Polymerase inhibitors and organisms lacking target DNA were not found in false-negative PCR samples. CONCLUSION: This commercial PCR test is accurate, cost-competitive, and much faster than culture for diagnosis of endocervical chlamydia infections in our population of intermediate prevalence of chlamydial infection.
Use of a yeast-lactobacillus differential microbiological assay permitted investigation into the synthesis of biotin vitamers by a variety of bacteria. A major portion of the biotin activity was found extracellularly. The level of total biotin (assayable with yeast) greatly exceeded the level of true biotin (assayed with lactobacillus). Values for intracellular biotin generally showed good agreement between the assays, suggesting the presence of only true biotin within the cells. Bioautographic analysis of the medium after growth of each organism revealed the presence of large amounts of a vitamer which corresponded to dl-desthiobiotin on the basis of Rf value and biological activity. Biotin, when detected at all, was at very low concentrations. Also, an avidin-uncombinable vitamer was synthesized by a majority of the bacteria. Addition of d-biotin to the growth medium prevented completely the synthesis of both vitamers of biotin. d-Biotin-d-sulfoxide had no effect on the synthesis of desthiobiotin or the avidin-uncombinable vitamer. Addition of dl-desthiobiotin did not prevent its own synthesis nor that of the other vitamer. Control of vitamer synthesis is therefore highly specific for d-biotin. The avidin-uncombinable vitamer was produced only at repressed levels in the presence of high concentrations of both d-biotin and dl-desthiobiotin, which suggested that it is not a degradation product of these substances. A possible mechanism for the overproduction of the biosynthetic precursors of biotin is presented.
Acutely damaged myocardium was shown in 103 patients with suspected acute myocardial infarction using 99Tcm pyp. A significant incidence of false positive and false negative results occurred, 'true' results being defined by standard clinical, electrocardiographic, and enzyme criteria. Localisation of infarction compared reasonably well with standard electrocardiographic criteria but more frequently suggested true posterior involvement. Serial estimates of infarct size may be of value in the recognition of infarct extension during the acute phase. Viable perfused myocardium was shown in 63 patients with a variety of cardiac disorders using 129Cs. The technique gives a reliable indication of anterior infarction but tends to underestimate inferior infarction. There was good correlation with the electrocardiogram with regard to localisation and extent of infarction. Nineteen patients received both isotopes and were included in each of the above groups. The combination permits further assessment of equivocal results Furthermore as 129Cs demonstrates both previous and recent infarction and 99Tcm pyp accumulates only in acutely damaged myocardium it was possible to estimate the extent of previous and recent myocardial damage.
Cardiac tumours can mimic collagen vascular disease and they are often accompanied by profound systemic upset. Both benign and malignant tumours may present in this way. Three cases of cardiac tumour, two malignant and one benign, are reported with just such a presentation. A review of fifteen similar case reports showed that a spectrum of different collagen vascular diseases was diagnosed and treated before the true diagnosis emerged. In half of these cases the cardiac tumour was only diagnosed at necropsy. The diagnosis of collagen vascular disease should not be made in the absence of corroborative laboratory data. In cases of malignant cardiac tumour, and less commonly with atrial myxoma, M mode and cross sectional echocardiography may not exclude the diagnosis. There may be a good response to steroid treatment in cases of suspected but not confirmed collagen vascular disease in which the true diagnosis is cardiac tumour.
Immediate and late postoperative results in 70 patients undergoing resection of a true left ventricular aneurysm (50 patients) and of an asynergic area (20 patients) are presented. The operative mortality was 14 per cent. Predicted survival by actuarial methods was 80 per cent at one year after operation and 65 per cent at six years. Functional improvement was obvious with most of the survivors falling in NYHA class I or II. Factors influencing operative mortality were the clinical indication for operation and the anatomical lesion. Late postoperative results were better for true aneurysms than for asynergic areas. An asynergic area was usually associated with multiple coronary vessel lesions and a diffusely ischaemic myocardium. An aneurysm was often associated with a single coronary vessel disease and with good function of the non-infarcted myocardiun.
We used the Gompertz growth curve to model a simulated longitudinal dataset provided by the QTLMAS2009 workshop and applied genomic evaluation to the derived model parameters and to a model-predicted trait value.
Prediction of phenotypic information from the Gompertz curve allowed us to obtain genomic breeding value estimates for a time point with no phenotypic records. Despite that the true model used to simulate the data was the logistic growth model, the Gompertz model provided a good fit of the data. Genomic breeding values calculated from predicted phenotypes were highly correlated with the breeding values obtained by directly using the respective observed phenotypes. The accuracies between the true and estimated breeding value at time 600 were above 0.93, even though t600 was outside the time range used when fitting the data. The analysis of the parameters of the Gompertz curve successfully discriminated regions with QTL affecting the asymptotic final value, but it was less successful in finding QTL affecting the other parameters of the logistic growth curve. In this study we estimated the proportion of SNPs affecting a given trait, in contrast with previously reported implementations of genomic selection in which this parameter was assumed to be known without error.
The two-step approach used to combine curve fitting and genomic selection on longitudinal data provided a simple way for combining these two complex tasks without any detrimental effect on breeding value estimation.
Openness is one of the most important principles in scientific inquiry, but there are many good reasons for maintaining secrecy in research, ranging from the desire to protect priority, credit, and intellectual property, to the need to safeguard the privacy of research participants or minimize threats to national or international security. This article examines the clash between openness and secrecy in science in light of some recent developments in information technology, business, and politics, and makes some practical suggestions for resolving conflicts between openness and secrecy.
“By academic freedom I understand the right to search for the truth and to publish and teach what one holds to be true. This right also implies a duty; one must not conceal any part of what one has recognized to be true. It is evident that any restriction of academic freedom serves to restrain the dissemination of knowledge, thereby impeding rational judgment and action.”
Albert Einstein, quotation inscribed on his statute in front of the National Academy of Sciences, Washington, DC.
Recently, an exact binomial test called SGoF (Sequential Goodness-of-Fit) has been introduced as a new method for handling high dimensional testing problems. SGoF looks for statistical significance when comparing the amount of null hypotheses individually rejected at level γ = 0.05 with the expected amount under the intersection null, and then proceeds to declare a number of effects accordingly. SGoF detects an increasing proportion of true effects with the number of tests, unlike other methods for which the opposite is true. It is worth mentioning that the choice γ = 0.05 is not essential to the SGoF procedure, and more power may be reached at other values of γ depending on the situation. In this paper we enhance the possibilities of SGoF by letting the γ vary on the whole interval (0,1). In this way, we introduce the ‘SGoFicance Trace’ (from SGoF's significance trace), a graphical complement to SGoF which can help to make decisions in multiple-testing problems. A script has been written for the computation in R of the SGoFicance Trace. This script is available from the web site http://webs.uvigo.es/acraaj/SGoFicance.htm.
Our research is motivated by 2 methodological problems in assessing diagnostic accuracy of traditional Chinese medicine (TCM) doctors in detecting a particular symptom whose true status has an ordinal scale and is unknown—imperfect gold standard bias and ordinal scale symptom status. In this paper, we proposed a nonparametric maximum likelihood method for estimating and comparing the accuracy of different doctors in detecting a particular symptom without a gold standard when the true symptom status had an ordered multiple class. In addition, we extended the concept of the area under the receiver operating characteristic curve to a hyper-dimensional overall accuracy for diagnostic accuracy and alternative graphs for displaying a visual result. The simulation studies showed that the proposed method had good performance in terms of bias and mean squared error. Finally, we applied our method to our motivating example on assessing the diagnostic abilities of 5 TCM doctors in detecting symptoms related to Chills disease.
Bootstrap; Diagnostic accuracy; EM algorithm; MSE; Ordinal tests; Traditional Chinese medicine (TCM); Volume under the ROC surface (VUS)
Accurate determination of gestational age underpins good obstetric care. We assessed the performance of six existing ultrasound reference charts to determine gestational age in 1268 singleton IVF pregnancies, where “true” gestational age could be precisely calculated from date of fertilisation. All charts generated dates significantly different to IVF dates (P < 0.0001 all comparisons). Thus we generated a new reference chart, The Monash Chart, based on a line of best fit describing crown-rump length across 6 + 1 to 9 + 0 weeks of gestation (true gestational age) in the IVF singleton cohort. The Monash Chart, but none of the existing charts, accurately determined gestational age among an independent IVF twin cohort (185 twin pairs). When applied to 3052 naturally-conceived singletons scans, The Monash Chart generated estimated due dates that were different to all existing charts (P ≤ 0.004 all comparisons). We conclude that commonly used ultrasound reference charts have inaccuracies. We have generated a CRL reference chart based on true gestational age in an IVF cohort that can accurately determine gestational age at 6–9 weeks of gestation.
Recognition codes for protein-DNA interactions typically assume that the interacting positions contribute additively to the binding energy. While this is known to not be precisely true, an additive model over the DNA positions can be a good approximation, at least for some proteins. Much less information is available about whether the protein positions contribute additively to the interaction.
Using EGR zinc finger proteins, we measure the binding affinity of six different variants of the protein to each of six different variants of the consensus binding site. Both the protein and binding site variants include single and double mutations that allow us to assess how well additive models can account for the data. For each protein and DNA alone we find that additive models are good approximations, but over the combined set of data there are context effects that limit their accuracy. However, a small modification to the purely additive model, with only three additional parameters, improves the fit significantly.
The additive model holds very well for every DNA site and every protein included in this study, but clear context dependence in the interactions was detected. A simple modification to the independent model provides a better fit to the complete data.
Self-recording of the blood pressure by patients away from hospital or office ("home blood pressure") has been advocated as providing a better estimate of "true" blood pressure. The reliability of home blood-pressure recording has been assessed only by standard indirect methods which themselves are subject to considerable error and variability. The accuracy of self-recorded blood pressures was therefore assessed in 57 patients with essential hypertension by comparison with simultaneous measurements of clinic blood pressures and with intra-arterial blood pressures recorded at home and at hospital. Home systolic blood pressures showed good agreement with clinic and intra-arterial pressures, but home diastolic blood pressures overestimated intra-arterial pressures, as did clinic diastolic pressures. The clinic and home diastolic pressures showed good agreement. There was considerable variability in individual differences comparing the indirect and intra-arterial methods, though the two indirect methods showed much closer agreement. This study suggests that home blood pressures are as accurate as clinic readings but may be recorded more frequently and thus provide more useful information. Neither is likely to approximate the intra-arterial blood pressure.
We report the largest and most comprehensive comparison of protein structural alignment methods. Specifically, we evaluate six publicly available structure alignment programs: SSAP, STRUCTAL, DALI, LSQMAN, CE and SSM by aligning all 8,581,970 protein structure pairs in a test set of 2930 protein domains specially selected from CATH v.2.4 to ensure sequence diversity.
We consider an alignment good if it matches many residues, and the two substructures are geometrically similar. Even with this definition, evaluating structural alignment methods is not straightforward. At first, we compared the rates of true and false positives using receiver operating characteristic (ROC) curves with the CATH classification taken as a gold standard. This proved unsatisfactory in that the quality of the alignments is not taken into account: sometimes a method that finds less good alignments scores better than a method that finds better alignments. We correct this intrinsic limitation by using four different geometric match measures (SI, MI, SAS, and GSAS) to evaluate the quality of each structural alignment. With this improved analysis we show that there is a wide variation in the performance of different methods; the main reason for this is that it can be difficult to find a good structural alignment between two proteins even when such an alignment exists.
We find that STRUCTAL and SSM perform best, followed by LSQMAN and CE. Our focus on the intrinsic quality of each alignment allows us to propose a new method, called “Best-of-All” that combines the best results of all methods. Many commonly used methods miss 10–50% of the good Best-of-All alignments.
By putting existing structural alignments into proper perspective, our study allows better comparison of protein structures. By highlighting limitations of existing methods, it will spur the further development of better structural alignment methods. This will have significant biological implications now that structural comparison has come to play a central role in the analysis of experimental work on protein structure, protein function and protein evolution.
comparison of structural alignment; protein structure alignment; protein structure comparison; geometric measures; ROC curves
ABSTRACT Good guidelines will help us to take evidence into practice. In a survey among Dutch orthopedic surgeons, development and use of evidence-based guidelines was perceived as one of the best ways of moving from opinion-based to evidence-based orthopedic practice. The increasing number of guidelines means that knowing how to make a critical appraisal of guidelines is now a key part of every surgeon’s life. This is particularly true because guidelines use varying systems to judge the quality of evidence and the strength of recommendations. In this manuscript we discuss what a guideline is, where we can find guidelines, how to evaluate the quality of guidelines, and finally provide an example on the different steps of guideline development. Thus, we show that good guidelines are a summary of the best available evidence and that they provide a graded recommendation to help surgeons in evidence-based practice.
Indirect reciprocity is a form of reciprocity where help is given to individuals based on their reputation. In indirect reciprocity, bad acts (such as not helping) reduce an individual's reputation while good acts (such as helping) increase an individual's reputation. Studies of indirect reciprocity assume that good acts and bad acts are weighted equally when assessing the reputation of an individual. As different information can be processed in different ways, this is not likely to be the case, and it is possible that an individual could bias an actor's reputation by putting more weight to acts of defection (not helping) than acts of co-operation (helping) or vice versa. We term this difference ‘judgement bias’, and build an individual-based model of image scoring to investigate the conditions under which it may evolve. We find that, if the benefits of co-operation are small, judgement bias is weighted towards acts perceived to be bad; if the benefits are high, the reverse is true. Our result is consistent under both scoring and standing strategies, and we find that allowing judgement bias to evolve increases the level of co-operation in the population.
information processing; social information; co-operation; game theory; reputation; tragedy of the commons
Curves of growth delay (GD) or 'cure' after graded doses of radiation have been analysed for 16 lines of human and animal tumours grown as multicellular spheroids in vitro. Dose-survival curves were derived for those cellular units from which spheroids regrow after unsuccessful irradiation (spheroid-regenerating cellular units, SRU). For 10 sets of data from 6 spheroid lines, the Do's and extrapolation numbers of the SRU derived by GD could be compared with the response of the clonogenic cells of the spheroids. For Do, a good correlation (r = 0.910) was found between the two; this was true also for Do derived from curves of spheroid 'cure' (7 sets of data from 6 spheroid lines) and clonogenic cells (r = 0.986). Using GD, the correlation of extrapolation numbers was less good (r = 0.682), the values for SRU commonly being higher than those for clonogenic cells. This may reflect features of the growth curves of spheroids after the lower range of doses of radiation. For human and animal tumour spheroids of 250 microns or less, derived Do ranged from 0.5 to 2.5 Gy. For spheroids of 350 microns or more, derived Do for animal tumour lines ranged from 3.4 to 4.2 Gy, for human lines from 1.5 to 2.1 Gy.
This study examined maltreated and non-maltreated children’s (N = 183) emerging understanding of “truth” and “lie,” terms about which they are quizzed to qualify as competent to testify. Four- to six-year-old children were asked to accept or reject true and false (T/F) statements, label T/F statements as the “truth” or “a lie,” label T/F statements as “good” or “bad,” and label “truth” and “lie” as “good” or “bad.” The youngest children were at ceiling in accepting/rejecting T/F statements. The labeling tasks revealed improvement with age and children performed similarly across the tasks. Most children were better able to evaluate “truth” than “lie.” Maltreated children exhibited somewhat different response patterns, suggesting greater sensitivity to the immorality of lying.
Child witnesses; Child maltreatment; Competency examination; Moral development; Cognitive development
Systems are being developed that utilize algorithms to predict impending hypoglycemia using commercially available continuous glucose monitoring (CGM) devices and to discontinue insulin delivery if hypoglycemia is predicted. In outpatient studies designed to test such systems, CGM-measured glycemic indices will not only be important outcome measures of efficacy but, in certain cases, will be the only good outcome. This is especially true in short-term studies designed to reduce hypoglycemia since the event rate for severe hypoglycemic events is too low for it to be a good outcome, and milder hypoglycemia often will be variably detected. Continuous glucose monitoring inaccuracy can be accounted for in the study design by increasing sample size and/or study duration.
continuous glucose monitoring; nocturnal hypoglycemia; severe hypoglycemia; type 1 diabetes mellitus
Supervised learning for classification of cancer employs a set of design examples to learn how to discriminate between tumors. In practice it is crucial to confirm that the classifier is robust with good generalization performance to new examples, or at least that it performs better than random guessing. A suggested alternative is to obtain a confidence interval of the error rate using repeated design and test sets selected from available examples. However, it is known that even in the ideal situation of repeated designs and tests with completely novel samples in each cycle, a small test set size leads to a large bias in the estimate of the true variance between design sets. Therefore different methods for small sample performance estimation such as a recently proposed procedure called Repeated Random Sampling (RSS) is also expected to result in heavily biased estimates, which in turn translates into biased confidence intervals. Here we explore such biases and develop a refined algorithm called Repeated Independent Design and Test (RIDT).
Our simulations reveal that repeated designs and tests based on resampling in a fixed bag of samples yield a biased variance estimate. We also demonstrate that it is possible to obtain an improved variance estimate by means of a procedure that explicitly models how this bias depends on the number of samples used for testing. For the special case of repeated designs and tests using new samples for each design and test, we present an exact analytical expression for how the expected value of the bias decreases with the size of the test set.
We show that via modeling and subsequent reduction of the small sample bias, it is possible to obtain an improved estimate of the variance of classifier performance between design sets. However, the uncertainty of the variance estimate is large in the simulations performed indicating that the method in its present form cannot be directly applied to small data sets.
Searching databases for distant homologues using alignments instead of individual sequences increases the power of detection. However, most methods assume that protein evolution proceeds in a regular fashion, with the inferred tree of sequences providing a good estimation of the evolutionary process. We investigated the combined HMMER search results from random alignment subsets (with three sequences each) drawn from the parent alignment (Rand-shuffle algorithm), using the SCOP structural classification to determine true similarities. At false-positive rates of 5%, the Rand-shuffle algorithm improved HMMER's sensitivity, with a 37.5% greater sensitivity compared with HMMER alone, when easily identified similarities (identifiable by BLAST) were excluded from consideration. An extension of the Rand-shuffle algorithm (Ali-shuffle) weighted towards more informative sequence subsets. This approach improved the performance over HMMER alone and PSI-BLAST, particularly at higher false-positive rates. The improvements in performance of these sequence sub-sampling methods may reflect lower sensitivity to alignment error and irregular evolutionary patterns. The Ali-shuffle and Rand-shuffle sequence homology search programs are available by request from the authors.
The phylogenetic inference of ancestral protein sequences is a powerful technique for the study of molecular evolution, but any conclusions drawn from such studies are only as good as the accuracy of the reconstruction method. Every inference method leads to errors in the ancestral protein sequence, resulting in potentially misleading estimates of the ancestral protein's properties. To assess the accuracy of ancestral protein reconstruction methods, we performed computational population evolution simulations featuring near-neutral evolution under purifying selection, speciation, and divergence using an off-lattice protein model where fitness depends on the ability to be stable in a specified target structure. We were thus able to compare the thermodynamic properties of the true ancestral sequences with the properties of “ancestral sequences” inferred by maximum parsimony, maximum likelihood, and Bayesian methods. Surprisingly, we found that methods such as maximum parsimony and maximum likelihood that reconstruct a “best guess” amino acid at each position overestimate thermostability, while a Bayesian method that sometimes chooses less-probable residues from the posterior probability distribution does not. Maximum likelihood and maximum parsimony apparently tend to eliminate variants at a position that are slightly detrimental to structural stability simply because such detrimental variants are less frequent. Other properties of ancestral proteins might be similarly overestimated. This suggests that ancestral reconstruction studies require greater care to come to credible conclusions regarding functional evolution. Inferred functional patterns that mimic reconstruction bias should be reevaluated.
It is now possible to apply computational methods to known current protein sequences to recreate the sequences of ancestral proteins. By synthesising these proteins and measuring their properties in the laboratory, we can gain much information about the nature of evolution, better understand how proteins change and adapt over time, and develop insights into the environments of ancient organisms. Unfortunately, the accuracy of these reconstructions is difficult to evaluate. We simulate protein evolution using a simplified computational model and apply the various reconstruction methods to the sequences that arise from our simulations. Because we have the complete record of the evolutionary history, we can evaluate the reconstruction accuracy directly. We demonstrate that the reconstruction procedures in common use may have a bias toward overestimating the properties of these ancestral proteins, opposite to what has been assumed previously. An alternative method of creating these sequences is presented, Bayesian sampling, that can eliminate this bias and provide more robust conclusions.