|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: AKF. Performed the experiments: AKF. Analyzed the data: AKF. Contributed reagents/materials/analysis tools: AKF. Wrote the paper: AKF.
An often reported, but nevertheless persistently striking observation, formalized as the Newcomb-Benford law (NBL), is that the frequencies with which the leading digits of numbers occur in a large variety of data are far away from being uniform. Most spectacular seems to be the fact that in many data the leading digit 1 occurs in nearly one third of all cases. Explanations for this uneven distribution of the leading digits were, among others, scale- and base-invariance. Little attention, however, found the interrelation between the distribution of the significant digits and the distribution of the observed variable. It is shown here by simulation that long right-tailed distributions of a random variable are compatible with the NBL, and that for distributions of the ratio of two random variables the fit generally improves. Distributions not putting most mass on small values of the random variable (e.g. symmetric distributions) fail to fit. Hence, the validity of the NBL needs the predominance of small values and, when thinking of real-world data, a majority of small entities. Analyses of data on stock prices, the areas and numbers of inhabitants of countries, and the starting page numbers of papers from a bibliography sustain this conclusion. In all, these findings may help to understand the mechanisms behind the NBL and the conditions needed for its validity. That this law is not only of scientific interest per se, but that, in addition, it has also substantial implications can be seen from those fields where it was suggested to be put into practice. These fields reach from the detection of irregularities in data (e.g. economic fraud) to optimizing the architecture of computers regarding number representation, storage, and round-off errors.
Newcomb  observed how much faster the first pages of tables of decadic logarithms wear out than the last ones, indicating that the first significant figure is oftener 1 than any other digit, and that the frequency diminishes up to 9. Without giving actual numerical data and a strict formal proof, he reached the conclusion that “The law of probability of the occurrence of numbers is such that all mantissae of their logarithms are equally probable”, so that “every part of a table of anti-logarithms is entered with equal frequency”. This resulted in a table giving the probabilities of occurrence in the case of the first two significant digits; see Table 1.
More than a half century later, Benford  rediscovered Newcomb's observation. Based on substantial empirical evidence from 20 different domains, such as the surface areas of 335 rivers, the sizes of 3259 U.S. populations, 104 physical constants, 1800 molecular weights, 5000 entries from a mathematical handbook, 308 numbers contained in an actual issue of Readers' Digest, the street addresses of the first 342 persons listed in American Men of Science, and 418 death rates, Benford stated a logarithmic law of frequencies of significant digits. This law gives
for the probability of the digit in the first place of observed numbers and
for the probability of the second-place digit . Most of the 20 domain-specific distributions of the first-place digits showed rather good agreement with the logarithmic law (1) – that later came to be known as Benford's or Newcomb-Benford law (NBL) –, but the averaged distribution fitted nearly perfectly.
These findings initiated “a varied literature, among the authors of which are mathematicians, statisticians, economists, engineers, physicists, and amateurs”, as Raimi  wrote in his comprehensive review on the first digit problem (p.521). After having described and discussed several approaches taken to ground the NBL, namely density and summability, scale-invariance, base-invariance, and mixture-distribution arguments, he concludes that – up to that time – Pinkham's  scale-invariance argument gave the first theoretical explanation of the NBL, however assuming a cumulative distribution function that cannot exist , pp.253–264, and assuring only a miserable numerical approximation. As an example, on p.533 Raimi  mentions the half Cauchy distribution with scale parameter a and density , which satisfies all the relevant hypotheses stated by Pinkham. For this distribution, Pinkham's formula gives the lower and upper bounds .05 and 0.55, respectively, for the first-place digit to be 1. But some 15 years earlier, Furry and Hurwitz  already had derived much more precise bounds for the half Cauchy distribution.
Since Raimi's  review, the literature on the NBL has been expanded considerably. This can be seen from the bibliography compiled by Hürlimann  in 2006 with its 350 entries and from the up-to-date online bibliography implemented by Berger and Hill  that actually lists nearly 600 sources related to the NBL. Major theoretical advances have to be attributed to Hill who showed in a series of papers that base-invariance implies the NBL , , and that random samples coming from many random distributions may generate a compound distribution fulfilling the NBL . Schatte  and Lolbert  studied the NBL in dependence on the numeral base, with the result that the approximation by the NBL becomes worse outside a limited range of bases. The relationship between the distribution of first digits with Zipf's law, with prime numbers and Riemann zeta zeroes, and with order statistics was investigated by Irmay , Luque and Lacasa , and Miller and Nigrini , respectively. Further, it was shown that exponential random variables  and other survival distributions  obey the NBL, that mixtures of uniform distributions fulfill a (generalized) version of the NBL , , and that data coming from different types of multiplicative processes also result in a first-digit distribution following the NBL – (but see also earlier results in , ) as do geometric sequences, for example powers of two , p.525. Bounds for the approximation error to the NBL were given by Dümbgen and Leuenberger  for the (half-)normal, the log-normal, the Gumbel, and the Weibull (including the exponential) distributions.
The NBL has been shown to fit rather closely many empirical data: in addition to most of those analyzed by Benford , among others, stock index returns , stock prices , eBay auctions , and consumer prices half a year after the introduction of the Euro in 2002 . In contrast, the latter study found deviations from the NBL due to psychological pricing (consumer prices preferably ending in 0, 5, or 9) immediately after and a full year after the introduction of the Euro. So, the NBL may be useful as a benchmark for detecting irregularities in data. This has become of widespread use in economic fraud detection (e.g. tax evasion) , but the NBL was also proposed as a means to identify possible problems with survey data , self-reported ratings , and scientific results . Because of its prominence, the NBL found even entrance in esteemed newspapers . Another, merely future field for putting the NBL into practice is computer design. Theoretical considerations concerned the interrelation between number representation and storage requirements, as well as round-off errors arising in the computation of products . Interestingly, empirical evidence was provided by Torres et al.  that file sizes in PCs behave according to the NBL of the first and second digit. In all, the goal could be to optimize the architecture of computers in order to fasten precise calculations and to save storage, both by taking the implications of the NBL into account.
On the other hand, many data obey the NBL rather badly or simply not, for example some mathematical functions such as square roots and the inverse . In his review, Raimi  gave two empirical examples for failure of the NBL: the 1974 Vancouver (Canada) telephone book, where no number began with the digit 1, and sizes of populations of all populated places with population at least 2500 from five US states according to the censuses from 1960 and 1970, where 19% only began with digit 1 but 20% with digit 2. To give but one recent empirical example, Beer's  finding should be mentioned that terminal digits of data in pathology reports do not follow the NBL. A simple explanation of the incompatibility of empirical data with the NBL cannot be found in any case, but these three cases have their obvious peculiarities: assignment of telephone numbers in an arbitrary manner, truncation of population size at 2500 inhabitants, and rounding data, comparable to psychological pricing of consumer goods.
It seems that nowadays the practical potentialities of the NBL have been recognized, and that meanwhile this empirically derived law can be considered theoretically well-analyzed. However, its relation to common distributions of random variables was investigated up to now only rudimentary , , , . In addition, previous studies concentrated on the first digit, derived the deviation of the distribution under consideration from the NBL by calculating or approximating the respective integrals, and did not consider functions of random variables. In contrast, the present study investigates the leading ten digits and counts their frequencies from simulated data for different numbers of figures generated, whereby this is done not only for the random variables themselves, but also for ratios therefrom. This proceeding allows one to get an impression of the degree to which real data of finite sample size may approach the distribution predicted by the NBL while adopting one of Newcomb's arguments stated in the second paragraph of his two-pages 1881 note : “As natural numbers occur in nature, they are to be considered as the ratios of quantities. Therefore, instead of selecting a number at random, we must select two numbers at random, and inquire what is the probability that the first significant digit of their ratio is the digit n. To solve the problem we may form an indefinite number of such ratios, taken independently;… (p.39).
This statement suggests the interpretation that Newcomb did not intend to consider numbers stemming from one and the same domain, for example from one of those investigated later by Benford, but that he had in mind to consider numbers drawn at random from the universe of all possible domains. If so, the measure being available for an object can be understood as the ratio of the two numbers and , where represents the object's size “per se” and represents the scaling unit. In case that the objects stem from the same domain and were measured on the same scale, the scaling constant is no longer of interest and considering the measures as given entities is appropriate, the more so as the Newcomb-Benford distribution has been shown to be scale-invariant. (This means that performing an admissible transformation of the ratio scale – that is, by multiplying all of the values by a positive constant – does neither reduce nor improve the degree to which the NBL fits the data.) Therefore, both relations are of interest: within one domain the relation between the NBL and the distribution of a random variable, and across domains the relation between the NBL and the ratio distribution of two random variables.
Out of the manifold of common distributions seven were selected for the simulation study. Criteria for inclusion were, first, that each one of the distributions gives support for only, second, that some of the earlier investigated distributions should be included in order to allow comparisons, and, third, that across the selected distributions their shape should vary from right-skewed to left-skewed, including symmetric distributions. The seven types of distributions illustrated in Figure 1 and the resulting ratio-distributions (cf. Figure 2 for some of them) are the following ones.
According to each one of these distributions random numbers were generated for increasing sample size, , beginning with up to . For the ratio distributions pairs of random numbers and were generated, from which the ratios were calculated, that is, the ratio distributions were not involved directly. This can be seen to be an advantage of the simulation approach: in principle, the distribution of the ratio of two (independent) random variables and can be generated in that way for any two distributions of and , even without knowing the form of the distribution of . To save space, results will not be presented for all sample sizes under study, but mostly for (realistic sample size for real data) and (to approximate the true distributions). In the next step, the frequencies of the first ten leading digits were counted. As for the sample sizes, results will be given in a reduced manner, namely for the first- and the second-place digits only. (No drastic irregularities became observable for third-place etc. digits. Moreover, it is known since Newcomb  that already the distribution of the third-place digit follows rather closely the uniform; see Table 1.) All of the calculations were performed in double precision by a FORTRAN program using the built-in function RANDOM which produces uniformly distributed pseudo-random variables between 0 and 1.
The numerical results for the uniform distribution and the ratio distribution of two uniforms are shown in Table 2. The uniform distribution produces a uniform distribution of first- and second place digits, as was to be expected. Hence, the clear conclusion is, that the uniform distribution and the NBL are incompatible. Nevertheless it is instructive to consider in more detail the discrepancies between the simulated relative frequencies of the digits and their theoretical values to get an impression of the precision which can be expected from the simulation study. Assuming a uniform distribution, the probabilities of occurrence for the first-place digits are and for the second-place digits they are . For the first-place digit, the deviation of the simulated relative frequencies from these values does not exceed .0231 for and .0013 for , respectively. Similar maximal discrepancies (.0250 for and .0018 for ) are obtained for the second-place digit. Nearly perfect agreement is found for the first two digits and , that is, under this sample size the true distribution is generated nearly perfectly. Across all sample sizes, for each digit its simulated relative frequency lies within the approximate (for the number of tests corrected overall) 99% confidence interval around the corresponding probability , CI: . From this it can be concluded that the pseudo-random number generator works properly. Therefore, rather reliable results can be expected for all distributions under study even for in terms of absolute differences between simulated and true distributions. But in terms of relative differences , the agreement must be expected to be much weaker: the maximal relative differences turn out to be 20.29% for and 1.17% for in the case of the first-place digit, and 25.00% for and 1.80% for in the case of the second-place digit. One has to bear in mind these facts when evaluating the fit to the NBL in the presence of real data with moderate sample size, as well as when interpreting the results for the various distributions in the following.
In contrast to the uniform distribution, the ratio distribution of two uniforms fits the NBL rather good. For the maximal absolute difference between the simulated relative frequencies and the probabilities according to the NBL amounts to .0322, is found for the leading digit 1, and corresponds to a relative difference of 10.7%. For the same sample size even larger relative differences are found in some cases. For example, for the leading digit to be 9, the absolute difference is .0159 only, however resulting in the relative difference of nearly 35%. Especially for the large sample size most of the simulated relative frequencies fall outside any usual confidence interval around the digits' probabilities as given by the NBL. Thus, the NBL does not hold in a strict sense for the ratio distribution of two uniforms, that is, for unrealistically large sample sizes the H0: “The digits' distributions follow the NBL” would have to be rejected. But the NBL approximates the digits' distributions to such a degree that it may be acceptable as a H0 in the presence of real data sets with typical sample size.
The numerical results for the exponential distribution with parameter =0.5, 1, 2 and the ratio distribution of two exponentials are given in Table 3. The exponential distribution produces first- and second-place digits' distributions coming close to the Newcomb-Benford distribution. As derived theoretically by Engel and Leuenberger  and also shown numerically by Leemis, Schmeiser and Evans , however for the leading digit only, the maximal absolute deviation is less than 0.03. This result was reproduced here, and it does not only apply to large samples, but also to the sample size of . Further, it generalizes to the second-place digit. Note that for both the first- and second place the quality of fit depends on and varies across the digits.
The ratio distribution of two exponentials with clearly outperforms these results. For , the maximal absolute deviation amounts to .0150 for the first-place digit to be 1 and to .0151 for the second-place digit also to be 1; for , the maximal deviation is found to be .0011 (first-place digit 1). Most simulated relative frequencies look as if they were generated under the NBL, and they lie within the confidence intervals introduced above, except for . Comparable results were obtained for some ratio distributions of exponentials with , but details will be omitted.
The numerical results for the half-normal distribution with =1, 2.5, 5 and the ratio distribution of two half-normals are shown in Table 4. The three half-normals under investigation do not fit the NBL as well as was to be expected following Dümbgen and Levenberger , but far better than given by Furry and Hurwitz . According to our results, the maximal deviance across all cases studied is found to be .0790 for the first-place digit to be 1 if =1 and , whereas Furry and Hurwitz reported .33. (Note that Furry and Hurwitz speak of the normal distribution, in fact they investigated the half-normal, as can be seen from their formula (a) on p.53. Note further that they reported .115 for the deviance of the exponential distribution – which now is known to be much smaller, see above –, but .0557 for the half Cauchy distribution that was not included in the present study because of its similarity with the normal distribution; cf. thereto p.300 in Johnson, Kotz and Balakrishnan ). The digits' distributions remain unaffected when multiplying by integer powers of 10 so that, for example, the entries found in the first half of Table 4 also apply to the half-normal with =10, =25, and =50, respectively.
Surprisingly good fit to the NBL shows the ratio distribution of two half-normals with (independent of their actual values), , , and , . The fit is not as perfect as it is for the ratio of two exponentials, but it is better than that of the ratio of two uniforms. Especially good agreement is observed for the second-place digit under all three scenarios studied here, and even for the first-place digit the maximal deviance is found to be only .0089, .0087, and .0256, respectively (digit 1, ). Overall, it seems to make little difference of whether the variances of the two random variables are equal or not, with the slight tendency to worsen the fit if the variance of the variable in the denominator, , exceeds that of the numerator, , in the ratio .
The numerical results for the right-truncated normal distribution and the ratio distribution of two right-truncated normals are given in Table 5. Its entries speak for themselves so that a short comment will suffice. As compared with survival distributions, the right-truncated normal shows inverse behaviour in that it puts most mass on large values of the random variable. That is why the right-truncated normal was selected for inclusion in the present study. It turns out that it may serve as a prototypical example of distributions of random variables not leading to first- and second-place digits' distributions obeying the NBL. Presented are the figures only for , two distributions, with =1.1, =0.25 and =100, =15, and their ratio distributions. The discrepancies between the simulated digits' distributions and the Newcomb-Benford distribution are such that even for small sample sizes conventional goodness-of-fit tests, for example Pearson's chi-square and the likelihood-ratio test, have a good chance to become significant. Considering the ratio distribution of two right-truncated normals does not improve matters. (Note that nonconformance to the NBL was reported for the Gumbel distribution whose density also increases with increasing value of the random variable .)
Similar results were obtained for the normal distribution and the ratio distribution of two normals; see Table 6. The normal distribution, putting most mass around the mean of the random variable, was selected for inclusion in the present study as a further possible candidate of nonconformity with the NBL. Neither the normal distribution nor the ratio distribution of two normals disappointed this expectation. As for the right-truncated normal, figures are presented for and two sets of parameters only.
The numerical results for the chi-square distribution and the ratio distribution of two chi-squares are shown in Table 7. Regarding the chi-square distribution, a clear tendency becomes obvious. Very good fit to the NBL is found for the chi-square with (, maximal deviance .0065 for first-place digit 2), increasing the (shown for and ) worsens the fit considerably. This does not come as a surprise when taking the shape of the chi-square distribution into account: the chi-square with behaves like a survival distribution, for increasing it approaches a normal distribution.
The ratio distribution of two chi-squares (F-distribution) with fits better than does the chi-square. Moreover, the ratio distribution of two chi-squares proves more robust against increasing the . For , the simulated first- and second-place digits' distributions are nearly indistinguishable from the Newcomb-Benford distribution, and up to the deviance increases rather slowly. Note that the F-distribution with is formally identical to the ratio distribution of two exponentials with . Therefore, figures for were omitted; see Table 3.
The numerical results for the log-normal distribution are given in Table 8. For this two-parameter distribution, the fit to the NBL heavily depends on and slightly depends on . The larger and/or , the better is the fit. For , and , the misfit is massive, so that considering the effect of sample size becomes obsolete; hence figures are given for only. The best fit amongst the cases reported here is obtained with , : the simulated first- and second-place digits' distributions come very close to the Newcomb-Benford distribution when ; the maximal deviance amounts to .0064 and refers to the first-place digit 1. As the ratio distribution of two log-normals also follows the log-normal, no separate presentation of results is needed.
The results of the simulation study may be summarized in two statements. First, all types of distributions which turned out to be compatible with the NBL exhibit a common feature. They are long right-tailed and, thus, put most mass on small values of the random variable. To these distributions belong the exponential, the chi-square with very small degrees of freedom ( and ), the log-normal with large variance, and, with some limitations, the half-normal. Incompatibility with the NBL proved the uniform, the normal, and the right-truncated normal distributions. Second, the fit to the NBL generally improves when considering distributions of ratios of random variables. Among the seven types of ratio distributions studied here, five emerged as being consistent with the NBL. The ratio distribution of two exponentials, the ratio distribution of two chi-squares (F-distribution) with small degrees of freedom, and the ratio distribution of two log-normals with large variance fitted the first- and second-place digits' distributions as given by the NBL nearly perfectly, the ratio distributions of two uniforms and of two half-normals fitted it sufficiently well, whereas only the ratio distributions of two normals and of two right-truncated normals completely failed to fit.
Together with findings reported earlier , , ,  regarding the conformance to the NBL for some survival distributions (exponential, Muth, Gompertz, Weibull, gamma, log-logistic, and exponential power distributions) our results indicate that the validity of the NBL requires that the frequency of ‘natural’ numbers in the sense of Newcomb  decreases with increasing magnitude. Roughly speaking, this means that small numbers have to be predominant. That is, when thinking of real-world data, conformity to the NBL necessitates a majority of small objects. As the NBL has often been shown to be valid, conversely it can be deduced that, at least within numerous domains of our world, small objects must occur much more frequently than do large ones. Some examples given in the following will sustain this conclusion.
Analyzed were the distributions of the following five variables plus their first- and second-place digits' distributions.
Overall, results are as expected. First, all five variables possess a marked majority of small and a clear minority of large realizations. Four of the five variables exhibit a distribution coming more (areas and inhabitants of countries) or less (stock prices: very low values are underrepresented) close to survival distributions. The distribution of one variable (the bibliography data) follows rather a step function than a continuously decreasing density function: the highest frequency is found for starting pages 1 to 99 as it was to be expected; the starting pages 100 to 199, 200 to 299, and 300 to 399 occur with markedly lower, but approximately constant frequency; then the frequency decreases sharply to a level remaining approximately constant for the following five 100-pages sections (Figure 3; frequency distributions on the left).
Second, in all five cases the first-place digit 1 is slightly underrepresented. Nevertheless, based on the Pearson chi-squared goodness-of-fit test (5% significance level), all of the first- and second-place digits' distributions are compatible with the NBL, with one exception: the first-place digit's distribution of the bibliography data clearly fails to fit the NBL; see Table 9. The best fit is found for the areas of countries and their numbers of inhabitants, weaker fit is found for both variants of stock prices (prices in Euro vs. prices in local currencies). Note that the second-place digit's distribution of the stock prices in Euro is a borderline case pointing at the importance not to look at the first-place digit only when testing for the fit of the NBL. The first- and second-place digits' distributions are shown in Figure 3 on the right, whereby observed values are represented by bars, values expected according to the NBL by a line.
Third, and most importantly, the examples demonstrate the link between the distribution of a random variable on the one hand and the first-and second place digits' distributions on the other hand. The closer the shape of the distribution of a random variable comes to that of a survival distribution or a distribution behaving like a survival distribution, the better follows the first- and second-place digits' distributions the NBL. Regarding our five examples, the same ordering according to both properties is observable: the areas of countries and their numbers of inhabitants perform best, both versions of stock prices perform to some extent, but the bibliography data do simply not.
In the first part of this study seven types of common distributions were investigated regarding their conformance to the NBL. The results of the simulations showed first that all types of distributions behaving like survival distributions, that is, putting most mass on small values of the random variable and being long right-tailed, were compatible with the NBL. Second, distributions of the ratio of two random variables fitted better than did the distributions of a single random variable. For symmetric distributions (illustrated by example of the normal distribution), distributions tending to symmetry as a function of their parameters (illustrated by example of the chi-square and the log-normal distributions), and distributions whose density increases with increasing value of the random variable (illustrated by example of the right-truncated normal distribution), the misfit to the NBL was found to be substantial up to massive.
These observations together with the fact that the NBL – at least approximately – applies to many empirical data led to the suspicion that the size of ‘natural’ objects must follow a distribution behaving like a survival distribution in order to be able to obey the NBL. This suspicion could be substantiated by analyzing five sets of data. It turned out that the closer the distribution of a variable comes to that of a survival distribution the better is the fit to the NBL. Thereby, the fit to the NBL was tested formally by chi-square goodness-of-fit tests of the first- and second-place digits' distributions, whereas the fit of the observed variable's distribution to a survival distribution of unspecified form was informally assessed by visual inspection.
The overall conclusion resulting from the present study reads very simply. The frequently found good fit of the NBL to empirical data can be explained by the fact that in many cases the frequency with which objects occur in ‘nature’ is an inverse function of their size. Very small objects occur much more frequently than do small ones which in turn occur more frequently than do large ones and so on. Thus, the variable's distribution looks like a survival distribution whose leading digits' distributions follow the NBL, at least approximately.
It is somewhat surprising that in the literature on the NBL the connection between the distribution of a random variable and the leading digits' distributions was investigated up to now only for a handful of mainly survival distributions. Studies referring to empirical data concentrated solely on the leading digits' distributions, nearly always on the most significant digit only, neither discussing the relationship between the leading digits' distributions and the variable's distribution nor presenting the latter one. As a consequence, reanalyzing empirical data collected earlier was not possible and new data had to be found. Presumably the present study is therefore the first one focusing on the connection between the variables' distribution and the leading digits' distributions, in both theoretical and empirical settings. It remains to hope that future investigations on and applications of the NBL will pursue the approach taken here.
The author wishes to thank the Scientific Editor, Dr. Richard J. Morris, and an anonymous referee for their comments and suggestions that improved the completeness and presentation of this article. The author is especially grateful to Professor Theodore P. Hill, who also acted as a referee, for a very helpful discussion that clarified some critical issues arising from the literature. For having professionally prepared the figures, the author is indebted to Dipl.Ing. Martina Edl and Mike Swazina.
Competing Interests: The author has declared that no competing interests exist.
Funding: The author has no support or funding to report.