The identification of circulating tumor cells (CTCs) in the human circulatory system dates back to Ashworth’s 1869 paper 
in which he identified and pointed out the potential significance of cells similar to those found in the primary tumor of a deceased cancer victim. Since then, there has been sporadic focus on CTCs as a key diagnostic tool in the fight against cancer, based mostly on the so-called ‘seed and soil’ hypothesis 
of cancer metastasis, in which the CTCs play the role of seeds which detach from the primary tumor, disperse through the bloodstream, and get trapped at various distant sites (typically small blood vessels of organ tissues), then, if conditions are favorable, extravasate, form metastases, and subsequently colonize. The metastatic sites offer the soil for potential subsequent growth of secondary tumors. Paget’s 1889 seed-and-soil hypothesis 
asserts that the development of secondary tumors is not due to chance alone, but depends on detailed interactions, or cross-talk, between select cancer cells and specific organ microenvironments. In 1929, J. Ewing challenged the seed-and-soil hypothesis 
by proposing that metastatic dissemination occurs based on purely mechanical factors resulting from the anatomical structure of the vascular system, a proposal that is now known to be too simplistic an explanation for the metastatic patterns that are produced over large populations. While the seed-and-soil hypothesis remains a bedrock theory in cancer research, it has been significantly refined over the years to incorporate our current level of understanding on how the ability for a tumor cell to mestastasize depends on its complex interactions with the homeostatic factors that promote tumor cell growth, cell survival, angiogenisis, invasion, and metastastasis 
A schematic diagram associated with the metastatic process is shown in . Here, the primary tumor (from which the CTCs detach) is located in the lower part of the diagram and the distant potential secondary locations where CTCs get trapped and form metastases are shown. In this paper, we will not be concerned with extravasation, colonization and the formation of secondary tumors which are complex processes in their own right 
, but rather with a probabilistic description of metastatic progression from primary neoplasm to metastatic sites; hence, we provide a quantitative framework for charting the time-evolution of cancer progression along with a stochastic description of the complex interactions of these cells with the organ microenvironment. Also shown in the figure are representative scales of a typical red blood cell (8 µm), capillary diameter (5–8 µm), CTC (20 µm), and human hair diameter (100 µm). The total number of remote sites at which metastases are found for any given type of primary cancer is relatively small (see the autopsy data set described in 
), say on the order of 50 locations, those sites presumably being the locations at which CTCs get trapped and subsequently colonize. For any individual making up the ensemble, of course, the number of sites with metastatic tumors would be much smaller. A ‘ballpark’ estimate, based on the ratio of mets to primaries (from 
) suggests a number around 9484/3827~2.5, although in the modern era, this number is probably higher. A reasonably thorough overview of this process is described in 
Figure 1 Schematic diagram of human circulatory system showing circulating tumor cells (CTCs) detaching from primary tumor and getting trapped in capillary beds and other potential future metastatic locations as outlined by the ‘seed-and-soil’ (more ...)
It wasn’t until recently, however, that important technological developments in the ability to identify, isolate, extract, and genetically and mechanically study CTCs from cancer patients became available (see, for example 
). These new approaches, in turn, produced the need to develop quantitative models which can predict/track CTC dispersal and transport in the circulatory and lymphatic systems of cancer patients for potential diagnostic purposes. As a rough estimate, data (based primarily on animal studies) shows that within 24 hours after release from the primary tumor, less than 0.1% of CTCs are still viable, and fewer than those, perhaps only a few from the primary tumor, can give rise to a metastasis. There are, however, potentially hundreds of thousands, millions, or billions of these cells detaching from the primary tumor continually over time 
, and we currently do not know how to deterministically predict which of these cells are the future seeds, or where they will take root. All of these estimates, along with our current lack of detailed understanding of the full spectrum of the biological heterogeneity of cancer cells, point to the utility of a statistical or probabilistic framework for charting the progression of cancer metastasis. This is a particularly important step for any potential future comprehensive computer simulation of cancer progression, something not currently feasible. Although the dispersion of CTCs is the underlying dynamical mechanism by which the disease spreads, the probabilistic framework obviates the need to model all of the biomechanical features of the complex processes by which cells journey through the vascular/lymphatic system. This paper provides the mathematical/computational framework for such an approach.
In this paper, we develop a new Markov chain based model of metastatic progression for primary lung cancer, which offers a probabilistic description of the time-history of the disease as it unfolds through the metastatic cascade 
. The Markov chain is a dynamical system whose state-vector is made up of all potential metastatic locations identified in the data set described in 
(defined in ), with normalized entries that can be interpreted as the time-evolving (measured in discrete steps k
) probability of a metastasis developing at each of the sites in the network. One of the strengths of such a statistical approach is that we need not offer specific biomechanical, genetic, or biochemical reasons for the spread from one site to another, those reasons presumably will become available through more research on the interactions between CTCs and their microenvironment. We account for all such mechanisms by defining a transition probability ( which is itself a random variable) of a random walker dispersing from one site to another, thus creating a quantitative and computational framework for the seed-and-soil hypothesis as an ensemble based first step, then can be further refined primarily by using larger, better, and more targeted databases such as ones that focus on specific genotypes or phenotypes, or by more refined modeling of the correlations between the trapping of a CTC at a specific site, and the probability of secondary tumor growth at that location.
Metastatic site numbering system.
The Markov chain dynamical system takes place on a metastatic network based model of the disease, which we calculate based on the available data over large populations of patients. In particular, we use the data described in the autopsy analysis in 
in which metastatic distributions in a population of 3827 deceased cancer victims were analyzed. None of the victims received chemotherapy or radiation. The autopsies were performed between 1914 and 1943 at 5 separate affiliated centers, with an ensemble distribution of 41 primary tumor types, and 30 metastatic locations. shows histograms of the number of metastases found at the various sites in the population. shows the metastatic distribution in the entire population, while shows the distribution in the subset of the population with primary lung cancer. We note that this data offers no particular information on the time history of the disease for the population or for individual patients - only the long-time metastatic distribution in a population of patients, where long-time is associated with end of life, a timescale that varies significantly from patient to patient (even those with nominally the same disease). Although this paper focuses on a model for primary lung cancer, the approach would work equally well for all of the main tumor types. Indeed, one of the goals of future studies will be to compare the models obtained for different cancer types.
Figure 2 Metastatic distributions from autopsy data set extracted from 3827 patients .
Network based models of disease progression have been developed recently in various contexts such as the spread of computer viruses 
, general human diseases 
, and even cancer metastasis 
, but as far as we are aware, our Markov chain/random walk approach to modeling the dynamics of the disease on networks constructed for each primary cancer type from patient populations offers a new and potentially promising computational framework for simulating disease progression. More general developments on the structure and dynamics on networks can be found in the recent works 
. For brief introductions to some of the mathematical ideas developed in this paper, see