We present here, for the first time, a comprehensive assessment of gene expression patterns in the developing human lung between 53 and 154 dpc, a time interval spanning the pseudoglandular and canalicular stages. These data, including thousands of individual gene expression patterns, are publicly available. We have applied novel nonparametric, regression-based smoothing to the primary dataset in an effort to minimize subject/sample and age estimation-dependent effects. This approach effectively improves dataset quality as objectively defined by increasing similarity between trajectories for multiple probe sets interrogating the same transcript (data not shown). The resulting dataset significantly expands our current understanding of changes in the expression of genes during human lung development.
We distilled the developing human lung transcriptome into independent, dominant directions of gene expression variation using unsupervised PCA. Analyzing the developmental time series in this way, we can objectively resolve the contribution of different biological/functional modules throughout the course of development. For each PC, a characteristic gene set was defined and its corresponding bioontologic attribute profile identified. The characteristic gene sets were significantly enriched for many attributes known to be involved in lung development, but whose roles in these early stages have not been well characterized (e.g., surfactant). In addition, a number of genes previously unappreciated as playing a role in lung development were discovered using this approach.
Our PCA characterization facilitated visualization of the time-series as a linear trajectory defined by gene expression variation. The most dominant direction of sample variation in this developing human lung transcriptome (PC1) was strongly correlated with estimated gestational age. This is intuitive and consistent with our earlier studies of the developing mouse lung transcriptome time series (18
). However, the short gestational period of the mouse coupled with the low sampling frequency of that dataset limited the ability to detect finer transcriptomic granularity in that process.
Traditional embryology partitions growth of the developing human lung into five stages: embryonic (26 dpc to 5 weeks), pseudoglandular (5–16 weeks), canalicular (16–26 weeks), saccular (26 weeks to birth), and alveolar (birth to 6 months), based upon selective morphological features that are observed within these intervals. However, it has been long appreciated that complex biological processes occurring during lung development are not specifically associated with this classical stage definition. In the human data, we observed a gene expression transition point at approximately 117 dpc, reflecting the age associated with the morphologic transition from the pseudoglandular to canalicular stages. A second transition point was noted at approximately 94 dpc, suggesting the presence of at least two distinct molecular phases within the pseudoglandular stage. The substages identified by this PCA characterization might reflect critical molecular windows of lung development (3
) and suggest that molecular staging in this manner captures information that is complementary but distinct from classic morphologic staging alone. Given sufficient sampling/measurement frequency, PCs based upon genome-wide expression data could form a rational basis for an alternative and objective molecular taxonomy of lung development.
To further explore the biology identified by our analysis of the developing human lung transcriptome, we focused on characteristic gene sets (the highest 5% loading coefficient magnitude genes) of PC1–3 and their ontological attributes. We defined the 3,223–gene union of PC1–3 characteristic genes as the developing lung characteristic subtranscriptome (DLCS), which may be regarded as the minimal set of genes that accounts for a significant proportion of the transcriptomic variation in the early fetal developing human lung. Among the 46 genes common to all PC1–3 characteristic gene sets are three surfactant-associated genes (STFPB, SFTPC, and CLDN18), three MHC class II genes, and 12 substrate-specific transporter genes. Overall, the DLCS contained 28 of 77 genes previously identified to be functionally involved in general lung development (14
), which is a 2.9-fold enrichment (odds ratio P
< 0. 5 × 10−4
by two-tailed Fisher exact test; Table E9). This enrichment for functionally relevant genes and the prominence of surfactants strongly suggests the PC characterization to be a meaningful abstraction of the specific molecular biology underlying lung development.
There were 35 bioontologic attributes common to all PC1–3 characteristic gene sets. Not surprising among these were attributes related to cell cycle and cellular division, universal to general developmental processes. Additionally, there was a significant presence of surfactant–gaseous exchange and immunologic–MHC class II attributes. Genes with these attributes possess expression profiles that were largely increasing from 53 to 140 dpc. Although the production of surfactant is associated with later gestational development, our observations that these surfactant proteins are expressed relatively early in fetal lung development is consistent with a prior description of SFTPB and SFTPC expression as early as Week 13 in humans (30
) and before the end of the pseudoglandular stage in mice (31
). The early expression of surfactant proteins may also indicate a necessity of early molecular programming of the lung for subsequent development. This is supported by recent work revealing the potential of embryonic stem cells to form glandular respiratory epithelium after coculture with dissociated fetal lung (32
). Additionally, given the recent description of forkhead box M1 (FOXM1) as a key regulator of surfactant protein expression (33
), we found that FOXM1 is a PC1 characteristic gene here, further supporting the role of surfactants in early lung development.
We addressed the robustness of our findings in multiple ways. First, we noted that PCA using the DLCS, a feature set enriched in informative expression profiles likely to increase the signal-to-noise ratio, returned a time-contiguous trajectory very similar to that of the full transcriptome. qPCR for DLCS genes demonstrated a very high rate of global validation for individual gene expression trajectories (83%) and for differential expression of substage marker genes (70%). Next, we assessed the validity of this PCA-based characterization by testing its ability to predict the age of independent developing human or mouse lung samples. We divided the 38 fetal human lung samples into training (22 samples) and test (16 samples) sets and used the principal components of the training set to predict gestational age in the test set. We found that we could accurately estimate the gestational age of the test sample using the 3,223 gene DLCS or the full (genome-wide) transcriptome. The ability to independently predict gestational age from a lung transcriptomic profile may be viewed as proof of the concept that the PC characterization captures the essential biology of the system because one expects age to be critical parameter in the lung development process. The ability to chronologically order developing mouse lung profiles by their age in the human lung development PC space further suggests the DLCS identifies conserved molecular mechanisms and bioontologic attributes.
There are several limitations to this study. First, the dataset is limited to a relatively small number of samples from a specific window of time. We analyzed only 38 samples, including 29 distinct time points, over a 101-day sampling age range solely from early fetal stages. A larger sampling window and a higher sampling frequency would add to the details of the analysis and may lend further support for the presence of additional molecular stages that we are unable to investigate more comprehensively with the current data. However, the lack of available human samples during late fetal gestation means that extending the embryonic time points beyond the current age window is essentially impossible. We have performed preliminary microarray analysis of newborn and juvenile human lung tissue. Data from these samples is confounded by multiple variables, including varying age at birth and primary or secondary pulmonary complications. Preliminary analysis of data from these samples suggests they represent a poor comparison group for the fetal data set, and thus we have decided against including them in the current study. Second, there may be a degree of error in the estimation of fetal gestational age. This and other factors, such as maternal smoking history (34
), may affect gene expression in individual samples. However, our findings were robust in the prediction of gestational age. In fact, we implemented an expression profile smoothing approach specifically to minimize variability due to sampling. Third, we did not dissect individual cellular contributors to fetal lung development. Instead, we defined the process as the dynamic sum of all cellular constituents. Whole lung tissue transcriptome profiles may have limited success in resolving cell/tissue specific molecular events, particularly during the later stages of development when the lung cell population is more heterogeneous with distinct cooccurring and specialized molecular processes. Nonetheless, our data are representative of the overall molecular events occurring in the early fetal human lung.
In conclusion, we present novel expression patterns for thousands of genes in the developing normal human lung. Our analyses of the human fetal lung transcriptome support current developmental paradigms and have revealed the existence of molecular phases of development. Our results have been validated via different modalities and appear robust. These observations, and the application of these analytic strategies to more comprehensive data sets, will further our understanding of essential molecular events in normal development and pathological derangements of the lung and will provide further insights into critical windows of development wherein environmental exposures may lead to subsequent disease.