Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC5737754

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Notation and sampling
- 3. Egocentric ERGMs
- 4. Inference
- 5. Implementation
- 6. A simulation study
- 7. Understanding persistent racial disparities in HIV prevalence in the US
- 8. Discussion
- Supplementary Material
- References

Authors

Related links

Ann Appl Stat. Author manuscript; available in PMC 2017 December 20.

Published in final edited form as:

PMCID: PMC5737754

NIHMSID: NIHMS901911

Egocentric network sampling observes the network of interest from the point of view of a set of sampled actors, who provide information about themselves and anonymized information on their network neighbors. In survey research, this is often the most practical, and sometimes the only, way to observe certain classes of networks, with the sexual networks that underlie HIV transmission being the archetypal case. Although methods exist for recovering some descriptive network features, there is no rigorous and practical statistical foundation for estimation and inference for network models from such data. We identify a subclass of exponential-family random graph models (ERGMs) amenable to being estimated from egocentrically sampled network data, and apply pseudo-maximum-likelihood estimation to do so and to rigorously quantify the uncertainty of the estimates. For ERGMs parametrized to be invariant to network size, we describe a computationally tractable approach to this problem. We use this methodology to help understand persistent racial disparities in HIV prevalence in the US. We also discuss some extensions, including how our framework may be applied to triadic effects when data about ties among the respondent’s neighbors are also collected.

There is growing interest in the statistical modeling of network data across a wide range of fields: from the study of political coalitions in the social sciences, to protein-protein interaction networks in genetics and the spread of infectious diseases in epidemiology. In some cases, it is possible to observe the complete network of interest, but in others the network must be sampled. Estimating network models from sampled data raises some unique issues. While progress has been made in developing a general framework for statistical inference (Handcock and Gile, 2010), there is a need for feasible methods that can be used with common network sampling designs in different fields.

In this paper we present a framework for inference from egocentrically sampled network data, a network sampling design that is common in the social and population sciences. Egocentrically sampled network data contain very limited information about network structure: for those individuals in the sample, only information about their immediate partners in the network is observed, and even that information is often limited to non-identifying demographics. The work was motivated by a specific question in the field of HIV epidemiology—Does network structure help explain the persistent racial disparities in HIV prevalence in the United States?—but it has the potential for wide application given the simplicity of collecting egocentrically sampled network data in the population sciences.

The HIV epidemic in the US is now in its third decade. While the rate of transmission has dropped, the racial disparities in HIV prevalence have become entrenched. An African American today is 10 times more likely than a white American to be living with HIV/AIDS. The disparity begins early in life (Morris et al., 2006), and persists through to old age (NCHH-STP, 2013), and is evident among all risk groups: heterosexuals, men who have sex with men (MSM), and injection drug users. The disproportionate risks faced by heterosexual African-American women are especially steep. In 2010, the most recent year for which statistics are available (NCHH-STP, 2012), the annual rates of heterosexually acquired infections by demographic subgroup were roughly 33, 19, 7 and 1 per 100,000 persons for African-American women and men and White women and men, respectively. Similar disparities are found among other sexually transmitted infections, both bacterial and viral (Morris et al., 2006), and for the older reportable STIs, like gonorrhea and syphilis, they stretch back to the earliest reports in the 1960s. (NCDC, 1967)

Empirical studies repeatedly find that these disparities cannot be explained by systematic differences in individual behavior, such as higher numbers of partners or rates of injection drug use, or lower condom use (Hallfors et al., 2007, for example). Nor have race-linked biological differences been identified that could explain disparities across this wide range of pathogens. What all of these infections do share, however, is an underlying transmission network. This network can channel the spread of infection in the same way that a transportation network channels the flow of traffic, with emergent patterns that reflect the connectivity of the system, rather than the behavior of any particular element.

A growing body of work is therefore focused on the role that network structure may play in explaining these disparities. Descriptive analyses and simulation studies (Laumann et al., 1992, 1994; Morris, 1993a; Morris and Kretzschmar, 1997) have focused attention on two structural features: homophily and concurrency. Homophily is the strong propensity for within-group partner selection. It is a common pattern for many social attributes, though not all. (For example, most sexual partnerships are cross-sex rather than same-sex.) When present, homophily leads to clustered, segregated networks. Concurrency is non-monogamy—having partners that overlap in time. While there is a very strong norm of monogamy in sexual partnerships, deviations from the norm occur. When present, concurrency increases network connectivity by allowing for the emergence of stable network connected components larger than dyads (pairs of individuals).

The hypothesis is that these two network properties together can produce the sustained HIV/STI prevalence differentials we observe: differences in concurrency between groups are the mechanism that generates the prevalence disparity, while homophily is the mechanism that sustains it. To test this hypothesis, we need to assess the strength and significance of observed concurrency differentials and homophily by race, and to evaluate whether the observed network mechanisms predict differentials in network exposure by race and sex that are consistent with the differentials in observed HIV prevalence.

Statistical models for social networks, like exponential-family random graph models (ERGMs), let us test for these effects, and we can simulate from them to predict network exposure; but these models must first be fit to available data that can support broad population-level inference. Our main statistical challenge is, therefore, to fit network models (ERGMs in particular) to egocentrically sampled data, and to obtain rigorous measures of uncertainty of these fits; to do so in a computationally feasible manner, for when the population of interest is very large or its size is unknown. We elaborate on these models and these data in turn.

Exponential-family random graph models (ERGMs) are a popular and, importantly for us, parsimonious, class of stochastic models for graphs in general and for social networks in particular. (Frank and Strauss, 1986; Wasserman and Pattison, 1996; Hunter and Handcock, 2006) An ERGM expresses the probability of an observed graph ** y** as an exponential family:

$${Pr}_{\mathit{g}}(\mathit{Y}=\mathit{y};\mathit{x},\mathit{\theta})\equiv exp\{{\mathit{\theta}}^{\top}\mathit{g}(\mathit{y},\mathit{x})\}/{\kappa}_{\mathit{g}}(\mathit{\theta},\mathit{x}),\mathit{y}\in \mathcal{Y}.$$

(1.1)

It is specified by the sample space of possible networks (configurations of relationships) and a sufficient statistic vector ** g**(

Analogously to Pr**_{g}**(·;

Estimating ** θ** facilitates inference about the social forces that shape the network as well as principled simulation of complete networks whose features are similar, on average, to those of the network observed. In the case of sampled network data in particular, it would allow recovering possible full networks from which the sample may have been drawn. Therefore,

Network data are distinguished by having two units of analysis: the actors and the links between the actors. This gives rise to a range of sampling designs that can be classified into two groups: link tracing designs (e.g., snowball and respondent driven sampling) and egocentric designs. Much of the recent literature has focused on developing model- or design-based inference for link tracing designs. (Thompson and Frank, 2000; Salganik and Heckathorn, 2004; Volz and Heckathorn, 2008; Snijders, 2010; Handcock and Gile, 2010; Tomas and Gile, 2011; Illenberger and Fltter, 2012; Pattison et al., 2013) Our work focuses on the egocentric designs that are more commonly used in the social sciences, but are less well developed statistically.

Egocentric network sampling comprises a range of designs developed specifically for the collection of network data in social science survey research. The design is (ideally) based on a probability sample of respondents (“egos”) who, via interview, are asked to nominate a list of persons (“alters”) with whom they have a specific type of relationship (“tie”), and then asked to provide information on the characteristics of the alters and/or the ties. The alters are not directly observed. Depending on the study design, alters may or may not be uniquely identifiable, and respondents may or may not be asked to provide information on one or more ties among alters (the “alter” matrices). Alters can, in theory, also be present in the data as an ego or as an alter of a different ego; the likelihood of this depends on the sampling fraction and the network’s topology.

In this work, we focus on the minimal egocentric network study design, in which alters cannot be uniquely identified and alter matrices are not collected. The minimal design is more common, and the data are more widely available, for three reasons.

If the relationship of interest is sensitive, requiring full identification of the alters is likely to reduce respondent disclosure, and knowledge of alter–alter ties by the respondent may be unreliable. In addition, Institutional Review Boards often forbid the collection of identifiable data about the alters, as the alters have not given informed consent for their personal information to be collected. The minimal egocentric design allows for representative data to be collected in such contexts, with less intrusion and full consent. In public health research on HIV and other STIs, for example, egocentric study designs make it possible to conduct empirical research on how individual sexual behavior influences the population structure of infection transmission networks. There is a growing international archive of public data from such studies, with comparable surveys now available from over 50 different countries as far back as the early 1990s (MEASURE DHS, 2000–2014; Tanfer, 1991; Laumann et al., 1992; Udry, 2003; NSFG, 2002, 2006–2011, for example).

The number of potential alter–alter ties grows quadratically with the number of alters nominated by the ego, quickly making data collection burdensome for the respondent, and difficult to justify in surveys that must serve multiple needs. As a result, even the less sensitive forms of social network data tend to be collected using the minimal egocentric design. Perhaps the best known example is the egocentric friendship network data collected annually by the General Social Survey since 1985 (Burt, 1984), which was used in the landmark study of the decline in American friendship and social support networks as described in *Bowling Alone* (Putnam, 2000).

While in theory, the “seed nodes” for a link-traced sample could be chosen at random from the population of interest, the real strength of these designs is the ability to sample from hidden or inaccessible populations, where no sampling frame is available, and this is the application context in which they are most often used. Egocentric designs, by contrast, sample egos using standard sampling methods, and the sampling of links is implemented through the survey instrument. As a result, these methods are easily integrated into population-based surveys, and, as we show below, inherit many of the inferential benefits.

This sampling design has immediate implications for the scope of models that can be fit. For example, lack of information on alter–alter ties means that triadic (friend-of-a-friend) effects are not identifiable, and the lack of information on alter degree means that *degree assortativity* (a correlation between degrees of linked actors) (Gupta et al., 1989) also cannot be identified. These limitations must be kept in mind when drawing inferences, but they are not likely to undermine the specific inferences we draw here. Further discussion can be found in Section 8.2.

Despite the widespread availability of the minimal egocentrically sampled network data, statistical methods for analyzing them are still relatively undeveloped. Early work focused on descriptive methods for analyzing “mixing matrices”, cross-tabulations of ego–alter dyads by actor attributes (Marsden, 1981; Morris, 1991), or bivariate associations between ego attributes and alter summary statistics (Marsden, 1987; Admiraal, 2009). van Duijn, van Busschbach, and Snijders (1999) used a multilevel (mixed effects) model to analyze factors affecting changes over time in such networks based on repeated observations on same relations over time, and conditioning on the initial time point’s relationships. More recent work has focused on the key topic of recovering whole network attributes from egocentric data (Gjoka, Smith, and Butts, 2014b, e.g., clique size distributions), but does not provide a framework for inference.

Handcock and Gile (2010) established a general framework for model-based inference for networks based on sampled data that allows for egocentrically sampled data as a special case: when only dyads incident on those in the sample are observed (Koskinen, Robins, and Pattison (2010) developed a similar approach in a Bayesian framework.) Unfortunately, their likelihood approach is infeasible for our problem for three reasons. First, it requires fitting an ERGM to a network of the size equal to that of the population from which the egos were sampled, which is, often, on the order of millions, and possibly unknown. Second, it assumes uniquely identified alters: one can identify when one ego nominates another ego and when two egos nominate the same alter. For most egocentrically sampled data (including all of the studies cited above), alters are not identified. Although a likelihood can be derived for this case too, it requires integration over the space of networks that produce *exactly* the observed dataset—a more complex constraint. Third, if the data come from a complex (even just weighted) design, ignorability of the sampling process might not hold, requiring nested integration over the sampling process as well.

Krivitsky, Handcock, and Morris (2011) described how the sufficient statistic needed to fit certain ERGMs may be derived from egocentrically sampled data and used to simulate networks consistent with egocentric observations. This approach has been used in applied contexts (Morris et al., 2009; Goodreau et al., 2010; Smith, 2012). What remains lacking, however, is a general, rigorous framework for ERGM inference for such data, and we turn to the pseudo-MLE (PMLE)^{1} (Binder, 1983; Pfeffermann, 1993, for example) approach to do this.

The rest of the article proceeds as follows. In Section 2, we describe the notation and the sampling framework for egocentrically sampled network data, and in Section 3, we specify an ERGM subfamily amenable to being fit to such data. The pseudo-MLE of ** θ**—calculated by obtaining a design-based estimator of the ERGM sufficient statistic of the network of interest and fitting the ERGM to that—and its asymptotic properties are derived in Section 4, along with a method for quantifying its uncertainty. An overview of implementation issues and of a validating simulation study are given in Sections 5 and 6, respectively, with the details left to the supplemental article (Krivitsky and Morris, 2016). Finally, in Section 7, we apply our developments to the question of the impact of network structure on persistent racial disparities in HIV prevalence in the US.

Let *N* be the population being studied: a very large, but finite, set of actors whose relations are of interest, and let *x** _{i}* be a vector of attributes (e.g., age, sex, race) of an actor

Throughout, ** y** will refer to what we will call the

Now, let *e** _{i}* be the “egocentric” view of network

Then, (*e** _{i}*)

In the following developments, we will assume that egocentric observations are sampled using a conventional sampling design, with *N* as the sampling frame, though as we discuss in Section 5, this is not critical in practice. The proposed methods can be applied to more complex—stratified, for example—designs, but here, we focus on simple probability designs, and designs that can be approximated with simple probability designs. Specifically, let inclusion probabilities *π _{i}* Pr(

Analogously to the ERGM process, we will use E* _{S}*(·) and var

Even if the whole population is observed (i.e., *S* = *N*, a census), not every ERGM can be fit to such data, and we turn to the notion of sufficiency to identify those that can. Define an ERGM of the form (1.1) to be *egocentric* if both its sufficient statistic and its sample space constraints (if any) can be recovered from an egocentric census. We discuss them in turn.

We call a network statistic *g _{k}*(·

$${g}_{k}(\mathit{y},\mathit{x})\equiv {\sum}_{i\in N}{h}_{k}({\mathit{e}}_{i}),$$

(3.1)

for some function *h _{k}*(·) of egocentric information associated with a single actor. For the egocentric observations

Examples of egocentric statistics for undirected networks. x_{i,k} may be a dummy variable indicating i’s membership in a particular exogenously defined group. h_{k}(**e**_{i}) that sum over ties are halved because each tie is observed egocentrically twice: **...**

As one might expect from the discussion in Section 1.2, statistics measuring triadic closure, degree assortativity, and 4-cycles are not egocentric under this sample design. However, some of them (and thus the ERGMs that use them) may be egocentric under different egocentric sampling designs, which we discuss in Section 8.2.

Other statistics that are not egocentric include the average number of neighbors of an actor—*g _{k}*(

We call the sample space (·*,* ·) of an ERGM *egocentric* if it can be expressed as

$$\mathcal{Y}(N,\mathit{x})\equiv \left\{\mathit{y}\in {2}^{\mathbb{Y}(N)}:{\mathrm{\Pi}}_{i\in N}\mathscr{H}({\mathit{e}}_{i})\ne 0\right\},$$

for some indicator function (·) that depends only on egocentric information associated with a single actor. For example,
$\mathscr{H}({\mathit{e}}_{i})={1}_{\mid {\mathit{e}}_{i}^{\mathrm{a}}\mid \le d}$ would constrain ** y** (

For the remainder of this paper, we will fix (*e** _{i}*) = 1 so that (

Our inferential goal is to fit ERGMs to unobserved networks based on egocentric samples from them: to recover the parameters that would have been estimated had an ERGM been fit to fully observed ** y**. Because

Most treatments of ERGM estimation treat ** θ** as a parameter of a superpopulation process of which

$$\text{sc}(\widehat{\mathit{\theta}})\equiv \mathit{g}(\mathit{y})-{\mathit{\mu}}_{\mathit{g}}(\widehat{\mathit{\theta}})=\mathbf{0},$$

(4.1)

which has a unique solution ** = **** θ_{g}**{

In contrast, we treat ** θ** as a finite population parameter, defined implicitly for the unobserved population network

Our inferential approach is, therefore, to obtain an estimator for ** g** based on the available data, then substitute it into (4.1), solving it to obtain a pseudo-MLE

Following Binder (1983), substituting (3.1) into (4.1) gives a score equation of the form of Binder’s eq. 2.6. Binder’s Assumptions (a), (c), and (d) (open parameter space, smoothness, and continuity of variance, respectively) are guaranteed by finite exponential family proprieties of ERGMs. Assumption (b) calls for an asymptotically normal estimator of the population total ** g** and a consistent estimator of its variance. We treat them in turn.

For our design, we use the inverse-probability weighted estimator (*Hájek estimator*) scaled to the population size. (Hájek, 1971) With ** y**,

$$\stackrel{\sim}{\mathit{h}}({\mathit{e}}_{S})\equiv {\sum}_{i\in S}{w}_{i}\mathit{h}({\mathit{e}}_{i})/w.$$

(4.2)

is a design-consistent—if slightly biased—estimator of ** ***g**/*|*N*|, the population mean contribution of each actor to the sufficient statistic. (Fuller, 2011, p. 61) Scaling it to the population size, **(***e** _{S}*) |

$${\mid S\mid}^{{\scriptstyle \frac{1}{2}}}(\stackrel{\sim}{\mathit{g}}({\mathit{e}}_{S})-\mathit{g})=\phantom{\rule{0.16667em}{0ex}}\mid N\mid {\mid S\mid}^{{\scriptstyle \frac{1}{2}}}(\stackrel{\sim}{\mathit{h}}({\mathit{e}}_{S})-\overline{\mathit{h}})\stackrel{d}{\to}{\text{MVN}}_{p}(\mathbf{0},{\mid N\mid}^{2}{\mathit{\sum}}_{\mathrm{H}}),$$

where

$${\mathit{\sum}}_{\mathrm{H}}\equiv {\mu}_{w}^{-2}\phantom{\rule{0.16667em}{0ex}}\left(\overline{\mathit{h}}{\overline{\mathit{h}}}^{\top}{\mathit{\sum}}_{w,w}-\overline{\mathit{h}}{\mathit{\sum}}_{w,w\mathit{h}}-{\mathit{\sum}}_{w\mathit{h},w}{\overline{\mathit{h}}}^{\top}+{\mathit{\sum}}_{w\mathit{h},w\mathit{h}}\right),$$

(4.3)

with *μ _{w}* E

$$\left[\begin{array}{cc}{\mathit{\sum}}_{w,w}& {\mathit{\sum}}_{w,w\mathit{h}}\\ {\mathit{\sum}}_{w\mathit{h},w}& {\mathit{\sum}}_{w\mathit{h},w\mathit{h}}\end{array}\right]\equiv {\mathit{\sum}}_{[w,w\mathit{h}]}\equiv {var}_{S}\phantom{\rule{0.16667em}{0ex}}\left(\left[\begin{array}{c}{w}_{i}\\ {w}_{i}\mathit{h}({\mathit{e}}_{i})\end{array}\right]\right).$$

Provided the fourth moments of *w _{i}* and

Then, the PMLE ** = **** θ_{g}**{

$${\mid S\mid}^{{\scriptstyle \frac{1}{2}}}(\stackrel{\sim}{\mathit{\theta}}-\mathit{\theta})\stackrel{d}{\to}{\text{MVN}}_{p}\phantom{\rule{0.16667em}{0ex}}\left(\mathbf{0},{\{{\nabla}_{\mathit{\theta}}{\mathit{\mu}}_{\mathit{g}}(\mathit{\theta})\}}^{-1}{\mid N\mid}^{2}{\mathit{\sum}}_{\mathrm{H}}{[{\{{\nabla}_{\mathit{\theta}}{\mathit{\mu}}_{\mathit{g}}(\mathit{\theta})\}}^{-1}]}^{\top}\right).$$

(4.4)

Notably, ** θ_{g}**{

In addition to the estimator for *Σ*_{H}, the variance in (4.4) calls for an estimator of **_{θ}μ_{g}**(

$${var}_{S}(\stackrel{\sim}{\theta})\approx {[{\stackrel{\sim}{var}}_{\mathit{g}}\{\mathit{g}(\mathit{Y});\stackrel{\sim}{\mathit{\theta}}\}]}^{-1}({\mid N\mid}^{2}{\mathit{\sum}^{\sim}}_{\mathrm{H}}/\mid S\mid ){[{\stackrel{\sim}{var}}_{\mathit{g}}\{\mathit{g}(\mathit{Y});\stackrel{\sim}{\mathit{\theta}}\}]}^{-1},$$

(4.5)

an estimator of the form of Binder (1983, eq. 3.4).

Section 4 leads to the following procedure:

- Estimate the sufficient statistic of the ERGM with
**(***e*)._{S} - Obtain
**, using MCMLE to solve $\stackrel{\sim}{\text{sc}}(\stackrel{\sim}{\mathit{\theta}})=\mathbf{0}$.** - As a byproduct of Step 2, obtain ${\stackrel{\sim}{var}}_{\mathit{g}}\{\mathit{g}(\mathit{Y});\stackrel{\sim}{\mathit{\theta}}\}$.
- Estimate var
(_{S}**) as described in Section 4.2.**

For the simulation study and the analysis that follow, we use, mainly, the R (R Core Team, 2013) package
`ergm` (Hunter et al., 2008b; Handcock et al., 2014) for fitting and simulating from ERGMs. The extensions to fit ERGMs to egocentrically sampled data have been implemented in a new R package,
`ergm.ego`, under development for public release. We also use the R package
`sna` (Butts, 2008) to calculate network connected component sizes.

Some additional implementation challenges arise as well.

Formally, our procedure depends on ** x** being observed completely (i.e., a census), or, at least, its distribution being known to a very high degree of accuracy. Step 2’s MCMLE, in particular, requires sampling over the space of possible population networks, conditional on all actor attributes, and its implementation requires constructing a network having actor attributes

Therefore, in the analyses performed here, we use the design-based estimator of the finite-population distribution of *x** _{N}*: we replicate each

The procedure also calls for fitting an ERGM to a network of size |*N*|. While possible in principle, this is often a computationally infeasible task. For example, the “population” of the NHSLS study we consider below is all individuals aged 18 through 59 and living in the US at the time of the study (1992)—hundreds of millions—and naively fitting an ERGM to a smaller subnetwork is unlikely to produce meaningful estimates due to non-projectivity of ERGMs with dyadic dependence (Shalizi and Rinaldo, 2013). We work around this using the network-size-invariant parametrization of Krivitsky et al. (2011): by adding an offset term, some ERGMs can be adjusted so that fitting them to networks having similar structure and composition but different sizes produces the same parameter estimates. We thus construct a “scaled-down” pseudopopulation of interest, *N′*, and fit the adjusted model to it, thus approximating the ** that would have been obtained by fitting to the full ***N*. This adjustment is not applicable to every network feature of every possible ERGM, but its applicability to a given model can be verified by simulation. Further details are given in the supplemental article (Krivitsky and Morris, 2016, App. A).

Thus, using the network-size-invariant parametrization, and estimating *x** _{N}* from

To evaluate the properties of our estimators we performed a simulation study, constructing a large population network with known ERGM parameters and simulating egocentric samples from it, using two sampling designs: unweighted and weighted. The sampling weights have a range similar to the NHSLS, oversample some groups of actors (A and C), and are correlated with a continuous covariate used in the model (*x*_{i,}_{2}). For each sample, we calculated point estimates and standard errors in order to assess their accuracy and the coverage of Wald confidence intervals. Details and full results are given in the supplemental article (Krivitsky and Morris, 2016, App. B).

Selected bias and coverage results for sample size |*S*| = 1,000 are shown in Figure 1. The unweighted sampling estimates display some bias, though it does not appear to have a systematic pattern as a function of |*N′*| or model term. None of the estimated biases are greater than 10% of the standard deviation of ** under repeated sampling; that is, bias accounts for less than 1% of the mean squared error (MSE) of the estimator.**

Simulated bias in the point estimates, relative to simulated standard deviation (top); and 95% confidence interval coverage, relative to the miss probability of 5% (bottom) for |*S*| = 1,000. Dashed lines are level 0.05 critical values: 95% of the results **...**

The weighted sampling estimators are, as one would expect, highly biased for smaller |*N′*|. For the largest |*N′*|, the bias tends to approach that of the unweighted, and the most biased parameter’s (Difference in *x*_{·}_{,}_{2}) bias is less than 20% of its standard deviation (≈ 4% of MSE). A possible reason why it is the most biased is that egos with small *x*_{i,}_{2} are by design severely undersampled, which means that there will exist many samples where the full range of *x*_{·}_{,}_{2} is not represented. This is likely to be less problematic in real-world applications like the analysis in Section 7, where continuous covariates (like age) have an explicit range of interest.

Overall, we found the standard errors for both weightings to be conservative, overestimating the simulated standard deviation by between 1% and 20% (in a few cases). The resulting confidence interval coverage is consistent with these observations: for almost all terms, the intervals are somewhat conservative for both sampling designs (given sufficient |*N′*|).

We replicated the study for |*S*| = 2,000, and found that the biases decrease (both absolutely, and relative to their standard deviation, which is, itself, smaller), and the standard errors became more accurate as well. The coverage remains somewhat conservative. (Krivitsky and Morris, 2016, App. B.2)

This study demonstrates parameter recovery for when the egocentric ERGM on observed actor attributes is the “true” model (though dropping the assumption that the whole *x** _{N}* is observed). We discuss the issue of model misspecification in Section 8.2.

We now return to our motivating questions: 1) How strong is the race homophily in the the population? 2) Are there differences in the propensity towards monogamy and concurrency for the races and the sexes? And, 3) What impact do these network features have on overall network connectivity and differentials in network exposure by race and sex?

The National Health and Social Life Survey (NHSLS) of 1992 (Laumann et al., 1992, 1994) was undertaken at the start of the AIDS epidemic in the US. The objectives of the study included obtaining the data on sexual behavior necessary to predict the long term trajectory of HIV and AIDS prevalence, and to understand the disparities in HIV prevalence by race that had already begun to emerge. The survey collected, among other information, a representative egocentric sample of sexual partnerships of a stratified sample of residents of the US aged 18–59 (inclusive). A rich set of socio-demographic attributes was collected, including the respondent’s age, sex, race/ethnicity, and respondents were asked for similar information about all of their sexual partners from the previous twelve months.

For this analysis, we focus on modeling the cross-sectional network of ongoing sexual partnerships of persons aged 18–59. Some reported partners were outside this age range, and a small number of respondents had turned 60 by the time they were interviewed. We exclude these partnerships and persons, as well as those with missing data on the necessary attributes (race, sex, and age, for both ego and alter). More appropriate handling of missing actor data in egocentrically sampled networks is subject for future research. These exclusions lead to dropping 75 egos and 215 alters, leaving 3,357 egos and 2,555 alters in the sample for analysis. The ego degree distribution for the analytic sample was not significantly different from the full sample.

The NHSLS study used a stratified multistage cluster sample, with oversampling of Black households. The public dataset includes weights that account for both stratification and attribute-based non-response, so we approximate the design by an independent weighted sample.

We divide the respondents into three racial/ethnic categories: White, Black, and Other. While the primary contrast of interest here is between Whites and Blacks, a non-negligible fraction of egos reported other identifications for themselves and their alters. These cannot be dropped in a network analysis, as they can serve as connecting elements that influence the measures of interest.

Homophily is operationalized as an edge covariate, and is defined as concordance in actor attributes in a partnership, as reported by ego. We focus on homophily by sex and race in this analysis, allowing for differential homophily by race. Concurrency is operationalized at the actor level, and is defined as actor degree greater than 1. For modeling purposes, we fit a monogamy term to capture these effects, defined as actor degree equal to 1, again allowing for group-specific propensities for monogamy. This produces more stable estimates, especially for smaller groups with very low rates of reported concurrency.

We fit a sequence of nested models to test the network hypothesis for the racial disparities in HIV prevalence. *Model 1* serves as a baseline, fitting the observed mean degree for each sex by race, as well as the prevalence of heterosexual mixing. It has terms for the main effects for each sex and each race, and a homophily term for sex. Since this is a largely heterosexual population, we expect the sex homophily term will be strongly negative, but same-sex partnerships are not precluded. This model assumes partners are selected at random with respect to race, and there is no propensity for monogamy in sexual partnerships. *Model 2* tests homophily by race by adding a term for each race to capture the prevalence of within-group mixing. We expect these terms to be large and positive. *Model 3* tests heterogeneities in the propensity for monogamy by adding a term for each sex by race to capture the prevalence of persons with exactly one partner. We expect these terms to be positive, given the strong norm of monogamy in sexual partnerships, but we also expect there to be significant differences by race and sex. Since the group-specific mean degrees have been fit by the baseline terms, lower coefficient values on the monogamy terms will imply higher prevalence of concurrent partners.

We evaluate the goodness of fit of each model using the following criteria:

We compare the observed degree distribution to 100 realizations of complete networks from each specified model. In principle, we can evaluate the goodness of fit to any egocentric statistics that are not already in the model, following the general approach of Hunter, Goodreau, and Handcock (2008a). We choose the degree distribution because it is a primary determinant of network connectivity. A model that does not fit the degree distribution well is unlikely to produce the unobserved network connectivity that we wish to infer.

We simulate complete networks from the fitted models, which allows us to evaluate whether the micro-level processes specified by each model produce non-egocentric macro-level outcomes (network connectivity and exposure) that are consistent with observed epidemiological data. We measure overall network connectivity using the connected component size distribution. The propensity for monogamy in *Model 3* is expected to increase the number of components of size 2 (mutual monogamy) and decrease the number and size of the larger components. We measure network exposure at the actor level, using the probability of membership in components of size 3 or greater. This represents the risk of indirect exposure: an actor with only one partner will have little direct exposure, but by virtue of the network she or he may still be exposed to her or his partner’s other partner(s) and beyond. Under the network hypothesis, only *Model 3* is expected to produce differentials in network risk exposure that are consistent with the observed disparities in HIV incidence.

The models we consider are, of course, not intended to be complete specifications of the network process, as they exclude many factors that are known to influence sexual behavior and partner selection. For our purposes, the question is whether our results are robust to the inclusion of these factors, which would depend on whether the factors are correlated to race and sex, or interact with them in relevant ways. A good example to consider is age, as it influences both degree and partner selection patterns, and we examine its effects by adding to *Model 3* terms to capture the impact of age on both degree distribution and partner selection. If the addition of these terms renders the monogamy-bias or homophily terms nonsignificant, it would suggest that there is an age mechanism that underlies these biases, and the marginal impacts in *Model 3* are due to differences in age composition by race.

The population network would have had |*N*| ≈ 147 million persons (Population Estimates Program, 2001), so we take advantage of the scaled-down approach mentioned in Section 5.2. Based on reasoning detailed in the supplemental article (Krivitsky and Morris, 2016, App. C.1), we select, conservatively, |*N′*| ≈ 45,000 ≈ 13.4 × |*S*|, or 44,859 after rounding the scaled sampling weights. This network size is also used in the simulation results we report. Notably, this may be overly conservative in practice: we obtained very similar results in a pilot analysis using |*N′*| ≈ 15,000.

Verification of the assumption that our models are amenable to network-size- invariant parametrization is given in the supplemental article (Krivitsky and Morris, 2016, App. C.4).

We report the model fits in Table 2. *Model 1* results are consistent with expectations. There is a significant and strong propensity for heterosexual ties, and a slightly higher mean degree for men than women. There are no significant differences in mean degree by race. In *Model 2*, the results are consistent with the network hypothesis: all of the race homophily terms are large and significant. In *Model 3*, the results are again consistent with the hypothesis: there is a strong propensity for monogamy in all groups, but the propensity is relatively higher among White men and women. The difference between Whites and Blacks is significant for men (contrast diff. = 1.17, s.e. = 0.32, *P*-value *<* 0.001), but not for women (diff. = 0.45, s.e. = 0.51, *P*-value *>* 0.3). Women have higher rates of monogamy than men in all groups, especially among Blacks, but these differences are not statistically significant.

Coefficients and standard errors for the three models. Coefficients reported are in the presence of an edge count offset of −log(44859) = −10.71.

The goodness of fit for each model is shown in Figure 2a. The first two models do a very poor job fitting the observed degree distribution: both underestimate the fraction of persons with only one partner and overestimate both the fraction with no partner, and more than one partner. The data clearly indicate a strong propensity for monogamy, and *Model 3* captures this well. Because there are race and sex-specific monogamy terms in *Model 3*, it provides a good fit for all groups. (A more detailed breakdown and discussion are given in the supplemental article (Krivitsky and Morris, 2016, App. C.2).)

Simulation results based on 100 realizations from each of the fitted models: (a) goodness-of-fit plot, comparing simulated degree frequencies (dot plot) to that observed in the data; (b) simulated network component size distributions, averaged over each **...**

The overall network connectivity predicted by each model can be seen in Figure 2b. The plot shows the distribution of component sizes produced by each model. The first two models are, again, similar: both predict that about three quarters of the components are size 1 (isolated actors), and the ties distributed to the remaining actors produce components with sizes that can reach 100 or more. By contrast, *Model 3* predicts that the modal component is size 2 (mutual monogamy), that only about 20% of the actors are isolates, and the maximum component size attained in the 100 simulated realizations has fallen to only 6. Monogamy thus has the expected effect: it dramatically reduces the connectivity in the overall network.

The differential network risk exposure by race and sex predicted by each model can be seen in Figure 2c. This plot shows the group-specific distributions of the probability of belonging to a component of size 3 or more for each model. *Models 1* and *2* produce very similar results: overall network exposure probabilities are about 40%, and the race-specific differences are small, with lower probabilities of exposure predicted for Blacks than Whites. Since lower probabilities of exposure imply lower transmission risks, this pattern is the opposite of what we expect given the HIV/STI prevalence disparities. Adding the differential monogamy terms in *Model 3* reverses the predicted network exposure risk differentials for both sexes, producing a pattern that is consistent with the observed racial disparities in HIV/STI prevalence, and consistent with the network hypothesis.

Note that within each racial group, women are more likely than men to be in components of size 3 or more. This is because 3 is by far the most common component size predicted among those not mutually monogamous (≈ 80%, as seen in Figure 2b), and coupled with the higher rates of concurrency among men than women this means that these components typically comprise 1 man and 2 women. This a good example of the somewhat counterintuitive logic of network exposure in infectious diseases: your exposure is not just a function of your own behavior, but also a function of your partner’s. In countries with generalized heterosexual HIV epidemics, such as those in sub-Saharan Africa, concurrency is similarly gendered, and women’s HIV prevalence is typically much higher then men’s (40% higher across this particular region (UNAIDS, 2014)).

*Model 3* is clearly the best fit, and it predicts network exposure risks that are consistent with the observed disparities in HIV prevalence by race and sex. Per Section 7.2, we augmented it with several age-related terms. The results, given in the supplemental article (Krivitsky and Morris, 2016, App. C.3), are that age effects are generally significant, but that the key homophily and monogamy-bias results reported for *Model 3* are robust.

There are many challenges in developing rigorous inference for ERGM estimates from egocentrically sampled network data: the stochastic dependence in networks can be complex, the information present in an egocentric sample is limited, and the use of the data is often secondary. We have proposed a technique to conduct statistically valid ERGM inference under these conditions, using pseudo-maximum-likelihood estimation and exploiting exponential family proprieties. This makes it possible to estimate the parameters of a class of statistical models defined by the structural properties of a network that can be observed from an egocentric sample, test hypotheses of interest, assess goodness of fit, and conduct principled simulation from a superpopulation of networks having properties similar to those observed. By making use of a network-size-invariant parametrization for the ERGMs of interest, this can be done even when the target population is very large or even unknown. The result is a general statistical framework for leveraging whole network information from an efficient minimal sampling design.

As with all inferential frameworks, ours rests on a set of approximations and assumptions, and comes with limitations, some of which may be addressed in future research:

Per Section 5.1, we had constructed *x** _{N′}* by extrapolating from the sample, but we did not take this into account in our inference. Our simulation study suggests that the inference is still valid (conservative, in fact), but this issue can be addressed more rigorously. Recall that we only require the joint distribution of

Alternatively, uncertainty from ** x** being sampled may be incorporated into the inferential procedure: in particular, Fellows and Handcock (2012) propose an exponential-family model for jointly modeling actor attributes and ties. Provided the sufficient statistic associated with actor attributes could be recovered from egocentric data, our results should be applicable.

We approximated the sampling design of the NHSLS study with that of Section 2.2: a simple probability sample. A more accurate estimate of var* _{S}*(

In our application, we assumed that the responses were accurate: that, for example, the male respondents did not overreport their partnerships and female respondents did not underreport. While the discrepancy in the number of lifetime partners reported by men and women is well-established in the literature (Morris, 1993b), the magnitude of the discrepancy declines with the length of the recall period (Hamilton and Morris, 2010). In the NHSLS data used here, there is almost perfect correspondence between the weighted total number of ongoing heterosexual partnerships reported by women and men (1366 and 1388, respectively), which suggests some internal validity to the reports.

The potential for model misspecification is particularly serious for egocentric inference. This is not because of the inferential approach as such: our asymptotic results guarantee that for an egocentric ERGM, the PMLE (say, _{E}) consistently estimates the parameters (*θ*_{E}) that would have been obtained had the same model been fit to the full population network, with variability of _{E} under repeated sampling accurately quantified. Rather, this is because of how the data limit the scope of the models that can be estimated: even where the “true” model specification is known and is an ERGM (say, with coefficients *θ*_{T}), it might not be estimable (identifiable) with the available data. Therefore, the question about the impact of misspecification in egocentric inference reduces to the question of whether substantive conclusions drawn based on the coefficients in *θ*_{E} are different from those that would be drawn from the corresponding coefficients of *θ*_{T}, were it estimable.

While assessing general misspecification through traditional goodness-of-fit approaches is uniquely challenging in the context of egocentric data constraints, it is still possible to assess some aspects of model specification this way, as we have done in Section 7.2. Others can be tested by validating model predictions against specific hypotheses, for example, demonstrating that the disparities in network exposure predicted by the model are consistent with the observed disparities in HIV prevalence only when both of the hypothesized network features (homophily and monogamy bias) are included. But, this still leaves open the question of whether unobserved higher-order (such as triadic or degree-assortative) effects significantly influence the overall network structure, and potentially alter the impacts of the observed network features on the outcomes of interest.

In general, the potential impact of specific unobserved features could be examined through simulation. One could construct or simulate networks whose egocentric statistics match those inferred from the data but which also possess the specific hypothesized higher-order properties, such as more transitivity than predicted by simulating from _{E} obtained from the egocentric submodel alone. Estimating the “true” MLE _{T} and the egocentric submodel’s MLE _{E} (≈ _{E}, by construction) based on these networks and comparing their common elements should provide information about what biases may have been introduced by omitting the higher-order terms.

In our application, we did not use simulation to address this question, because the specific alternative hypothesized network effects in this case can be assessed more directly. For example, one might wish to consider how unobserved triadic effects or degree-based mixing might influence the conclusions we have drawn. Both effects would clearly influence the overall network structure in ways that might alter the spread of an infectious disease—positive/negative triadic effects might cluster/disseminate prevalence, and assortative/disassortative degree-based mixing would have similar effects. But, both of these are also examples of network configurations that can be produced by other mechanisms that, in our case, are observable in egocentric data. In particular, a triangle would have to contain at least one tie between nodes of the same sex. Given the very low prevalence of same-sex sexual ties observed in our population level data, the potential for triangles is very limited, and our models capture this through negative sex homophily even if the data preclude us from identifying specific triadic biases (positive or negative) beyond that. Similarly, one of the most striking features of the HIV prevalence disparities in this application is the lack of correlation with observed activity levels (degree). This would be predicted by degree-disassortative mixing, which we cannot observe in egocentric data. However, we do observe and model a mechanism that would produce degree-disassortative configurations: the systematically higher monogamy bias among women, combined with disassortative mixing on sex. In both cases, therefore, we are producing strucural configurations—that we cannot directly observe—by representing mechanisms that are well grounded in the science, observable, and statistically testable.

In other application settings, where explicitly triadic mechanisms may be at work, friendship networks where such triadic mechanisms can be empirically distinguished from homophily (Goodreau et al., 2009, in particular), suggest that misspecification bias is a bigger problem for estimates of triadic effects than estimates of homophily. That is, unaccounted-for homophily can lead to upward bias in the triadic closure coefficients, but the converse does not hold: unaccounted-for triadic closure does not appear to affect homophily coefficients substantively, and certainly not to the point of, e.g., turning assortative into disassortative mixing or *vice versa*. Whether this is a general property of the model, or an empirical feature of social networks, is a topic for future research.

One may also be interested in considering potential confounders for the effect of race on the observed racial homophily: for example, homophily on socioeconomic status and residential segregation may both play a role, and (in the United States) both are associated with race and ethnicity. However, this would not alter the main finding that racial homophily contributes to HIV prevalence disparities by race: from the pathogen’s point of view, it does not matter *why* a particular group structure has emerged, only that it has.

Ultimately, the underlying generative process for the network is outside of the scope of models that we consider for other reasons as well. (Airoldi, Blei, Fienberg, Goldenberg, Xing, and Zheng, 2008, Panel Discussion) The model we consider here is a model only for the static relations, but the underlying process is dynamic. Representing these dynamics would require models for tie formation and dissolution (Krivitsky and Handcock, 2014), and may also require models for random nodal attributes, nodal entry and exit (Airoldi et al., 2008, Blei), and (depending on research goals) the underlying utility functions that represent nodal preferences (Airoldi et al., 2008, Shalizi). This means that our methodology is subject to the usual limitations in terms of causal inference.

We can consider a number directions for explanding the scope of these models, subject to data availability.

Our development was aimed at undirected relations on unipartite networks, but the general inferential technique should be applicable to directed relations, provided each ego’s in-ties, as well as out-ties, are observed, and to bipartite networks.

We note in Section 3.1 that whether an ERGM term is egocentric—whether it can be identified given available data—is a function of the egocentric survey design: what information does *e** _{i}* contain? Though less common, information about alters’ connections is sometimes available in an egocentric design through solicitation of alter–alter ties. Smith (2012) and Gjoka, Smith, and Butts (2014a) consider this case for estimating clique size distributions and for reconstructing plausible networks, though not for ERGM inference. Other variations and extensions include studies that collect egocentric-like data but allow some individuals appearing twice in the data to be matched, e.g., couple studies (where the two individuals with a link are recruited together and each is asked about their alters), or a one-wave snowball sample. In epidemiology, such data may be collected in more focused studies of networks of men who have sex with men (MSM) and injecting drug users (IDUs)—where triadic effects may be highly relevant. (Dhanjal et al., 2011; Trotter et al., 1995)

In such studies, *e** _{i}* would contain some additional information, and the set of statistics expressible in the form of (3.1) would expand accordingly. For example, the

$${g}_{k}(\mathit{y},\mathit{x})=\sum _{(i,j)\in \mathit{y}}\underset{k\in N\backslash \{i,j\}}{max}{\mathit{y}}_{i,k}{\mathit{y}}_{k,j}$$

can be expressed in the form of (3.1) as

$${h}_{k}({\mathit{e}}_{i})=\frac{1}{2}\sum _{j\in {\mathit{y}}_{i}}\underset{k\in {\mathit{y}}_{i}\backslash \{j\}}{max}{\mathit{y}}_{j,k},$$

which depends only on the set of *i*’s immediate neighbors (i.e., *y** _{i}*) and which of those neighbors,

Minimal egocentric surveys often do include temporal information on the ties—such as duration of ongoing and recent partnerships—and inference for dynamic network models based on such data is the focus of ongoing work. (Krivitsky, 2012)

Expressability in the form (3.1), is sufficient but not strictly necessary for the estimation: for example, the global clustering coefficient (3 *×* (# of triangles)/(# of 2-stars)) cannot be so expressed, but both its dividend and its divisor can (given alter–alter data), so the Delta Method or a resampling method can be used to obtain an estimator and its variance. At the same time, such model terms would tend not to be local (Krivitsky et al., 2011)—a given dyad’s state would be conditionally dependent on the whole network, rather than some reasonable social neighborhood of the actors involved—causing difficulties in interpretation and network-size adjustment.

Our implementation utilizes the network-size-invariant parametrization of Krivitsky et al. (2011) primarily to facilitate computation—to allow us to fit the model to an “microcosm” *N′* of the population *N*.

This parametrization also facilitates interpretation by giving us parameter estimates invariant to our choice of |*N′*|, which was the original motivation for its development. It is highly likely that this scaling could be extended to bipartite networks, and Krivitsky and Kolaczyk (2015) propose a size adjustment for ERGM mutuality terms that could be used for egocentric data on directed networks, along with an approach that holds promise for triadic effects. The discussion in the supplemental article (Krivitsky and Morris, 2016, App. A.1) may provide some guidance for developing and testing such adjustments in the future.

In general, an ERGM specification is not guaranteed to have a meaningful size-invariant parametrization. But, size invariance is not strictly necessary for our framework: given sufficient computing power, one may dispense with *N′* and estimate parameters on *N* directly, with inferential results and ability to simulate from the fit networks consistent with the observed data unaffected.

Our simulation study in Section 6 and the supplemental article (Krivitsky and Morris, 2016, App. B) shows our estimators to be slightly biased. There are four likely sources of bias: (i) sampling variation of ** x**; (ii) biasedness of the Hájek estimator (4.2) for

We attempted to reduce (ii) using jackknife, with little noticeable improvement in the simulation studies, suggesting that it is not a major source of bias in this case. Judging by the small difference between the biases from the higher values of |*N′*| considered (found in the supplemental article (Krivitsky and Morris, 2016, App. B.2)), (iv) likely becomes negligible reasonably quickly: the estimates converge to a nonzero bias. We believe this remaining bias to be primarily due to (i) and (iii).

Unfortunately, while a technique like nonparametric bootstrap jointly resampling *w _{i}* and

Lastly, in our framework, the population network ** y** is fixed and unknown and

$${var}_{S\circ \mathit{g}}\phantom{\rule{0.16667em}{0ex}}[\stackrel{\sim}{\mathit{g}}\{{\mathit{e}}_{S}(\mathit{Y})\};\mathit{\theta}]={\mathrm{E}}_{\mathit{g}}\phantom{\rule{0.16667em}{0ex}}({var}_{S}[\stackrel{\sim}{\mathit{g}}\{{\mathit{e}}_{S}(\mathit{Y})\}\mid \mathit{Y}];\mathit{\theta})+{var}_{\mathit{g}}\phantom{\rule{0.16667em}{0ex}}({\mathrm{E}}_{S}[\stackrel{\sim}{\mathit{g}}\{{\mathit{e}}_{S}(\mathit{Y})\}\mid \mathit{Y}];\mathit{\theta})\phantom{\rule{0ex}{0ex}}\approx {\mathrm{E}}_{g}\phantom{\rule{0.16667em}{0ex}}({var}_{S}[\stackrel{\sim}{\mathit{g}}\{{\mathit{e}}_{S}(\mathit{Y})\}\mid \mathit{Y}];\mathit{\theta})+{var}_{\mathit{g}}\phantom{\rule{0.16667em}{0ex}}\{\mathit{g}(\mathit{Y});\mathit{\theta}\},$$

which, for an ERGM superpopulation and the sampling process (*S*) being independent of the network (** Y**)—something likely to hold in social surveys—reduces to

$${var}_{S\circ \mathit{g}}(\stackrel{\sim}{\mathit{\theta}})\approx {\stackrel{\sim}{\mathit{V}}}^{-1}({\mid N\mid}^{2}{\mathit{\sum}^{\sim}}_{\mathrm{H}}/\mid S\mid +\stackrel{\sim}{\mathit{V}}){\stackrel{\sim}{\mathit{V}}}^{-1}\phantom{\rule{0ex}{0ex}}\approx {\stackrel{\sim}{\mathit{V}}}^{-1}({\mid N\mid}^{2}{\mathit{\sum}^{\sim}}_{\mathrm{H}}/\mid S\mid ){\stackrel{\sim}{\mathit{V}}}^{-1}+{\stackrel{\sim}{\mathit{V}}}^{-1},$$

for
$\stackrel{\sim}{\mathit{V}}={\stackrel{\sim}{var}}_{\mathit{g}}\{\mathit{g}(\mathit{Y});\stackrel{\sim}{\mathit{\theta}}\}$, and if E**_{g}** (var

^{1}This is not to be confused with the maximum pseudolikelihood estimation (MPLE) of Strauss and Ikeda (1990), the technique for approximating the MLE for an intractable likelihood for fully observed networks. We do not make direct use of it in this work.

(doi: COMPLETED BY THE TYPESETTER; .pdf). Additional derivations and results referenced in Sections 5, 6, 7, and 8

- Admiraal R. PhD thesis. University of Washington; Seattle, WA: 2009. Dynamic Network Models based on Revealed Preference for Observed Relations and Egocentric Data.
- Airoldi E, Blei D, Fienberg S, Goldenberg A, Xing E, Zheng A. Statistical Network Analysis: Models, Issues, and New Directions: ICML 2006 Workshop on Statistical Network Analysis; Pittsburgh, PA, USA. June 29, 2006; Berlin Heidelberg: Springer; 2008. Revised Selected Papers.
- Binder DA. On the Variances of Asymptotically Normal Estimators from Complex Surveys. Int Stat Rev. 1983;51:279–292.
- Brown LD. Lecture Notes—Monograph Series. Vol. 9. Institute of Mathematical Statistics; Hayward, California: 1986. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory.
- Burt RS. Network Items and the General Social Survey. Soc Networks. 1984;6:293–339.
- Butts CT. Social Network Analysis with
`sna`. J Stat Softw. 2008;24:1–51. [PubMed] - Dhanjal C, Clémençon S, Arazoza HD, Rossi F, Tran VC. The Evolution of the Cuban HIV/AIDS Network. 2011 arXiv preprint arXiv:1109.2499.
- Fellows I, Handcock MS. Exponential-Family Random Network Models. 2012 arXiv preprint arXiv:1208.0121.
- Firth D. Bias Reduction of Maximum Likelihood Estimates. Biometrika. 1993;80:27–38.
- Frank O, Strauss D. Markov Graphs. J Am Stat Assoc. 1986;81:832–842.
- Fuller WA. Wiley Series in Survey Methodology. Vol. 560. JohnWiley & Sons; 2011. Sampling Statistics.
- Geyer CJ, Thompson EA. Constrained Monte Carlo Maximum Likelihood for Dependent Data (with discussion) J R Stat Soc Ser B. 1992;54:657–699.
- Gjoka M, Smith E, Butts C. Estimating Clique Composition and Size Distributions from Sampled Network Data. Sixth IEEE International Workshop on Network Science for Communication Networks.2014a.
- Gjoka M, Smith E, Butts CT. Design-based Estimators for Attribute-Labeled, Low-Semidiameter Subgraphs. 34th Sunbelt Network Conference; St. Pete Beach: INSNA; 2014b.
- Goodreau S, Cassels S, Kasprzyk D, Montao D, Greek A, Morris M. Concurrent Partnerships, Acute Infection and HIV Epidemic Dynamics Among Young Adults in Zimbabwe. AIDS Behav. 2010:1–11. [PMC free article] [PubMed]
- Goodreau SM, Kitts JA, Morris M. Birds of a Feather, or Friend of a Friend? Using Exponential Random Graph Models to Investigate Adolescent Social Networks. Demography. 2009;46:103–125. [PMC free article] [PubMed]
- Gupta S, Anderson RM, May RM. Networks of sexual contacts: implications for the pattern of spread of HIV. AIDS. 1989;3:807–18. [PubMed]
- Hájek J. Comment on An Essay on the Logical Foundations of Survey Sampling by Basu, Debabrata. In: Godambe VP, Sprott DA, editors. Foundations of Statistical Inference: Proceedings of the Symposium on the Foundations of Statistical Inference; March 31 to April 9. 1970.Ont., Canada: René Descartes Foundation, Holt McDougal, Department of Statistics, University of Waterloo; 1971.
- Hallfors DD, Iritani BJ, Miller WC, Bauer DJ. Sexual and Drug Behavior Patterns and HIV and STD Racial Disparities: The Need for New Directions. Am J Public Health. 2007;97:125–132. [PubMed]
- Hamilton D, Morris M. Consistency of Self-Reported Sexual Behavior in Surveys. Arch Sex Behav. 2010;39:842–860. [PMC free article] [PubMed]
- Handcock MS, Gile KJ. Modeling Social Networks from Sampled Data. Ann Appl Stat. 2010;4:5–25. [PMC free article] [PubMed]
- Handcock MS, Hunter DR, Butts CT, Goodreau SM, Krivitsky PN, Morris M.
`ergm`: Fit, Simulate and Diagnose Exponential-Family Models for Networks. The Statnet Project. 2014 ( http://www.statnet.org). R package version 3.1.2. [PMC free article] [PubMed] - Hummel RM, Hunter DR, Handcock MS. Improving Simulation-Based Algorithms for Fitting ERGMs. J Comput Graph Stat. 2012;21:920–939. [PMC free article] [PubMed]
- Hunter DR, Goodreau SM, Handcock MS. Goodness of Fit for Social Network Models. J Am Stat Assoc. 2008a;103:248–258.
- Hunter DR, Handcock MS. Inference in Curved Exponential Family Models for Networks. J Comput Graph Stat. 2006;15:565–583.
- Hunter DR, Handcock MS, Butts CT, Goodreau SM, Morris M.
`ergm`: A Package to Fit, Simulate and Diagnose Exponential-Family Models for Networks. J Stat Softw. 2008b;24:1–29. [PMC free article] [PubMed] - Illenberger J, Fltter G. Estimating Network Properties from Snowball Sampled Data. Soc Networks. 2012;34:701–711.
- Koskinen JH, Robins GL, Pattison PE. Analysing Exponential Random Graph (p-star) Models with Missing Data Using Bayesian Data Augmentation. Stat Methodol. 2010;7:366–384.
- Krivitsky PN. Tech Rep 2012-01. Pennsylvania State University Department of Statistics; 2012. Modeling of Dynamic Networks based on Egocentric Data with Durational Information.
- Krivitsky PN, Handcock MS. A separable model for dynamic networks. Journal of the Royal Statistical Society, Series B. 2014;76:29–46. [PMC free article] [PubMed]
- Krivitsky PN, Handcock MS, Morris M. Adjusting for Network Size and Composition Effects in Exponential-Family Random Graph Models. Stat Methodol. 2011;8:319–339. [PMC free article] [PubMed]
- Krivitsky PN, Kolaczyk ED. On the Question of Effective Sample Size in Network Modeling: An Asymptotic Inquiry. Statistical Science. 2015;30:184–198. [PMC free article] [PubMed]
- Krivitsky PN, Morris M. Supplement to ”Inference for Social Network Models from Egocentrically-Sampled Data, with Application to Understanding Persistent Racial Disparities in HIV Prevalence in the US” 2016
- Laumann EO, Gagnon JH, Michael RT, Michaels S. National Health and Social Life Survey. Chicago, IL, USA: University of Chicago and National Opinion Research Center [producer]; Ann Arbor, MI, USA: Inter-university Consortium for Political and Social Research [distributor]; 1992. 1995. 2008-04-17. Computer file.
- Laumann EO, Gagnon JH, Michael RT, Michaels S. The Social Organization of Sexuality. University of Chicago Press; Chicago: 1994.
- Marsden PV. Models and Methods for Characterizing the Structural Parameters of Groups. Soc Networks. 1981;3:1–27.
- Marsden PV. Core Discussion Networks of Americans. Am Sociol Rev. 1987;52:122–131.
- MEASURE DHS. Demographic and Health Surveys. ICF International; 2000–2014.
- Morris M. A Log-Linear Modeling Framework for Selective Mixing. Math Biosci. 1991;107:349–77. [PubMed]
- Morris M. Epidemiology and Social Networks: Modeling Structured Diffusion. Socio Meth Res. 1993a;22:99–126.
- Morris M. Telling Tails Explain the Discrepancy in Sexual Partner Reports. Nature. 1993b;365:437–440. [PubMed]
- Morris M, Handcock MS, Miller WC, Ford CA, Schmitz JL, Hobbs MM, Cohen MS, Harris KM, Udry JR. Prevalence of HIV Infection among Young Adults in the U.S.: Results from the ADD Health Study. Am J Public Health. 2006;96:1091–1097. [PubMed]
- Morris M, Kretzschmar M. Concurrent Partnerships and the Spread Of HIV. AIDS. 1997;11:641–648. [PubMed]
- Morris M, Kurth AE, Hamilton DT, Moody J, Wakefield S. Concurrent Partnerships and HIV Prevalence Disparities by Race: Linking Science and Public Health Practice. Am J Public Health. 2009;99:1023–1031. [PubMed]
- National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention (NCHHSTP) Tech Rep. 4. Vol. 17. Centers for Disease Control and Prevention; 2012. HIV Surveillance Supplemental Report: Estimated HIV incidence in the United States, 2007–2010.
- National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention (NCHHSTP) Tech Rep. 3. Vol. 18. Centers for Disease Control and Prevention; 2013. HIV Surveillaince Supplemental Report: Diagnoses of HIV Infection among Adults Aged 50 Years and Older in the United States and Dependent Areas, 2007–2010.
- National Communicable Disease Center (NCDC) Tech Rep. 53. Vol. 15. U.S. Department of Health, Education, and Welfare; Atlanta, GA: 1967. Morbidity and Mortality Weekly Report: Reported Incidence of Notifiable Diseases in the United States, 1966.
- National Survey of Family Growth Staff. Tech rep. Division of Vital Statistics, National Center for Health Statistics; 2002. 2006–2011 National Survey of Family Growth (NSFG)
- Pattison PE, Robins GL, Snijders TA, Wang P. Conditional Estimation of Exponential Random Graph Models from Snowball Sampling Designs. J Math Psychol. 2013;57:284–296.
- Pfeffermann D. The Role of Sampling Weights when Modeling Survey Data. Int Stat Rev. 1993;61:317–337.
- Population Estimates Program. Resident Population Estimates of the United States by Age and Sex: April 1, 1990 to July 1, 1999, with Short-Term Projection to November 1, 2000. Population Division, U.S. Census Bureau; 2001. Online. Retrieved June 9, 2009.
- Putnam RD. Bowling Alone: The Collapse and Revival of American Community. Simon & Schuster; New York: 2000.
- R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2013.
- Salganik MJ, Heckathorn DD. Sampling and Estimation in Hidden Populations Using Respondent-Driven Sampling. Sociol Methodol. 2004;34:193–239.
- Shalizi CR, Rinaldo A. Consistency under Sampling of Exponential Random Graph Models. Ann Stat. 2013;41:508–535. [PMC free article] [PubMed]
- Smith JA. Macrostructure from Microstructure: Generating Whole Systems from Ego Networks. Sociol Methodol. 2012;42:155–205. [PMC free article] [PubMed]
- Snijders TA. Conditional Marginalization for Exponential Random Graph Models. J Math Sociol. 2010;34:239–252.
- Strauss D, Ikeda M. Pseudolikelihood Estimation for Social Networks. J Am Stat Assoc. 1990;85:204–212.
- Tanfer K. National Survey of Women. In: McKean EA, Muller KL, Lang EL, editors. AIDS/STD Data Archive. Sociometrics Corporation; Los Altos, CA: 1991. pp. 17–19.
- Thompson SK, Frank O. Model-Based Estimation with Link-Tracing Sampling Designs. Survey Methodol. 2000;26:87–98.
- Tomas A, Gile KJ. The Effect of Differential Recruitment, Non-Response and Non-Recruitment on Estimators for Respondent-Driven Sampling. Electron J Stat. 2011;5:899–934.
- Trotter RT, Baldwin JA, II, Bowen AM. Network Structure and Proxy Network Measures of HIV, Drug and Incarceration Risks for Active Drug Users. Connections. 1995;18:88–103.
- Udry JR. Tech rep. Carolina Population Center, University of North Carolina; Chapel Hill: 2003. The National Longitudinal Study of Adolescent Health (Add Health), Waves I & II, 1994–1996; Wave III, 2001–2002.
- UNAIDS. Tech rep. United Nations; 2014. HIV Estimates with Uncertainty Bounds 1990–2013.
- van Duijn MAJ, van Busschbach JT, Snijders TAB. Multilevel analysis of personal networks as dependent variables. Soc Networks. 1999;21:187–210.
- Volz E, Heckathorn DD. Probability Based Estimation Theory for Respondent Driven Sampling. J Off Stat. 2008;24:79–97.
- Wasserman SS, Pattison P. Logit Models and Logistic Regressions for Social Networks: I. An Introduction to Markov Graphs and
*p*. Psychometrika. 1996;61:401–425.^{*}

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |