
Multistage models of disease development are a theoretically and
practically useful way of modeling time delays between when a
disease process begins and when it is manifest. Modeling time
delays is useful for both infectious and non-infectious diseases
in two ways. First it captures important aspects of dynamics.
Second it creates a foundation for modeling the joint effects
of exposures and for understanding the dynamics of screening programs.
Modeling the incubation period for a contagious disease is useful
because the distribution of incubation periods between onset of
the infection process and the onset of contagiousness affects
the speed with which an infection can spread through a population.
For chronic non-infectious diseases the dynamics of disease development
in individuals will affect the age distribution of disease. For
example the rising frequency of cancer with age has been explained
mostly in terms of multistage models. The dynamics of disease
development after a transient exposure will affect the way we
analyze exposure effects and evaluate the effects of exposure
control programs. Some model of disease development dynamics is
always assumed when analyzing exposure-disease relationships.
While we address these direct uses of multistage modeling in this
chapter, they do not present the major motivation for the presenting
multistage models. This chapter is meant to create a basis for
further practical and theoretical modeling issues. One practical
issue for which it creates a modeling foundation is the design
and evaluation of screening programs. In the next chapter we will
use multistage models to gain insight into how predictive values
should change after the institution of a screening program and
how intervals between repeat screenings will affect the performance
of screening programs. These issues are not well addressed by
the static approaches to relating sensitivity, and specificity
to predictive values at different levels of disease prevalence.
An issue needing multistage models that has even broader practical
and theoretical implications is the joint effects of two or more
different exposures. The pattern of disease in the population
by exposure is dependent in predictable ways on how two different
exposures relate to each other in causing disease. There are three
major ways that they may relate to each other: 1) they may contribute
to advancing the same stage of a disease process, 2) they may
contribute to different stages of the same disease process, or
3) they may contribute to distinct pathways to the disease. To
correctly assess exposure effects statistically and to predict
the consequences of eliminating exposures, the analysis model
used must have joint effects that correspond those produced by
the underlying causal relationships. By building models of these
three relationships we will gain insight into the analytic models
that will be appropriate for different causal situations.
Single Stock Models of Disease Incubation:
Up to this point we have dealt with disease processes as if the
time that you entered a disease process didn't influence the time
when you developed a disease. Making this assumption simplifies
the mathematics of disease processes and it reduces the time a
computer must take to simulate a process. Although such assumptions
are usually unrealistic, they provide appropriate caricatures
of systems in the initial exploration of how those systems might
be expected to behave.
Simple one stage disease process models can be particularly useful
for studying the equilibrium states of systems. Sometimes, but
not always, the equilibrium states don't depend on the distribution
of times that it took individuals to reach a disease state. Sometimes
they only depend upon the mean time that it takes individuals
to reach the disease state. We will illustrate one such situation
in this chapter. Later in the course we will also see that the
shape of the distribution of incubation periods can markedly affect
how fast an infectious disease epidemic rises but it does not
affect the equilibrium levels of infectious diseases.
A simple alternative to modeling the development time to disease
is to assume that there is a fixed time to disease development
so that everyone takes exactly this fixed time to develop the
disease. We can produce simulations with such fixed delays using
the conveyor stock in Stella. The conveyor model is equivalent
to having a compartment for each dt in the numerical estimation
procedure and having the entire contents of each compartment move
into the next compartment each dt. The negative exponential distribution
and the fixed delay represent two extremes of possible distributions
of incubation periods. Let us generate these two extreme distributions
using the following Stella models:
In the conveyor we put 1 individual in the most distant time slot
from the exit and zeros in the 99 other time slots. In the Well
Reservoir we put one individual as well. You can appreciate that
we have to plot the flows from the well state on different scales
in these two models. The maximum rate from the reservoir is 0.01
but there are continuous flows. From the conveyor, our one individual
flows out all at once. Make sure you can reproduce the output
in graph 9.1.
Certainly there must be some more realistic alternatives between
these two extremes.
Let us consider one implication of these two models for infectious
diseases. Rather than using these models for incubation periods,
we first consider a simple SIR model where we use these two models
for the infectious period. In the first SIR model we use a reservoir
for the infection compartment so that the distribution of times
spent in that compartment has a negative exponential form as seen
in Graph 9.1. In the second, we use a conveyor so that every individual
has exactly the same duration of stay in the infectious state.
We set the rate of exit from the reservoir equal to the inverse
of the time length of the conveyor so that in each case the average
duration in the infectious state is the same. The diagram for
our model is seen in Diagram 9.2.
COMPARTMENTS
Icon(t) = Icon(t - dt) + (NewIcon - NewRcon) * dt
INIT Icon = .0001
TRANSIT TIME = 2
INFLOW LIMIT = INF
CAPACITY = INF
INFLOWS:
NewIcon = Scon*c*beta*(Icon/(Icon+Rcon+Scon))
OUTFLOWS:
NewRcon = CONVEYOR OUTFLOW
Ires(t) = Ires(t - dt) + (NewIres - NewRres) * dt
INIT Ires = .0001
INFLOWS:
NewIres = Sres*c*beta*(Ires/(Ires+Rres+Sres))
OUTFLOWS:
NewRres = Ires/D
Rcon(t) = Rcon(t - dt) + (NewRcon) * dt
INIT Rcon = 0
INFLOWS:
NewRcon = CONVEYOR OUTFLOW
Rres(t) = Rres(t - dt) + (NewRres) * dt
INIT Rres = 0
INFLOWS:
NewRres = Ires/D
Scon(t) = Scon(t - dt) + (- NewIcon) * dt
INIT Scon = .999
OUTFLOWS:
NewIcon = Scon*c*beta*(Icon/(Icon+Rcon+Scon))
Sres(t) = Sres(t - dt) + (- NewIres) * dt
INIT Sres = .999
OUTFLOWS:
NewIres = Sres*c*beta*(Ires/(Ires+Rres+Sres))
PARAMETERS
beta = .25
c = 4
D = 2
Now let us examine the incidence and cumulative incidence curves
from this model. We see in Graph 9.2 that the epidemic rises faster
given the conveyor model of duration of contagiousness. Since
the basic reproduction number is the same for both models, however,
both have the same threshold values for contact rates and transmission
probabilities at which there will be rising transmission and both
come to the same final fraction of the population infected.
Let us consider some technical points about how we generated these
curves in Graph 9.2. Whenever a delay, a quick burst inflow, or
a conveyor is used in our simulations, we should stick to the
Euler method of numerical estimation. To attain accuracy with
the Euler method, a very small time step is required in this model
because during the rapid rise of the epidemic in the reservoir
case, flows can be relatively large. When a sufficiently small
time step is used, however, you see that the final fraction of
individuals infected is the same for both models. The initial
rate of rise of the epidemics, however, is different.
If we start with a small enough seed of initially infected individuals,
we can examine the initial period of the epidemic to see how long
it takes either the total number of infected individuals or the
flow of newly infected individuals to double. The reservoir model
has a constant doubling time from the start which persists until
the fraction of the population which is not susceptible can perceptibly
affect the doubling time. The conveyor model starts out with a
uniform distribution and takes a few doubling times until cases
are distributed along the conveyor in an exponential fashion.
After that point there is again a period of constant doubling
times until the decreased fraction susceptible begins to slow
the rise.
The period of constant doubling times at the very beginning of
an epidemic can be used to estimate the basic reproduction rate
of the transmission process. We can see from these results, however,
that to use that doubling time to estimate the basic reproduction
number, we will have to have a model that describes the distribution
of infectious periods and not just an average duration for the
infectious period.
Homework 9.1
Construct two SIR models with
vital dynamics that have an unchanging population size. In one
use a reservoir for the infectious period and in the other use
a conveyor. Demonstrate that when the average duration of the
infectious period is the same, these two models lead to the same
equilibrium values.
Multistage Disease Processes:
Realistically the time one develops a disease is likely to depend
upon when one started incubating the disease. That is to say that
a one stage reservoir model is not completely realistic. There
will almost always be a distribution of disease onsets after exposure
where some individuals have short incubation periods and some
have longer periods. If that is the case then the conveyor model
is not ideally realistic either. For some purposes we might construct
our models with these two extremes and see if we come to the same
conclusion with either extreme. If we do, then we could generally
feel confident that a more realistic intermediate model would
also lead us to that same conclusion. Because of limitations we
find in Stella on the nature of flows into and out of conveyors,
however, we may not be able to test our model with these two extremes
in all cases. Thus we are motivated to find an intermediate model.
Fortunately there is a very convenient one readily at hand. It
is the multistage model.
We can create a realistic distribution of incubation periods with
most incubation periods clustered around a peak by introducing
multiple stages to the disease development process.
Let us think about a multistage process in physiological terms.
Consider inhaling a dose of influenza viruses as the chance event
that starts a disease process. It takes some time for those viruses
to bounce around at random until they finally attach to the cells
that they can invade. Then the cell membrane molecules have to
bounce around at random until all the necessary linkages between
virus and cell membranes are made that lead to actual virus entry
and separation from virus RNA and attached proteins from its viral
membrane. Then the virus proteins have to randomly go about until
they hit their targets. It seems that we could always go into
finer and finer detail with such descriptions. In that sense,
any disease development process seems to have a lot of random
event sequences that it must go through. Only a few of those,
however, will be long enough to make much of a difference. For
the influenza virus situation, there will have to be several generations
of virus replication going through all of these minute stages
before there is enough virus present to make one ill. We could
think of each viral generation as a stage of the incubation period.
For the incubation of cancer we generally think of stages during
which somatic cells have to undergo mutations. Each cumulative
mutations marks another stage. We will see later that such multistage
models can be elaborated further by specifying whether the mutations
have to occur in sequential order or whether any order will do
the job.
A simple two stage model:
Lets build a simple two stage disease process and examine the
distributions of times to entry into the disease state by plotting
the flows into the disease state. The model we will build is as
seen in STELLA diagram 9.3
COMPARTMENTS:
First_stage_incubating(t) = First_stage_incubating(t - dt) + (- Asymptomatic_progressions) * dt
INIT First_stage_incubating = 1
OUTFLOWS: Asymptomatic_progressions = First_stage_incubating*Stage_1_Progression_rate
Second_stage_incubating(t)
= Second_stage_incubating(t - dt) + (Asymptomatic_progressions - Disease_onsets) * dt
INIT Second_stage_incubating = 0
INFLOWS: Asymptomatic_progressions = First_stage_incubating*Stage_1_Progression_rate
OUTFLOWS: Disease_onsets = Second_stage_incubating*Stage_2_progression_rate
Diseased(t) = Diseased(t - dt) + (Disease_onsets) * dt
INIT Diseased = 0
INFLOWS: Disease_onsets = Second_stage_incubating*Stage_2_progression_rate
PARAMETERS
Stage_1_Progression_rate = 1
Stage_2_Progression_rate = 1
DERIVED VARIABLES
Well_Total = First_stage_incubating+Second_stage_incubating
Incidence_rate_or_Hazard_function = Disease_onsets/Well_Total
In this model we put all of the modeled population into the first
stage compartment before we begin running the program. There is
no input into this compartment so everyone in the compartment
will have spent the same amount of time in the compartment. Therefore
we can think of this as a model of a cohort of individuals who
are identified at some point of time and followed. We will consider
the start of the model to be just at the moment that individuals
have begun their first step toward developing disease. Then individuals
will have to take a second step to get into the second compartment
and a third step to get into the compartment where disease is
manifest. For example, we might think of time zero in the first
stage as the time of HIV infection and the disease onsets in the
third compartment as AIDS onsets.
Alternatively, we might think of an entire cohort of individuals
being exposed to some hazard beginning at time zero that defines
the onset of the disease process. For example we might be interested
in examining the incubation period through which smoking generates
cancer. Smoking might generate a constant chance of reaching the
next stage of the disease process which might be some somatic
mutation. Then once that stage is reached (in the second compartment
in our model), there could be a constant hazard that the next
step which takes one into the disease state could be taken. This
step might be the growth of a tumor to a critical detectable size
or it might be attainment of the capacity to metastasize. For
most cancers we will in fact probably want more than two stages.
Growth processes in general require us to use several stages and
metastases don't become immediately manifest as symptoms. Thus
in this example we might want actually to use more than two stages.
But we want to begin our multistage modeling with the simplest
multistage model.
In either example, because we have put everyone into the first
stage of disease at time zero, time zero is when we define the
risk as beginning. At that time no one will have made any progression
to disease. Of course with many disease processes, if we were
to select a cohort of well individuals for study, some of those
individuals would be at various stages already along the progression
to disease. We assume in this model that we have some way of defining
a beginning point for the process.
Suppose we want the total average incubation period or time spent
in non-diseased stages to be two time units so we will divide
the total average incubation period into two parts.
Note that we can treat the flow into the diseased state in this two stage model just as we treated the flow into the diseased state in the one stage model. Across any discrete time interval, the number of individuals flowing into the diseased state represents the number of individuals that have an incubation period corresponding to the midpoint of that time interval. To emphasize this we present graph 9.3. The total number of healthy individuals in this graph has been calculated by summing the individuals in the two different incubating states.
As the "dt" constituting the base of the rectangles
used to measure the volume under the curve defined by "Disease
Onsets" gets smaller and smaller, the sum of the volume of
those rectangles gets closer to the total volume under that curve.
That total volume is one since we put one individual in the first
well stage. As the step sizes go to zero, the rectangles will
have no volume. That is to say the probability of any individual
fitting into any particular rectangle as the dt goes to zero will
approximate zero. But the height of the curve can be interpreted
as the probability density of having an incubation period of the
corresponding length of time.
Note that just as in the case of one stage disease, the survival
function is one minus the cumulative risk function, the probability
density function is the derivative of the cumulative risk function
or the negative derivative of the survival function, and the hazard
function is the probability density function divided by the survival
function. Let us summarize this symbolically:
For those of you with a calculus background, you might also be interested to see that
l(x) = -d ln(S(x)) / dx.
In this case, as opposed to the one stage time independent or autonomous model, our hazard function is not constant over time. Note that our model is still time independent (flows don't depend on the values of stocks at any time in the past) and it is still autonomous (time itself does not affect parameter values). We have not put any delays or conveyors into the system and we have not put time into any of the functions defining flows. The hazard function is the instantaneous rate at which disease is developing in well individuals. In Stella we have to create this as a derived variable by taking the flow into the disease stage per unit time and dividing by the total well population. Note that unlike the one stage disease process, this hazard function changes value over time. The reason for this is that at first there is no one in the stage just before disease so no one can get disease. As more and more individuals flow into that pre-disease state, the total proportion of individuals in that state will rise, and the overall disease rate will rise. Let us look at this function graphically.
The duration of separate stages and the overall duration
There are various ways we could divide the incubation period into
two parts. We could have a lot of the incubation period in the
first stage and little in the second. We could make the two parts
equal as done for graph 9.3. Or we could make the first stage
short and the second stage long. We want now to examine how these
different divisions of the total incubation time into the different
stages affects the probability density function of the incubation
period.
If we divided our incubation stages evenly so that the average
time in each stage is 1 time unit, the flow rates out of each
incubating stage would have to be one in order for the average
time spent in each of them to equal one. We will label the flow
into the diseased state as IR 50 per 100 for this 50-50 division
of the incubation period. Alternatively we might divide the time
periods up so that on the average half a year is spent in the
first stage incubating and 1.5 years in the second stage incubating.
To make this division, we would make the flow rate out of the
first stage to equal 1/0.5 or 2 and out of the second stage to
equal 1/1.5 or .66667. In the graph below, we label the flow into
the diseased state as IR 25 per 100 for these values.
Note that in order to get a total time of 2, it is not the rates
that add up, it is one over the rates! We could also divide the
time up so that only 1.5 time units on average were spent in the
first stage and 0.5 were spent in the second stage. We would just
switch around the previous flow rates for this purpose. We label
the corresponding flow IR 75 per 100. Similarly we could divide
the periods into .2 and 1.8 years so that the flow rates would
be 5 and 1/1.8. (IR 10 per 100) Or we could go to the extreme
where 99% of the incubating time would be spent in the second
stage so the flow rates would be 50 and 1/1.98. (IR 1 per 100)
(Better use a very small time step when you have a flow rate of
50 or alternatively change your time units!)
Given these combinations of flow rates, we get the distributions of incubation times as seen in graph 9.5 by plotting the flows into the diseased state.
Homework 9.2
Reproduce a set of curves like those in graph 5.4 but where the
average total incubation period is only 0.5 time units. Make sure
that your results are not influenced by the time period for the
dt which you have chosen by demonstrating that your results are
similar even when you halve the dt. With a total incubation time
in the two stages of 0.5, you are going to have some pretty big
flow rates. Make sure you understand how to handle such high rates
without generating artifacts.
Note that in a two stage model like this, the final distribution
of incubation periods or times spent in the non-diseased state
will be the same if we make the first stage fast and the second
slow or vice versa. We see this above graph where the IR 25 per
100 and the IR 75 per 100 curves are congruent.
Homework 9.3
Explain why the IR 25 per 100 and the IR 75 per 100 curves are
congruent. Be as specific as possible. See if you can use any
mathematical logic in your explanation.
Note that for all practical purposes, we could treat the IR 1
per 100 model as a one stage process. The second stage happens
so fast that it doesn't make a noticeable difference in the incubation
period. From exercise 1 above you should have seen that adding
a small, fast moving stage complicates the calculation process
considerably. Thus there is every reason to avoid such stages
in a model and little reason to include them.
Note also that to move the peak of our distribution of incubation
times maximally to the larger numbers, we divided the total incubation
period evenly. But even then, the curve has a very short left
side and a long right side.
How can we move the peak even further to the right while still preserving the same total average incubation time? We can do that by adding more stages. Graph 9.6 presents the distribution of onset times for a total average incubation time of 10 years when there are 1, 2, 4, 20, and 40 stages. Each stage is given exactly the same length. The flow rate out of each stage is r/10 where r is the number of stages.
You will note that as the number of stages is increased, the curve
gets more bell-like but that it still has a longer tail on the
right than on the left.
The Erlang distribution:
The curves in graph 9.6 are Erlang distributions. Erlang distributions
are a subset of a larger family of distributions called gamma
distributions. Gamma distributions can describe probability density
functions just like the negative exponential function rate*e-rate*time
describes a probability density function. In fact the negative
exponential distributions are a subset of the Erlang distributions.
There is a closed form for the gamma distribution which is kemtm-1e
-kt / m! . When m is constrained to the integer numbers,
we have an Erlang distribution. In that case k corresponds to
the flow rate out of each of the sequential compartments and m
corresponds to the number of compartments.
The average incubation period is m / k . That is because the average
duration in each stage is one over k and there are m stages.
The Multi-Hit process.
In the above examples, one has to be in a stage before the events
that lead to some other stage can take place. In some situations
where two different events are involved in leading up to a disease
process, it may not be necessary to have one event occur before
the other. For example, perhaps there are a set of genetic mutations
that need to be accumulated before the disease develops and it
makes no difference which mutation occurs first.
The accumulation of random risk over time (Weibull) process.
The Weibull distribution:
Consider another way of making the flow out of an incubating disease
stage depend upon the time that one has been in that incubating
stage. Instead of adding additional stages, we could make the
flow dependent upon the amount of time that has been spent in
a stage as we fill in the formula for the flow regulator. That
approach will only work in a cohort study where everyone starts
in the incubating compartment at time 0 and no new individuals
come into that compartment after time zero. For that reason this
approach does not have as wide applicability in compartmental
modeling as the multistage model does. In the model in Diagram
9.4 we have filled in the flow rate as having a linear dependence
on time.
Diseased(t) = Diseased(t - dt) + (Disease_Onsets) * dt
INIT Diseased = 0
Well(t) = Well(t - dt) + (- Disease_Onsets) * dt
INIT Well = 1
Weibull_Parameter = .25
Disease_Onsets = Well*Weibull_Parameter*time
Incidence_density_or_Hazard_Function = Disease_Onsets/Well
This gives us graph 5.6
Curve 3 is the Weibull distribution. It has the statistical advantage
of having only one parameter. The closed form for this curve is
at(base of the natural logarithms to
the power)[-at2
/ 2]. The a
is the Weibull parameter and t is the time since entering the
state. In this model the time of entering the state is time zero
so that t is just simulation time which we can enter into STELLA®
just by typing time.
The average incubation period given this type of process is .
You can see by the fact that the flow rate out of the healthy
state can exceed in value the number of remaining healthy incubating
individuals that this distribution has a different shape from
the Erlang distribution. Also the first part of the curve is not
concave as is the case with the Erlang distribution.
For your entertainment:
See how close you can make a Weibull distribution approximate
a 3 stage Erlang distribution or the generalized three stage Erlang
from the last exercise. Start with a Weibull parameter of around
0.02. Also see how well you can make a Weibull approximate a 3
stage Erlang distribution or a 10 stage Erlang distribution. Suppose
you are trying to model the incubation period from the onset of
HIV infection to the onset of AIDS and you have data only on half
of the rising part of the curve. Comment upon the problems of
using a Weibull distribution if the real distribution were an
Erlang distribution.
The reason for using a distribution of this sort in data models
is the advantage of having only one parameter instead of two required
by Erlang distributions. It is difficult to think of a causal
process, however, that would make the flow rate linearly dependent
upon the time in a state. Moreover the fact that in Stella we
can only use this distribution when examining a cohort that all
start their risk at the same time makes it of limited application
for compartmental models.
Review Questions:
1 Describe the two extreme distributions for a one stage disease
process. How can you generate these with a Stella® simulation?
2 Describe two different approaches to generating distributions
that are intermediate between the extreme distributions discussed
in question 1. Discuss the biological relevance of these approaches.
3 Describe what happens to the shape of the incubation period
distribution as "n" increases when there are "n"
stages of disease and the flow out of each stage is some constant
times "n".
4 How can one generate the probability density function of an
Erlang distribution using Stella?
5 How can one generate the cumulative density function of an Erlang
distribution using Stella?
6 How can one generate the survival function associated with an Erlang distribution using Stella.