Relative survival analysis in R | Accepted 12 January 2006
Relative survival techniques are used to compare the survival experience in a study cohort with the one expected should they follow the background population mortality rates. The techniques are especially useful when the cause-specific death information is not accurate or not available since they provide a measure of excess mortality in a group of patients with a certain disease. There are several approaches to modeling relative survival, but there is no widely used statistical package that would incorporate the relevant techniques. The existing software was mostly written by the authors of different methods, in different computer languages and with different requirements for the data input, which makes it almost impossible for a user to choose between available models. We describe our R package that provides functions for easy and flexible fitting of several relative survival regression models.
Introduction
Survival analysis is concerned with studying the random variable T, representing the time between entry to a study and some event of interest (e.g. death, recurrence of disease . . . ). Instead of the cumulative distribution function F(t), we usually estimate the cumulative survival function S(t), which is defined as S(t) = P(T > t) = 1 − F(t). In the case of the final event being death, S(t) is the probability of being alive at time t. When the final event of interest is death due to a certain disease, but the causes of death are not known, it is not possible to directly estimate the proportion of dead due to the disease in question. We then resort to the methods of relative survival. The cumulative relative survival function r(t) is defined
where denotes observed survival and stands for population or expected survival, which is estimated on the basis of population life tables. Obviously, r(t) can be any non-negative number, although the methods are most often applied to data where r(t) is less than 1.
Relative survival regression models
The relative survival literature most frequently refers to the additive models, dominating especially in cancer research. The main assumption is that the hazard of every individual can be split into two additive components:
where is the hazard every patient takes because of his age, sex and cohort year (or any other combination of covariates included in the population mortality data). denotes the excess hazard, specific for the disease in question and is used for the observed hazard - the one we can estimate from our data. Using the equality
, and we write .
The multiplicative models assume the hazard components to be multiplicative:
.
Such a model does not assume that the observed hazard is always greater than the population hazard, but has a less obvious interpretation as the additive model. Our programs provide for fitting the transformation model, and three different approaches to fitting the additive model. Only the last four have been widely used in the past, while transformation approach has only recently been published and given its simplicity, it might become popular in the future, especially in studies where long-term observations are common.
i) Hakulinen–Tenkanen additive survival method: in order to fit the model, the patients must first be grouped into K strata, indexed by k, with one stratum for each combination of the relevant predictor variables (age, sex, cohort year, stage . . . ) and a life table must be estimated for each stratum, with intervals indexed by i. The is assumed to be a multiplicative function of the covariates and constant within each time interval i.
β‘
which is a generalized linear model with a binomial error structure and a complementary log–log link function combined with a division by .
ii) The glm model with the Poisson error structure: the method is similar to Hakulinen–Tenkanen and it requires the same grouping of the data. The number of deaths is assumed to be distributed as Poisson(), where and is the person-time at risk for group k of observations on time interval i.
exp(β‘z + ) and β‘z + + .
The model can be rewritten to yield the same likelihood function as the Est`eve model, meaning that the grouping of the variables is not crucial.
iii) The Est`eve additive survival model: the method uses individual data and estimates the coefficients using the maximum likelihood approach. The likelihood for the additive model is
exp[β'z+' fu] + exp(β'z+'fu(s))ds]
where is the event time for person and the status indicator. The log-likelihood function:
+ exp[β'z+γ‘fu()] - exp [β'z+γ'fu(s)]ds - .
iv) The Andersen multiplicative model: the model assumes {β'z}, which is the same as in the Cox model. The observed hazard thus becomes
.
This model can be rewritten in the form
from which it is obvious that this is a Cox model with an additional time-dependent variable with the coefficient fixed to 1. Though is a continuous process, the population mortality tables are usually organised in yearly intervals. In order to fit the model using the Cox model procedures, data can be split in yearly intervals, with the updated on each of the intervals.
v) The transformation approach: the individual survival times t are first transformed as
where is the cumulative distribution function of a person of a certain age, sex and cohort year (or any other combination included in the population tables) that would apply if the person is a typical representative of the general population. This distribution function is calculated from the general population mortality data. The values y can thus be interpreted as the achieved values on the expected cumulative distribution function for each individual. By transforming to the new scale, the population hazard is automatically taken into account, consequently all what is left is precisely the disease-specific hazard, which we can thus directly model. One way is to use the Cox model
.
The package
The core of the package are three functions that fit the models described in the previous section:
+ fits an additive model. The default is the Est`eve method (iii) which is specified by "max.lik", the other two options are method="glm.bin" and method="glm.poi" for models. When using one of the glm methods, the observed and expected survival proportions for each group are returned as object groups.
+ fits the multiplicative model as described in (iv). A more computationally intensive alternative is chosen by specifying method="mul1"that splits the intervals at every time point in which the population mortality changes—for example on 1 January and on the individual’s birthday.
+ fits the transformation model as described in (v). If only the transformation times are needed, this can be done directly by the function (survival package) or by function , where transformed times are returned:.
The package also includes the functions for testing goodness-of-fit and plotting relevant graphs for all the above models with methods. Additionally, two data sets are included in the package. One is called and contains survival data that can be used as an example. The data set contains the population mortality tables for Slovenia.
All the functions are called in the same way, for example:
rstrans(formula, data, ratetable, int, na.action, init, control)
Two data sets are required for any relative survival model. One is our observed data set, which we will pass to the function as argument data. The other is the population mortality table that we want to compare our observed data to, and this is specified in the argument ratetable. The population mortality tables have to be organised as an object of class ratetable (defined in package survival), default is the survexp.us table that contains the US data (also in the survival package).
The model is specified within the argument formula. It is organised following the syntactic rules of an R language formula and thus similar to any other survival package formula. The left-hand side must be a Surv object. For example, if time and status are the survival time variable and the censoring indicator, and x is a covariate, then the command may be
rstrans(Surv(time,status)∼x)
If the variables that are used for the expected survival calculation (for example age, sex and year) are not organised and not named in the same way as in the population tables, a special term ratetable is to be included in the formula, for example:
Surv(time,status)∼x+ratetable(age=age,sex="male", year=diag)
Argument int specifies the number of yearly intervals to be used in rsadd function. int can either be a single number or a vector. The latter is used when the intervals are not all one year long, for example int=c(0,0.5,1,5,10) means a couple of short intervals at the beginning and longer intervals thereafter with the maximum follow-up time of 10 years. Each interval includes the right endpoint, but not the left one, except for the first interval which also includes the left endpoint. In this case the third interval would therefore be (1, 5 ].
This option can also be used in rsmul and rstrans functions, serving simply as the maximum follow-up time after which all the data are censored, enabling in this way a more straightforward comparison of models. Missing data are handled in the usual way through the argument na.action and the initial values for the model can be specified via the argument init. The number of iterations and other fitting specifications can be specified through the control argument, which follows the glm.control function in the rsadd function and the coxph.control function in rsmul and rstrans.
The user, however, does not have to change the data set—all the matching can be done by the help of a special term in the formula named ratetable. In order to construct this term, first check the names of the ratetable variables. For the slopop rate table the names are
> attributes(slopop)$dimid
"age" "year" "sex"
These names are now to be matched with the variable names, for example:
Surv(time,status)∼x+ratetable(age=age*365.24,
sex="male",year=diag)
The names on the left of each equality sign are the dimension names of the ratetable names, while the names on the right are the variable names in the observed data set. This example is for a data set in which no variable sex is included as all the patients are male, the calendar year variable is stored as diag (diagnosis year), and age is in years and must therefore be multiplied by 365.24 to match the data in the rate table (note that .24 is used in all R calculations as there are 24 leap years per century).
rsadd(Surv(time,status)∼x,ratetable=slopop,
data=my.data)
The rsadd function returns an object of class rsadd, while rsmul and rstrans return a coxph object. All the returned values can be viewed as such or by the help of summary function. Apart from the coefficients, their standardized values and the other usually returned values, the common object returned by all the functions is named y and containts the survival and censoring times for each individual, organised as a Surv object. In the case of rstrans function, these are already the transformed times and this function can thus also be used just for making the transformation.
The max.lik method of rsadd function is based on maximum likelihood and therefore the output also contains the log-likelihood values stored as loglik. The other two methods are based on the glm function and the returned objects preserve all its values. The original and the grouped data are both stored in the output object as data and groups, respectively. The original data set contains the demographic variables in the format that matches the rate table (stored as X1, X2,. . .) and the covariates in the format in which they were used in the model.
Example
To illustrate the usage of the program, we use a subset of data from the study of survival of patients after acute myocardial infarction that is included in the package as the file rdata. The data were collected in the study carried out at the University Clinical Center in Ljubljana and contain 1040 patients diagnosed between 1982 and 1986 and followed up until 1997. During this time 547 deaths occurred and as the causes of death are not given, this is a good example of the need of the relative survival methodology. The organisation of the data is as follows:
We fit all the models to the data in the same way:
Time and censoring are already in the right format, so we put them in the formula as Surv(time,cens);
• sex and agegr (“under 54” is the reference group) are taken as covariates; agegr is forced to be a factor variable =⇒ sex+as.factor(agegr);
• We concentrate on the first 5 years of follow-up (and assume the hazard is constant in yearly intervals for all the additive models) =⇒ int=5;
• The name of the observed data set is rdata, the study was performed in Slovenia, therefore we use Slovenian population tables =⇒ data=rdata, ratetable=slopop;
• Variables sex and year are in the same format as in the rate table slopop, age must be put into days =⇒ ratetable(age=age*365.24, sex=sex, year=year).
The command lines for our five models therefore look very similar:
> rsadd(Surv(time,cens)∼sex+as.factor(agegr)+
+ ratetable(age=age*365.24,sex=sex,year=year),
+ data=rdata,ratetable=slopop,int=5,method="glm.bin")
> rsadd(Surv(time,cens)∼sex+as.factor(agegr)+
+ ratetable(age=age*365.24,sex=sex,year=year),
+ data=rdata,ratetable=slopop,int=5,method="glm.poi")
> rsadd(Surv(time,cens)∼sex+as.factor(agegr)+
+ ratetable(age=age*365.24,sex=sex,year=year),
+ data=rdata,ratetable=slopop,int=5)
> rsmul(Surv(time,cens)∼sex+as.factor(agegr)+
+ ratetable(age=age*365.24,sex=sex,year=year),
+ data=rdata,ratetable=slopop,int=5)
> rstrans(Surv(time,cens)∼sex+as.factor(agegr)+
+ ratetable(age=age*365.24,sex=sex,year=year),
+ data=rdata,ratetable=slopop,int=5)
The outputs of all these functions are organized in the same way, the computed values are stored in objects that closely match those of other standard regression model functions in R (e.g. coxph function). As an example of the most important part of the output (obtained with function summary), we report the estimates of the β and γ coefficients, their standard errors and the results of the Wald test for each covariate (z and p values) for the Est`eve additive model (the default of the rsadd function):
Covariate agegr has four possible values, the youngest age group is automatically taken as the reference group. The output therefore consists of nine coefficients, the last five representing the follow-up interval indicators. The positive sex coefficient (γ = 0.903) implies that the survival of men is relatively better. The age however does not seem to be a very important factor, with even the oldest group coefficient not differing significantly from the youngest (p = 0.052). This means that the differences in survival between the age groups that we would get from any classical survival method are nearly wholly attributable to the population risks. The coefficients for the follow-up years are similar as well, only the first year seems to have a larger hazard (exp(fu[0, 1]) = 0.015).
To conclude
The usage of any statistical method depends mainly on the availability of adequate software, and relative survival techniques are no exception. For now, the choice of the model, and the method to fit it, were mostly influenced by software at hand. Different methods were programmed in different languages, and their usage was hindered by considerable effort needed to learn the special requirements of such programs. We believe that by incorporating all the different methods into an R package, now being widely used by statisticians and other professionals, such obstacles are mostly overcome. All the programs use the standard R format for commands, and their usage is as easy as that of any other regression program in R. Users, experienced in relative survival analysis, and in programming in R, will find it easy to adjust the programs to their needs if necessary.
The package was written by the first author and is publicly available at CRAN. The rest of the work means specifying the attributes. The easiest way to do it is to check the attributes of an existing rate table and imitate them (as is done in the following example for an array we named my.rate). The compulsary attributes are the following:
++ If we have a four-dimensional array with dimensions age, sex, year and race, we can specify this attribute as:
> attributes(my.rate)$dimid <- c("age","sex", "year","race")
++ dim. The dimensions of the array, for example
> attributes(my.rate)$dim <- c(100,2,4,3)
++ dimnames. The names of each of the groups, i.e. a vector of names for each of the dimid variables. The race names could be specified as (race is the fourth dimension of the array)
> attributes(my.rate)$dimnames[[4]]
<- c("white","black","other")
++ factor. This attribute specifies which of the varibles are constant in time - the values for sex and race will therefore be equal to 1, the values for age and calendar year will be 0. Other values are allowed and are used in the US population tables, but special functions must of course be programmed when using those. In our example the value would be
> attributes(my.rate)$factor <- c(0,0,1,1)
++ cutpoints. A list of vectors denoting the cut points for each of the dimid variables (the constant ones like sex must have value NULL). If our population tables are estimated every ten years for four decades, vector of cut points will equal to (calendar year is the third dimension)
> attributes(my.rate)$cutpoints[[3]]
<- as.date(c("1Jan70",
+ "1Jan80","1Jan90","1Jan2000"))
++ class. Set the class of the object to
> attributes(my.rate)$class <- "ratetable"