GLM.Basics

From BattleActs Wiki
Jump to navigation Jump to search

Reading: Goldburd, M.; Khare, A.; Tevet, D.; and Guller, D., "Generalized Linear Models for Insurance Rating,", CAS Monograph #5, 2nd ed., Chapters 1 and 2.

Synopsis: This article gives a solid grounding in the foundations of generalized linear modeling. All of the material is directly testable but there is a lot of it so don't expect it to all come up on any one exam.

Study Tips

None of the material is particularly hard but there's a lot to remember. Return to the Battlecards frequently to test your memory, and make sure you work the past exam problems as you go along to solidify the concepts. The recent syllabus changes have placed more emphasis on GLMs so expect more questions from this reading than in the past.

Estimated study time: 24 Hours (not including subsequent review time)

BattleTable

Based on past exams, the main things you need to know (in rough order of importance) are:

  • Be able to calculate/work with GLMs to determine outcomes.
  • Be able to list and describe the components of a GLM.
  • Recommend an appropriate error distribution for a GLM.
  • Understand when to use an offset and be able to calculate one.
  • Explain when it is appropriate to use a GLM and what the limitations are.
Questions from the Fall 2019 exam are held out for practice purposes. (They are included in the CAS practice exam.)
reference part (a) part (b) part (c) part (d)
E (2018.Fall #7) GLM offset
- calculate
GLM fitted probability
- calculate
Link functions
- describe properties
GLM offsets
- describe usage
E (2017.Fall #1) Statistical considerations
- discuss
GLM premium
- calculate
Variable choice
- explain reasons
Rating plan development
- discuss issues
E (2016.Fall #4) Variable choice
- explain considerations
GLMs and deductibles
- explain considerations
GLM offset
- explain incorporation
E (2016.Fall #6) Apply GLM
- calculate premium
GLM offsets
- describe application
Model Validation
- GLM.ModelRefinement
E (2015.Fall #2) Aliasing
- explain types
Near aliasing
- describe impact
Aliasing
- alternate approach
E (2015.Fall #3) Apply GLM
- calculate frequency
E (2014.Fall #3) GLM variance
- define parameters
Error distributions
- recommend & explain
Error distributions
- parameter selection
E (2013.Fall #2) Design matrix
- define
Response vector
- define
Solve GLM
- calculate 1 parameter
Error distributions
- describe appropriateness
E (2013.Fall #4) GLM properties
- suitability to task
PC analysis
- outdated
Cluster analysis
- outdated
Test statistics
- outdated
E (2012.Fall #2) GLM properties
- describe components
Missing data
- GLM.DataPrep
E (2012.Fall #4) GLM probability
- calculate
Model improvement
- GLM.ModelSelection
Full BattleQuiz Excel Files Forum

You must be logged in for the quizzes to work.

In Plain English!

You should skim the first chapter of the text briefly. Generalized Linear Models (GLMs) are widely used; the goal of the text is to get you up and running fast when it comes to using GLMs in classification ratemaking. The focus is on ratemaking but the ideas translate to other types of modelling such as predicting where a customer will be retained or not.

Overview of Technical Foundations

We want to explain how a collection of explanatory variables (the predictors) relate to a target variable. A generalized linear model (GLM) is one way of quantifying the relationship. A GLM outputs the expected value of the outcome for a given set of predictors. The actual results observed may vary considerably but should have the same expected value if the model is good.

Notation: Let y be the target variable and [math]x_1,\ldots,x_n[/math] be the predictors.

The output of a GLM is the result of two components

  • Systematic Component: This is the variance in outcomes that can be explained by the predictors. A good predictor will explain more of the variance than a bad predictor.
  • Random Component: This is the variation in the outcome that is not explained by the predictors. This could be because we didn't choose the right set of predictors (or weren't allowed to include a predictor such as credit based insurance score), or simply randomness that is unpredictable - after all accidents have to happen to someone, even if there's no set of characteristics that make them more likely than others to have one.

By using a GLM the modeler is making assumptions about the behaviour of the systematic and random components.

The target variable is assumed to be a random variable from the exponential family of distributions. This family includes the Normal, Poisson, Gamma, Binomial, and Tweedie distributions. All members of the exponential family have two parameters, a mean denoted by [math]\mu[/math], and a dispersion parameter [math]\phi[/math]. The dispersion parameter is related to the variance of the distribution. The choice of distribution from within the exponential family is important. When we pick say a Gamma distribution, what we're saying is after accounting for the predictor variables, the target variable is randomly distributed according to a Gamma distribution whose parameters we will find. That is, the GLM random component has a Gamma distribution in this case.

The estimate of the parameter [math]\mu[/math] is very important as it is the mean of the distribution and is the expected value of the outcome. It is the output of the model.

A GLM builds an estimate of [math]\mu[/math] for each set of predictor values. Thus, the output of a GLM is really [math]\mu_i[/math] with i denoting the ith set of predictor values. The dispersion parameter is assumed to be constant across all sets of predictor values.

A key idea behind GLMs is there is a function g called the link function which relates the ith observed target variable, [math]\mu_i[/math], to a linear combination of the predictor variables. That is, if we have p predictor variables called [math]x_{i,1},x_{i,2},\ldots,x_{i,p}[/math], then there is a function g such that [math]g(\mu_i)=\beta_0+\beta_1x_{i,1}+\beta_2x_{i,2}+\ldots+\beta_px_{i,p}[/math].

The right side of the equation is known as the linear predictor since there are not other powers or functions of the predictor variables other than linear terms. [math]\beta_0[/math] is known as the intercept and [math] \beta_1,\ldots,\beta_p[/math] are the coefficients. The goal of a GLM is to use the data available to determine the values of [math]\beta_0, \beta_1,\ldots, \beta_p[/math] for a predetermined link function. If the link function is carefully chosen to be invertible then [math]\mu_i[/math] can be obtained directly and predictions made for any set of values for the predictor variables.

There are many types of link functions which may be used. However, it is common (at least in insurance applications) to use the natural logarithm, [math]\ln(x)[/math]. This is because inverting the logarithm results in a product of exponential variables on the right hand side. That is, [math]\mu_i = e^{\beta_0}\cdot e^{\beta_1x_{i,1}}\cdot\ldots\cdot e^{\beta_px_{i,p}}[/math].

Question: Why are multiplicative rating structures common?
Solution:
Multiplicative rating structures are common because they are
  1. Simple and practical to implement,
  2. Always have a positive premium if the base rate is positive. There is less need for minimum premium requirements.
  3. Intuitive: Suppose a violation always costs $500, then if your premium is $10,000 this might not be much of a deterrent whereas if your premium is $100, it might be considered unfair. It is much easier to explain that having a violation increases the premium by 10%.

Exponential Family Variance

Let [math]\mu[/math] denote the mean of a distribution from the exponential family. Then the variance may be written as [math]Var(y)=\phi\cdot V(\mu)[/math]. Here, [math]\phi[/math] is the dispersion parameter and [math]V(\mu)[/math] is called the variance function.

The variance function depends on [math]\mu[/math]. It's usually an increasing function of [math]\mu[/math] (but may be a constant). Having an increasing variance function is helpful because one requirement of GLMs is the dispersion parameter [math]\phi[/math] must be constant across all risks. By having an increasing variance function this still means we can reflect that larger risks typically have a larger variance.

Although memorization isn't a large part of the exam, the following table is one you should be familiar with.

Distribution Variance Function, [math]V(\mu)[/math] Variance, [math]\phi V(\mu)[/math]
Normal 1 [math]\phi[/math]
Poisson [math]\mu[/math] [math]\phi\cdot\mu[/math]
Gamma [math]\mu^2[/math] [math]\phi\cdot\mu^2[/math]
Inverse Gaussian [math]\mu^3[/math] [math]\phi\cdot\mu^3[/math]
Negative binomial [math]\mu(1+\kappa\mu)[/math] [math]\phi\cdot\mu(1+\kappa\mu)[/math]
Binomial [math]\mu(1-\mu)[/math] [math]\phi\cdot\mu(1-\mu)[/math]
Tweedie [math]\mu^p[/math] [math]\phi\cdot\mu^p[/math]

Note you only need memorize the first and second columns due to the relation between the variance function and the variance.

Let's look at an example from the text on how to calculate with a GLM.

Predict Auto Claim Severity

Alice: Don't forget to practice! So here's another version for you to try...

Practice Predicting Auto Claim Severity

mini BattleQuiz 1 You must be logged in or this will not work.

Variable Significance

Since a GLM uses observed data as its input for determining the coefficients of the predictors, a (small) change in the target variable values for the observed data (for the same inputs) will likely result in different estimates of the coefficients for the predictors. Two important questions are:

What is the true value of each coefficient?
Does each of the predictors have an impact on the outcome?

It is possible for the true coefficient to be zero for a predictor, i.e. the predictor has no impact on the target variable. Yet the output of the GLM for that predictor may be non-zero because we lack perfect information about all risks. Three statistics which help us understand the predictor coefficients are the: standard error, p-value and confidence interval.

The standard error is the estimated standard deviation of the process. If we collected the same dataset a number of times over but with slightly different target variable values each time (same predictor inputs) and built the GLM on each dataset then the standard deviation of the coefficient would be approximately the standard error.

A small standard error implies the estimated coefficient is likely close to the true value of the coefficient. A large standard error says a wide range of estimates may occur due to randomness.

The standard error is related to the estimated value of the dispersion parameter, [math]\phi[/math]. A larger value of [math]\phi[/math] implies greater variance, i.e. more variation in the outcomes which is reflected in a larger standard error.

The p-value for the estimate of a coefficient is the probability of a value of at least that magnitude arising by chance. A variable with a low p-value is called significant. Testing whether a predictor is significant is normally performed using the null hypothesis. That is, assume the true value of the predictor coefficient is zero and reject this assumption if the p-value is below a predetermined cut-off. Typically [math]p=0.05[/math] is used as a cut-off. However, it's important to note this means there's a 1-in-20 chance of rejecting the null hypothesis when it is true. I.e. we would include the predictor in the model when based on the data in the GLM it has no effect. Since insurance models generally consider a lot more than 20 variables, there can be a good reason to use even smaller p-values.

If a variable is not significant then it should be removed from the model. If a sub-category of a categorical variable is not significant, it should be grouped with the base level for the variable instead of removing it from the dataset.

The confidence interval is the range of coefficients for the predictor which we would not reject. It is a reasonable range of estimates for the coefficient. The confidence interval is based on a p-value, and is called the [math](1-p)\%[/math] confidence interval. For instance, if the p-value is 0.05 then we have a 95% confidence interval.

Question: The GLM software gives the coefficient as 0.48, the p-value as 0.00056, and a 95% confidence interval of [0.17, 0.79].

Can we reject the null hypothesis at the 5% significant level?

Solution:
Yes. The p-value is very small, much smaller than our significance threshold of 0.05, so we reject the null hypothesis and assume this variable is significant.
The confidence interval tells us the although we observed the coefficient as 0.48, the true value is somewhere between 0.17 and 0.79 if we're willing to be wrong once every 20 times.
If observed p-value was instead 0.3 which is greater than 0.05 then we would accept the null hypothesis that the true value is 0 (despite observing 0.48) and would remove this variable from the model or combine it with the base level if it is categorical.

Types of Variables to Use in a GLM

Variables are either continuous (such as the amount of insurance in dollars) or categorical. A categorical variable can be numeric (such as the territory number assigned) or non-numeric (such as the type of protective device found in a dwelling).

When dealing with continuous variables, you can usually use them right away with minimal transformation. The important transformation is to match the scale of your continuous variable to the scale of your link function. So if you're using a log link function then you should take the log of your continuous variables before applying the GLM process.

Using logged continuous data with a log link function is said to be a power transform of the original variable. This is because if the linear predictor is [math]\beta_0+\beta_1\ln(x)[/math] and we have a log link, then [math]\ln(\mu)=\beta_0+\beta_1\ln(x)[/math] and taking exponents on both sides gives [math]\mu=e^{\beta_0}\cdot x^{\beta_1}[/math]. That is, the predicted coefficient is applied directly to the original variable, rather than wrapping it up in [math]e^{\beta_1x}[/math].

If the original predictor and the target variable have a linear relationship and we're using a log link, then the coefficient will be close to 1 if the original predictor was logged. For a logged original predictor then if the coefficient is larger than 1, then the original predictor increases at an increasing rate. If the coefficient is between 0 and 1 then the original predictor is related to the target variable by a curve than increases at a progressively slower (decreasing) rate. This is often a good shape for insurance applications.

If there is a log link function and the continuous predictor is not logged, then there will always be a positive exponential growth curve for any positive coefficient produced by the GLM. Sometimes this is appropriate, such as when wanting to pick up trend effects over time. However, in general it is much better to always log continuous variables when using a log link function.

A limitation to be aware of when using the log link is if your continuous predictor variable can have zero or negative values. For instance, consider age of home. If new homes are coded as 0 then you cannot take a log. Two approaches to solve this are:

  1. Shift the curve by adding 1 (or whatever amount is necessary) to all values of the predictor variable to ensure it only has positive values.
  2. Create a categorical variable that encodes the sign or zero and replace the bad values with some other appropriate value. For instance, if working with age of home data that includes age 0 (i.e. no negative values to worry about) then create an indicator value that is 0 if the age of home is > 0 and 1 if age of home = 0. Replace all of the zeros in the original variables with 1's. The new indicator variable will reflect the difference in mean between a 1-year old home and a brand new home.

Other potential limitations to be aware of are

  • Avoid logging variables such as year when trying to account for trend across multiple years,
  • Some "continuous" variables such as credit-based insurance score may not have a response which is best captured through applying the log transform first - consider modeling both ways.

An assumption when using a log link function with a logged continuous predictor in the model is there exists a linear relationship with the logged mean of the outcome. This isn't always the case, for instance auto premiums are typically high for young drivers and then decline, but pick back up once drivers reach their senior years. These situations require special attention.

mini BattleQuiz 2 You must be logged in or this will not work.

How to Handle Categorical Variables

A categorical variable is a variable which takes one value out of at least two discrete values. The possible discrete values are called levels. The levels may be numeric (e.g. Territory numbers, 1 through 100) or non-numeric (e.g. vehicle type: Car/SUV/Truck), or a mix of both (e.g. Number of claims in past 3-years: 0, 1, 2, 3+). Categorical variables are translated by GLM software into a sequence of new indicator variables which together make up a design matrix.

Example:
Suppose we have a variable called Vehicle Type which has categories: Car, Van, Truck, & SUV. We first designate a base level, which is typically the category with the most exposures. The GLM will predict everything relative to the base level. We'll assume the Car category has the highest exposures in our data set, so we'll designate Car as the base level. The design matrix for our example has three columns, one each for Van, Truck, and SUV. Each record in the data set has these three columns appended, with a 1 if it's in that category and a 0 others. So a Truck record would have "0, 1, 0" appended to it, a Van would have "1, 0, 0" appended to it, and a Car would have "0, 0, 0" appended to it as it's the base level. Note that we can't have say "1, 1, 0" because a vehicle cannot be in two categories at once. Also note the original column Vehicle Type containing "Car", "Van", "Truck", and "SUV" is removed from the data set because that information is completely determined by the three columns appended.

Alice: "Now try the following problem..."

Create the Design Matrix & Response Vector

The summary statistics for the coefficients produced by the GLM in the case of categorical variables tell us the relative difference against the base level. For instance, if the Truck variable has a coefficient of 1.23 and a p-value of 0.0087 then this says (after undoing the log link) that trucks have a factor of [math]e^{1.23}=3.421[/math] and since cars have a factor of [math]e^0=1[/math], trucks are 242% higher than cars. The very low p-value indicates we are highly unlikely to observe this result if there is actually no difference between cars and trucks.

However, suppose the GLM output says the Van coefficient is -0.3 and the p-value is 0.3567. Then the factor for vans is [math]e^{-0.3}=0.741[/math] so the factor for vans is 25.9% lower than the factor for cars (1.000). The high p-value says we don't have enough evidence to conclude vans are different from cars. Since this is a categorical variable, this means we should include all van records in the base level. So our base level now becomes the set of Cars & Vans and we'd calculate a new design matrix and set of statistics before continuing on.

It is fairly common to present a graphical view of the GLM model output for categorical variables. The categories are on the x-axis with bars plotted to display the number of exposures in each category. The point estimates of each coefficient are plotted along with the [math](1-p)\%[/math] confidence interval (typically 95%). The wider the confidence interval, the less confidence there is in the point estimate. This data can be plotted either before or after undoing the log link function. When the [math](1-p)%[/math] confidence interval, also known as the error bar, crosses the level indicated by the base level category then that estimate is not significant.

Choosing a base

You can choose any category within a variable to refer to as your base level once the analysis is complete. However, it's important to choose the base level carefully when you begin your analysis because although you'll get the same estimates of the coefficients, the significant statistics will likely change substantially. If we choose the base category as one which has few exposures then the error bars tend to be large for all variables. Whereas if we choose the base category as one with a lot of exposures then the error bars tend to be proportional to the number of exposures in the other categories. This is because the significant statistics for a categorical predictor measure the confidence in any level being different from the base level. So it's important to have confidence in the mean response of the base level as well as the mean response of the other category. If the base level has low exposures (thin data) then we can't be sure of its mean and consequently can't be sure of the relativity of the mean of any other category to it.

Data Set Weights

The way the input data for a GLM is aggregated can influence the results. In particular, a row of data might correspond to several exposures having the same risk characteristics, while the next row may only have one exposure for a different set of risk characteristics. GLMs include a weight variable to specify the weight to each row so that logically a row with more exposures can have more weight when estimating the coefficients. Let [math]\omega_i[/math] be the weight of the i th observation in the data set. It is non-zero for each observation so the weights modify the variance for record i to be [math]Var(y_i)=\frac{\phi\cdot V(\mu_i)}{\omega_i}[/math].

Offsets

An offset is an adjustment to the mean of the target variable whereas changing the record weights is an adjustment to the variance of the target variable.

GLMs are good at certain things and not so good at others (such as estimating deductibles). Often, some variables in the GLM will have their factors derived via another method but we want to keep them in the model for their effect on the other predictors. We do this via an offset which is a predictor variable with the coefficient set equal to 1. That is, [math]g(\mu_i)=\beta_0+\beta_1x_{i,1}+\ldots+\beta_px_{i,p}+1*(\mbox{offset variable})[/math].

It is very important that the offset is on the same scale as the rest of the linear predictor. So if a log link function is used, the offset variable must be logged before going in the model.

Let's look at another example from the text on how to apply an offset.

Offset Deductibles in a GLM
Exposure Offsets

One example of when you might use an offset is if you're modelling claims. When modelling claim frequency you would use the number of exposures as the weights in the model. However, if you're modelling claim counts then you should offset the exposures because the more exposures you have the more claims there will be (assuming no differences in claim likelihood). Assuming you're using a logged-link model then to offset the exposures you add the log of the exposures to the linear predictor for each record.

A claim count model which includes exposures as an offset is equivalent to a claim frequency model which includes exposures as the weights. Both models result in the same predictions, relativities and standard errors.

mini BattleQuiz 3 You must be logged in or this will not work.

Distributions for Severity

The Gamma and inverse Gaussian distributions are often used to model claim severity.

The variance function for the Gamma distribution is [math]V(\mu)=\mu^2[/math], thus claims which have a higher mean will also have a higher variance, which is intuitive. Also, for two Gamma distributions with the same [math]\mu[/math] but different dispersion parameters, [math]\phi[/math], the lower the value of [math]\phi[/math], the lower the variance. The Gamma distribution has a lower bound at zero and is right skewed with a long tail.

Inverse Gaussian distributions also have a lower bound at zero and are right skewed. Inverse Gaussian distributions have a sharper peak and wider tail than Gamma distributions which make them better for modelling more extreme skewed severity distributions. Its variance function is [math]V(\mu)=\mu^3[/math].

Distributions for Frequency

The Poisson and negative binomial distributions are commonly used to measure claim frequency.

Although the Poisson distribution typically takes on integer values, the implementation of a Poisson model within a GLM allows it to produce fractional values as well. This is good for measuring frequency in terms of number of claims per exposure which may not be an integer.

Claim frequency distributions are often overdispersed. Overdispersion of a distribution is when the variance is greater than the mean.

Since a Poisson distribution has equal mean and variance, this is a limitation. Overdispersion occurs because there is variance due to the underlying Poisson process but also variance in the value of μ between policyholders. In other words, not all policyholders have the same likelihood, μ, of a claim occurring. The GLM attempts to account for differences in μ between risks but won't be perfect and this leads to overdispersion.

When modelling claim frequency with a Poisson distribution it is likely the variance will be underestimated. This distorts the standard error and p-values.

Instead, we could use an overdispersed Poisson distribution (ODP). Both the Poisson and overdispersed Poisson have variance function [math]V(\mu)=\mu[/math]. The Poisson has dispersion parameter [math]\phi=1[/math] so the variance equals the mean. The overdispersed Poisson has variance [math]\phi\mu[/math] for some [math]\phi\gt 0[/math].

Poisson and ODP distributions produce the same estimates of the model coefficients but the statistics (model diagnostics) are different.

Another way of dealing with overdispersion is to use a negative binomial distribution. Since it is the variation in the Poisson mean between risks that is a cause of overdispersion, we can address this by modelling the Poisson parameter, [math]\lambda[/math], with another distribution. It is common to use a Gamma distribution. The compounding of a Poisson and a Gamma distribution yields a negative binomial distribution. The variance function for a negative binomial distribution is [math]V(\mu)=\mu(1+\kappa\mu)[/math]. [math]\kappa[/math] is the overdispersion parameter. The dispersion parameter, [math]\phi[/math], for a negative binomial is 1.

Distributions for Pure Premiums

This is a technically challenging problem because most policies do not incur a loss, yet when a loss does occur, the distribution of losses is highly skewed. So any modelling distribution needs most of its mass at 0 with the rest of its mass right skewed.

The Tweedie distribution addresses this problem by introducing a power parameter, p, which can take any real number except those strictly between 0 and 1. Its variance function is [math]V(\mu)=\mu^p[/math].

Clearly, the normal, Poisson, Gamma, and inverse Gaussian are special cases of the Tweedie distribution. Further, since Gamma distributions have moderate skewness, and inverse Gaussian distributions have extreme skewness, the Tweedie distribution gives some control by choosing a p value between 2 and 3. For p values in the range [1,2] the Tweedie distribution is a combination of Poisson and Gamma distributions which is good for modelling pure premiums or loss ratios.

For a Poisson distribution with mean [math]\lambda[/math] and a Gamma distribution with parameters α and θ, the mean of the Tweedie distribution is [math]\lambda\cdot\alpha\cdot\theta[/math], i.e. the product of the expected frequency and the expected severity.

The power parameter for the Tweedie distribution may be expressed as [math]p=\frac{\alpha+2}{\alpha+1}[/math]. Since the coefficient of variation of a Gamma distribution is [math]\frac{1}{\sqrt{\alpha}}[/math], the power is related to the variation of the severity distribution. Typical insurance applications use p values between 1.5 and 1.8.

As the p value gets close to 1, the total loss amount is driven by the frequency component, leading to discrete spikes as there is little variation in claim size. As the variation of the Gamma distribution gets larger, the p value moves away from 1 and the total loss distribution becomes much smoother (see Figure 5. in the text).

The dispersion parameter for a Tweedie distribution is [math]\phi=\frac{\lambda^{1-p}\cdot (\alpha\theta)^{2-p}}{2-p}[/math].

When a Tweedie distribution with mean μ and variance [math]\phi\cdot V(\mu)=\phi\cdot\mu^p[/math] is used in a GLM, [math]\mu[/math] varies with each record in the data set while [math]\phi[/math] and p are assumed constant across all records.

Question: How can we determine the value of p for a Tweedie distribution?
Solution:
  • Use GLM software to estimate the value of p. This can be computationally intensive/time consuming for large data sets.
  • Consider a selection of values of p, such as 1.6, 1.65, 1.7, and then use statistical measures to determine which one worked best (see GLM.ModelRefinement).
  • Use actuarial judgement to select a reasonable value since small changes to the value of p usually have a minimal impact on the model estimates.

If we think of a Tweedie distribution as being the compound distribution of a Poisson frequency and Gamma severity distribution, the Tweedie distribution assumes frequency and severity move in the same direction. That is, an increase in the target variable consists of an increase in both frequency and severity. This assumption often fails in real life.

mini BattleQuiz 4 You must be logged in or this will not work.

Logistic Regression

The target variable isn't always going to be a number from a continuous range of values. Often it is the answer to a question such as "is this claim potentially fraudulent or not?" This is an example of a dichotomous or binary variable. The target variable is assigned 0 if the event did not occur and 1 if it did. Answering this type of question is well suited to using a binomial distribution. The resulting mean of the binomial distribution is the probability the event occurs.

The linear predictor [math]\beta_0+\beta_1x_{i,1}+\ldots+\beta_px_{i,p}[/math] can take on any value in the range [math]\left(-\infty,\infty\right)[/math] and when modelling a continuous variable, μ is also in this range, so the link function isn't limited other than due to practical considerations such as wanting a multiplicative rating structure. Whereas, for a binary target variable, μ is between 0 and 1 so the link function, [math]g(\mu)[/math], is restricted to functions which map [0,1] onto [math][-\infty,\infty][/math].

It is common to use the logit function, [math]g(\mu)=\ln\left(\frac{\mu}{1-\mu}\right)[/math]. The inverse of the logit function is the logistic function, [math]g^{-1}(\mu)=\frac{1}{1+e^{-\mu}}[/math]. The ratio [math]\frac{\mu}{1-\mu}[/math] is known as the odds. By taking exponents, the GLM equation expresses the effect of the predictor variables on the odds via a product of multiplicative terms.

Predictor Correlation, Multicolinearity, and Aliasing

A weakness of univariate analyses is two correlated variables may pick up the same signal separately. Reacting to each separate univariate analysis could lead to adverse selection due to double counting the effects. GLMs are designed to handle all predictor variables together in the presence of correlations between predictors and still give accurate estimates of the coefficients.

The key assumption is the correlation between predictor variables is moderate. Before beginning a GLM project, look at the correlations between pairs of predictor variables as this will improve your understanding of the GLM output.

Since a GLM is forced not to double count, when the correlation between two variables is high the GLM allocates the target variable impact between both predictors and the level of uncertainty associated with each coefficient rises. Consequently, coefficients may become very high or very low, the standard errors associated with the coefficients are large, and adding or removing records from the data set can cause large changes to the model. All in all, the model is unstable.

By examining for correlations prior to analyzing with a GLM, you can ensure you start with a good set of predictors. When predictors are highly correlated, there are a few approaches available:

  1. Remove all except one of the correlated predictors from the model.
    • This is simple to implement but may cause loss of information.
  2. Pre-process the data using dimensionality reduction techniques such as principal component analysis or factor analysis to create new variables from correlated groups of predictors. The new variables have limited correlation between them.

Multicolinearity is an extension of the idea of two variables being highly correlated. It is where two or more variables may be combined to describe the effects of another predictor variable. Since it involves more than pairs of variables (3 or more), it is hard to detect in a correlation matrix.

The variation inflation factor (VIF) statistic is produced by statistical packages to measure the increase in squared standard error due to the presence of correlations with other predictors. It is produced by running a linear model for each of the predictors using all of the other predictors as inputs - it's computationally intensive!

A VIF of 10 or more is considered high. As always, try to understand the structure of your data before making any decisions about predictor variables.

Variables which are perfectly correlated are called aliased. In this situation, the GLM cannot produce a unique solution. Typically, the GLM software can detect this and remove one of the predictors. However, if the predictors are close to aliased then it may not detect it. This can cause the model to be unstable or not converge.

Multicolinearity is where [math]x_1+x_2\approx x_3[/math], whereas aliasing is when [math]x_1+x_2=x_3[/math].

mini BattleQuiz 5 You must be logged in or this will not work.

Limitations of GLMs

  1. GLMs assign full credibility to the data. The GLM can indicate via a large standard error and high p-value that there may be insufficient data but the GLM doesn't correct for it. Extending GLMs to consider credibility is beyond the scope of Exam 8.
  2. GLMs assume the randomness of outcomes are uncorrelated. That is, the piece of the outcome (target variable) which is not explained by the predictor variables is assumed to be independent between records in the data set. Two examples of where this assumption may be violated are:
    • Many records in the data set may be renewals of the same policy and thus have correlated outcomes. For instance, if you're a bad driver in year 1 then you're probably still a bad driver in years 2, 3 and 4.
    • When modelling a peril that is prone to storms, policyholders in the same area likely have similar outcomes since losses may be driven by storms which impacted multiple insureds at once.

Large amounts of correlated outcomes will cause the GLM to place too much weight on those events, causing it to pick up too much random noise and produce poor estimates.

Full BattleQuiz Excel Files Forum

You must be logged in for the quizzes to work.