Degrees of Freedom Consumed by Polynomial Terms - F2018 Q6
A 2nd order polynomial is added to the model in this problem. The answer indicates that this will require 2 new parameters be fit. Is the answer key assuming that the 2nd order polynomial being fit is of the form ax^2 + bx? I.e. 2 new parameters a and b, one linear and one quadratic?
My interpretation was that adding a single 2nd order polynomial would just mean adding beta_k*(driver_age)^2 to the model, which I would have suspected added just 1 new parameter. Is this incorrect? Maybe it's just another classic case of state your assumptions, and you'll be fine, but I want to be sure I'm not blind to the actual the dummy variable implementation running behind the scenes in the GLM software. I haven't yet spotted any explicit mention of this in the source text.
Thanks.
Comments
An nth degree polynomial is of the form a_0 + a_1*x + a_2*x^2 + ... + a_n*x^n. So a second degree polynomial requires 3 coefficients. However, we only gain two additional degrees of freedom because we already have an intercept term in the GLM. The intercept still has to be re-estimated but we already had a parameter for it.
A potential way to add a second degree polynomial to the GLM yet only add one degree of freedom could be something like:
"A log-link GLM is built using territory and driver age, where driver age is treated as a continuous variable. The modeler wants to improve the fit by adding a second order polynomial for driver age to the model. How many degrees of freedom are added?"
It's implicit here the model looks like g(u) = b_0 + b_1*ln(age) + b_2TerrA + b_3TerrB + ...
Adding a general second order polynomial would mean including c_0 + c_1*ln(Age) + c_2*(ln(age))^2
But you can do the algebra to combine terms that are already in the model, so you notice you're only adding the c_1 coefficient. That is, only gaining one degree of freedom.
Ok. I have always been taught that the degree of the polynomial is the value of the highest power, with no dependence on the number of non-zero monomials behind it: 2x^3 is still a 3rd degree polynomial; to save degrees of freedom we wouldn't blindly keep the other terms if their significance does not warrant it. I'll run with the logic presented, and state assumptions as needed.
And yes, we should normally assume the continuous variables will match the form of the link function, unless otherwise stated; though logging the continuous Insurance Score was not an accepted assumption in this problem (maybe because the variable name was not logInsuranceScore?). Sometimes not matching the scale of the link function yields better fit to some data structures, as noted in the source.
Thanks again.
Splitting hairs here. You're correct that the order (or degree) of a polynomial refers to its highest power. However, by referring to a polynomial we're implicitly allowing all lower order terms to be included else we would refer to the monomial of that degree.
I think the general approach implied by the text is you start with the most general polynomial of the required degree and would then look at the associated model p-values to determine which terms you actually want to keep. Only the new non-zero parameters would be the correct number of added parameters when it comes to the F-statistic.
You're right that the text calls out Credit Based Insurance Score as an example of including a continuous variable without necessarily matching the scale of the link function. As you say, state your assumptions clearly and hope for the best!