Fall 2018 Q5
Hi,
The examiner report states "For the hinge function, the candidate using an incorrect function form by including log transformation, for example, max (log(age) - 25, 0)", while the Battlewiki states "Unlike binning, the predictor variable is first logged if a log link function is used. That is, transform your variable before breaking into pieces."
May I know when should we transform and when should we not?
Thanks.
Comments
In general if you are using a log link function then you should log your continuous variables. The max(log(age)-25,0) comment is because the age variable is not logged in this partial residual plot so the hinge point is placed at an unhelpful location when you do this.
A log transformation isn't appropriate if you believe the response variable should follow a power law. See https://battleacts8.ca/8/wiki/index.php?title=Goldburd.Interactions#Interacting_a_Categorical_Variable_with_a_Continuous_Variable for a description of this.
A log-transformation also may not be appropriate for something like age where we have a continuous variable that really only takes a finite number of values. We rarely distinguish between rates for a 35 year old and a 35year old and two days.
Lastly, in this case, the residual plot does not suggest a power curve (something that looks like an exponential curve) so a log-transformation is highly unlikely to make the residual plot look random.
Sorry I read the "Interacting a Categorical Variable with a Continuous Variable" section but I don't think it mentioned power law - do you mind pointing me to the section that describes this?
Paragraph (2) and (4) sound contradictory to me - if the residual plot suggest a power curve (i.e. if we believe the response variable should follow power law), is log-transformation appropriate and likely to make residual plot look random?
It doesn't call out the term power law explicitly. It's assumed that AOI^k where k is a constant and AOI is the variable of interest is recognized as an example of a power law.
If the residual plot showed something looking like exponential growth in the residuals then that's a clear indication to log the predictor. It may be helpful to read the following again: https://battleacts8.ca/8/wiki/index.php?title=Goldburd.Basics#Types_of_Variables_to_Use_in_a_GLM
The partial residual plot here doesn't immediately imply a power curve as the end-scale behavior isn't obviously going to either 0 or infinity and we have a large flat section. To replicate this using a linear combination of power law variables (x^k + x^m + ... + x^2 + x, with separate coefficients in front of each) would likely require a huge number of terms and violate the parsimony requirement for our model.