今天用R glm做逻辑回归分析时,遇到模型系数为NA,效果如下

Call:
glm(formula = is_lost ~ ., family = "binomial", data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4928  -1.2366   0.8919   0.9078   6.0179  

Coefficients: (1 not defined because of singularities)
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)        0.716540   0.011068  64.743  < 2e-16 ***
gold_num          -0.266506   0.052782  -5.049 4.44e-07 ***
ladder_single_num -0.054578   0.003099 -17.613  < 2e-16 ***
ladder_double_num -0.052016   0.009559  -5.442 5.28e-08 ***
room_num          -0.184745   0.008601 -21.480  < 2e-16 ***
robot_num         -0.073728   0.003479 -21.193  < 2e-16 ***
normal_num               NA         NA      NA       NA    
adventure_num     -0.043090   0.001246 -34.571  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 71820  on 52243  degrees of freedom
Residual deviance: 63239  on 52237  degrees of freedom
AIC: 63253

Number of Fisher Scoring iterations: 6

问题的原因是列线性依赖,列线性依赖,列线性依赖(重要的事情说三遍)。导致有无数种系数组合。R控制台会有警告信息,如下

 
> lr <- glm(is_lost~., data = data, family='binomial')
Warning message:
glm.fit:拟合機率算出来是数值零或一 

后来查ETL SQL,发现有个低级错误,导致两个列的数据一样,:-(。Cross Validate中也有同样的问题和解答,供参考。