Allstate Claims Severity Solution

Here I am highlighting my work on Allstate Claims Prediction, a Kaggle problem.

Source : Google Images

The Allstate Corporation is an American insurance company, headquartered in Northfield Township , Illinois, near Northbrook since 1967. Founded in 1931 as part of Sears, Roebuck and Co., it was spun off in 1993. The company also has personal lines insurance operations in Canada.

“Good Hands”

Allstate’s slogan “You’re in good hands” was created in the 1950s by Allstate Insurance Company’s sales executive, Davis W. Ellis. When you’ve been devastated by a serious car accident, your focus is on the things that matter the most: family, friends, and other loved ones. Pushing paper with your insurance agent is the last place you want your time or mental energy spent. This is why Allstate, a personal insurer in the United States, is continually seeking fresh ideas to improve their claims service for the over 16 million households they protect.

They are committed to create a truly immersive service, Allstate is currently developing automated methods of predicting the cost, and hence severity, of claims.

Here I am highlighting the stepwise approach of my work.

  1. Understanding Business Problem
  2. Business Objective and Constraints
  3. Data
  4. Evaluation Metric
  5. EDA
  6. Data Preparation
  7. Modeling
  8. Future Work
  9. Reference

1. Understanding Business Problem:

Best Phrase about customers is Put yourself in their shoes. And every company wants their customers to not leave. A happy customer isn’t just someone who makes a purchase with you today. A truly happy customer is one who will be loyal to you and your business for a long time to come and customer loyalty and happiness have a tendency to spread. The most important thing is as the number of unhappy customers increases, the growth of the company decreases. So a company can address these things by using loss Prediction. Our task is to predict how severe the claim will or might be for a new household, predict future loss based on given features.

2. Business Objectives and Constraints:

  • There is no strict low latency requirement.
  • Minimize the loss

3. Data:

Allstate have provided anonymized data (changed feature names) and this dataset makes really hard to understand features. Dataset contains claim records of Allstate clients. There are two data sets available on the competition website on Kaggle.

  • train.csv — This file is containing 188318 number of rows and 132 features, id (unique id) and loss(target variable).
  • test.csv — This file is containing 125546 number of rows and 131 features, id (unique id) and features. You must predict the loss value for the ids in this file.

The anonymity of the data makes it difficult to get an intuitive feeling of what information given results or models are providing us with. It is, however, understandable that Allstate takes such a measure to protect anonymity and confidentiality. As the target variable is numeric and loss value we can pose this problem as a Regression problem. By doing this in my opinion will enhance overall customer service which will be beneficial for the company as well as the claimant plus help the insurance company.

4. Evaluation Metric:

The goal of this project is to predict the cost of insurance claims in order to help improve insurance claim severity analysis and provide better targeted assistance to customers. For predicting the cost , we need to know the amount, which is real number that can be claimed in insurance policy. So, we can’t use R-squared or Adjusted R-squared.

Metric to be used — Mean Absolute Error (MAE)

The mean absolute error measures the average magnitude of the errors on a set of forecasts, without considering the direction.

n — Number of data points, y(i)- Observed values and y^(i) — Predicted values

MAE is the average over the verification sample of the absolute values of the differences between forecast and the corresponding observation. The MAE is a linear score which means that the all the individual differences are weighted equally in the average.

5. EDA:

5. 1 Dataset overview:

Given data have only Cat and Con and no idea what exactly they are. As the features are anonymous, EDA became a very important stage for understanding more insights of the data.

Already explained about the dataset.

There are a total of 130 distinct features excluding target feature and id feature. These features contain categorical and numerical data types. Out of 130 features, 116 are categorical and 14 are numerical. We noticed that there are no missing values in the train dataset. This shows that Allstate provided highly user friendly prepossessed data.

5. 2 Missing values:

Let’s check for missing values.

As there are no missing values in train and test data we no need to do pre-processing.

5. 3 Plots:

Histogram of continuous features

There are many spikes in the continuous which shows that features are not distributed normally. We try transforming these features making their distribution more gaussian, but it might not improve the model’s performance

Box-Plot of continuous features

There are few outliers in cont7, cont9 and cont10 features. As stated above in this context, outliers are important to analysis, so we are not going to remove them.

Correlation matrix of continuous features

  1. cont11 and cont12 give an almost linear pattern…one must be removed
  2. 2. cont1 and cont9 are highly correlated …either of them could be safely removed
  3. cont6 and cont10 show very good correlation too

We see a high correlation on above listed features. This may be a result of a data-based multicollinearity as two or more predictors are highly correlated. There are many problems it causes, so we should be very careful while implementing linear regression models on current dataset.

Calculating the skewness on continuous features

Continuous variable 2 and 3 are left skewed ( Cont3 not skewed much ~ 0) Cont9 ( 1.0 + ) variable is more skewed compared to other cont. variables Target variable is highly skewed. The Box-Cox transformation transforms our data so that it closely resembles a normal distribution. In many statistical techniques, we assume that the errors are normally distributed

Analyzing Target variable

There are several distinctive peaks in the loss values representing severe accidents [ high Claims ]. Such data distribution makes this feature very skewed and can result in suboptimal performance of the regressor. From Skewed results we see that Loss is right skewed [ 3.7 ] We can apply log transformation on loss variable and plot the graphs to visualize the updated loss ( log ).

From the above figure , we can confirm that after applying the log transform , now data seems to be normally distributed.
As we applied logarithmic function on target feature to convert it into normal distribution. We have to apply exponential function on test data while measuring / predicting the loss ( e**y )

Categorical Feature Analysis

As there are 132 features in categorical , please check out my GitHub to visualize the plots for categorical features. Here I am analyzing top 10 categorical features.

Top 10 important categorical features ( Random Forest )


  1. For Cat79,Cat57, Cat87,Cat81,Cat12,Cat89 and Cat7 , the data is spread around the median.
  2. For the Categorical feature Cat80 the median is equal to 75th percentile. More data falls below the median.
  3. For the Categorical feature Cat101 we can observer that the data falling above the median values.
  4. Cat106 is looking good compared to other features with minimum , max and IQR values.
  5. From the box-plot we can observe that most of the claims are limited to small values. We can observer this by plotting scatter plot.
  6. There are few high claims for Cat101 and Cat106 features.


From the top10 important features

Cat80 , Cat79, Cat87,Cat81 are having 4 classes { A,B,C,D }

Cat57,Cat12,Cat7 are having 2 classes {A,B}

Cat101 is having 19 classes Cat89 is having 8 classes

Cat106 is having 17 classes

1. From the cat plot / scatter plot it is clear that, almost in categorical classes claims are under 40,000.

2. From the Cat80 , Cat79, Cat87,Cat81 features having 4 classes, the count of A is very less compared with B,C and D. The contribution of B and D is more compared with C as well. It clearly says contribution of very minimal. B and D are very important classes.

3. For Cat80 and Cat79, class A and C are contributing same loss where class B in Cat80 and class D in Cat79 and class D in Cat80 and class class B in Cat79 are having nearly same loss respectively.

4. From the Cat57,Cat12,Cat7 features having 2 classes, The count of A and B are almost equal.

5. From the Cat89 features having 8 classes, the class E, H, I and G are negligible compared with A, B , C and D. 90+% of loss is contributed by class A, B, C and D. The C and D classes are almost equal , while A is slightly higher than B.

6. For the Cat106 feature with 17 classes, all the classes are very important for calculating the loss except class B and P , count is too small compared with others. Most of the classes loss is less than 20000. Classes like G,I,H,F,E,D,C are very important for loss calculation.

7. Coming to the last and final feature in top10, Cat101 with 19 classes. Where the G,F,O,D,J,A,C,Q,M,I,L,R and S classes are equally contributing for the loss { each class more than 20000 } but the classes like E,N,H,B,U and K very less and can be negligible

10 lease important categorical features

Least 10 important features ( Random Forest )


From the above boxplot , it is clear that all values in least 10 are spread around the median. Even boxplot are not visible.

From the above scatter / cat plot , it is clear that the categorical features with two classes are less important and that too in a categorical feature where the count of class A very high { B can negligible compared with A } is not a important feature

Train and Test Dataset are identical ?

Classes missing in Train and Test

From the above table we can observe that there are few classes which are present in Train data but missing in Test data and vice versa.

6. Data Preparation

Dataset has no missing values and outliers are very important here as they might be high claim related data. So, we can’t remove them. So, I have applied Skewness on Continuous variables and replaced all missing classes by NumPy null values for categorical. Pandas factorize replaces all categorical variables by integers, and NaN with -1.

Initially I applied log-transform ( log1p() , and it worked good enough, but some time after I switched to other transformations like log(loss + 200 ), which worked somewhat better.

7. Modeling


XGBRegressorI used XGBRegressor with 7 parameters, out of them 2 were hyper parameter tuned ( learning rate and n estimators ). GridSearchCV cross validation code is this —

By fitting these hyper parameters I achieved the Mean Absolute Error value of 1150 .

Gradient Boosting

GradientBoostingRegressor- I used GBR with 5 parameters, all of them were hyper parameter tuned. GridSearchCV cross validation code is this —

By using the GBR model I achieved the Mean Absolute Error 1147.8637829161644.


LGBMRegressor — I used LGBMRegressor with 8 parameters, out of them 6 were hyper parameter tuned. RandomizedSearchCV cross validation code is this —

With this model I have received the Mean Absolute Error of 1138.95191121 but in submission file I received the score as 1122.

The above 3 models I have build by applying Label Encoder on categorical features and skewness in continuous features.

Final Model

In the final I have replaced the missing classes in Train and Test Datasets, classes which available in train but not in test and vice versa , with NumPy null values and factorized with pandas.

Here I have used xgboost Dmatrix , watchlist , train , param to train the model. In xgboost Dmatrix it very hard to train the Hyper parameters because it takes only two parameters at once.

By using k-fold cross validation trained model with xgboost Dmatrix.

DMatrix is an internal data structure that is used by XGBoost, which is optimized for both memory efficiency and training speed. You can construct DMatrix from multiple different sources of data.

By this I achieved Mean Absolute Error as 1136 but in submission file I have received the score 1115.11.

This score falls exactly at 261th position out of 3045 entries. So ultimately the score that I got is in top 10% range.

best score
Total number of entries — 3045.
261 is my position.

8. Future Work

There can be good number of future work with this project. But the most common ones are —

  1. Stacking may work very well.
  2. Instead 3-fold , we can try with 10-fold cross validation.
  3. Categorical Embedding with NN and different embedding layer sizes.
  4. TF-IDF on categorical features.

9. References:


Please go through my GitHub profile to have a glance at code.

Click here to connect with me in LinkedIn.

A Machine learning Enthusiast!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store