Insurance pricing game
Insurance pricing game: EDA
It will help you save time analyzing your data and get a more detailed overview.
Hello everyone! The notebook will be more useful for new competitors. It will help you save time analyzing your data and get a more detailed overview. Hope this is helpful) Also, if I find it useful, I will continue to supplement the notebook.
Support like it if it is helpful
This notebook will examine the competition data in more detail! Ideal for new members.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import random
df = pd.read_csv('/kaggle/input/motor-insurance-market/training (1).csv')
df.head()
Understanding data¶
The training set presents policy data for the last 4 years for each user. Let's check it out.
df.id_policy.value_counts()
Let's get acquainted with ordinal, categorical and binary features:¶
Ordinal:
- pol_coverage: Min, Med1, Med2, Max, in this order.
Categorical:
- pol_pay_freq: the price of the insurance coverage can be paid anually, bi-anually, quarterly or monthly.
- pol_usage: WorkPrivate, Retired, Professional, AllTrips.
- drv_sex1: H or F.
- drv_sex2: H or F.
- vh_make_model: hashes representing car models, such as Honda Civic or Fiat Uno.
- vh_fuel: Diesel, Gasoline or Hybrid.
- vh_type: Tourism or Commercial.
Boolean:
- pol_payd: Pay As You Drive. Indicates whether a client subscribed a mileaged-base policy or not.
- drv_drv2: Indicates the presence of a secondary driver in the contract.
Missing data¶
df.info()
Note that there is a lot of missing data associated with the second driver. This is most likely due to the fact that there is no second driver. These missing values may need to be categorized into new categories.
Data analysis¶
Correlations¶
Unfortunately, there are very few dependencies on the target variable. Getting good predictions won't be easy.)
sns.heatmap(df.corr())
Target¶
Let's see what the distribution of the target variable looks like. Expectedly, there are many zeros, which means that the person was not hurt. There is also a long right tail.
sns.distplot(df['claim_amount'], kde=False)
sns.distplot(df[(df['claim_amount'] > 0) & (df['claim_amount'] < 10000)]['claim_amount'], kde=False)
sns.boxplot(df['claim_amount'])
Now let's look at the ratio of those who have suffered and not.
df['Damage'] = np.where(df['claim_amount'], 1, 0)
sns.countplot(df['Damage'])
Categorical features¶
sns.countplot('pol_coverage', data=df, hue = 'Damage')
sns.countplot('pol_pay_freq', data=df, hue ='Damage')
sns.countplot('pol_usage', data=df, hue='Damage')
sns.countplot('drv_sex1', data=df, hue='Damage')
sns.countplot('vh_fuel',data=df, hue='Damage')
sns.countplot('vh_type',data=df, hue='Damage')
Crashes¶
Find out how many users have not received damage for 4 years, as well as the rest who have received damage.
count_damage_per_user = df.groupby('id_policy')['Damage'].sum()
count_damage_per_user.value_counts()
Let's explore which data depends and changes over time. For some users, they received that their gender was changed. I think we need to investigate this in more detail.
no_change, change = [], []
for column in df.columns:
constant_cells = (df.groupby(['id_policy'])[column].value_counts() == 4).all()
if constant_cells:
no_change.append(column)
else:
change.append(column)
change
Content
Comments
You must login before you can post a comment.
I read it. Not all claims are due to damage to the insured vehicle, if you read the description of pol_coverage you’ll see that all categories cover third party liability.
Thank you for your comment!