What is Selection Bias — Data Science

VISHWAS UPADHYAYA
2 min readMar 20, 2020

--

Hello, In this post we will know about Selection Bias.
So, First we will focus on name Selection Bias, selection means selecting or choosing from something and bias means partiality.
So whole term selection bias means selecting anything with partiality.

For example, Suppose your college teacher is selecting a team for a cricket match and he selects all batsman or all bowler, So here your teacher is biasing in selecting the team, So due to this selection your team can lose the match.

From the above example one thing is confirmed that the selection bias is problem.This problem also occurs in Data Science.
Now we will understand this problem with Data Science.

Suppose you have a data of spam detection and you select some data for training and some data for testing purpose.
Your data looks some thing like this —

you will be watching there is one column of labels(ham or spam) and another of messages.

you have selected the train and test data.Suppose in the train data you have only those data whose labels are ham(it can be happen because there is lots of ham).So when you train your model with this data then your model don’t know what is spam and when you will test your model then you will get very low accuracy because your model didn’t learn how to recognize spam messages.

And there is one more thing suppose In your train data there is some(one or two) spam labels, Then also your model will not train your well.

So this was the selection bias because your train data selection is so biased(towards ham).

To recover this particular above problem, first you have to count the number of spams and hams.Then you have to select equal-equal from spams and hams for the test and train data.

Example — -
spam=10, ham=60
train_data=5+30
test_data=5+30

For this you can use Sicket Learn library.
In sklearn.model_selection there is class StratifiedShuffleSplit()

Please see the code —

from sklearn.model_selection import StratifiedShuffleSplit
split=StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)
for train_index,test_index in split.split(df,df['label']):
strat_train_set=df.loc[train_index]
strat_test_set=df.loc[test_index]

So you can use this above code for any feature(column) in your data for selecting equal number of values for both train and test.

There is lot’s of methods to understand selection bias.

That’s it for this post.

Thank you.

--

--

VISHWAS UPADHYAYA

Currently working on Data Science.Python Developer(Django Framework)