Feature engineering is a critcal stage in any machine learning project. It’s main purpose is to extract features from the raw data and generate numerical values that can be used to train a machine learning model. In this notebook, we will explore the different types of features and how to generate them.
Label encoding
Label encoding generates number for category:
import sklearn.preprocessing as preprocessingtargets = np.array(["Sun", "Sun", "Moon", "Earth", "Monn", "Venus"])labelenc = preprocessing.LabelEncoder()labelenc.fit(targets)targets_trans = labelenc.transform(targets)print("The original data")print(targets)print("The transform data using LabelEncoder")print(targets_trans)
The original data
['Sun' 'Sun' 'Moon' 'Earth' 'Monn' 'Venus']
The transform data using LabelEncoder
[3 3 2 0 1 4]
The label encoding operation must be performed on both the train and test dataset at the same time. We can use .astype("category") and pandas.Series.cat.coded to do the same:
df = pd.DataFrame({"col1": ["Sun", "Sun", "Moon", "Earth", "Monn", "Venus"]})print("The original types of DataFrame")print(df.dtypes)print("*"*30)df["col1"] = df["col1"].astype("category")print("The new types of DataFrame")print(df.dtypes)print("*"*30)df["col1_label_encoding"] = df["col1"].cat.codesprint("The new column.")df
The original types of DataFrame
col1 object
dtype: object
******************************
The new types of DataFrame
col1 category
dtype: object
******************************
The new column.
col1
col1_label_encoding
0
Sun
3
1
Sun
3
2
Moon
2
3
Earth
0
4
Monn
1
5
Venus
4
One-hot encoding
Even though we generated numbers from catogory, most times these numbers have no order (i.e. they are nominal, not ordinal). In that case, proper way of encoding is to use one-hot encoding.
import sklearn.preprocessing as preprocessingtargets = np.array(["Sun", "Sun", "Moon", "Earth", "Moon","Venus"])labelEnc = preprocessing.LabelEncoder()new_target = labelEnc.fit_transform(targets)onehotEnc = preprocessing.OneHotEncoder()onehotEnc.fit(new_target.reshape(-1, 1))targets_trans = onehotEnc.transform(new_target.reshape(-1, 1))print("The original data")print(targets)print("The transform data using OneHotEncoder")print(targets_trans.toarray())
The original data
['Sun' 'Sun' 'Moon' 'Earth' 'Moon' 'Venus']
The transform data using OneHotEncoder
[[0. 0. 1. 0.]
[0. 0. 1. 0.]
[0. 1. 0. 0.]
[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 0. 1.]]
Mean encoding maps each category to the mean of the target variable for that category (we could use statistic other then mean too like variance and stdev):
WOE is a technique used to encode categorical features for classification tasks. We assign several probabilites to the categories. For example for binary classification problems, WOE is defined as: \[ WOE = \log \frac{p(1)}{p(0)} \]
The original feature matrix is
fea1 fea2
0 a red
1 b yellow
2 a white
3 b blue
4 a red
==============================
The new feature matrix is
fea1 fea2 fea1_fea2
0 a red a_red
1 b yellow b_yellow
2 a white a_white
3 b blue b_blue
4 a red a_red
Datatime features
We encode datetime into year/month/day/hour/etc. features: