Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing.1
Data preprocessing in Python
In python we can use the Pandas library to preprocess data.
import pandas as pd
df = pd.read_csv('../data/titanic-train.csv')
df.head() #shows the head of the loaded dataset
df.info() #number of entries for each feature and feature type
df.describe() #see information about the numerical features
See Panda examples from Udemy course
IMDB dataset
We’ll want to build a reverse dictionary and pad our data (make all reviews have the same length)
from keras.datasets import imdb
(X_train, y_train), (X_test, y_test) = imdb.load_data('/tmp/imdb.npz',
num_words=None,
skip_top=0,
maxlen=None,
start_char=1,
oov_char=2,
index_from=3)
max(idx.values())#number of different words
rev_idx = {v+3:k for k,v in idx.items()}
rev_idx0 = 'padding_char'
rev_idx1 = 'start_char'
rev_idx2 = 'oov_char'
rev_idx3 = 'unk_char'
#transform review from indices to words
example_review = ' '.join([rev_idx[word] for word in X_train0])
print(example_review)
from keras.preprocessing.sequence import pad_sequences
#this type of padding preserves the last maxlen datapoints
X_train_pad = pad_sequences(X_train, maxlen=maxlen)
X_test_pad = pad_sequences(X_test, maxlen=maxlen)
One hot encoding
from keras.utils import np_utils
uniques, ids = np.unique(y, return_inverse=True)
y_code = np_utils.to_categorical(ids, len(uniques))
Train set split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X.values, y_cat, test_size=0.2)
Normalization
Standard normalization
We want our data to have μ(mean)=0 and σ(Xi)=σ(Xj) for any j!=i (variance).
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
#for training data and unseen data
train = ss.fit_transform(train)#learn a set of scaling/shifting operations to fit the data in a standard distribution with mean 0 and variance 1
test = ss.transform(test)#apply the same operations to previously unseen test data
MinMax normalization
We’ll scale our data to fit on a scale from 0.0 to 1.0
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
#for training data and unseen data
train = mms.fit_transform(train)#learn a set of scaling/shifting operations to fit the data in the [0,1] range
test = mms.transform(test)#apply the same operations to previously unseen test data