Data manipulation#

Learning objectives#

  • Discover what tensors are and how to manipulate them with NumPy and PyTorch.

  • Be able to load and prepare datasets of different types (tabular data, images or videos) for training a Machine Learning model.

  • Learn how the pandas and scikit-learn libraries can help achieve the previous task.

Environment setup#

# pylint: disable=wrong-import-position

import os

# Installing the ainotes package is only necessary in standalone runtime environments like Colab
if os.getenv("COLAB_RELEASE_TAG"):
    print("Standalone runtime environment detected, installing ainotes package")
    %pip install ainotes

# pylint: enable=wrong-import-position
import platform

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import sklearn
from sklearn.datasets import load_sample_images
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder

import torch

from ainotes.utils.train import get_device
# Setup plots

# Include matplotlib graphs into the notebook, next to the code
# https://stackoverflow.com/a/43028034/2380880
%matplotlib inline

# Improve plot quality
%config InlineBackend.figure_format = "retina"
# Print environment info
print(f"Python version: {platform.python_version()}")
print(f"NumPy version: {np.__version__}")
print(f"scikit-learn version: {sklearn.__version__}")
print(f"PyTorch version: {torch.__version__}")


# PyTorch device configuration
device, message = get_device()
print(message)
Python version: 3.11.1
NumPy version: 1.26.4
scikit-learn version: 1.4.1.post1
PyTorch version: 2.2.1
Using MPS GPU :)

Working with tensors#

Definition#

In the context of AI, a tensor is a set of primitive values sharing the same type (most often numerical), shaped into an array of any number of dimensions. It is a fancy name for a multidimensional array.

Tensors are heavily used by AI algorithms to represent and manipulate information. They are, in particular, the core data structures of Machine Learning.

Tensor properties#

  • A tensor’s dimension is also called an axis.

  • A tensor’s rank is its number of axes.

  • The tensor’s shape describes the number of values along each axis.

In mathematical terms, a rank 0 tensor is a scalar, a rank 1 tensor is a vector and a rank 2 tensor is a matrix.

Warning: rank and dimension are polysemic terms, which can be confusing.

Tensors in Python#

Python offers limited native support for manipulating tensors. Lists can be used to store information, but their mathematical capacities are insufficient for any serious work.

# A vector (rank 1 tensor)
a = [1, 2, 3]
print(a)

# A matrix (rank 2 tensor)
b = [a, [4, 5, 6]]
print(b)
[1, 2, 3]
[[1, 2, 3], [4, 5, 6]]

Dedicated libraries#

Over the years, several tools have been developed to overcome Python’s native limitations.

The most widely used is NumPy, which supports tensors in the form of ndarray objects. It offers a comprehensive set of operations on them, including creating, sorting, selecting, linear algebra and statistical operations.

Tensor management with NumPy#

Creating tensors#

The np.array function creates and returns a new tensor.

NumPy array creation

The NumPy API contains many functions for creating tensors using predefined values.

def print_tensor_info(t):
    """Print values, number of dimensions and shape of a tensor"""

    print(t)
    print(f"Dimensions: {t.ndim}")
    print(f"Shape: {t.shape}")
# Create a scalar
x = np.array(12)

print_tensor_info(x)
12
Dimensions: 0
Shape: ()
# Create a vector (1D tensor)
x = np.array([1, 2, 3])

print_tensor_info(x)
[1 2 3]
Dimensions: 1
Shape: (3,)

Generating random tensors#

The NumPy API also permits the creation of (pseudo-)randomly valued tensors, using various statistical laws and data types.

# Init a NumPy random number generator
rng = np.random.default_rng()

# Create a 3x4 random matrix (2D tensor) with real values sampled from a uniform distribution
x = rng.uniform(size=(3, 4))

print_tensor_info(x)
[[0.87327715 0.69783603 0.91380947 0.95877851]
 [0.3600699  0.87642275 0.40869857 0.43925051]
 [0.54297924 0.77137579 0.81503553 0.34392423]]
Dimensions: 2
Shape: (3, 4)
# Create a 3x2x5 3D tensor with integer values sampled from a uniform distribution
x = rng.integers(low=0, high=100, size=(3, 2, 5))

print_tensor_info(x)
[[[97 16 21 94  2]
  [66 48 96  7 51]]

 [[10 32 50 31 30]
  [98 72 28  2 29]]

 [[11  7 20 85 10]
  [80 97 64 92 11]]]
Dimensions: 3
Shape: (3, 2, 5)

Shape management#

A common operation on tensors is reshaping: giving it a new shape without changing its data.

The new shape must be compatible with the current one: the new tensor needs to have the same number of elements as the original one.

NumPy reshaping

# Reshape a 3x2 matrix into a 2x3 matrix
x = np.array([[1, 2], [3, 4], [5, 6]])
x_reshaped = x.reshape(2, 3)

print_tensor_info(x_reshaped)
[[1 2 3]
 [4 5 6]]
Dimensions: 2
Shape: (2, 3)
# Reshape the previous matrix into a vector
x_reshaped = x.reshape(
    6,
)

print_tensor_info(x_reshaped)
[1 2 3 4 5 6]
Dimensions: 1
Shape: (6,)
# Error: incompatible shapes!
# x.reshape(5, )

Indexing and slicing#

Tensors can be indexed and sliced just like regular Python lists.

NumPy indexing

x = np.array([1, 2, 3])

# Select element at index 1
assert x[1] == 2

# Select elements between indexes 0 (included) and 2 (excluded)
assert np.array_equal(x[0:2], [1, 2])

# Select elements starting at index 1 (included)
assert np.array_equal(x[1:], [2, 3])

# Select last element
assert np.array_equal(x[-1], 3)

# Select all elements but last one
assert np.array_equal(x[:-1], [1, 2])

# Select last 2 elements
assert np.array_equal(x[-2:], [2, 3])

# Select second-to-last element
assert np.array_equal(x[-2:-1], [2])

Tensor axes#

Many tensor operations can be applied along one or several axes. They are indexed starting at 0.

NumPy axes

# Create a 2x2 matrix (2D tensor)
x = np.array([[1, 1], [2, 2]])
print(x)

# Summing values on first axis (rows)
print(x.sum(axis=0))

# Summing values on second axis (columns)
print(x.sum(axis=1))
[[1 1]
 [2 2]]
[3 3]
[2 4]

Element-wise operations#

These operations are applied independently to each entry in the tensors being considered.

# Element-wise product between two matrices (shapes must be identical)
x = np.array([[1, 2, 3], [3, 2, -2]])
y = np.array([[3, 0, 2], [1, 4, -2]])
z = x * y

print(x)
print(y)
print_tensor_info(z)
[[ 1  2  3]
 [ 3  2 -2]]
[[ 3  0  2]
 [ 1  4 -2]]
[[3 0 6]
 [3 8 4]]
Dimensions: 2
Shape: (2, 3)

Dot product#

On the contrary, operations like dot product combine entries in the input tensors to produce a differently shaped result.

# Dot product between two matrices (shapes must be compatible)
x = np.array([[1, 2, 3], [3, 2, 1]])
y = np.array([[3, 0], [2, 1], [4, -2]])
# alternative syntax: z = x.dot(y)
z = np.dot(x, y)

print(x)
print(y)
print_tensor_info(z)
[[1 2 3]
 [3 2 1]]
[[ 3  0]
 [ 2  1]
 [ 4 -2]]
[[19 -4]
 [17  0]]
Dimensions: 2
Shape: (2, 2)

Broadcasting#

Broadcasting is a mechanism that allows operations to be performed on tensors of different shapes. Subject to certain constraints, the smaller tensor may be “broadcast” across the larger one so that they have compatible shapes.

NumPy broadcasting

Broadcasting provides a efficient means of vectorizing tensor operations.

# Broadcasting between a vector and a scalar
x = np.array([1.0, 2.0])
print(x * 1.6)
[1.6 3.2]
# Broadcasting between a matrix and a vector
x = np.array([[0, 1, 2], [-2, 5, 3]])
y = np.array([1, 2, 3])
z = x + y

print_tensor_info(z)
[[ 1  3  5]
 [-1  7  6]]
Dimensions: 2
Shape: (2, 3)

GPU-based tensors#

For all its qualities, NumPy has a limitation which can be critical in some contexts: it only runs on the machine’s CPU.

Among other advantages, newer tools offer support for dedicated high-performance processors like GPUs or TPUs, while providing a NumPy-like API to make onboarding easier. The most prominent ones are currently TensorFlow, PyTorch and JAX.

# Create a 2x2 random PyTorch tensor, trying to store it into the GPU memory
x = torch.rand(size=(2, 2), device=device)
print(x)
tensor([[0.5224, 0.7456],
        [0.5272, 0.0403]], device='mps:0')
/Users/baptiste/Documents/Projets/GitHub/bpesquet/ainotes/.venv/lib/python3.11/site-packages/torch/_tensor_str.py:137: UserWarning: MPS: nonzero op is supported natively starting from macOS 13.0. Falling back on CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Indexing.mm:283.)
  nonzero_finite_vals = torch.masked_select(

Loading and exploring data#

Introduction to pandas#

The pandas library is dedicated to data analysis in Python. It greatly facilitates loading, exploring and processing tabular data files.

The primary data structures in pandas are implemented as two classes:

  • DataFrame, which is quite similar to as a relational data table, with rows and named columns.

  • Series, which represents a single data column. A DataFrame contains one or more Series and a name for each Series.

The DataFrame is a commonly used abstraction for data manipulation.

# Create two data Series
pop = pd.Series({"CAL": 38332521, "TEX": 26448193, "NY": 19651127})
area = pd.Series({"CAL": 423967, "TEX": 695662, "NY": 141297})

# Create a DataFrame contraining the two Series
# The df_ prefix is used to distinguish pandas dataframes from plain NumPy arrays
df_poprep = pd.DataFrame({"population": pop, "area": area})

print(df_poprep)
     population    area
CAL    38332521  423967
TEX    26448193  695662
NY     19651127  141297

Loading a tabular dataset#

When describing tabular information, most datasets are stored as a CSV (Comma-Separated Values) text file.

The pd.read_csv function can load a CSV file into a DataFrame from either a local path or a URL.

The following code loads a dataset which was extracted from a Kaggle competition.

# Load a CSV file into a DataFrame
# Data comes from this Kaggle competition:
df_olympics = pd.read_csv(
    "https://raw.githubusercontent.com/bpesquet/ainotes/master/data/athlete_events.csv"
)

Exploring tabular data#

Once a dataset is loaded into a DataFrame, many operations can be applied to it for visualization or transformation purposes. For more details, see the 10 minutes to pandas tutorial.

Let’s use pandas to perform the very first steps of what is often called Exploratory Data Analysis.

# Print dataset shape (rows x columns)
print(f"df_olympics: {df_olympics.shape}")
df_olympics: (271116, 15)
# Print a concise summary of the dataset
df_olympics.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271116 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB
# Print the first 10 rows of the dataset
df_olympics.head(n=10)
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
0 1 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Summer Barcelona Basketball Basketball Men's Basketball NaN
1 2 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 Summer London Judo Judo Men's Extra-Lightweight NaN
2 3 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Summer Antwerpen Football Football Men's Football NaN
3 4 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold
4 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 500 metres NaN
5 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 1,000 metres NaN
6 5 Christine Jacoba Aaftink F 25.0 185.0 82.0 Netherlands NED 1992 Winter 1992 Winter Albertville Speed Skating Speed Skating Women's 500 metres NaN
7 5 Christine Jacoba Aaftink F 25.0 185.0 82.0 Netherlands NED 1992 Winter 1992 Winter Albertville Speed Skating Speed Skating Women's 1,000 metres NaN
8 5 Christine Jacoba Aaftink F 27.0 185.0 82.0 Netherlands NED 1994 Winter 1994 Winter Lillehammer Speed Skating Speed Skating Women's 500 metres NaN
9 5 Christine Jacoba Aaftink F 27.0 185.0 82.0 Netherlands NED 1994 Winter 1994 Winter Lillehammer Speed Skating Speed Skating Women's 1,000 metres NaN
# Print 5 random samples
df_olympics.sample(n=5)
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
234619 117659 Andrzej Szymczak M 27.0 193.0 96.0 Poland POL 1976 Summer 1976 Summer Montreal Handball Handball Men's Handball Bronze
165291 83006 John William Mulhall M 26.0 165.0 62.0 Great Britain GBR 1964 Summer 1964 Summer Tokyo Gymnastics Gymnastics Men's Horse Vault NaN
226761 113946 Barbora potkov F 31.0 182.0 80.0 Czech Republic CZE 2012 Summer 2012 Summer London Athletics Athletics Women's Javelin Throw Gold
223951 112493 Clement Eyer Smoot M 20.0 NaN NaN Western Golf Association-1 USA 1904 Summer 1904 Summer St. Louis Golf Golf Men's Team Gold
196758 98783 Shankar Ramu M 22.0 NaN NaN Malaysia MAS 1992 Summer 1992 Summer Barcelona Hockey Hockey Men's Hockey NaN
# Print descriptive statistics for all numerical attributes
df_olympics.describe()
ID Age Height Weight Year
count 271116.000000 261642.000000 210945.000000 208241.000000 271116.000000
mean 68248.954396 25.556898 175.338970 70.702393 1978.378480
std 39022.286345 6.393561 10.518462 14.348020 29.877632
min 1.000000 10.000000 127.000000 25.000000 1896.000000
25% 34643.000000 21.000000 168.000000 60.000000 1960.000000
50% 68205.000000 24.000000 175.000000 70.000000 1988.000000
75% 102097.250000 28.000000 183.000000 79.000000 2002.000000
max 135571.000000 97.000000 226.000000 214.000000 2016.000000
# Print descriptive statistics for all non-numerical attributes
df_olympics.describe(include=["object", "bool"])
Name Sex Team NOC Games Season City Sport Event Medal
count 271116 271116 271116 271116 271116 271116 271116 271116 271116 39783
unique 134732 2 1184 230 51 2 42 66 765 3
top Robert Tait McKenzie M United States USA 2000 Summer Summer London Athletics Football Men's Football Gold
freq 58 196594 17847 18853 13821 222552 22426 38624 5733 13372
# Print the number of samples by sport
df_olympics["Sport"].value_counts()
Sport
Athletics        38624
Gymnastics       26707
Swimming         23195
Shooting         11448
Cycling          10859
                 ...  
Racquets            12
Jeu De Paume        11
Roque                4
Basque Pelota        2
Aeronautics          1
Name: count, Length: 66, dtype: int64
# Print rounded average age for women
age_mean = df_olympics[df_olympics["Sex"] == "F"]["Age"].mean()
print(f"{age_mean:.0f}")
24
# Print percent of athletes for some countries

athletes_count = df_olympics.shape[0]

for country in ["USA", "FRA", "GBR", "GER"]:
    percent = (df_olympics["NOC"] == country).sum() / athletes_count
    print(f"Athletes from {country}: {percent*100:.2f}%")
Athletes from USA: 6.95%
Athletes from FRA: 4.71%
Athletes from GBR: 4.52%
Athletes from GER: 3.63%

Loading images#

Digital images are stored using either the bitmap format (an array of color values for all individual pixels in the image) or the vector format (a description of the elementary shapes in the image).

Bitmap images can be easily manipulated as tensors. Each pixel color is typically expressed using a combination of the three primary colors: red, green and blue.

RGB wheel

# Load sample images provided by scikit-learn into a NumPy array
images = np.asarray(load_sample_images().images)

# Load the last sample image
sample_image = images[-1]

# Show image
plt.imshow(sample_image)

print(f"Images: {images.shape}")
print(f"Sample image: {sample_image.shape}")
print(f"Sample pixel: {sample_image[225, 300]}")
Images: (2, 427, 640, 3)
Sample image: (427, 640, 3)
Sample pixel: [219  78  60]
../_images/e8688a303beb0251862a850075c1f0473dce8f1483c0fb2899b8cbb87f2302e9.png

Preparing data for training#

A mandatory step#

In Machine Learning, the chosen dataset has to be carefully prepared before using it to train a model. This can have a major impact on the outcome of the training process.

This important task, sometimes called data preprocessing, might involve:

  • Splitting data between training, validation and test sets.

  • Reshaping data.

  • Removing superflous features (if any).

  • Adding missing values.

  • Scaling data.

  • Transforming values into numeric form.

  • Augmenting data.

  • Engineering new features.

Data splitting#

Once trained, a ML model must be able to generalize (perform well with new data). In order to assert this ability, data is always split into two or three sets before training:

  • Training set (typically 80% or more): fed to the model during training.

  • Validation set: used to tune the model without biasing it in favor of the test set.

  • Test set: used to check the final model’s performance on unseen data.

Dataset splitting

# Demonstrate the use of scikit-learn's train_test_split for splitting a dataset

# Create a random 30x4 matrix (fictitious inputs) and a random 30x1 vector (fictitious results)
x = np.random.rand(30, 4)
y = np.random.rand(30)
print(f"x: {x.shape}. y: {y.shape}")

# Split fictitious dataset between training and test sets, using a 75/25 ratio
# A unique call to train_test_split is mandatory to maintain inputs/target correspondance between samples
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

print(f"x_train: {x_train.shape}. y_train: {y_train.shape}")
print(f"x_test: {x_test.shape}. y_test: {y_test.shape}")
x: (30, 4). y: (30,)
x_train: (22, 4). y_train: (22,)
x_test: (8, 4). y_test: (8,)

Image and video reshaping#

A bitmap image can be represented as a 3D multidimensional array of dimensions \(height \times width \times color\_channels\).

A video can be represented as a 4D multidimensional array of dimensions \(frames \times height \times width \times color\_channels\).

They have to be reshaped, or more precisely flattened in that case, into a vector (1D tensor) before being fed to most ML algorithms.

Reshaping an image

# Flatten the image, which is a 3D tensor, into a vector (1D tensor)
flattened_image = sample_image.reshape((427 * 640 * 3,))

# Alternative syntaxes to achieve the same result
# -1 means the new dimension is inferred from current dimensions
# Diference between flatten() and ravel() is explained here:
# https://numpy.org/doc/stable/user/absolute_beginners.html#reshaping-and-flattening-multidimensional-arrays
flattened_image = sample_image.reshape((-1,))
flattened_image = sample_image.ravel()
flattened_image = sample_image.flatten()

print(f"Flattened image: {flattened_image.shape}")
Flattened image: (819840,)

Handling of missing values#

Most ML algorithms cannot work with missing values in features.

Depending on the percentage of missing data, three options exist:

  • remove the corresponding data samples;

  • remove the whole feature(s);

  • replace the missing values (using 0, the mean, the median or something more meaningful in the context).

# Demonstrate the use of scikit-learn's SimpleImputer for handling missing values

# Create a tensor with missing values
x = np.array([[7, 2, np.nan], [4, np.nan, 6], [10, 5, 9]])
print(x)

# Replace missing values with column-wise mean
imputer = SimpleImputer(strategy="mean")
print(imputer.fit_transform(x))

# Replace missing values with "Unknown"
imputer = SimpleImputer(strategy="constant", missing_values=None, fill_value="Unknown")
print(imputer.fit_transform([["M"], ["M"], [None], ["F"], [None]]))
[[ 7.  2. nan]
 [ 4. nan  6.]
 [10.  5.  9.]]
[[ 7.   2.   7.5]
 [ 4.   3.5  6. ]
 [10.   5.   9. ]]
[['M']
 ['M']
 ['Unknown']
 ['F']
 ['Unknown']]

Feature scaling#

Most ML algorithms work best when all features have a similar scale. Several solutions exist:

  • Min-Max scaling: features are shifted and rescaled to the \([0,1]\) range by substracting the min value and dividing by (max-min) on the first axis.

  • Standardization: features are centered (substracted by their mean) then reduced (divided by their standard deviation) on the first axis. All resulting features have a mean of 0 and a standard deviation of 1.

# Demonstrate the use of scikit-learn's MinMaxScaler to rescale values

# Generate a random 3x4 tensor with integer values between 1 and 10
x = np.random.randint(1, 10, (3, 4))
print(x)

# Compute min and max then scale tensor in one operation
x_scaled = MinMaxScaler().fit_transform(x)

print(x_scaled)
print(f"Minimum: {x_scaled.min(axis=0)}. Maximum: {x_scaled.max(axis=0)}")
[[4 2 1 1]
 [4 9 7 4]
 [3 1 9 3]]
[[1.         0.125      0.         0.        ]
 [1.         1.         0.75       1.        ]
 [0.         0.         1.         0.66666667]]
Minimum: [0. 0. 0. 0.]. Maximum: [1. 1. 1. 1.]
# Demonstrate the use of scikit-learn's StandardScaler to standardize values

# Generate a random (3,4) tensor with integer values between 1 and 10
x = np.random.randint(1, 10, (3, 4))
print(x)

# Center and reduce data
scaler = StandardScaler().fit(x)
print(scaler.mean_)

x_scaled = scaler.transform(x)
print(x_scaled)

# New mean is 0. New standard deviation is 1
print(f"Mean: {x_scaled.mean()}. Std: {x_scaled.std()}")
[[8 3 6 7]
 [2 7 7 6]
 [7 9 8 5]]
[5.66666667 6.33333333 7.         6.        ]
[[ 0.88900089 -1.33630621 -1.22474487  1.22474487]
 [-1.3970014   0.26726124  0.          0.        ]
 [ 0.50800051  1.06904497  1.22474487 -1.22474487]]
Mean: 1.850371707708594e-17. Std: 1.0

Feature scaling and training/test sets#

In order to avoid information leakage, the test set must be scaled with values calculated on the training set.

# Compute mean and std on training set
scaler = StandardScaler().fit(x_train)

# Standardize training and test sets, using mean and std computed on training set
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)

print(f"Train mean: {x_train_scaled.mean(axis=0)}")
print(f"Train std: {x_train_scaled.std(axis=0)}")
print(f"Test mean: {x_test_scaled.mean(axis=0)}")
print(f"Test std: {x_test_scaled.std(axis=0)}")
Train mean: [ 1.71579922e-16 -1.36885452e-16 -6.92627773e-16 -2.52323415e-17]
Train std: [1. 1. 1. 1.]
Test mean: [ 0.38510838  0.74722209  0.15848908 -0.06560817]
Test std: [0.95896239 1.37968827 1.12558319 1.33617739]

Image and video scaling#

Individual pixel values for images and videos are typically integers in the \([0,255]\) range. This is not ideal for most ML algorithms.

Dividing them by \(255.0\) to obtain floats into the \([0,1]\) range is a common practice.

# Scaling sample image pixels between 0 and 1
scaled_image = sample_image / 255.0

# Check that all values are in the [0,1] range
assert scaled_image.min() >= 0
assert scaled_image.max() <= 1

Encoding of categorical features#

Some features or targets may come as discrete rather than continuous values. Moreover, these discrete values might be strings. ML models are only able to manage numerical-only data.

A solution is to apply one-of-K encoding, also named dummy encoding or one-hot encoding. Each categorical feature with K possible values is transformed into a vector of K binary features, with one of them 1 and all others 0.

Note: using arbitrary integer values rather than binary vectors would create a proximity relationship between the new features, which could confuse the model during training.

# Demonstrate the use of scikit-learn's OneHotEncoder to one-hot encode categorical features

# Create a categorical variable with 3 different values
x = [["GOOD"], ["AVERAGE"], ["GOOD"], ["POOR"], ["POOR"]]

# Encoder input must be a matrix
# Output will be a sparse matrix where each column corresponds to one possible value of one feature
encoder = OneHotEncoder().fit(x)
x_hot = encoder.transform(x).toarray()

print(x_hot)
print(encoder.categories_)
[[0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]]
[array(['AVERAGE', 'GOOD', 'POOR'], dtype=object)]
# Demonstrate one-hot encoding of categorical features given as integers

# Generate a (5,1) tensor with integer values between 0 and 9
x = np.random.randint(0, 9, (5, 1))
print(x)

# Encoder input must be a matrix
# Output will be a sparse matrix where each column corresponds to one possible value of one feature
x_hot = OneHotEncoder().fit_transform(x).toarray()

print(x_hot)
[[0]
 [2]
 [6]
 [6]
 [0]]
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [1. 0. 0.]]

One-hot encoding and training/test sets#

Depending on value distribution between training and test sets, some categories might appear only in one set.

The best solution is to one-hot encode based on the training set categories, ignoring test-only categories.

x_train = [["Blue"], ["Red"], ["Blue"], ["Green"]]
# "Yellow" is not present in training set
x_test = [
    ["Red"],
    ["Yellow"],
    ["Green"],
    ["Yellow"],
]

# Using categories from train set, ignoring unkwown categories
encoder = OneHotEncoder(handle_unknown="ignore").fit(x_train)
print(encoder.transform(x_train).toarray())
print(encoder.categories_)

# Unknown categories will result in a binary vector will all zeros
print(encoder.transform(x_test).toarray())
[[1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]]
[array(['Blue', 'Green', 'Red'], dtype=object)]
[[0. 0. 1.]
 [0. 0. 0.]
 [0. 1. 0.]
 [0. 0. 0.]]
x = [["M"], ["M"], [None], ["F"]]

# Replace missing values with constant
print(
    SimpleImputer(
        strategy="constant", missing_values=None, fill_value="Unknown"
    ).fit_transform(x)
)
[['M']
 ['M']
 ['Unknown']
 ['F']]

Data augmentation#

Data augmentation is the process of enriching a dataset by adding new samples, slightly modified copies of existing data or newly created synthetic data.

Image augmentation example

Feature engineering#

Feature engineering is the process of preparing the proper input features, in order to facilitate the learning task. The problem is made easier by expressing it in a simpler way. This usually requires a good domain knowledge.

The ability of deep neural networks to discover useful features by themselves has somewhat reduced the criticality of feature engineering. Nevertheless, it remains important in order to solve problems more elegantly and with fewer data.

Example (taken from the book Deep Learning with Python): the task of learning the time of day from a clock is far easier with engineered features rather than raw clock images.

Feature engineering