About Data in ML

ML practitioners spend far more time evaluating, cleaning, and transforming data than building models.

numerical data
categorical data

Numerical Data

This unit focuses on numerical data, meaning integers or floating-point values that behave like numbers. That is, they are additive, countable, ordered, and so on.

Feature Vector

The feature vector is input during training and during inference. A feature Vector is an array of feature values comprising an example.

Feature Vectors seldom use the dataset’s raw values. Instead, you must typically process the dataset’s values into representations that your model can better learn from. This process is called feature engineering.

Every value in a feature vector must be a floating-point value. However, many features are naturally strings or other non-numerical values. Consequently, a large part of feature engineering is representing non-numerical values as numerical values.

The most common feature engineering techniques are:

Normalization: Converting numerical values into a standard range.
Binning (also referred to as bucketing): Converting numerical values into buckets of ranges.

Data Process

Visualize your data

Visualizations help you continually check your assumptions. Use Pandas for visualization:

Working with Missing Data: pandas Documentation
Visualizations pandas Documentation

Statistically evaluate your data

Use Pandas describe()

Find outliers

Categorical Data

Categorical Data include numbers that behave like categories.