ML practitioners spend far more time evaluating, cleaning, and transforming data than building models.
- numerical data
- categorical data
Numerical Data
This unit focuses on numerical data, meaning integers or floating-point values that behave like numbers. That is, they are additive, countable, ordered, and so on.
Feature Vector
The feature vector is input during training and during inference. A feature Vector is an array of feature values comprising an example.
Feature Vectors seldom use the dataset’s raw values. Instead, you must typically process the dataset’s values into representations that your model can better learn from. This process is called feature engineering.
Every value in a feature vector must be a floating-point value. However, many features are naturally strings or other non-numerical values. Consequently, a large part of feature engineering is representing non-numerical values as numerical values.
The most common feature engineering techniques are:
- Normalization: Converting numerical values into a standard range.
- Binning (also referred to as bucketing): Converting numerical values into buckets of ranges.
Data Process
Visualize your data
Visualizations help you continually check your assumptions. Use Pandas for visualization:
- Working with Missing Data: pandas Documentation
- Visualizations pandas Documentation
Statistically evaluate your data
Use Pandas describe()
Find outliers
Categorical Data
Categorical Data include numbers that behave like categories.