There are some concepts need to understand about dataset:

  • There are four different characteristics of data and datasets.
  • There are four different causes of data unreliability.
  • what is label.
  • You can improve the quality of human-rated labels.
  • When you train a model, you will subdivide a dataset into a training set, validation set, and test set.
  • What is overfitting.
  • What is regularization.

Data characteristics

Tables are an intuitive input format for machine learning models. You can imagine each row of the table as an example and each column as a potential feature or label.

Such as

  • comma-separated values (CSV)
  • directly from spreadsheets
  • database tables.

Types of data

  • numerical data
  • categorical data
  • human language, including individual words and sentences
  • multimedia (such as images, videos, and audio files)
  • outputs from other ML systems
  • embedding vectors

Quantity of data

Quality and reliability of data

Complete vs. incomplete

Real-world examples are often incomplete, meaning that at least one feature value is missing.