Definitions and useful references

Proposed here in the order of use in the course

Data Scientist: “Person who is better at statistics than any software engineer and better at software than any statistician.” (J. Wills)

Feature: A property of an instance that can be used in a prediction task. For example, "The household has 2 children".

Data, training and validation sets:

The original data set, which may have several dimensions can be divided into two complementary data sets. The training data set and the validation data set.

Untitled

The model is initially estimated on a training set, which is a subset of the full data set used to fit the parameters using an optimization methods. The current model is estimated with the training set only.

Successively, the estimated model is used to predict the responses for the observations in a second data set called the validation set. The comparison of the observed values and the predicted ones provide an unbiased evaluation of the quality of the prediction on unseen (by the model) data set. This indicator is then used to fine tuning the model's hyperparameters.

In practice, the validation set is is fact composed of randomly selected observations in the original data set and not by selecting an fixed block of observations. see also https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets

Cross-Validation (k-fold Cross validation): A technique involving partitioning a data set into 2 complementary subsets: the training set for performing the analysis, and the validation set to validate the analysis on new data (unseen and unused in the analysis). The technique is repeated k-time (k-fold Cross validation) to tackle overfitting since the model is trained to provide the best prediction for many new (unseen) validation sets.

In practice, the process is also using random selection of validation set in each of the k-fold Cross validation data sets. see also https://en.wikipedia.org/wiki/Cross-validation_(statistics)

Bias The consistent deviation of estimated results from the "true" (unknown) value. This is different from the (observed) error that measure the difference of the estimation to observed values.