data-driven approach training dataset: labeled images image classification pipeline: input training set(N images labeled with one of K classess)->learning training a classifier/learning a model->evaluation predict labels of a new set of images
compare the image pixel by pixle and add up difference. calculate L1 distance/L2 distance etc.
find the top k closest images->vote on the label decision boundaries
hyperparameters:cannot use test set to tweak hyperparameters generalization overfit tune hyperparameters: split training set in two(validation set (slightly smaller)&training set)->choose best k cross-validation:iterate over different validation sets, average the performance **in practice**avoid cross-validation,usually use 50%-90% of training data to train, rest to validate.
just store, take no time to train. predicting takes too much time