Assignment 1

Dates and rules.

This assignment is for groups of 2 students. Student groups must be registered by October 7.

The deadline for submitting the assignment is October 18, plus 48 hours for solving any problems with your submission. No submissions will e accepted after 23:59 of October 20.

You must submit a zip file containing, at least, these three files, named exactly as specified (names are case-sensitive):

report.pdf
This is your report, in pdf format
tp1.py
This is a Python 3.x script that can be used to run your code for this assignment.
TP1-data.csv
This is the original data file we provide.

You can include other files in the zip file, if necessary (e.g. other python modules if you wish to separate your code into several files)

Your report may be in English or Portuguese.

Objective

The goal of this assignment is to parametrize, fit and compare logistic regression, K-nearest neighbours and Naive Bayes classifiers. The data set is the banknote authentication data set, which you can download here: TP1-data.csv.

This data set was obtained from the UCI machine learning repository

You will have to load and pre-process the data, select the best regularization parameter for the logistic regression (using the four features provided; do not add extra features), select the best k value for the K-Nearest Neighbours classifier and select the best bandwidth parameter for the Kernel Density Estimators used in the Naive Bayes classifier. Assume the same bandwidth for all Kernel Density Estimators used.

You must implement your own Naive Bayes classifier using the Kernel Density Estimators for the probability distributions of the features. For this, you can use any code from the lectures, lecture notes and tutorials that you find useful. Also, you can use the KernelDensity class from sklearn.neighbors.kde for the density estimation. You should use the Logistic Regression and k-Nearest Neighbours classifiers available in the Scikit-learn library.

Finally, you must compare the performance of the three classifiers, choose the best and discuss if it is significantly better than the others.

The data are available in a .csv file where each line corresponds to a bank note and the five values, separated by commas, are, in order, the four features (variance, skewness and curtosis of Wavelet Transformed image and the entropy of the bank note image) and the class label, an integer with values 0 or 1, to distinguish between real bank notes and fake bank notes.

Your report should explain and justify the parameter selections and final conclusions, show the relevant error plots and explain your implementation of the Naive Bayes classifier.

Guidelines for the implementation

These are suggestions to help you understand what is required in this assignment. This section may be updated to clarify questions that arise during the assignment.

  • Process the data correctly, including randomizing the order of the data points and standardizing the values.
  • Determine the parameters with cross validation on two thirds of the data, leaving one third out for testing.
  • For the regularization parameter of the logistic regression classifier, start with a C value of 1 and double it at each iteration for 20 iterations. Plot the errors against the logarithm of the C value.
  • For the k value of the K-Nearest Neighbours classifier, test k values from 1 to 39 using odd values only.
  • Use a Gaussian kernel (default) for all the Kernel Density Estimators in your Naive Bayes classifier.
  • Use the same bandwidth value for all the Kernel Density Estimators in your Naive Bayes classifier, and try values from 0.01 to 1 with a step of 0.02.
  • When splitting your data, for testing and for cross validation, use stratified sampling.
  • Use 5 folds for cross validation
  • Use the fraction of incorrect classifications as the measure of the error. This is equal to 1-accuracy, and the accuracy can be obtained with the score method of the logistic regression and KNN classifiers in Scikit-learn.
  • For the NB classifier, you can implement your own measure of the accuracy or use the accuracy_score function in the metrics module.
  • For comparing the classifiers, use McNemar's test with a 95% confidence interval

Guidelines for the report

The report for this assignment should:

  • Explain your implementation of the Naïve Bayes classifier
  • Explain what are the three parameters that were optimized and their effects on their respective classifiers. This will also require a brief explanation of how each classifier works.
  • Explain the method by which the optimal values were found, noting the differences between the errors and the importance of leaving out a test set for the final evaluation
  • Estimate the true error for each classifier, compare the classifiers with McNemar's test and discuss which, if any, is better for this application
  • The report should show the error plots but no other plots are necessary (since the data is in four dimensions, plotting the data and the classification would not be a simple task).