K-Nearest Neighbor Machine Learning algorithm

The German credit dataset can be downloaded from UC Irvine, Machine learning community to indicate the predicted outcome if the loan applicant defaulted or not. Applying the logistic regression with three variables duration, amount, and installment, K-means classification, and K-Nearest Neighbor machine learning algorithm.

# Logistic regression
# Load the file from the hard disk after setting the work directory
germandata <- read.csv("Creditdata.csv") # Print dataset to see the pattern of the data germandata # The variable response is leveraged to evaluate the probability of the default outcome of the credit loan germandata$Response <- factor(germandata$Response) # The subset of the data has been created to leverage the variables duration, amount, installment, and response germandata <- germandata[,c("duration","amount","installment","Response")] # Print the dataset to see the data for these variables germandata #Perform the summary function on the dataset to see the data summary(germandata) #Sample output for 10 rows: > germandata
duration amount installment Response
1 6 1169 A143 1
2 48 5951 A143 2
3 12 2096 A143 1
4 42 7882 A143 1
5 24 4870 A143 2
6 36 9055 A143 1
7 24 2835 A143 1
8 36 6948 A143 1
9 12 3059 A143 1
10 30 5234 A143 2
11 12 1295 A143 2

# Create the matrix for generating the indicator variables

creditmatrix <- model.matrix(Response~.,data=germandata)[,-1] creditmatrix[1:1000,] # Generate training and testing datasets # Select 900 cases as training data and the rest for the testing dataset. # Use set seed function to generate random number generation set.seed(1) #Training set sampledata <- sample(1:1000,900) crematrix1 <- germanmatrix[sampledata,] crematrix11 <- germanmatrix[-sampledata,] crematrix2 <- germandata$Response[sampledata] crematrix22 <- germandata$Response[-sampledata] germanglm=glm(Response~.,family=binomial,data=data.frame(Response=crematrix2,crematrix1)) summary(germanglm) # Output > germanglm=glm(Response~.,family=binomial,data=data.frame(Response=crematrix2,crematrix1))
> summary(germanglm)

Call:
glm(formula = Response ~ ., family = binomial, data = data.frame(Response = crematrix2,
crematrix1))

Deviance Residuals:
Min 1Q Median 3Q Max
-1.7156 -0.8321 -0.6935 1.2619 1.8573

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.140e+00 2.292e-01 -4.975 6.54e-07 ***
duration 3.035e-02 7.638e-03 3.973 7.09e-05 ***
amount 3.852e-05 3.187e-05 1.209 0.22674
installmentA142 -2.138e-01 3.748e-01 -0.570 0.56846
installmentA143 -5.877e-01 2.021e-01 -2.909 0.00363 **

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1106.2 on 899 degrees of freedom
Residual deviance: 1057.4 on 895 degrees of freedom
AIC: 1067.4

Number of Fisher Scoring iterations: 4

#Testing set
testingset <- predict(germanglm, traindata=data.frame(trainvar),type="response") data.frame(crematrix22,testingset) #Sample output > data.frame(crematrix22,testingset)
crematrix22 testingset
832 2 0.2433087
767 1 0.3323543
273 2 0.5492819
188 2 0.2320270
225 1 0.2510320
62 2 0.2291168
60 1 0.4024574
148 1 0.2079565
73 1 0.2989247
43 2 0.2804005
518 1 0.4108038
781 1 0.4123850

K-means classification

K-means classification applies partitioning of the dataset . The method kmeans takes the dataset kmeansclass and considering the number of clusters applied as 2, the algorithm performs the K-means classification. Other functions such as pam() and pamk() can be applied as well. In this case, I applied kmeans() function (Quick-R, n.d.).

library(class)

# Load the file from the hard disk
germandata <- read.csv("Creditdata.csv") # Print dataset to see the pattern of the data germandata # The variable response is leveraged to evaluate the probability of the default outcome of the credit loan germandata$Response <- factor(germandata$Response) germandata <- germandata[,c("duration","amount","installment","Response")] #germandataset<-data.frame("duration","amount","installment","Response") germandata[1:5,] #output > germandata[1:5,]
duration amount installment Response
1 6 1169 A143 1
2 48 5951 A143 2
3 12 2096 A143 1
4 42 7882 A143 1
5 24 4870 A143 2

#Print summary of German data

summary(germandata)
#Output

> summary(germandata)
duration amount installment Response
Min. : 4.0 Min. : 250 A141:139 1:700
1st Qu.:12.0 1st Qu.: 1366 A142: 47 2:300
Median :18.0 Median : 2320 A143:814
Mean :20.9 Mean : 3271
3rd Qu.:24.0 3rd Qu.: 3972
Max. :72.0 Max. :18424

# • For the k-means classification, use 3 continuous variables:
# duration, amount, and installment.
kmeansclass <- cbind(germandata$Response,germandata$duration,germandata$amount,germandata$installment) result <- kmeans(kmeansclass, 2) result result$cluster #Output > result
K-means clustering with 2 clusters of sizes 825, 175

Cluster means:
[,1] [,2] [,3] [,4]
1 1.275152 17.96000 2185.781 2.687273
2 1.417143 34.77714 8388.509 2.617143

Clustering vector: (Sample output):
[1] 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1 1 1 1 2 2 2
[991] 1 1 1 1 1 1 1 1 1 1

Within cluster sum of squares by cluster:
[1] 1125263911 1280057044
(between_SS / total_SS = 69.8 %)

Available components:

[1] “cluster” “centers” “totss” “withinss” “tot.withinss” “betweenss”
[7] “size” “iter” “ifault”

> result$cluster (Sample output):
[1] 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1 1 1 1 2 2 2
[46] 1 1 1 2 1 1 2 1 1 1 1 2 2 1 2 1 1 1 2 1 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 1 1 2 1 1
(Stats, n.d.)

Cross-validation with k = 5 for the nearest neighbor

The K-nearest neighbor algorithm applied with knn() method with trained and tested German credit datasets for predicted outcomes finds the newest objects based on outcomes from the closest objects of the trained German dataset. K-nearest is a machine learning algorithm. It gave me numerous errors when applied on training and testing dataset. The only way I was able to resolve the error when I applied 2,drop = FALSE to the training and testing dataset, to make it consider as a data frame, instead of a vector. When I set it with the drop and FALSE, it accepted the data sets. I have used three continuous variables duration, amount, and installation (Stats, n.d.).
## k-nearest neighbor method
library(class)

nearest5 <- knn(train=xtrain[,2,drop = FALSE], test=xnew[,2,drop = FALSE], cl=ytrain, k=5) nearest5 #Output > nearest5
[1] 1 2 2 1 2 2 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 2 1 1 1 1
[47] 1 1 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1
[93] 1 1 1 1 1 1 2 2
Levels: 1 2

data.frame(ynew,nearest5)[1:10,]

> data.frame(ynew,nearest5)[1:10,]
ynew nearest5
1 2 1
2 1 2
3 2 2
4 1 1
5 1 2
6 1 2
7 1 2
8 1 1
9 1 1
10 1 1

## calculate the proportion of correct classifications
corrclassfn5=100*sum(ynew==nearest5)/100
corrclassfn5
#Output
> ## correct classifications
> corrclassfn5=100*sum(ynew==nearest5)/100
> corrclassfn5
[1] 69

References

Quick-R (n.d.). Cluster Analysis. Retrieved February 26, 2016 , from http://www.statmethods.net/advstats/cluster.html
Stats (n.d.). K-Means Clustering. Retrieved February 23, 2016 , from https://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html
Stats (n.d.). k-Nearest Neighbour Classification. Retrieved February 26, 2016 , from https://stat.ethz.ch/R-manual/R-devel/library/class/html/knn.html

2 thoughts on “K-Nearest Neighbor Machine Learning algorithm

Leave a Reply

Your email address will not be published. Required fields are marked *