机器学习算法

K近邻算法

K近邻是一种有监督学习算法。因为没有对数据进行训练，而是通过新数据与旧数据的比较得到相应的结果。因此是一种隐式的学习过程和训练过程。K近邻算法可以用来解决分类问题，也可以用来解决回归问题。

步骤

对未知类别的属性的数据集中的每个点依次执行以下操作：

计算已知类别数据集中点与当前点之间的距离
按照距离递增次序排序
选取与当前点距离最小的k个点
确定前k个点所在的类别出现的频率
返回前k个点出现频率最高的类别作为当前点的预测分类

在确定样本和当前点的距离时，通常采用的是欧式距离公式$$d=\sqrt{(x_1-x_0)+(y_1-y_0)}$$ 当公式中的指数变化时，随之也会得到相应的不同的距离公式。

如下图所示的例子中

1、当采用实线的圆作为k近邻的范围，也就是$k=3$时，此时与绿点距离更近的三个点中，三角形出现的频率更大，因此将绿点归为三角形一类

2、当采用虚线的圆作为k近邻的范围，也就是$k=5$，时，此时与绿点距离更近的五个点中，正方形的频率更大，因此将绿点归为正方形一类。

以下采用鸢尾花作为例子进行KNN测试

代码

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# print(df)
df['label'] = iris.target
# print(len(df))  150
# 绘散点图
# print(df.info()) # 显示数据类型
# 前两个特征
Colors = []
for i in range(df.shape[0]):
    item = df.iloc[i, -1]  # 定位到标签
    if item == 0:
        Colors.append('black')
    if item == 1:
        Colors.append('red')
    if item == 2:
        Colors.append('orange')
plt.rcParams['font.sans-serif'] = ['Simhei']
bgplt = plt.figure(figsize=(12, 8))

fig1 = bgplt.add_subplot(221)
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], marker='.', c=Colors)
plt.xlabel('sepal length')
plt.ylabel('sepal width')

# 13两个特征
fig2 = bgplt.add_subplot(222)
plt.scatter(df.iloc[:, 0], df.iloc[:, 2], marker='.', c=Colors)
plt.xlabel('sepal length')
plt.ylabel('petal length')

# 34两个特征
fig3 = bgplt.add_subplot(223)
plt.scatter(df.iloc[:, 2], df.iloc[:, 3], marker='.', c=Colors)
plt.xlabel('petal length')
plt.ylabel('petal width')

plt.show()

new_data = [5.7, 3.3, 6.2, 0.7]

def KnnAlgorithm(data, target, k=3):
    tmp_list = list((((data.iloc[:150, 0:4] - target) ** 2).sum(1)) ** 0.5)
    dist_l = pd.DataFrame({'dist': tmp_list, 'label': (data.iloc[:150, -1])})  # 加上标签
    dist_sort = dist_l.sort_values(by='dist')[: k]
    # print(dist_sort)
    res = dist_sort.loc[:, 'label'].value_counts()
    return res.index[0]


print(KnnAlgorithm(df, new_data, k=4))