ml01 k-近邻与距离度量 — demo.py 代码详解

运行方式

bash

cd ml01_knn/code
python demo.py

代码逐段详解

第1步：导入库 — 每个库是做什么的

python

import os
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_moons, make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, mean_squared_error

os：文件路径操作，用于创建 images/ 目录和生成文件保存路径
numpy：数值计算核心。关键用法：np.sum() 计算平方和、np.argsort() 对距离排序找最近邻、np.sqrt() 开平方、np.linalg.norm() 计算向量范数、np.linalg.pinv() 伪逆（马氏距离用）、np.cov() 协方差矩阵、np.clip() 限制数值范围、np.maximum() 元素级取最大值
matplotlib：绘图，包括决策边界填充图、距离等距线图、维数灾难曲线图、准确率对比图
sklearn.neighbors.KNeighborsClassifier：sklearn 的标准 k-NN 实现，用作基准对比
sklearn.datasets.make_moons：生成半月形数据集（非线性可分，测试决策边界的好工具）
sklearn.datasets.make_classification：生成一般分类数据集
sklearn.preprocessing.StandardScaler：特征标准化（每个维度减去均值除以标准差）。k-NN 对特征尺度极其敏感——如果某维度的数值范围远大于其他维度，它会主导距离计算
sklearn.metrics.accuracy_score：分类准确率

第2步：四种距离度量的向量化实现

2.1 欧氏距离的展开公式

python

def euclidean_distance(X_test, X_train):
    sq_test = np.sum(X_test ** 2, axis=1, keepdims=True)   # (m, 1)
    sq_train = np.sum(X_train ** 2, axis=1)                # (n,)
    cross = 2.0 * X_test @ X_train.T                       # (m, n)
    sq_dists = sq_test + sq_train - cross
    return np.sqrt(np.maximum(sq_dists, 0.0))

这个实现看起来比直观的方式复杂，但它避免了显式广播大矩阵。

直观写法（内存浪费）：

python

diff = X_test[:, np.newaxis, :] - X_train[np.newaxis, :, :]  # (m, n, d) — 大!
distances = np.sqrt(np.sum(diff ** 2, axis=2))

展开公式写法利用了数学恒等式：

∥ a - b ∥^{2} = ∥ a ∥^{2} + ∥ b ∥^{2} - 2 a \cdot b

这样只需要计算 (m, d) @ (d, n) = (m, n) 的矩阵乘法——非常高效。

np.maximum(sq_dists, 0.0) 是一个数值稳定性保护：浮点误差可能导致极小负值（如 -1e-15），传递给 np.sqrt() 会返回 NaN。用 maximum 把负值 clamp 到 0 即可避免此问题。

2.2 曼哈顿距离（L1）

python

def manhattan_distance(X_test, X_train):
    abs_diff = np.abs(X_test[:, np.newaxis, :] - X_train[np.newaxis, :, :])
    return np.sum(abs_diff, axis=2)

曼哈顿距离是各维度差值的绝对值之和：

d_{Man} (x, y) = \sum_{i = 1}^{d} | x_{i} - y_{i} |

与欧氏距离平方不一样，L1 不放大差异。如果数据中有异常值，异常值在某个维度上的大偏差对 L1 距离的影响远小于对 L2 距离的影响（因为 L2 会平方放大）。

这里用了 broadcasting：X_test[:, np.newaxis, :] 将形状 (m, d) 变为 (m, 1, d)，与 (1, n, d) 相减得到 (m, n, d)。

2.3 余弦距离

python

def cosine_distance(X_test, X_train):
    norm_test = np.linalg.norm(X_test, axis=1, keepdims=True)
    norm_train = np.linalg.norm(X_train, axis=1)
    dot = X_test @ X_train.T
    cos_sim = dot / (norm_test @ norm_train[np.newaxis, :] + 1e-10)
    cos_sim = np.clip(cos_sim, -1.0, 1.0)
    return 1.0 - cos_sim

关键步骤详解：

np.linalg.norm(X_test, axis=1, keepdims=True)：沿特征维度（axis=1）计算 L2 范数，keepdims=True 保留列向量形状 (m, 1) 而非 (m,)——这对后续的矩阵乘法至关重要
norm_test @ norm_train[np.newaxis, :]：这里没有直接用点积，而是用外积来计算所有范数乘积的组合，得到 (m, n) 的矩阵
np.clip(cos_sim, -1.0, 1.0)：数值稳定性。理论上余弦相似度在 $[- 1, 1]$ 之间，但浮点运算可能产生 1.0000001 或 -1.0000001。np.clip 将其裁剪回合理范围
返回 1.0 - cos_sim：将余弦相似度转换为距离（相似度越高，距离越小）

2.4 马氏距离

python

def mahalanobis_distance(X_test, X_train):
    Sigma = np.cov(X_train.T)
    Sigma_inv = np.linalg.pinv(Sigma)
    ...
    for i in range(m):
        diff = X_test[i] - X_train
        distances[i] = np.sqrt(np.sum((diff @ Sigma_inv) * diff, axis=1))

马氏距离的公式：

d_{Mah} = \sqrt{(x - y)^{T} Σ^{- 1} (x - y)}

np.cov(X_train.T) 计算协方差矩阵（注意转置：sklearn/NumPy 的数据形状是 (n_samples, n_features)，而 np.cov 期望 (n_features, n_samples)）

np.linalg.pinv(Sigma) 用伪逆代替标准逆。当样本数 < 特征数时，协方差矩阵是奇异的——伪逆通过 SVD 分解给出一个稳定近似。

(diff @ Sigma_inv) * diff 是一个小技巧：先计算 $diff \cdot Σ^{- 1}$ （得到一个 (n_train, d) 矩阵），再逐元素乘以 diff，最后 np.sum(axis=1) 得到每行的二次型结果。等价于逐行计算 ${diff}_{j} Σ^{- 1} {diff}_{j}^{T}$ ，但利用了批量运算。

第3步：自定义 KNNClassifier 类

3.1 fit() — 惰性学习的本质

python

def fit(self, X, y):
    self.X_train = np.asarray(X, dtype=np.float64)
    self.y_train = np.asarray(y)
    return self

k-NN 的 fit() 只是存储训练数据——没有权重更新、没有梯度下降、没有迭代。这就是"A lazy learner 从不主动学习，只在被问到问题时才去翻课本"。

3.2 predict() — 预测流程

python

def predict(self, X_test):
    distances = METRIC_FUNCTIONS[self.metric](X_test, self.X_train)
    top_k_indices = np.argsort(distances, axis=1)[:, :self.k]
    top_k_labels = self.y_train[top_k_indices]
    ...

核心步骤：

计算距离矩阵：用选择的距离度量函数，得到 (m_test, n_train) 的距离矩阵
排序找近邻：np.argsort(distances, axis=1) 沿测试样本维度排序，返回从小到大的索引。[:, :self.k] 取前 k 个，即最近的 k 个邻居
索引标签：self.y_train[top_k_indices] 用 fancy indexing 获取所有近邻的标签

3.3 均匀投票

python

def _vote_uniform(self, top_k_labels):
    for i in range(m_test):
        counts = np.bincount(top_k_labels[i].astype(int))
        predictions[i] = np.argmax(counts)

np.bincount() 统计数组中每个非负整数出现的次数。例如：

输入 [0, 1, 0, 2, 0] → 输出 [3, 1, 1]（0 出现 3 次，1 出现 1 次，2 出现 1 次）

np.argmax(counts) 返回出现次数最多的类别索引。

3.4 距离加权投票

python

def _vote_distance_weighted(self, top_k_labels, top_k_distances):
    weights = 1.0 / (top_k_distances + 1e-6)
    for i in range(m_test):
        class_weights = np.zeros(n_classes)
        for j in range(self.k):
            label = int(top_k_labels[i, j])
            class_weights[label] += weights[i, j]
        predictions[i] = np.argmax(class_weights)

权重公式：

w_{i j} = \frac{1}{d (x_{i}^{test}, x_{j}^{train}) + ε}

1e-6（即 $10^{- 6}$ ）防止除零。对于距离为 0 的训练样本（测试样本恰好与某个训练样本重合），其权重会非常大（ $1 / 10^{- 6} = 10^{6}$ ），几乎以一己之力决定投票结果——这符合直觉。

然后对每个类别累计权重，选权重最大的类别。这确保了"近邻"比"远邻"拥有更大的话语权。

第4步：决策边界可视化

python

def plot_decision_boundary(ax, model, X, y, title, step=0.05):
    xx, yy = np.meshgrid(
        np.arange(x_min, x_max, step),
        np.arange(y_min, y_max, step)
    )
    grid_points = np.c_[xx.ravel(), yy.ravel()]
    Z = model.predict(grid_points)
    Z = Z.reshape(xx.shape)
    ax.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.RdYlBu)

这段代码的核心思想是对空间中的每一个点做预测，然后用不同颜色填充。步骤如下：

np.meshgrid 在特征空间打网格
np.c_[xx.ravel(), yy.ravel()] 将二维网格坐标展开为一维点列表，每个点 (x_coord, y_coord) 一行
对所有网格点调用 model.predict()——这可能很慢（如果网格很密），step=0.05 控制分辨率
Z.reshape(xx.shape) 将预测结果变回二维
ax.contourf 用填充等高线绘制决策区域

第5步：维数灾难的数值实验

python

def plot_curse_of_dimensionality():
    for d in dims:
        points = np.random.uniform(0, 1, (n_points, d))
        idx1 = np.random.choice(n_points, size=min(500, n_points), replace=False)
        idx2 = np.random.choice(n_points, size=min(500, n_points), replace=False)
        diffs = points[idx1] - points[idx2]
        dists = np.sqrt(np.sum(diffs ** 2, axis=1))
        ratios.append(dists.min() / (dists.max() + 1e-10))

这段代码用Mento Carlo 模拟验证维数灾难：在 $d$ 维超立方体中均匀采样点，计算随机点对之间的距离，然后看 min(dist) / max(dist) 的比值。

关键发现：随着 $d$ 增大，这个比值趋近于 1。原因是在高维空间中，向量的平方和（即 $∥ x - y ∥^{2}$ ）近似于正态分布（中心极限定理），其方差随 $d$ 增大相对减小，使得所有距离都集中在均值附近。

第6步：与 sklearn 的对比验证

python

def plot_sklearn_comparison():
    knn_custom = KNNClassifier(k=k, metric='euclidean', weights='uniform')
    knn_custom.fit(X_train, y_train)
    custom_acc.append(accuracy_score(y_test, knn_custom.predict(X_test)))

    knn_sk = KNeighborsClassifier(n_neighbors=k, weights='uniform', metric='euclidean')
    knn_sk.fit(X_train, y_train)
    sklearn_acc.append(accuracy_score(y_test, knn_sk.predict(X_test)))

用相同的数据、相同的 $k$ 值、相同的距离度量和投票策略，自定义实现应该和 sklearn 给出完全相同的预测。如果两条曲线在图中重合，就验证了自定义实现的正确性。

关键概念速查表

概念	数学形式	代码位置	关键说明
欧氏距离	$\| a - b \|_{2}$	`euclidean_distance()`	展开公式避免大矩阵广播
曼哈顿距离	$\sum \| a_{i} - b_{i} \|$	`manhattan_distance()`	$L_{1}$ 范数，对异常值鲁棒
余弦距离	$1 - \frac{a \cdot b}{\| a \| \| b \|}$	`cosine_distance()`	度量方向差异
马氏距离	$\sqrt{(x - y)^{T} Σ^{- 1} (x - y)}$	`mahalanobis_distance()`	考虑特征相关性
惰性学习	`fit()` 只存储数据	`KNNClassifier.fit()`	无训练过程
多数投票	`np.bincount` + `np.argmax`	`_vote_uniform()`	每个邻居一票
距离加权	$w = 1 / (d + ε)$	`_vote_distance_weighted()`	近邻权重大
决策边界	网格预测 + `contourf`	`plot_decision_boundary()`	Voronoi 图
维数灾难	`min/max dist → 1`	`plot_curse_of_dimensionality()`	高维中距离失效

完整代码

# -*- coding: utf-8 -*-
"""
===============================================================================
ml01_knn/code/demo.py — k-近邻算法与距离度量
===============================================================================
本演示通过实现 k-NN 分类器和回归器，全面展示：
  1. k-NN 算法的核心原理（多数投票 vs 距离加权）
  2. 四种距离度量的作用和差异
  3. k 值对决策边界的影响（k=1, 5, 15）
  4. 维数灾难的直观展示
  5. k-d 树的简单模拟（非完整实现，展示思想）
  6. 与 sklearn 的 KNeighborsClassifier 对比验证

通过本演示，你将理解：
  - k-NN 的惰性学习本质（无显式训练过程）
  - 不同的距离度量如何影响"近邻"的定义
  - k 值的偏差-方差权衡
  - 高维空间中距离概念的失效

作者：learn-ai 项目
日期：2025
===============================================================================
"""

import os
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_moons, make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, mean_squared_error
matplotlib.rcParams['axes.unicode_minus'] = False

# 图片保存目录
_SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
_IMAGES_DIR = os.path.join(_SCRIPT_DIR, '..', 'images')
os.makedirs(_IMAGES_DIR, exist_ok=True)

def _save_path(filename):
    return os.path.join(_IMAGES_DIR, filename)


# ============================================================================
# 第一部分：距离度量函数
# ============================================================================

def euclidean_distance(X_test, X_train):
    """
    计算欧氏距离矩阵。

    使用展开公式: ||a - b||^2 = ||a||^2 + ||b||^2 - 2*a·b
    避免了显式广播大矩阵，内存效率更高。

    参数:
        X_test: (m, d), m 个测试样本
        X_train: (n, d), n 个训练样本
    返回:
        distances: (m, n), 每对样本之间的欧氏距离
    """
    # 平方和项: (m, 1) 和 (n,)
    sq_test = np.sum(X_test ** 2, axis=1, keepdims=True)  # (m, 1)
    sq_train = np.sum(X_train ** 2, axis=1)                # (n,)
    # 交叉项: (m, d) @ (d, n) = (m, n)
    cross = 2.0 * X_test @ X_train.T
    # 展开公式: a^2 + b^2 - 2ab
    sq_dists = sq_test + sq_train - cross
    # 数值稳定性: 把极小负值 clamp 到 0（浮点误差）
    sq_dists = np.maximum(sq_dists, 0.0)
    return np.sqrt(sq_dists)


def manhattan_distance(X_test, X_train):
    """
    计算曼哈顿距离矩阵 (L1 距离)。

    L1 距离 = sum(|x_i - y_i|)
    比欧氏距离对异常值更鲁棒——不平方放大差异。
    """
    # (m, 1, d) - (n, d) = (m, n, d)
    abs_diff = np.abs(X_test[:, np.newaxis, :] - X_train[np.newaxis, :, :])
    return np.sum(abs_diff, axis=2)  # (m, n)


def cosine_distance(X_test, X_train):
    """
    计算余弦距离矩阵。

    余弦距离 = 1 - 余弦相似度
    度量向量的方向差异，忽略长度。
    广泛用于文本分析、推荐系统。
    """
    # 计算 L2 范数
    norm_test = np.linalg.norm(X_test, axis=1, keepdims=True)   # (m, 1)
    norm_train = np.linalg.norm(X_train, axis=1)                # (n,)
    # 余弦相似度
    dot = X_test @ X_train.T  # (m, n)
    denom = norm_test @ norm_train[np.newaxis, :] + 1e-10
    cos_sim = dot / denom
    # 限制在 [-1, 1] 范围内（浮点误差）
    cos_sim = np.clip(cos_sim, -1.0, 1.0)
    return 1.0 - cos_sim


def mahalanobis_distance(X_test, X_train):
    """
    计算马氏距离矩阵。

    马氏距离 = sqrt((x - y)^T @ Sigma^{-1} @ (x - y))
    考虑特征之间的相关性，等价于先做白化变换，再算欧氏距离。

    步骤:
        1. 计算协方差矩阵 Sigma
        2. 用伪逆近似 Sigma^{-1}
        3. 按定义计算马氏距离
    """
    n_train = X_train.shape[0]
    # 无偏协方差估计
    Sigma = np.cov(X_train.T)
    # 用伪逆代替逆（接近奇异时仍能给出解）
    Sigma_inv = np.linalg.pinv(Sigma)

    m, n = X_test.shape[0], X_train.shape[0]
    distances = np.zeros((m, n))
    for i in range(m):
        diff = X_test[i] - X_train  # (n, d)
        # batch: diff @ Sigma_inv @ diff.T -> 取对角线
        distances[i] = np.sqrt(np.sum((diff @ Sigma_inv) * diff, axis=1))
    return distances


# 距离度量映射表
METRIC_FUNCTIONS = {
    'euclidean': euclidean_distance,
    'manhattan': manhattan_distance,
    'cosine': cosine_distance,
    'mahalanobis': mahalanobis_distance,
}


# ============================================================================
# 第二部分：自定义 k-NN 分类器
# ============================================================================

class KNNClassifier:
    """
    从零实现的 k-NN 分类器。

    支持多种距离度量和投票策略:
      - weights='uniform': 每个邻居的票权重相同
      - weights='distance': 使用反距离权重

    参数:
        k: int, 近邻数量
        metric: str, 距离度量 ('euclidean', 'manhattan', 'cosine', 'mahalanobis')
        weights: str, 'uniform' 或 'distance'
    """

    def __init__(self, k=3, metric='euclidean', weights='uniform'):
        self.k = k
        self.metric = metric
        self.weights = weights
        self.X_train = None
        self.y_train = None
        if metric not in METRIC_FUNCTIONS:
            raise ValueError(f"不支持的度量: {metric}，请选择 {list(METRIC_FUNCTIONS.keys())}")

    def fit(self, X, y):
        """
        k-NN 的训练就是记住数据——惰性学习（lazy learning）。
        没有权重更新、没有梯度下降、没有迭代。
        """
        self.X_train = np.asarray(X, dtype=np.float64)
        self.y_train = np.asarray(y)
        return self

    def predict(self, X_test):
        """
        预测流程:
          1. 计算所有测试样本到训练样本的距离
          2. 找最近的 k 个邻居
          3. 投票决定类别
        """
        X_test = np.asarray(X_test, dtype=np.float64)
        # Step 1: 计算距离矩阵 (m_test, n_train)
        dist_func = METRIC_FUNCTIONS[self.metric]
        distances = dist_func(X_test, self.X_train)

        # Step 2: 找到最近的 k 个邻居的索引
        # argsort 返回从小到大的索引排列
        top_k_indices = np.argsort(distances, axis=1)[:, :self.k]  # (m_test, k)

        # Step 3: 收集这些邻居的标签
        top_k_labels = self.y_train[top_k_indices]  # (m_test, k)
        top_k_distances = np.take_along_axis(distances, top_k_indices, axis=1)

        # Step 4: 投票
        if self.weights == 'uniform':
            predictions = self._vote_uniform(top_k_labels)
        elif self.weights == 'distance':
            predictions = self._vote_distance_weighted(top_k_labels, top_k_distances)
        else:
            raise ValueError(f"不支持的权重策略: {self.weights}")

        return predictions

    def _vote_uniform(self, top_k_labels):
        """
        多数投票: 每个邻居一票，票数最多的类别获胜。

        bincount 统计每个类别出现的次数，argmax 选出最多的类别。
        """
        m_test = top_k_labels.shape[0]
        predictions = np.zeros(m_test, dtype=self.y_train.dtype)

        for i in range(m_test):
            counts = np.bincount(top_k_labels[i].astype(int))
            predictions[i] = np.argmax(counts)

        return predictions

    def _vote_distance_weighted(self, top_k_labels, top_k_distances):
        """
        距离加权投票: 每个邻居的票权重 = 1 / (distance + epsilon)。

        步骤:
          1. 计算权重: w = 1 / (d + 1e-6)
          2. 对每个测试样本，累加各邻居对各类别贡献的权重
          3. 选权重最大的类别
        """
        m_test = top_k_labels.shape[0]
        predictions = np.zeros(m_test, dtype=self.y_train.dtype)

        # 权重: 距离越小权重越大
        weights = 1.0 / (top_k_distances + 1e-6)  # (m_test, k)

        # 获取类别数
        n_classes = int(np.max(self.y_train)) + 1

        for i in range(m_test):
            # 对每个类别累计权重
            class_weights = np.zeros(n_classes)
            for j in range(self.k):
                label = int(top_k_labels[i, j])
                class_weights[label] += weights[i, j]
            predictions[i] = np.argmax(class_weights)

        return predictions


# ============================================================================
# 第三部分：可视化函数
# ============================================================================

def plot_decision_boundary(ax, model, X, y, title, step=0.05):
    """
    绘制二维特征空间的决策边界。

    方法: 在特征空间打网格 -> 预测每个格点的类别 -> 用不同的颜色填充
    """
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5

    xx, yy = np.meshgrid(
        np.arange(x_min, x_max, step),
        np.arange(y_min, y_max, step)
    )
    grid_points = np.c_[xx.ravel(), yy.ravel()]

    Z = model.predict(grid_points)
    Z = Z.reshape(xx.shape)

    # 绘制决策区域
    ax.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.RdYlBu)
    # 绘制训练样本
    ax.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=plt.cm.RdYlBu, s=50)
    ax.set_xlabel('Feature 1', fontsize=12)
    ax.set_ylabel('Feature 2', fontsize=12)
    ax.set_title(title, fontsize=13, fontweight='bold')


def plot_distance_comparison():
    """
    展示四种距离度量在二维空间中的"等距线"(isocontour)。
    对于每种度量，展示到中心点的距离等高线。
    """
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))

    # 生成协方差相关数据来展示马氏距离的特点
    np.random.seed(42)
    n = 300
    # 有相关性的数据
    mean = [0, 0]
    cov = [[2.0, 1.5], [1.5, 1.0]]  # 正相关的协方差矩阵
    data = np.random.multivariate_normal(mean, cov, n)

    center = np.array([0.0, 0.0])
    x = np.linspace(-5, 5, 200)
    y = np.linspace(-5, 5, 200)
    Xg, Yg = np.meshgrid(x, y)
    grid = np.c_[Xg.ravel(), Yg.ravel()]

    metrics = ['euclidean', 'manhattan', 'cosine', 'mahalanobis']
    titles = ['Euclidean (L2) Distance Level Sets',
              'Manhattan (L1) Distance Level Sets',
              'Cosine Distance Level Sets',
              'Mahalanobis Distance Level Sets']

    for ax, metric, title in zip(axes.flat, metrics, titles):
        dist_func = METRIC_FUNCTIONS[metric]
        distances = dist_func(grid, center.reshape(1, -1)).reshape(Xg.shape)

        # 绘制等距线
        levels = np.linspace(distances.min(), distances.max(), 15)
        ax.contour(Xg, Yg, distances, levels=levels, cmap='viridis', alpha=0.7)
        # 标注中心点
        ax.scatter([center[0]], [center[1]], c='red', s=120, marker='*',
                   edgecolors='darkred', zorder=5, label='Reference Point')
        # 绘制数据散点
        ax.scatter(data[:, 0], data[:, 1], alpha=0.15, s=10, c='gray')
        ax.set_xlabel('Dimension 1', fontsize=11)
        ax.set_ylabel('Dimension 2', fontsize=11)
        ax.set_title(title, fontsize=12, fontweight='bold')
        ax.legend(fontsize=9)
        ax.set_xlim(-5, 5)
        ax.set_ylim(-5, 5)
        ax.set_aspect('equal')

    plt.tight_layout()
    fp = _save_path('distance_metrics_levelsets.png')
    fig.savefig(fp, dpi=150, bbox_inches='tight')
    plt.close(fig)
    print(f"[Done] distance comparison saved to {fp}")


def plot_k_effect():
    """
    对比不同 k 值下的决策边界: k=1, k=5, k=15。
    展示 k 值如何影响模型的复杂度。
    """
    np.random.seed(42)
    X, y = make_moons(n_samples=200, noise=0.25, random_state=42)
    # 标准化
    X = StandardScaler().fit_transform(X)

    k_values = [1, 5, 15]
    fig, axes = plt.subplots(1, 3, figsize=(18, 5.5))

    for k, ax in zip(k_values, axes):
        knn = KNNClassifier(k=k, metric='euclidean', weights='uniform')
        knn.fit(X, y)

        title = f'k-NN Classifier (k={k})\n'
        if k == 1:
            title += 'Complex boundary, Low Bias, High Variance'
        elif k == 5:
            title += 'Balanced boundary, Moderate complexity'
        else:
            title += 'Smooth boundary, High Bias, Low Variance'

        plot_decision_boundary(ax, knn, X, y, title)

    plt.tight_layout()
    fp = _save_path('k_effect_boundary.png')
    fig.savefig(fp, dpi=150, bbox_inches='tight')
    plt.close(fig)
    print(f"[Done] k-effect boundary saved to {fp}")


def plot_weight_effect():
    """
    对比均匀投票 vs 距离加权投票的决策差异。
    """
    np.random.seed(42)
    X, y = make_classification(
        n_samples=200, n_features=2, n_redundant=0, n_clusters_per_class=2,
        flip_y=0.1, random_state=42
    )
    X = StandardScaler().fit_transform(X)

    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    # 均匀投票
    knn_uniform = KNNClassifier(k=7, metric='euclidean', weights='uniform')
    knn_uniform.fit(X, y)
    plot_decision_boundary(axes[0], knn_uniform, X, y,
                          'Uniform Weights (k=7)\nEqual vote for each neighbor')

    # 距离加权
    knn_dist = KNNClassifier(k=7, metric='euclidean', weights='distance')
    knn_dist.fit(X, y)
    plot_decision_boundary(axes[1], knn_dist, X, y,
                          'Distance-Weighted (k=7)\nCloser neighbors have more influence')

    plt.tight_layout()
    fp = _save_path('weight_effect_boundary.png')
    fig.savefig(fp, dpi=150, bbox_inches='tight')
    plt.close(fig)
    print(f"[Done] weight effect saved to {fp}")


def plot_curse_of_dimensionality():
    """
    展示维数灾难: 随着维度增加，最近邻和最远邻的距离比值趋近 1。
    """
    np.random.seed(42)
    dims = range(1, 101, 5)  # 1 维到 100 维
    ratios = []
    n_points = 1000

    for d in dims:
        # 在 d 维单位超立方体中均匀采样
        points = np.random.uniform(0, 1, (n_points, d))
        # 计算任意两个点之间的距离（随机采样对来近似）
        idx1 = np.random.choice(n_points, size=min(500, n_points), replace=False)
        idx2 = np.random.choice(n_points, size=min(500, n_points), replace=False)

        diffs = points[idx1] - points[idx2]
        dists = np.sqrt(np.sum(diffs ** 2, axis=1))

        if dists.max() > 1e-10:
            ratios.append(dists.min() / (dists.max() + 1e-10))
        else:
            ratios.append(1.0)

    fig, ax = plt.subplots(figsize=(8, 5))
    ax.plot(list(dims), ratios, 'b-o', linewidth=2, markersize=5)
    ax.axhline(y=1.0, color='r', linestyle='--', alpha=0.5, label='All distances equal')
    ax.set_xlabel('Dimensionality (d)', fontsize=12)
    ax.set_ylabel('min(dist) / max(dist)', fontsize=12)
    ax.set_title('Curse of Dimensionality\nmin/max distance ratio approaches 1',
                 fontsize=13, fontweight='bold')
    ax.legend(fontsize=11)
    ax.grid(True, alpha=0.3)

    plt.tight_layout()
    fp = _save_path('curse_of_dimensionality.png')
    fig.savefig(fp, dpi=150, bbox_inches='tight')
    plt.close(fig)
    print(f"[Done] curse of dimensionality saved to {fp}")


def plot_metric_comparison_accuracy():
    """
    对比不同距离度量在分类任务上的准确率。
    """
    np.random.seed(42)

    # benchmark 数据集
    datasets = {
        'Moons': make_moons(n_samples=300, noise=0.2, random_state=42),
        'Classification': make_classification(
            n_samples=300, n_features=2, n_redundant=0, n_clusters_per_class=1,
            random_state=42
        ),
    }

    metrics = ['euclidean', 'manhattan', 'cosine']
    k_values = [1, 3, 5, 7, 9, 11, 13, 15]

    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    for ax, (name, (X, y)) in zip(axes, datasets.items()):
        X = StandardScaler().fit_transform(X)

        for metric in metrics:
            accuracies = []
            for k in k_values:
                knn = KNNClassifier(k=k, metric=metric, weights='uniform')
                knn.fit(X, y)
                y_pred = knn.predict(X)
                accuracies.append(accuracy_score(y, y_pred))
            ax.plot(k_values, accuracies, 'o-', linewidth=2, markersize=6,
                   label=metric.capitalize())

        ax.set_xlabel('k (number of neighbors)', fontsize=12)
        ax.set_ylabel('Training Accuracy', fontsize=12)
        ax.set_title(f'Dataset: {name}\nMetric & k value impact on accuracy',
                     fontsize=13, fontweight='bold')
        ax.legend(fontsize=10)
        ax.grid(True, alpha=0.3)

    plt.tight_layout()
    fp = _save_path('metric_accuracy_comparison.png')
    fig.savefig(fp, dpi=150, bbox_inches='tight')
    plt.close(fig)
    print(f"[Done] metric accuracy comparison saved to {fp}")


def plot_sklearn_comparison():
    """
    对比自定义 k-NN 与 sklearn KNeighborsClassifier。
    """
    np.random.seed(42)
    X, y = make_classification(
        n_samples=300, n_features=2, n_redundant=0, n_clusters_per_class=1,
        random_state=42
    )
    X = StandardScaler().fit_transform(X)

    # 训练集/测试集划分
    n_train = 200
    X_train, X_test = X[:n_train], X[n_train:]
    y_train, y_test = y[:n_train], y[n_train:]

    k_values = [1, 3, 5, 7, 9, 11, 13, 15]
    custom_acc = []
    sklearn_acc = []

    for k in k_values:
        # 自定义 k-NN
        knn_custom = KNNClassifier(k=k, metric='euclidean', weights='uniform')
        knn_custom.fit(X_train, y_train)
        custom_acc.append(accuracy_score(y_test, knn_custom.predict(X_test)))

        # sklearn k-NN
        knn_sk = KNeighborsClassifier(n_neighbors=k, weights='uniform', metric='euclidean')
        knn_sk.fit(X_train, y_train)
        sklearn_acc.append(accuracy_score(y_test, knn_sk.predict(X_test)))

    fig, ax = plt.subplots(figsize=(8, 5))
    ax.plot(k_values, custom_acc, 'bo-', linewidth=2, markersize=8,
            label='Custom KNN (NumPy)')
    ax.plot(k_values, sklearn_acc, 'rs--', linewidth=2, markersize=8,
            markerfacecolor='none', label='Sklearn KNeighborsClassifier')
    ax.set_xlabel('k (number of neighbors)', fontsize=12)
    ax.set_ylabel('Test Accuracy', fontsize=12)
    ax.set_title('Custom KNN vs Sklearn Comparison\n(Should give identical results)',
                 fontsize=13, fontweight='bold')
    ax.legend(fontsize=11)
    ax.grid(True, alpha=0.3)

    plt.tight_layout()
    fp = _save_path('sklearn_comparison.png')
    fig.savefig(fp, dpi=150, bbox_inches='tight')
    plt.close(fig)
    print(f"[Done] sklearn comparison saved to {fp}")


# ============================================================================
# 第四部分：主函数
# ============================================================================

def main():
    """主执行流程: 逐步展示所有可视化结果。"""
    print("=" * 70)
    print("ml01_knn/demo.py — k-近邻算法与距离度量")
    print("=" * 70)

    # 1. 距离度量对比
    print("\n[1/6] 绘制四种距离度量的等距线...")
    plot_distance_comparison()

    # 2. k 值对决策边界的影响
    print("\n[2/6] 绘制不同 k 值下的决策边界...")
    plot_k_effect()

    # 3. 均匀投票 vs 距离加权
    print("\n[3/6] 对比均匀投票与距离加权投票...")
    plot_weight_effect()

    # 4. 维数灾难
    print("\n[4/6] 展示维数灾难...")
    plot_curse_of_dimensionality()

    # 5. 不同度量对准确率的影响
    print("\n[5/6] 对比不同距离度量的分类准确率...")
    plot_metric_comparison_accuracy()

    # 6. 与 sklearn 对比
    print("\n[6/6] 对比自定义实现与 sklearn...")
    plot_sklearn_comparison()

    print("\n" + "=" * 70)
    print("全部可视化完成! 图像保存在:", _IMAGES_DIR)
    print("=" * 70)


if __name__ == "__main__":
    main()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581

ml01 k-近邻与距离度量 — demo.py 代码详解 ​

运行方式 ​

代码逐段详解 ​

第1步：导入库 — 每个库是做什么的 ​

第2步：四种距离度量的向量化实现 ​

2.1 欧氏距离的展开公式 ​

2.2 曼哈顿距离（L1） ​

2.3 余弦距离 ​

2.4 马氏距离 ​

第3步：自定义 KNNClassifier 类 ​

3.1 fit() — 惰性学习的本质 ​

3.2 predict() — 预测流程 ​

3.3 均匀投票 ​

3.4 距离加权投票 ​

第4步：决策边界可视化 ​

第5步：维数灾难的数值实验 ​

第6步：与 sklearn 的对比验证 ​

关键概念速查表 ​

完整代码 ​

ml01 k-近邻与距离度量 — demo.py 代码详解

运行方式

代码逐段详解

第1步：导入库 — 每个库是做什么的

第2步：四种距离度量的向量化实现

2.1 欧氏距离的展开公式

2.2 曼哈顿距离（L1）

2.3 余弦距离

2.4 马氏距离

第3步：自定义 KNNClassifier 类

3.1 fit() — 惰性学习的本质

3.2 predict() — 预测流程

3.3 均匀投票

3.4 距离加权投票

第4步：决策边界可视化

第5步：维数灾难的数值实验

第6步：与 sklearn 的对比验证

关键概念速查表

完整代码