Don Sheehy

A Simple Algorithm for kNN Sampling in General Metrics

The Canadian Conference on Computational Geometry, August 8, 2020

Finding the $k$th nearest neighbor to a query point is a ubiquitous operation in many types of metric computations, especially those in unsupervised machine learning. In many such cases, the distance to $k$ sample points is used as an estimate of the local density of the sample. In this paper, we give an algorithm that takes a finite metric $(P, d)$ and an integer $k$ and produces a subset $M\subseteq P$ with the property that for any $q\in P$, the distance to the second nearest point of $M$ to $q$ is a constant factor approximation to the distance to the $k$th nearest point of $P$ to $q$. Thus, the sample $M$ may be used in lieu of $P$. In addition to being much smaller than $P$, the distance queries on $M$ only require finding the second nearest neighbor instead of the $k$th nearest neighbor. This is a significant improvement, especially because theoretical guarantees on $k$th nearest neighbor methods often require $k$ to grow as a function of the input size $n$.