Don Sheehy

A Simple Algorithm for kNN Sampling in General Metrics

Kirk Gardner, and Donald R. Sheehy

CCCG: Proceedings of the 32nd Canadian Conference on Computational Geometry, 345-351 2020

Finding the $k$th nearest neighbor to a query point is a ubiquitous operation in many types of metric computations, especially those in unsupervised machine learning. In many such cases, the distance to $k$ sample points is used as an estimate of the local density of the sample. In this paper, we give an algorithm that takes a finite metric $(P, d)$ and an integer $k$ and produces a subset $M\subseteq P$ with the property that for any $q\in P$, the distance to the second nearest point of $M$ to $q$ is a constant factor approximation to the distance to the $k$th nearest point of $P$ to $q$. Thus, the sample $M$ may be used in lieu of $P$. In addition to being much smaller than $P$, the distance queries on $M$ only require finding the second nearest neighbor instead of the $k$th nearest neighbor. This is a significant improvement, especially because theoretical guarantees on $k$th nearest neighbor methods often require $k$ to grow as a function of the input size $n$.

@inproceedings{gardner20simple,
  title = {A Simple Algorithm for $k$NN Sampling in General Metrics},
  author = {Kirk P. Gardner and Donald R. Sheehy},
  booktitle = {Proceedings of the 32nd Canadian Conference on Computational Geometry},
  pages = {345--351},
  year = {2020}}