# K-means clustering - NumPy API

The `pykeops.numpy.LazyTensor.argmin()` reduction supported by KeOps `pykeops.numpy.LazyTensor` allows us to perform bruteforce nearest neighbor search with four lines of code. It can thus be used to implement a large-scale K-means clustering, without memory overflows.

Note

For large and high dimensional datasets, this script is outperformed by its PyTorch counterpart which avoids transfers between CPU (host) and GPU (device) memories.

## Setup

Standard imports:

```import time

import numpy as np
from matplotlib import pyplot as plt

from pykeops.numpy import LazyTensor
import pykeops.config

dtype = "float32"  # May be 'float32' or 'float64'
```

Simple implementation of the K-means algorithm:

```def KMeans(x, K=10, Niter=10, verbose=True):
N, D = x.shape  # Number of samples, dimension of the ambient space

# K-means loop:
# - x  is the point cloud,
# - cl is the vector of class labels
# - c  is the cloud of cluster centroids
start = time.time()
c = np.copy(x[:K, :])  # Simplistic random initialization
x_i = LazyTensor(x[:, None, :])  # (Npoints, 1, D)

for i in range(Niter):
c_j = LazyTensor(c[None, :, :])  # (1, Nclusters, D)
D_ij = ((x_i - c_j) ** 2).sum(
-1
)  # (Npoints, Nclusters) symbolic matrix of squared distances
cl = D_ij.argmin(axis=1).astype(int).reshape(N)  # Points -> Nearest cluster

Ncl = np.bincount(cl).astype(dtype)  # Class weights
for d in range(D):  # Compute the cluster centroids with np.bincount:
c[:, d] = np.bincount(cl, weights=x[:, d]) / Ncl

end = time.time()

if verbose:
print(
"K-means example with {:,} points in dimension {:,}, K = {:,}:".format(
N, D, K
)
)
print(
"Timing for {} iterations: {:.5f}s = {} x {:.5f}s\n".format(
Niter, end - start, Niter, (end - start) / Niter
)
)

return cl, c
```

## K-means in 2D

First experiment with N=10,000 points in dimension D=2, with K=50 classes:

```N, D, K = 10000, 2, 50
```

Define our dataset:

```x = np.random.randn(N, D).astype(dtype) / 6 + 0.5
```

Perform the computation:

```cl, c = KMeans(x, K)
```
```K-means example with 10,000 points in dimension 2, K = 50:
Timing for 10 iterations: 0.00458s = 10 x 0.00046s
```

Fancy display:

```plt.figure(figsize=(8, 8))
plt.scatter(x[:, 0], x[:, 1], c=cl, s=30000 / len(x), cmap="tab10")
plt.scatter(c[:, 0], c[:, 1], c="black", s=50, alpha=0.8)
plt.axis([0, 1, 0, 1])
plt.tight_layout()
plt.show()
```

## K-means in dimension 100

Second experiment with N=1,000,000 points in dimension D=100, with K=1,000 classes:

```if pykeops.config.gpu_available:
N, D, K = 1000000, 100, 1000
x = np.random.randn(N, D).astype(dtype)
cl, c = KMeans(x, K)
```
```K-means example with 1,000,000 points in dimension 100, K = 1,000:
Timing for 10 iterations: 5.68102s = 10 x 0.56810s
```

Total running time of the script: (0 minutes 7.842 seconds)

Gallery generated by Sphinx-Gallery