2022-07-10 ~1 min read

UMAP

UMAP is used for dimensionality reduction of large datasets

Steps

UMAP calculates the distance b/w every pair of points in the high dimension
For each point
1. It calculates the required similarity score based on the hyper parameter (Number of neighbours). Let’s call this $S$
2. It fits a similarity curve for every other data point based on the similarity formula. Every other data point must fall on this similarity curve and the sum of the values of those data points on the curve should be equal to $S$
Once we have the curve for each data point and similarity wrt to every other data point, we make the score symmetrical by using another formula (Fuzzy union)
Initiate all the data points in a lower dimensional space using spectral embedding
Select a point randomly to move it in the correct direction (reference point)
1. Pick another random points from a neighbouring cluster based on similarity score we calculated in Step 3
2. Pick a point from not a neighbouring cluster
3. Calculate lower dimension similarity score in reference to the above two points
  1. Use the formula $Low\space d.\space score =\frac{1}{1+\alpha * d^{2\beta}}$ where $d$ is the distance in lower dimension
4. Once we have the low distribution score, move the point such that the cost function for that point is minimized
5. Cost function $C = \log(1/neighbour) + \log(\frac{1}{1- not\space neighbour})$
6. This cost function is minimized using stochastic gradient decent
After multiple iterations, we’ll have lower dimensional visibility of higher dimension data