This is part of the documentation for UWOT.
The default initialization for UMAP uses the eigenvectors of a normalized graph Laplacian of the sparse affinity matrix. This normally works well except when:
uwot
, there is a risk that the RSpectra can have trouble converging under these circumstances.An alternative to using the spectral initialization is to use PCA, although depending on the scaling of your input variables, this could lead to a quite spread-out initial set of coordinates, which can be hard to optimize. For the related technique t-SNE, Kobak and Berens recommend scaling the PCA results so that the standard deviation for each dimension is 1e-4
. You can get that behavior in uwot
by using init = "spca"
.
You don’t need to worry about disconnected components when attempting to use spectral initialization: if uwot
detects more than one component in the affinity graph, it will fall back to using the "spca"
initialization automatically.
However, using PCA ignores the information in the affinity matrix that might be useful for initialization: for example, when using supervised UMAP or mixing different types of distance metrics. And conceptually, using the affinity matrix has a lot of appeal. Dzwinel and co-workers describe a very fast multi-dimensional scaling-like method called ivhd which generates a connected graph, by only knowing only a few neighbor and non-neighbor points and pretending the target distances are 0 and 1, respectively. Similar theoretical support for this is given by Linderman and co-workers. On the basis of this work, I have implemented an “approximate global” spectral method of initialization: which is a very minor modification of the standard spectral initialization:
n_neg_samples
for each positive edge in the graph. This is consistent with the ratio of neighbor to non-neighbor edge sampling during optimization and, if using a default value of n_neg_samples
, is a similar order of magnitude as the ratio of neighbor to non-neighbor entries in the graph that is used with ivhd.You can get this behavior in uwot with init = "agspectral"
.
Another option for a connected graph is to use a variation on the normalization used in the graph Laplacian to get something close to Laplacian Eigenmap with init = "laplacian"
. In Luxburg’s spectral clustering tutorial (PDF) this form of normalization is recommended over that used in the init = "spectral"
normalization.
If all else fails, you can use random initialization. In the spirit of generosity, uwot
offers two types of random initialization. The first is that used by UMAP, which uses a random uniform distribution between -10 and +10 along each axis. You can get this with init = "rand"
. Alternatively, there is the method favoured by LargeVis (and used by default with lvish
), which is to use a Gaussian distribution with standard deviation 1e-4. This is also the method used by t-SNE. You can get this by setting init = "lvrand"
.
To summarize your options:
# Default initialization: use spectral initialization if possible
# falling back to scaled PCA if multiple components are found in the graph
embedding <- umap(data)
# same as the above
embedding <- umap(data, init = "spectral")
# use scaled PCA
embedding <- umap(data, init = "spca")
# use an "approximate global" spectral initialization that should be applicable
# even under conditions where "spectral" fails and has to use spca
embedding <- umap(data, init = "agspectral")
# use a Laplacian Eigenmap style initialization, which will also fall back to
# spca if it has to
embedding <- umap(data, init = "laplacian")
# use random initialization, UMAP style
embedding <- umap(data, init = "rand")
# use random initialization, t-SNE/LargeVis style
embedding <- umap(data, init = "lvrand")
Below, we’ll explore the effect of these settings. For more details on the datasets, see the examples page.
Apart from changing init
, mainly default settings where used, except using pca = 100
, which is necessary with high dimensional datasets for the Annoy-based nearest neighbor search to complete in a reasonable time. An exemplary command using the iris
dataset and "agspectral"
initialization:
set.seed(1337)
iris_umap <- umap(iris, pca = 100, init = "agspectral")
Below are some examples of using either the scaled PCA initialization or a spectral approach. On the first row, the left-hand image is the result of using init = "spca"
, and the right-hand image is init = "agspectral"
. On the second row, the left-hand image is for init = "spectral"
. and the right-hand image is init = "laplacian"
. Some datasets don’t have the second row of images, because with the settings used here, they generate more than one component in the graph, and therefore both init = "spectral"
and init = "laplacian"
would fall back to the same results as seen with init = "spca"
.
As mentioned earlier, the init = "spca"
settings scales the PCA initialization to have a standard deviation of 1e-4
, as recommended by Kobak and Berens. This value is taken from the t-SNE initialization, so if it seems arbitrary, at least it has a history of being quite successful on its side. But for MNIST, one of the clusters has been broken up, and this wasn’t seen with the spectral initialization. I repeated the embedding with a different seed (results not shown) and again saw a split cluster, so probably this wasn’t due to bad luck with the random number generator.
Could the standard deviation of the input be to blame? Although this setting works well with t-SNE, the normalization involved in that method may mean that the sort of gradients seen at those distances are quite different from those used with UMAP. To test this, below are results using a larger standard deviation. Results on the left use:
iris_umap <- umap(iris, init = "spca", init_sdev = 0.01)
and those on the right:
iris_umap <- umap(iris, init = "spca", init_sdev = 1)
Results are much better with a larger value for init_sdev
. As the typical standard deviation of a spectral initialization for these datasets is in the range of 1-5, using init_sdev = 1
, seems like a good place to start.
rand
Below are two results each from using random initialization, so you can get a sense for how much variation you can expect from getting lucky or not with the initialization.
# left-hand plots used this seed
set.seed(1337)
# right-hand plots used this seed
# set.seed(42)
iris_umap <- umap(iris, init = "rand")
Broken-up clusters are again observed, although the prevalence is increased. I tried repeating these embeddings with init_sdev = 1
, but that didn’t help.
lvrand
And here is the result of using random initialization in the t-SNE/LargeVis style, which use a Gaussian distribution over a much smaller initial range of coordinate values. Will it make much difference?
# left-hand plots used this seed
set.seed(1337)
# right-hand plots used this seed
# set.seed(42)
iris_umap <- umap(iris, init = "lvrand")