🔪 The Sharp Bits 🔪¶
Read ahead for some pitfalls, counter-intuitive behavior, and sharp edges that we had to introduce in order to make this work.
Netket computations run mostly via Jax’s XLA. Compared to NetKet 2, this means that we can automatically benefit from multiple CPUs without having to use MPI. This is because mathematical operations such as matrix multiplications and overs will be split into sub-chunks and distributed across different cpus. This behaviour is trigered only for matrices/vectors above a certain size, and will not perform particularly good for small matrices or if you have many cpu cores. To disable this behaviour, refer to Jax#743, which mainly suggest defining the two env variables:
export XLA_FLAGS="--xla_cpu_multi_thread_eigen=false intra_op_parallelism_threads=1"
Usually we have noticed that the best performance is achieved by combining both BLAS parallelism and MPI, for example by guaranteeing between 2-4 (depending on your problem size) cpus to every MPI thread.
Running on CPU when GPUs are present¶
If you have the CUDA version of jaxlib installed, then computations will, by default, run on the GPU. For small systems this will be very inefficient. To check if this is the case, run the following code:
import jax print(jax.devices())
If the output is
[CpuDevice(id=0)], then computations will run by default on the CPU, if instead you see
[GpuDevice(id=0)] computations will run on the GPU.
To force Jax/XLA to run comutations on the CPU, set the environment variable
Jax supports GPUs, so your calculations should run fine on GPU, however there are a few gotchas:
GPUs have a much higher overhead, therefore you will see very bad performance at small system size (typically below 40 spins)
Not all Metropolis Transition Rules work on GPUs. To go around that, those rules have been rewritten in numpy in order to run on the cpu, therefore you might need to use netket.sampler.MetropolisSamplerNumpy instead of netket.sampler.MetropolisSampler.
Eventually we would like the selection to be automatic, but this has not yet been implemented.
Please open tickets if you find issues!
NaNs in training and loss of precision¶
If you find NaNs while training, especially if you are using your own model, there might be a few reasons:
It might simply be a precision issue, as you might be using single precision (
np.complex64) instead of double precision (
np.complex128). Be careful that if you use
complexas dtype, they will not always behave as you expect! They are known as weak dtypes, and when multiplied by a single-precision number they will be converted to single precision. This issue might manifest especially when using Flax, which respects type promotion, as opposed to jax.experimental.stax, which does not.
Check the initial parameters. In the NetKet 2 models were always initialized with weights normally distributed. In Netket 3, netket.nn layers use the same default (normal distribution with standard deviation 0.01) but if you use general flax layers they might use different initializers. different initialisation distributions have particoularly strong effects when working with complex-valued models. A good way to enforce the same distribution across all your weights, similar to NetKet 2 behaviour, is to use