For good performance, VASP requires highly optimised BLAS routines. This package can be retrieved from many public domain servers, for instance ftp.netlib.org. Most machine suppliers also offer optimised BLAS packages. BLAS routines are for instance part of the following libraries:
libessl (on IBM) libcxml (on DEC ALPHA) libblas (available from SGI) libmkl (available from INTEL) libgoto (P4/Athlon http://www.cs.utexas.edu/users/kgoto/signup_first.html)These packages reach peak performance on most machines (up to 6 Gflops). Whenever possible one should obtain these routines from the manufacturer of the machine. As an alternative, one can install the public domain versions but this might slow down VASP by a factor of 1.5 to 2 for very large systems.
If possible, an optimised LAPACK should also be installed, although this is less important for good performance. All required LAPACK routines are also available in the files vasp.lib/lapack_double.f. If optimised LAPACK routines are not available, it is often possible to improve performance slightly by specifying -DNOZTRMM (see section 3.5.4) in the makefile. The can be determined, using a large test system (for instance bench.Hg.tar) and running with IALGO=-1 specified in the INCAR file. The only timing influenced is ORTHCH.
Of considerable importance is in addition the performance of the FFT routines. VASP is supplied with routines written and optimised by J. Furthmüller (it is a version of Schwarztrauber's multiple sequence FFT, supporting radices 2,3,4,5 and 7). On most machines these routines outperform the manufacturer supplied routines (for instance CRAY C90, SGI, DEC). It is possible to optimise these routines by supplying an additional flag to the pre-compiler
-DCACHE_SIZE=XXXXXThe following values resulted in optimal performance:
IBM -DCACHE_SIZE=32768 T3D -DCACHE_SIZE=8000 DEC ev5 -DCACHE_SIZE=8000 LINUX -DCACHE_SIZE=16000CACHE_SIZE=0 has a special meaning. It performs the FFT's in x and y direction plane by plane, increasing the cache consistency on some machines. So it is worthwhile trying this setting as well. After changing CACHE_SIZE in the makefile fft3dfurth must be touched
touch fft3dfurth.Fand vasp recompiled. On vector computers CACHE_SIZE should be set to 0. It is also worthwhile increasing the optimisation level for these routines (but in our tests we have never found a significant performance improvement).
There are a few other routines which might benefit from higher optimisation: Most important are nonl.F and nonlr.F. Tests for these routines can be done with bench.Hg.tar and IALGO=-1. For LREAL=.TRUE. the timings for RPRO and RACC (nonlr.F) are affected, whereas for LREAL=.FALSE. the timings for VNLACC and PROJ (nonl.F) are affected. In particular, one can try to set -Davoidalloc in the makefile (see Sec. 3.5.12). In this case ALLOCATE and DEALLOCATE sequencies are avoided in some performance sensitive areas. Notably under LINUX, ALLOCATE and DEALLOCATE is slow, and hence avoiding it, improves the performance of nonlr.F by roughly 10% (presently this option is selected on all Linux platforms).