The table below shows the scaling of VASP.4 code on the T3D. The system is l-Fe with a cell containing 64 atoms, Gamma point only was used, the number of plane waves is 12500, and the number of included bands is 384.
cpu's | 4 | 8 | 16 | 32 | 64 | 128 |
NPAR | 2 | 4 | 4 | 8 | 8 | 16 |
POTLOK: | 11.72 | 5.96 | 2.98 | 1.64 | 0.84 | 0.44 |
SETDIJ: | 4.52 | 2.11 | 1.17 | 0.61 | 0.36 | 0.24 |
EDDIAG: | 73.51 | 35.45 | 19.04 | 10.75 | 5.84 | 3.63 |
RMM-DIIS: | 206.09 | 102.80 | 52.32 | 28.43 | 13.87 | 6.93 |
ORTHCH: | 22.39 | 8.67 | 4.52 | 2.4 | 1.53 | 0.99 |
DOS : | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
LOOP: | 319.07 | 155.42 | 80.26 | 44.04 | 22.53 | 12.39 |
100 | 99 | 90 | 90 | 80 |
The main problem with the current algorithm is the sub space rotation. Sub space rotation requires the diagonalization of a relatively small matrix (in this case ), and this step scales badly on a massively parallel machine. VASP currently uses either scaLAPACK or a fast Jacobi matrix diagonalisation scheme written by Ian Bush (T3D, T3E only). On 64 nodes, the Jacoby scheme requires around 1 sec to diagonalise the matrix, but increasing the number of nodes does not improve the timing. The scaLAPACK requires at least 2 seconds, and scaLAPACK reaches this performance already with 16 nodes.
3mm
[width=12cm,clip=.true.]scalePdO_3.2G.eps |