Bug #4
openstrange convergence issue
0%
Description
I'm running some fairly simple spin constrained calculations, slightly deformed FCC Fe supercell, 32 atoms, with moment constrained to +/-2.22 (randomly distributed, moments sum to 0). Usually this works fine (i.e. various particular strains/perturbations and spin orders), but some fail to converge with messages like `Can't find root of fermi function!`.
There's no evidence from the `F(n)` lines that convergence is going wrong, but the last `MSpin` like looks quite bad, with spins that are very far from +/-2.
```
MSpinIN and MSpinPlusIN = [1.71358, 2.15209, -2.20616, -2.12958, 1.83245, -2.19669, -2.51251, 2.29917, 2.03888, -2.6082, -2.54471, -2.26177, 2.14971, -2.36319, -2.10124, 2.12354, 1.73145, 2.13388, 1.87072, -2.20505, -2.12301, 1.95849, 1.60102, 2.59554, -1.92163, -2.14923, 2.03021, -1.98307, -2.19011, 1.79122, 1.80599, 2.1234, 2.27402, -2.06464, -2.33764, -2.10806] [7.93397, 0.421756, -0.184141, -0.212914, 7.86488, -0.158079, 7.56214, 0.307096, 7.72943, 7.6946, 7.6804, -0.172813, 0.369972, -0.128673, -0.217443, 5.90343, 7.82785, 7.55684, 8.0082, -0.136673, -0.160927, 7.91901, 7.99078, -7.35754, -7.70475, -0.173438, 7.78037, -7.68236, -0.171121, 7.95852, 7.95509, 0.449824, 0.256467, -6.66463, -0.102838, -0.22757]
```
I've attached the badly behaved stdout file. What other information would be useful to understand what's going on?
Files
Updated by Christoph Freysoldt about 1 month ago
- Assignee set to Christoph Freysoldt
Dear Noam,
I have tried to understand what might have happened, but I can only offer speculation.
According to the output, everything looks perfectly fine until the last scf iteration, and then within that, the loop that optimize the Lagrangian multipliers (nu) look ok until the line where Gamma jumos to 5.66314e+08. This is a Conjugate Gradient; kappa is the trial step length for the line minimization and kappa opt the optimized step length. The "gradient" for the search direction is the deviation in atomic spins, and gamma is the ratio of the new to previous gradient (squared). That should not be able to jump by 8 orders of magnitude upon a small step in the previous iteration.
I have the suspicion that this results from a synchronization error between the MPI tasks, that can happen if numerical noise accumulates for (conceptually global) quantities that are computed independently for each MPI tasks. What then happens is that the MPI tasks disagree whether the iterations have converged, and they no longer work on the same part of the code. Then some MPI tasks send completely different data, the eigenvalues pick up random values, and the Fermi-level searching algorithm fails to find a reasonable value, and you have the symptoms I see in the log file.
One frequent symptom of MPI mismatch is that the behavior is not reproducible, i.e., the same calculation stops at different moments, or runs in some cases and stops in others. One could also try to set the environment SX_LOG_STDOUT=ALL - then all MPI tasks produce output (tasks 1 - N to hidden files .stdoutDump_<i>, and one can compare if they show different output in the end that shows that some tasks continue with the next step, while others continue to iterate the nu's. Also, if affordable, a non-MPI run should not show such issues.
I had similar things long time ago with the scf iterations, and I ended up broadcasting the convergence decision. Before I implement a corresponding fix for the nu optimization loop, could you try to check if my suspicion is true by rerunning the same calculation with SX_LOG_STDOUT=ALL, and check if the final lines of output show any signs of discrepancies?
If that is the case, I apologize for the inconvenience. MPI parallelization was implemented in the second round, using simple test cases to ensure that I can run them in serial. These simple test cases may be more well-behaved.
Updated by Noam Bernstein 25 days ago
Thank you for taking a look. It does look like the behavior is not reproducible, suggesting that your idea about the synchronization between MPI tasks makes sense. Unfortunately I haven't yet been able to reproduce this behavior with a sufficiently cheap system to run serially, but I can do several parallel runs and see if I get a sense of how often it happens, and save the SX_LOG_STDOUT=ALL output.
Updated by Noam Bernstein 22 days ago
I was able to reproduce this behavior in a run with SX_LOG_STDOUT=ALL. How do we go about confirming your hypothesis? Is there something I should look at, or should I just upload all 8 stdout files?
Updated by Christoph Freysoldt 22 days ago
Thanks.
There should be differences toward the end, notably in the nu loop (kappa/kappa opt/Gamma etc). Some could finish at low residues, right before the others explode.
If you don't see anything, please upload all 8 log files (or the last scf iteration, if they are too big).
Updated by Noam Bernstein 21 days ago
- File stdouts.tar.gz stdouts.tar.gz added
I don't see any differences in the last 100 or so lines (although there are differences near machine precision level in some quantities earlier in the process). I've uploaded the output files. One thing I also have, which I'll inspect, is the corresponding set of files from a run of supposedly the same inputs that didn't blow up. If I see anything interesting I'll upload those as well.