Project

General

Profile

Actions

Feature #2

open

slow writing of waves.sxb

Added by Noam Bernstein 20 days ago. Updated 14 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Start date:
03/12/2025
Due date:
% Done:

0%

Estimated time:

Description

For some reason my sphinx takes a very long time to write waves.sxb, even in systems that aren't that large (primitive bcc Fe cell, cutoff 680 eV, 12^3 k-point mesh, ~300 MB waves.sxb), at least judging by the last modified file and stdout when it appears to hang for 5-15 minutes. This may or may not be the fault of the way I compiled parallel I/O and/or our filesystem. Regardless, I'm wondering about two possibilities:
  1. turning off writing of waves.sxb entirely - this doesn't seem to be possible, even if I put noRhoStorage and noWavesStorage in the initialGuess and all the SCF sections.
  2. putting in some timing code to figure out what exactly is so slow.

Would you have suggestions for trying either of these possibilities?

Actions #1

Updated by Christoph Freysoldt 19 days ago

Hi Noam,

I cannot reproduce your observation that noWavesStorage and noRhoStorage are ignored, if I use my current version. Which sphinx version do you use (sphinx --version should tell)?

When I put these flags, the log file says "storage omitted", and the sxb files aren't written (except for vElStat-eV.sxb).

if this is not a human mistake (like misspelling the camel case or so), you can try commenting out the write commands in dft/SxHamSolver.cpp in the SxHamSolver::writeData routine.

If something else is slow, let me know. Long time ago, I had problems when writing with certain combinations of parallel netcdf and MPI libraries, but nothing I really understood (different library version solved the problem). I also had severe problems if multiple sphinx runs were trying to the same file, if I by mistake started parallel serial executables instead of MPI. I think that had to do with file locking from the netcdf library. But then also the log-files get corrupted.

But I agree that this slow writing should not happen at all. That's why we had the no...Storage flags in the first place.

Actions #2

Updated by Noam Bernstein 15 days ago

3.0.9. I'd be happy to update if this is something likely to be fixed in a newer version.

This is my main section:

main {
    scfDiag {
        blockCCG {
            blockSize = 32;
            maxStepsCCG = 4;
        }
        dEnergy = 0.001 / 27.211386024367243;
        rhoMixing = 0.5;
        spinMixing = 0.5;
        maxSteps = 100;
        nPulaySteps = 20;
        preconditioner {
            type = KERKER;
            scaling = 0.5;
        }
        noRhoStorage;
        noWavesStorage;
    }
    evalForces { file = "forces.sx"; }
}

and this is the initial guess:

initialGuess {
    waves { lcao {} }
    rho { atomicOrbitals;
        atomicSpin { label="L_Fe_0"; spin=2.29999; }
    }
    noRhoStorage;
    noWavesStorage;
}

I definitely get rho.sxb and waves.sxb files created by this run. Am I specifying something wrong?

Actions #3

Updated by Noam Bernstein 15 days ago

Oddly, I do see "storage omitted" in the stdout file

tin 2124 : fgrep storage sphinx.stdout 
storage omitted
|   Wavefunctions ...    storage omitted
storage omitted
|   Wavefunctions ...    storage omitted

but the files definitely exist:

tin 2126 : ls -ltr
total 323533
-rwx------ 1 bernstei bernstei       982 Mar 17 12:31 base.sx.0*
-rwx------ 1 bernstei bernstei    238254 Mar 17 12:31 POTCAR.Fe*
-rwx------ 1 bernstei bernstei       565 Mar 17 12:31 struct.sx.0*
-rw------- 1 bernstei bernstei         0 Mar 17 12:31 fftwisdom.dat
-rw------- 1 bernstei bernstei     12128 Mar 17 12:31 AtomicOrbitals00.dat
-rw------- 1 bernstei bernstei     12128 Mar 17 12:31 AtomicOrbitals01.dat
-rw------- 1 bernstei bernstei     12128 Mar 17 12:31 AtomicOrbitals02.dat
-rw------- 1 bernstei bernstei      1277 Mar 17 12:32 energy.dat
-rw------- 1 bernstei bernstei       159 Mar 17 12:32 spins.dat
-rw------- 1 bernstei bernstei       317 Mar 17 12:32 residue.dat
-rw------- 1 bernstei bernstei    124444 Mar 17 12:32 vElStat-eV.sxb
-rw------- 1 bernstei bernstei       383 Mar 17 12:32 forces.sx
-rw------- 1 bernstei bernstei    242329 Mar 17 12:32 rho.sxb
-rw------- 1 bernstei bernstei 379509440 Mar 17 12:40 waves.sxb
-rw------- 1 bernstei bernstei    252642 Mar 17 12:40 eps.0.dat
-rw------- 1 bernstei bernstei    252642 Mar 17 12:40 eps.1.dat
-rw------- 1 bernstei bernstei       167 Mar 17 12:40 parallelHierarchy.sx.actual
-rw------- 1 bernstei bernstei   2849664 Mar 17 12:40 sphinx.stdout

Actions #4

Updated by Christoph Freysoldt 15 days ago

I think I know what is going on.

The waves/rho are written by the evalForces{} group. That is consistent with the timing of the files (forces.sx written directly before rho.sxb), as well as with the source code.

This has been changed in 3.1 (where evalForces never writes rho/waves again), but prior versions should still respect noRhoStorage/noWavesStorage on the code side also for evalForces. Yet, I am not entirely sure if the format checker complains about extra flags being set in the evalForces group - if so, one would have to declare the flags in share/sphinx/std/paw.std.

As an even better solution, it also should be possible to set the flags top-level, i.e., outside of the main{} group, where the format checker allows any settings, and these would be found from within any sublevel. The only exceptions to this general rule are a few settings like dEnergy that occur at multiple levels with different meaning, where the bottom level must not look outside if dEnergy is missing, but rather use a default value.

I didn't do any checks on an actual 3.0.9 installation. If things do not work as expected, let me know.

Actions #5

Updated by Noam Bernstein 14 days ago

OK - moving the noRho and noWaves outside all of the sections indeed works, even with the current 3.0.9 version. That's a good workaround for this particular set of runs, but I'd like to resolve some of the slow I/O issues as well. I guess we can close this issue, and I'll investigate other MPI versions and see whether I see anything systematic. I can always open another issue for that if I can't resolve it.

I don't suppose you happen to remember if the old issues were solved by netcdf or MPI version changes or both?

Actions

Also available in: Atom PDF