[lammps-users] LAMMPS and eventually FFTW3

Dear LAMMPS-developers,

I am aware LAMMPS currently uses the (now-classified as legacy) FFTW2 library, so this isn’t another one of those mails. I am interested in FFTW3 support in LAMMPS and will be willing to help out. I’m sending this e-mail just to get a grounding in the facts.

MPI FFTs in FFTW3 seem to be coming out “soon” (estimated in months judging from their changelogs–hey it was 2003 when FFTW3 was released). It is worth nothing that MPI FFT support merged and unmerged with the developmental trunk in the past alpha build.

So I ask (pertaining to when the FFTW folks actually release MPI FFTs in v3):
Are the only LAMMPS files related to FFT handing “fft3d.cpp” and “fft3d.h” ?
If the correct code is added two just those files, as laid out by “#ifdef” blocks, “support” should be added as simply as throwing a “-DFFT_FFTW3” ?

The only thing LAMMPS uses from FFTW is on-processor 1-d FFTs.
It doesn't use, nor probably will it use in the future, any MPI FFTW
stuff (e.g. its parallel FFTs), unless there is a good reason to do so.

So all you really need to figure out (with the current FFTW3) is what
the interface is to its 1-d FFTs, to do a group of them. This is
what LAMMPS calls from FFTW2, but the interface changed in FFTW3,
and we haven't bothered to upgrade. If you want to add the changes,
it would be exactly what you describe below - additions to fft3d.cpp/h
with some new ifdefs.

Steve

Find attached my first stab at FFTW3 support! Attached are “fft3d.*” files to replace in “src”.

Simply throw “-DFFT_FFTW3” and “-lfftw3”. I got an executable on an Ubuntu 10.04 x64 machine compiled with gnu compilers with the OpenMPI interface, though none of that should matter.

I’m concerned about FFTW3 reproducing the FFTW2 data. I ran the “peptide” example using:
(1) The unchanged 1May10 release
(2) The source with my 2 changed files, with FFTW2 ("-DFFT_FFTW")
(3) The source with my 2 changed files, with FFTW3 ("-DFFT_FFTW3")

(1) and (2) give identical results, as they should. However, (2) and (3) don’t. I am pleased to see the run timings are similar, but there are different system states (temperature, energies, etc). Aren’t the differences numerically significant? I want to say yes, so I’m opening the question to the developers and community for input or corrections to my source changes. I can provide the associated dump files too, or you can simply recompile using these 2 fft3d.* files and link it to FFTW3 yourself to play with.

Also, I’m inclined to say FFT_INTEL refers to an interface that was deprecated with MKL 9.0. I may attempt to add a newer interface for it (say for -DFFT_INTEL_MKL9, which would be the DFTI interface).

fft3d.cpp (31.9 KB)

fft3d.h (5.87 KB)

log.fftw2.lammps (5.63 KB)

log.fftw3.lammps (5.63 KB)

yes - the differences look too big. I would guess there
would be zero difference if implemented correctly. So I think
you will need to debug what FFTW3 is returning vs FFTW2 and
figure out why it is different. You should be able to
do that for a 0 timestep run, for the initial energy calculation.

Steve

Well, I’ve gotten a little farther. FFTW v3’s encapsulation of the input/output arrays is quite a thorn for a package that has supported many types of FFTs.
The FFTW folks justify encapsulation of input/output arrays (something not done with FFTW2) with
“In most high-performance applications, as far as we can tell, you are usually transforming the same array over and over, so FFTW’s semantics should not be a burden”

Looking at LAMMPS, I’d be willing to nod my head yes to that statement.
However observe the only lines of code that involve the FFT calls:

grep -n “->compute(” pppm.cpp

First 4 are production, last 4 are timing.
Why is “work1” is used with the “fft2->compute” in timing calculations instead of “work2”, which would be consistent with the production calculation lines.
The only reason it bothers me is because it would mess up this whole encapsulation paradigm FFTW3 is so intent on using.

Can anyone shed some light on it for me?

Work1 and 2 are just scratch arrays. You can see some lines
in pppm.cpp that use both, so both are needed.

Steve

Indeed, that is clear from the code. What I couldn’t understand was how the FFT3d object named “fft2” first operated on work2, and later it switched to operating on work1, whereas “fft1” stayed consistent with work1.

I’d be tempted to conclude that and all fft2 calls should operate on work2?
That is “fft2->compute(work1,work1,-1)” should be “fft2->compute(work2,work,-1)” ?

hi,

i have just started looking into fftw3 support in LAMMPS yesterday,
since i am looking for a thread parallel FFT.

the work arrays are ok in pricinple. there is one issue but that has nothing
to do with fftw. there is one big change between fftw2 and fftw3 that would
require some changes to the code flow. in fftw3 the array pointers are part
of the plan. the way the FFTs are called through the FFT3d objects thus
requires to generate the plans on the fly and not ahead of time.

the issue with the work arrays is that ANSI c/c++ forbids argument aliasing.
that means the code fft2->compute(work1,work1,...) is not compliant and
can result in miscompilations, unless the compiler gets told to not enforce
ANSI compliance. generally assuming that there is not argument aliasing
will help to optimize, so rewriting the code in question to have three work
buffers instead of two is probably a good investment into the future.

we've been through this with the potential tables already and it has
helped to run several pair styles faster with recent GNU and Intel compilers.

suggestions, comments, etc. are highly welcome.

cheers,
    axel.

Comment below.

Steve

Indeed, that is clear from the code. What I couldn't understand was how the
FFT3d object named "fft2" first operated on work2, and later it switched to
operating on work1, whereas "fft1" stayed consistent with work1.

I'd be tempted to conclude that and all fft2 calls should operate on work2?
That is "fft2->compute(work1,work1,-1)" should be
"fft2->compute(work2,work,-1)" ?

hi,

i have just started looking into fftw3 support in LAMMPS yesterday,
since i am looking for a thread parallel FFT.

the work arrays are ok in pricinple. there is one issue but that has nothing
to do with fftw. there is one big change between fftw2 and fftw3 that would
require some changes to the code flow. in fftw3 the array pointers are part
of the plan. the way the FFTs are called through the FFT3d objects thus
requires to generate the plans on the fly and not ahead of time.

the issue with the work arrays is that ANSI c/c++ forbids argument aliasing.
that means the code fft2->compute(work1,work1,...) is not compliant and
can result in miscompilations, unless the compiler gets told to not enforce
ANSI compliance. generally assuming that there is not argument aliasing
will help to optimize, so rewriting the code in question to have three work
buffers instead of two is probably a good investment into the future.

I don't understand this. The call to compute() ends up in a call to
fft_3d(in,out) where in may = out, but it is
is not doing anything invalid wrt aliasing, so far as I know. That routine
calls lower-level functions including library 1d FFTs and whenever in=out
in one of those calls, it is valid for something like doing an FFT in place
or a memory copy. So I don't see where we would ever be confusing
the compiler with a call to a lo-level routine that does actual work.

I don't understand this. The call to compute() ends up in a call to
fft_3d(in,out) where in may = out, but it is
is not doing anything invalid wrt aliasing, so far as I know. That routine

you are right. i've been mixing up fortran with c/c++

if this was a fortran routine in = out would not be
allowed at all! i just double checked and in c++
this is allowed, but _only_ if in and out are of the
same data type.

with the table lookups there was an illegal aliasing between
an int * and an float * which was then subsequently miscompiled
by compilers and required to use the union type to signal to
the compilers that aliasing does happen in this case.

calls lower-level functions including library 1d FFTs and whenever in=out
in one of those calls, it is valid for something like doing an FFT in place
or a memory copy. So I don't see where we would ever be confusing
the compiler with a call to a lo-level routine that does actual work.

compilers theses days do strange things, especially with automatic
vectorization and other memory access optimizations. for wrapper calls,
this is probably not an issue, but i consider it good programming practice
not to depend on something going right. in any case, for as long at in and
out are of the same data type, this is a safe operation and the compiler is
required to assume that in may be equal to out.

cheers,
    axel.