[lammps-users] LAMMPS and eventually FFTW3

Hello!

I have written the code to solve the encapsulation issue. Some FFT3d object methods are now parameter-less, which should satisfy the further stringencies of FFTW3. This code compiles and runs with FFTW2 and there seems to be no detectable changes in the “peptide” example. I assume I have not broken any other FFTs, but I have no capacity to check them.

I have also added FFTW3 code. Linking is straightforward (-DFFT_FFTW3 + -lfftw3) and it successfully compiles for me. But it segfaults at runtime. I feel like I’m close, but I’m missing something.

Find attached the patch that encapsulates arrays into the plan.

PS: Oops, left off the mailing-list.

patch.15May_encaps-borkfftw3 (15.7 KB)

Hello!

I have written the code to solve the encapsulation issue. Some FFT3d object
methods are now parameter-less, which should satisfy the further
stringencies of FFTW3. This code compiles and runs with FFTW2 and there

i don't agree with your approach. this is problematic in many ways.

it would be much cleaner to leave the fft3d class wrapper itself
untouched and rather call the planner right before the fft_execute.
with FFTW this has no real performance problem, since it records
and stores information about them. in fact, you might be able to speed
up things by preconditioning them during instantiation of the FFT3d class.

i will give it a try later today and see what i can come up with.

cheers,
     axel.

rather call the planner right before the fft_execute.
with FFTW this has no real performance problem,

I'm not clear on at what level you are talking about this.
But it would be a mistake, I think, to call the LAMMPS 3d FFT
plan creator routine every time. That operation is not lightweight.
It should be called once at setup and then invoked repeatedly,
as it is now.

Steve

I know that for FFTW3 only the first invocation for a given plan is a non-fast operation. Subsequent plans with the same parameters are then fast. I do not think that same quality holds for the other FFT routines, including FFTW2, so I feel happy leaving program flow as it is.

I can agree that the way I encapsulated isn’t really that pretty. At least it’s functional (for FFTW2) and doesn’t seem to be slower than the stable trunk. My main focus right now is just trying to get a functional FFTW3 to work. Just trying to get out of segfault’ing.

At this point, I’m convinced my problem is in the planner step, located in the block beginning with my version of the code at line 776 in fft3d.cpp. I invoke “fftw_plan_many_dft”, since I feel it most closely resembled the FFTW2 call that was already written. However, this new interface calls for “{i,o}nembed” parameters, which I just pass NULL for.
FFTW3 planner docs
What should those two parameters be?

rather call the planner right before the fft_execute.
with FFTW this has no real performance problem,

I'm not clear on at what level you are talking about this.
But it would be a mistake, I think, to call the LAMMPS 3d FFT
plan creator routine every time. That operation is not lightweight.
It should be called once at setup and then invoked repeatedly,
as it is now.

i am not suggesting to change the calling sequence of the
LAMMPS code. on the contrary. i suggest to keep it as unchanged
as much as possible.

however, the fftw planning operates a bit different than with other
fft libraries. fftw will maintain internal copies of all factorizations/plans
it generated and thus only the very first planning will be time consuming
all other calls will be very fast and thus it is not a problem with fftw
to only record the array sizes and do the "pre-planning" at the time
when the lammps fft planner is called and then do a new plan (which
has to include array pointers handed in and which keep changing)
at each compute() invocation.

it might be _even_ faster to generate plans for all permutations,
but that would be another optimization step. let us first get it to work. :wink:

cheers,
     axel.

please try out this patch.

http://git.icms.temple.edu/git/?p=lammps-icms.git;a=commitdiff;h=3d8af5246422f5cfaaa6c18c0a9b2b63293b75df

this implements FFTW3 support the way i was suggesting.
not bad for a couple of hours worth of work. :wink:

I know that for FFTW3 only the first invocation for a given plan is a
non-fast operation. Subsequent plans with the same parameters are
then fast. I do not think that same quality holds for the other FFT
routines, including FFTW2, so I feel happy leaving program flow as it
is.

FFTW2 has in the same feature, but you have to use the FFTW_USE_WISDOM
flag to enable it. LAMMPS doesn't use it

I can agree that the way I encapsulated isn't really that pretty. At
least it's functional (for FFTW2) and doesn't seem to be slower than
the stable trunk. My main focus right now is just trying to get a
functional FFTW3 to work. Just trying to get out of segfault'ing.

At this point, I'm convinced my problem is in the planner step,
located in the block beginning with my version of the code at line 776
in fft3d.cpp. I invoke "fftw_plan_many_dft", since I feel it most
closely resembled the FFTW2 call that was already written. However,
this new interface calls for "{i,o}nembed" parameters, which I just
pass NULL for.

that is fine. the docs explain this.

FFTW3 planner docs
What should those two parameters be?

NULL is correct. your problems are elsewhere.
during the time of planning you don't have the
data arrays available that you use for the transform.

creating and destroying a new plan is the safe way
to handle this. i'm now looking at implementing the
"new array" variant.

that should allow to keep the plans around and provide
a little speedup. since they require the same memory alignment,
one needs to replace calls to malloc(3) with calls to posix_memalign(3)
with a proper alignment (16 should be ok for all current platforms).
that should actually give an overall improvement on machines
with vectorizing compilers (e.g. intel icpc).

cheers,
    axel.

creating and destroying a new plan is the safe way
to handle this. i'm now looking at implementing the
"new array" variant.

ok. the next step is done, too.

you can check out the changes from my personal
LAMMPS repository at http://git.icms.temple.edu/lammps-icms.git

or browse the changes and extract the individual files from.
http://git.icms.temple.edu/git/?p=lammps-icms.git;a=summary

i'll now have a look into the alignment issues and will
then see, if there is any chance to get OpenMP support
working with is the reason why i wanted to use fftw3
in the first place.

let me know, if you can make this code crash. it works
with all my test inputs and with various processor counts.

thanks,
   axel.

I see you beat me to putting in the fft_malloc calls in to the ICMS code-base. With that addition, I’d be willing to declare FFTW3 support without threads complete.
I was thinking the new-array syntax would be nicer, but I just couldn’t get it to compile. Don’t know what I was doing wrong, because your additions are minimal and clean.

Consider a prototypical TIP4P water 3 nm cube short equilibration run (attached). With this example alone, I find the FFTW3-linked code runs within 5% of the FFTW2-linked code, sometimes faster, sometimes slower, but with about the same mean runtime. The initial energy calculations are identical. However, the post-0-timestep values are different. I have also attached log files to show the differences I get. Should we be worried that the values differ?

If you answer no to that, I look forward to seeing FFTW3 support merged to the trunk :slight_smile:
And I’m working on getting an updated FFT_INTEL_DFTI interface added to fft3d.*

in.lammps (676 Bytes)

data.water (156 KB)

log.lammps.fftw2 (13.1 KB)

log.lammps.fftw3 (13.1 KB)

I see you beat me to putting in the fft_malloc calls in to the ICMS
code-base. With that addition, I'd be willing to declare FFTW3 support
without threads complete.

there is little hope to get anything from threading the FFTs.
i have to have a closer look at what fftw does internally.
if at all, the threading has to be done very similar to what
i am doing in the USER-OPENMP package, where you
basically use threads almost like you program MPI and
massively reduce the thread launch and join overhead.

i've been doing some tests over the last weekend, but
i don't want to push the code to my repository, as i am not
very happy with it and the results are not encouraging.
i can send you a patch, if you are curious to try it out.

I was thinking the new-array syntax would be nicer, but I just couldn't get
it to compile. Don't know what I was doing wrong, because your additions
are minimal and clean.

thanks. i had a bit of an advantage since i have moved several codes in
the past from fftw2 to fftw3 (for example CPMD). i also know from working
on a number of well maintained packages how package software maintainer
like contributed changes to be done. it takes a bit of practice, so don't feel
too bad about it.

Consider a prototypical TIP4P water 3 nm cube short equilibration run
(attached). With this example alone, I find the FFTW3-linked code runs
within 5% of the FFTW2-linked code, sometimes faster, sometimes slower, but
with about the same mean runtime. The initial energy calculations are
identical. However, the post-0-timestep values are different. I have also
attached log files to show the differences I get. Should we be worried that
the values differ?

yes! this requires careful attention. i have seen no differences between fftw2
vs. fftw3 for all my test inputs. there may be something missing or different
between the two trees. i will check it out later today.

If you answer no to that, I look forward to seeing FFTW3 support merged to
the trunk :slight_smile:
And I'm working on getting an updated FFT_INTEL_DFTI interface added to
fft3d.*

for starters you can just compile the fftw2 or fftw3 wrappers and use those...

cheers,
    axel.

Consider a prototypical TIP4P water 3 nm cube short equilibration run
(attached). With this example alone, I find the FFTW3-linked code
runs within 5% of the FFTW2-linked code, sometimes faster, sometimes
slower, but with about the same mean runtime. The initial energy
calculations are identical. However, the post-0-timestep values are
different. I have also attached log files to show the differences I
get. Should we be worried that the values differ?

i can confirm that there is a problem. i get some strange error
on freeing the plans after the minimization and a segfault. there
has to be some kind of typo. i have a couple more hours to track
this down before i have to be on (another) conference call
and can go home. when using the same number of processors, the
results between the different FFTs should be identical (if they
are double precision, that is). i get this when comparing FFTW2
with the KISSFFT that i hacked in yesterday (for easier profiling
and investigating alternate strategies for multi-threading).

If you answer no to that, I look forward to seeing FFTW3 support
merged to the trunk :slight_smile:
And I'm working on getting an updated FFT_INTEL_DFTI interface added
to fft3d.*

since this is getting very technical, i suggest we continue the
discussion off-list and then announce the final outcome here. ok?

more later,
   axel.