using pigz for multi-core gzipping

Peter_Klaver · November 10, 2015, 2:02pm

Hello people,

Here is a small tip for those who want to frequently save gzipped dump files, while running on multi-core nodes. And that is to replace the call to gzip (which usually is a single core executable on linux systems) with the multi-core gzipping program pigz, see

http://zlib.net/pigz/

Using pigz, you avoid the wasted time when e.g. 15 out of 16 cores sit idle, while one core takes care of the zipping process all by itself. Pigz scales very well over multiple cores. On a very short run that was mostly just writing out big gzipped dump files every few steps, I got a factor ~6 speedup on an octocore system and factor 15 on an 48-core system. The reason why the 48-core test didn't scale as good as on the octocore was probably that writing to disk was more of a bottleneck there.

I couldn't get this to work by aliasing gzip to pigz in my account of job submit file, but that may just be how our cluster and queue system are set up. But it is very easy to just replace a few instances of the word gzip in dump.cc with pigz, requiring exactly 20 letters of code changes:

[[email protected]...:src]$ diff dump.cpp dump.cpp-orig
463,464c463,464
< char pigz[128];
< sprintf(pigz,"pigz -6 > %s",filecurrent);

akohlmey · November 12, 2015, 1:09am

Hello people,

Here is a small tip for those who want to frequently save gzipped dump files, while running on multi-core nodes. And that is to replace the call to gzip (which usually is a single core executable on linux systems) with the multi-core gzipping program pigz, see

pigz - Parallel gzip

Using pigz, you avoid the wasted time when e.g. 15 out of 16 cores sit idle, while one core takes care of the zipping process all by itself. Pigz scales very well over multiple cores. On a very short run that was mostly just writing out big gzipped dump files every few steps, I got a factor ~6 speedup on an octocore system and factor 15 on an 48-core system. The reason why the 48-core test didn't scale as good as on the octocore was probably that writing to disk was more of a bottleneck there.

you are missing another possible source of performance degradation:
converting numbers to formatted text. that is quite CPU time
consuming. if you create that much output so that the time spent on
compression makes a significant impact, you might consider writing out
binary trajectory files instead. there also are options to write out
data in parallel.

axel.