Thank you very much for you patient explanation.
My simulation can be carried out completely after reducing amount of MPIs.
By the way, I am trying to optimize simulation acceleration based on following formula CPU core number = MPI process number P * OpenMP threads number. I have tried to use all available CPU core in a node of supercomputer by keep number of MPI small enough and use OpenMP at the same time. However, it seem not very effective. As mentioned in many discussion, this work seem so difficult because of different memory distribution between MPI and OpenMP.
So, is it a bottleneck in parallel computing that I have no way to use my whole computation resources for a simulation?
Or if there is some effective construction for MPI - OpenMP hybrid method, where a beginner like me should start?
And, is there any other methods that I should consider? I am working on Windows so GPU package was not provided, I though.
It would be great if you could give me some advice.