## Home |
## Profile |
## Diary |
## Projects |
## Miscellaneous |

## About my projects
In this section I publish reports of my hobby projects. These are mostly programming-related, but I occasionally try to do something |
## High-performance phase field crystal (6.8.2016)
In this report I demonstrate my understanding of high-performance computing by applying it to efficient phase field crystal (PFC) calculations; see my previous report for a brief introduction to PFC. The daring can also check out our recently published paper PFC is computationally not very demanding and thereby allows studying large mesoscopic systems. Of the conventional atomistic techniques molecular dynamics can also reach the length scales attainable to PFC, but is limited to the time scales of atomic vibrations. In contrast, standard PFC dynamics neglects the jiggling of atoms and instead captures a time-averaged diffusive picture. This is very advantageous when pursuing the equilibrium state of complex material systems. It seems that we—people dealing with computational sciences—are always able to find problems whose solutions require studying ever-larger systems. In such instances it is important to squeeze out all the computing power from the computers at hand to find the answers in reasonable time. To this end I benchmark different high-performance PFC codes written by me and determine which is best suited for systems of varying sizes. While the graphical PFC tool from my previous report is useful in certain situations, the codes I present in this report are true high-performance programs for serious scientific work. ## Phase field crystal
In this report I will cover only the basic equations of the standard PFC model; see the links in the first paragraph for further details. PFC systems are described in terms of a density field F[ψ(. For the standard PFC model r)]F reads
where ϵ is related to temperature (ϵ < (>) 0 for a crystalline (liquid) state), the nonlocal interaction term (1+∇ gives rise to spatial oscillations and elastic behavior, the cubic term breaks the symmetry between positive and negative densities, and the quartic term stabilizes crystalline phases (it renders large-amplitude oscillations too expensive).
^{2})^{2}
Nature strives for the energy minimum—potential energy gives rise to a force pointing "downhill" in energy. Real-world systems can be modeled using PFC by starting with an approximate initial configuration and by minimizing the free energy of this PFC system. So far I have only needed to worry about the equilibrium state—not the dynamics of getting there—whereby I can assume nonconserved dissipative dynamics corresponding to steepest descent for the PFC systems under study. Nonconserved dynamics can be written as
τ fixes the otherwise nonconserved average density to a nonzero value, which is important as the equilibrium phase (liquid, striped, triangular, bcc, ...) given by the free energy functional depends on the average density.
A numerical solution to the equation above is given by the semi-implicit spectral method
i) and can be evaluated with high accuracy; (-) two DFTs need to be computed on each iteration, and the spectral treatment results in periodic boundary conditions (PBCs). Using FFTs this method scales almost linearly as k, -k^{2}, ...O(n log n) (naive DFTs: O(n), where ^{2})n is the total number of grid points. PBCs can be a severe limitation in some situations, but I have gotten by with them so far. On the other hand PBCs eliminate possible edge effects (boundaries cost extra energy compared to the bulk, etc.).
## Implementations
All of the codes in this report are written in C, and are compiled with the optimization flag n is an integer. Powers of small primes—especially of two—should be fastest with FFTW routines. My simple FFT routines are anyway limited to power-of-two sizes.
I use in-place FFTs only, as they should be fast (at least for FFTW with MPI) and moreover they minimize the amount of memory required, whereby larger systems can be treated. With in-place FFTs the spectral method can be implemented as follows:
- allocate arrays
`p` ,`q` ,`A` , and`B` - initialize
`p` with the initial state of`ψ` - compute
`A` and`B` (the minus signs are absorbed in these operators), and divide`B` by system size* - copy
`p` to`q` - transform
`p` and *descale it - every hundred iterations or so: copy
`q` to`p` and do 5. (avoids numerical instability by reseting the error accumulating in`p` ) - compute
`q = τ*q.*q + q.*q.*q` (`.` denotes element-wise operations)** - transform
`q` - compute
`p = A.*p + B.*q` (**) - inverse transform
`q` - go to 6. (unless converged or the maximum number of iterations is reached)
The iteration (6. - 11.) consists of two FFTs and two steps of element-wise floating-point operations. For FFTW plans I use the planner flag
My own FFT routines do not support real-data transforms, but the FFTW-PFC codes exploit this. If the input data for the forward transform is real—as both
I use single-precision in all of the codes to save memory. A quick secondary relaxation using double-precision may be needed for more accurate results. Single-precision floating-point operations are often slower than double-precision ones
## FFT routines
vice versa.
The code can be found here.
The top-right subfigure in Figure 1 reveals that this algorithm fails to exploit the caches completely—the initial values have to be looked up in the global memory, which is costly. An alternative is to calculate the butterflies recursively in a depth-first order. Figure 2 illustrates such an algorithm on the same fictitious computer.
It is obvious that this approach utilizes caches much better. And because it splits the problem recursively in half, it automatically produces subproblems that fit better in the different levels of cache. There is no need to hardcode cache-optimized handling of subproblems—hence the term "cache oblivious". Instead of applying this algorithm row-by-row, column-by-column to compute 2D transforms, I extended the recursion to two dimensions: the original square-shaped problem is split into four equal-sized subproblems that are further split and so on until the base case of 2-by-2 butterflies is encountered. Here each value is needed to calculate not just two but four others which allows a much more efficient recycling of intermediate results.
I also replaced the bit reversal sort function with one that uses a precomputed lookup table to shuffle the data. `map[0] = 0;` `int n, m;` `for(m = 1; m < L; m *= 2) {` `for(n = 0; n < m; n++) {` `map[n] *= 2;` `map[m+n] = map[n]+1;` `}` `}`
This approach seems to give a modest performance improvement, but because it scales linearly, i.e., negligibly, I did not benchmark it in detail. The code can be found here. ## PFC codes
The code can be found here.
The code can be found here.
The code can be found here.
The code can be found here.
The code can be found here. ## Results## FFT routines
Figure 3 gives the execution times of 2D FFTs computed using my naive and cache-oblivious codes, and FFTW's routines. The curves give the average of forward and inverse transforms, which in all cases displayed very similar performance between the two. The execution times are per transform and per FFT size, i.e., per the number of grid points
The cache-oblivious routines outperform the naive ones, and for larger problem sizes they offer a speed-up of up to a factor of 3.7. Nevertheless, FFTW wipes the floor with my routines at an-order-of-magnitude-faster execution times. The naive FFTs display a step in execution time at intermediate problem sizes. This is probably due to cache spilling, as the cache-oblivious FFTs demonstrate no such behavior.
## PFC codes
Figure 4 gives the total execution time for a single iteration (6. - 11.) of Algorithm 1 computed using the different PFC codes.
As expected, PFC-CO gives the worst performance. PFC-FFTW executes a steady order of magnitude faster and the parallel codes offer varying amounts of further improvement. Recall that the OpenMP and MPI codes are parallelized into eight threads and processes, respectively. For linear system sizes
Figure 5 gives the execution time for the forward transform in the different PFC codes.
Figure 6 gives the execution time for the inverse transform in the different PFC codes.
Figures 5 and 6 appear very similar to Figure 4, but of course the execution times depicted in the former two sum up close to the total shown in the latter.
Figure 7 gives the execution time for the "nonlinear" step (7.) of Algorithm 1 in the different PFC codes.
L = 2048 where the execution clearly becomes memory bound. Compared to the previous figures PFC-cuFFT shows weak performance. The way in which I divided the task into blocks—one block with a single warp per each row of data—is not optimal, but the FFTs comprise two thirds of the execution time whereby the step 7. (nor 9.) is not very critical.
Figure 8 gives the execution time for the "update" step (9.) of Algorithm 1 in the different PFC codes.
In Figure 8 PFC-CO working with full-complex data is again somewhat slower compared to PFC-FFTW. The complex addition and multiplication operations are apparently enough workload for the CPU to keep the memory from becoming a complete bottleneck as both parallel CPU codes execute a little faster compared to PFC-FFTW. The CUDA code also beats the serial CPU code. ## Summary and conclusionsIn this report I studied different implementations for high-performance PFC calculations, namely different FFT routines (my own, FFTW, and cuFFT) and different types of parallelization (OpenMP, MPI, and CUDA). Based on my understanding of the memory hierarchy in computers I was able to improve the performance of my own FFT routines by almost up to a factor of four. I also demonstrated parallel PFC codes that give significant speedup compared to a serial code exploiting FFTW's routines. Depending on the problem size, and the memory and GPU available, the OpenMP, the MPI, or the CUDA code is the best option. I benchmarked the codes on a single desktop PC only, and on different hardware different results should be expected. Nevertheless, here are some general recommendations: If the reader does not want to (or cannot) install any libraries, use PFC-CO. PFC-FFTW is not much slower compared to the parallel codes and is the most efficient choice if there is enough memory to run a different problem per each CPU. PFC-FFTW-OMP is fast and easily modifiable—just leave out the pragmas to write serial code in between the multithreaded parts. PFC-FFTW-MPI is a bit more complicated with its distributed memory handling and transposed data in k-space, but if you have truly huge systems to study and a supercomputer lying around it is the way to go. With a better GPU PFC-cuFFT could easily be the best piece of code for more modest system sizes. Regarding further improvements, it is not realistic to assume that I could parallel FFTW's routines so I have to stick with FFTW and seek performance gains in the nonlinear (7.) and update (9.) steps of Algorithm 1, or in the higher-level structure of Algorithm 1. However, the algorithm is already so minimalistic that I have trouble imagining further simplifications. The memory bottlenecks in steps 7. and 9. could possibly be alleviated by absorbing them into the FFT steps, but that would again require pursuing FFTW performance. In the end the FFTs comprise most of the execution time in all cases whereby little further improvement is achievable. The biggest bottleneck is on the hardware side—a task-specific FFT chip could boost the transforms.
But it is not that bad: Using the approach outlined in our graphene article the largest problem size I studied here, linear size |
High-performance phase field crystal (6.8.2016) Phase field crystal (17.9.2015) Optical lithography (22.7.2014) |