Projects

About my projects

In this section I publish reports of my hobby projects. These are mostly programming-related, but I occasionally try to do something ex silico as well. I like to write programs that model and simulate physical phenomena or serve as solvers for various puzzles and problems. I do not want to use pre-existing tools too much, instead I prefer writing my own tools and learn more from the mistakes that I make. As do my interests, also my project topics cover a broad range of scientific disciplines. I wish to demonstrate that I am capable of learning various new subjects independently and down to detail, and better yet, that I am adept at applying such knowledge to practice.

High-performance phase field crystal

High-performance phase field crystal (6.8.2016)

In this report I demonstrate my understanding of high-performance computing by applying it to efficient phase field crystal (PFC) calculations; see my previous report for a brief introduction to PFC. The daring can also check out our recently published paper[arXiv] on PFC modeling of graphene that gives a detailed, yet concise, treatment of PFC (before a lengthy analysis of four PFC variants). Here I point out the benefits and limitations set by the basic building blocks of modern computers: CPUs, GPUs, and memory hierarchies. I compare different implementations, for example serial and parallel (OpenMP, MPI, and CUDA) PFC codes. I also take a closer look at fast Fourier transforms (FFTs) that are needed for solving the equilibrium state of a PFC system both accurately and efficiently. I compare my own FFT routines with FFTW's.

PFC is computationally not very demanding and thereby allows studying large mesoscopic systems. Of the conventional atomistic techniques molecular dynamics can also reach the length scales attainable to PFC, but is limited to the time scales of atomic vibrations. In contrast, standard PFC dynamics neglects the jiggling of atoms and instead captures a time-averaged diffusive picture. This is very advantageous when pursuing the equilibrium state of complex material systems.

It seems that we—people dealing with computational sciences—are always able to find problems whose solutions require studying ever-larger systems. In such instances it is important to squeeze out all the computing power from the computers at hand to find the answers in reasonable time. To this end I benchmark different high-performance PFC codes written by me and determine which is best suited for systems of varying sizes. While the graphical PFC tool from my previous report is useful in certain situations, the codes I present in this report are true high-performance programs for serious scientific work.

Phase field crystal

In this report I will cover only the basic equations of the standard PFC model; see the links in the first paragraph for further details. PFC systems are described in terms of a density field ψ(r) that is governed by a free energy functional F[ψ(r)]. For the standard PFC model F reads
where ϵ is related to temperature (ϵ < (>) 0 for a crystalline (liquid) state), the nonlocal interaction term (1+∇2)2 gives rise to spatial oscillations and elastic behavior, the cubic term breaks the symmetry between positive and negative densities, and the quartic term stabilizes crystalline phases (it renders large-amplitude oscillations too expensive).

Nature strives for the energy minimum—potential energy gives rise to a force pointing "downhill" in energy. Real-world systems can be modeled using PFC by starting with an approximate initial configuration and by minimizing the free energy of this PFC system. So far I have only needed to worry about the equilibrium state—not the dynamics of getting there—whereby I can assume nonconserved dissipative dynamics corresponding to steepest descent for the PFC systems under study. Nonconserved dynamics can be written as
(For conserved dynamics the minus signs in the equation above need to be replaced with 2.) Under nonconserved dynamics mass can (dis)appear locally which is significantly faster for equilibration than mass transport via diffusion. For a closed system this would violate mass conservation, but a two-dimensional (2D) open system can exhange atoms with a surrounding gaseous or a liquid phase in this manner. The parameter τ fixes the otherwise nonconserved average density to a nonzero value, which is important as the equilibrium phase (liquid, striped, triangular, bcc, ...) given by the free energy functional depends on the average density.

A numerical solution to the equation above is given by the semi-implicit spectral method
where the carets denote discrete Fourier transforms (DFTs) and Δt is the step size. This method has its pros(+) and cons(-): (+) it is very stable, and in k-space gradients reduce to algebraic expressions (, 2, ... → ik, -k2, ...) and can be evaluated with high accuracy; (-) two DFTs need to be computed on each iteration, and the spectral treatment results in periodic boundary conditions (PBCs). Using FFTs this method scales almost linearly as O(n log n) (naive DFTs: O(n2)), where n is the total number of grid points. PBCs can be a severe limitation in some situations, but I have gotten by with them so far. On the other hand PBCs eliminate possible edge effects (boundaries cost extra energy compared to the bulk, etc.).

Implementations

All of the codes in this report are written in C, and are compiled with the optimization flag -O3. First, I compare my own serial FFT routines with FFTW's, and second, I compare serial (with my fastest FFT routines and FFTW) and parallel (OpenMP, MPI, CUDA) PFC codes. I use 2D systems initialized with random noise as the benchmark for my different implementations. I let these systems evolve for between 10 000 and one iteration steps depending on the problem size. To simplify the analysis I consider only square-shaped systems. Furthermore, all systems have dimensions of 2n grid points, where n is an integer. Powers of small primes—especially of two—should be fastest with FFTW routines. My simple FFT routines are anyway limited to power-of-two sizes.

I use in-place FFTs only, as they should be fast (at least for FFTW with MPI) and moreover they minimize the amount of memory required, whereby larger systems can be treated. With in-place FFTs the spectral method can be implemented as follows:

  1. allocate arrays p, q, A, and B
  2. initialize p with the initial state of ψ
  3. compute A and B (the minus signs are absorbed in these operators), and divide B by system size*
  4. copy p to q
  5. transform p and *descale it
  6. every hundred iterations or so: copy q to p and do 5. (avoids numerical instability by reseting the error accumulating in p)
  7. compute q = τ*q.*q + q.*q.*q (. denotes element-wise operations)**
  8. transform q
  9. compute p = A.*p + B.*q (**)
  10. inverse transform q
  11. go to 6. (unless converged or the maximum number of iterations is reached)
Algorithm 1. The algorithm for equilibrating PFC systems using the spectral method.

The iteration (6. - 11.) consists of two FFTs and two steps of element-wise floating-point operations. For FFTW plans I use the planner flag FFTW_MEASURE in all cases.

My own FFT routines do not support real-data transforms, but the FFTW-PFC codes exploit this. If the input data for the forward transform is real—as both ψ and N indeed are—the output is conjugate symmetric. Exploiting conjugate symmetry cuts almost in half both the workload and the memory needed to store the output.

I use single-precision in all of the codes to save memory. A quick secondary relaxation using double-precision may be needed for more accurate results. Single-precision floating-point operations are often slower than double-precision ones[link]. However, one also has to consider the memory and caches[another link]. In-place FFT algorithms do not reuse their intermediate results that many times before overwriting them, whereby memory access is likely to become the bottleneck. I did not investigate this, but double-precision could give a worse performance by spilling the caches more (recent values are stored in small caches from where they can be reaccessed much faster).

I ran all the calculations on a desktop PC with a quad-core Intel CPU (eight cores with hyper-threading), 16 GB of RAM, and an Nvidia GPU. I ran all the OpenMP (MPI) PFC calculations with eight threads (processes). Varying the number of threads (processes) would have introduced too many degrees of freedom.

FFT routines

Naive FFT — I wrote my first set of "naive" FFT routines based on Steven Smith's examples. He gives clear diagrams of a two-stage in-place FFT algorithm with time domain decomposition and frequency domain synthesis steps. The decomposition step comprises a bit reversal sorting of the input data, and the synthesis log2 n cycles of so-called butterfly operations on the data. This diagram illustrates beautifully the flow of information during the synthesis, and indicates that the synthesis undoes the bit reversal. Two-dimensional DFTs can be computed simply by FFT'ing the data row-by-row, then column-by-column, or vice versa.

The code can be found here.

Cache-oblivious FFT — In Smith's examples and in my naive routines the data is traversed in breadth-first order during the frequency domain synthesis. For large enough transforms this means that the intermediate results from previous butterfly calculations spill the caches and have to be fetched from higher-level cache or global memory for the next stage of butterfly calculations. The following diagram (Figure 1) demonstrates this for a 1D transform of 16 values on a simplified fictitious computer. Assume that the computer has two levels of cache dedicated entirely for FFT data: L1 can hold four complex values and L2 stores eight; the rest are spilled to global memory.

Figure 1. Intermediate stages of naive frequency domain synthesis (open in a new tab to enlarge). The arrows indicate which values are needed for the calculation of each new value. Black (gray) arrows indicate butterflies that have already (not yet) been calculated. The green, yellow and red dots indicate whether the intermediate result is found in L1, L2, or global memory, respectively.

The top-right subfigure in Figure 1 reveals that this algorithm fails to exploit the caches completely—the initial values have to be looked up in the global memory, which is costly. An alternative is to calculate the butterflies recursively in a depth-first order. Figure 2 illustrates such an algorithm on the same fictitious computer.

Figure 2. Intermediate stages of "cache-oblivious" frequency domain synthesis.

It is obvious that this approach utilizes caches much better. And because it splits the problem recursively in half, it automatically produces subproblems that fit better in the different levels of cache. There is no need to hardcode cache-optimized handling of subproblems—hence the term "cache oblivious". Instead of applying this algorithm row-by-row, column-by-column to compute 2D transforms, I extended the recursion to two dimensions: the original square-shaped problem is split into four equal-sized subproblems that are further split and so on until the base case of 2-by-2 butterflies is encountered. Here each value is needed to calculate not just two but four others which allows a much more efficient recycling of intermediate results.

I also replaced the bit reversal sort function with one that uses a precomputed lookup table to shuffle the data. N 1D tables suffice for N-dimensional data (or just one if the data is square-, cube-, ..., shaped). A lookup table map of length L can be formed as follows:

  • map[0] = 0;
  • int n, m;
  • for(m = 1; m < L; m *= 2) {
    • for(n = 0; n < m; n++) {
      • map[n] *= 2;
      • map[m+n] = map[n]+1;
    • }
  • }
Algorithm 2. The algorithm for generating a bit reversal sort lookup table.

This approach seems to give a modest performance improvement, but because it scales linearly, i.e., negligibly, I did not benchmark it in detail.

The code can be found here.

PFC codes

PFC-CO — A serial PFC code that computes the FFTs using my—spoiler alert!—faster cache-oblivious routines. Although—spoiler alert!—much slower than FFTW's routines, the reader can use this version without installing FFTW (or without an Nvidia GPU, CUDA, and cuFFT).

The code can be found here.

PFC-FFTW — A serial PFC code exploiting FFTW's routines. Basically for readers that want to modify the code but are not that familiar with parallel programming.

The code can be found here.

PFC-FFTW-OMP — A PFC code exploiting FFTW's routines and parallelized with OpenMP. Both the FFTs and other functions that are called repeatedly during the iteration are multithreaded.

The code can be found here.

PFC-FFTW-MPI — A PFC code exploiting FFTW's routines and parallelized with MPI. Using MPI the computation is carried out in processes that have separate memory. Multiple processes working on a single transform must therefore communicate with each other to be able to accomplish the task. In practice the processes work on their local data, come together to redistribute, or transpose, the global data amongst themselves, again work on their new local data, and again transpose to restore the initial ordering of the data. Transposing is expensive, but fortunately FFTW offers the possibility to leave out the final transposition whereby the data is left in transposed order. A forward transform results in transposed data in k-space, and a subsequent inverse transform restores the order in direct space. Hence the communication effort is cut in half at the cost of swapping the coordinates in k-space. This code can be run distributed over several compute nodes on a supercomputer.

The code can be found here.

PFC-cuFFT — A PFC code exploiting CUDA and cuFFT on Nvidia GPUs—multiple GPUs are not supported (I'm still a CUDA novice, want to keep things simple). Not just the FFTs, but the entire iteration is performed on the GPU side. Real-data transforms are supported by cuFFT and the code exploits them.

The code can be found here.

Results

FFT routines

Figure 3 gives the execution times of 2D FFTs computed using my naive and cache-oblivious codes, and FFTW's routines. The curves give the average of forward and inverse transforms, which in all cases displayed very similar performance between the two. The execution times are per transform and per FFT size, i.e., per the number of grid points n.

Figure 3. FFT execution time per single grid point as a function of the transform size for my naive and cache-oblivious (c.-o.) FFTs, and for FFTW's routines. The values are averages of forward and inverse transforms (negligible differences).

The cache-oblivious routines outperform the naive ones, and for larger problem sizes they offer a speed-up of up to a factor of 3.7. Nevertheless, FFTW wipes the floor with my routines at an-order-of-magnitude-faster execution times. The naive FFTs display a step in execution time at intermediate problem sizes. This is probably due to cache spilling, as the cache-oblivious FFTs demonstrate no such behavior.

PFC codes

Figure 4 gives the total execution time for a single iteration (6. - 11.) of Algorithm 1 computed using the different PFC codes.

Figure 4. Total execution time for a single iteration with different PFC codes.

As expected, PFC-CO gives the worst performance. PFC-FFTW executes a steady order of magnitude faster and the parallel codes offer varying amounts of further improvement. Recall that the OpenMP and MPI codes are parallelized into eight threads and processes, respectively. For linear system sizes L = 256, 512, 1024 the parallel codes give a reasonable boost of about a half an order of magnitude. In general one cannot expect a linear performance improvement as a function of the number of CPU cores utilized, unless the problem is embarassingly parallel and its execution is CPU bound rather than limited by cache and memory. For systems of linear size L ≥ 2048 the performance gain given by the parallel CPU codes diminishes: for PFC-MPI instantaneously and for PFC-OMP after L = 8192. This behavior is probably linked to the caches and memory becoming more of a bottleneck. At L = 16 384 the GPU runs out of memory, and at L = 32 768 also the CPU when running PFC-CO that deals with full-complex data. PFC systems with L = 32 768 can comprise a few tens of millions of atoms depending on the spatial discretization and lattice structure chosen.

Figure 5 gives the execution time for the forward transform in the different PFC codes.

Figure 5. Execution time of the forward transform in different PFC codes.

Figure 6 gives the execution time for the inverse transform in the different PFC codes.

Figure 6. Execution time of the inverse transform in different PFC codes.

Figures 5 and 6 appear very similar to Figure 4, but of course the execution times depicted in the former two sum up close to the total shown in the latter.

Figure 7 gives the execution time for the "nonlinear" step (7.) of Algorithm 1 in the different PFC codes.

Figure 7. Execution time of the nonlinear step in different PFC codes.

PFC-CO is a little slower compared to the other codes because it works with complex values that eat twice the memory and are more laborious to multiply. PFC-FFTW-MPI gives significant speedup until L = 2048 where the execution clearly becomes memory bound. Compared to the previous figures PFC-cuFFT shows weak performance. The way in which I divided the task into blocks—one block with a single warp per each row of data—is not optimal, but the FFTs comprise two thirds of the execution time whereby the step 7. (nor 9.) is not very critical.

Figure 8 gives the execution time for the "update" step (9.) of Algorithm 1 in the different PFC codes.

Figure 8. Execution time for the update step in different PFC codes.

In Figure 8 PFC-CO working with full-complex data is again somewhat slower compared to PFC-FFTW. The complex addition and multiplication operations are apparently enough workload for the CPU to keep the memory from becoming a complete bottleneck as both parallel CPU codes execute a little faster compared to PFC-FFTW. The CUDA code also beats the serial CPU code.

Summary and conclusions

In this report I studied different implementations for high-performance PFC calculations, namely different FFT routines (my own, FFTW, and cuFFT) and different types of parallelization (OpenMP, MPI, and CUDA). Based on my understanding of the memory hierarchy in computers I was able to improve the performance of my own FFT routines by almost up to a factor of four. I also demonstrated parallel PFC codes that give significant speedup compared to a serial code exploiting FFTW's routines.

Depending on the problem size, and the memory and GPU available, the OpenMP, the MPI, or the CUDA code is the best option. I benchmarked the codes on a single desktop PC only, and on different hardware different results should be expected. Nevertheless, here are some general recommendations: If the reader does not want to (or cannot) install any libraries, use PFC-CO. PFC-FFTW is not much slower compared to the parallel codes and is the most efficient choice if there is enough memory to run a different problem per each CPU. PFC-FFTW-OMP is fast and easily modifiable—just leave out the pragmas to write serial code in between the multithreaded parts. PFC-FFTW-MPI is a bit more complicated with its distributed memory handling and transposed data in k-space, but if you have truly huge systems to study and a supercomputer lying around it is the way to go. With a better GPU PFC-cuFFT could easily be the best piece of code for more modest system sizes.

Regarding further improvements, it is not realistic to assume that I could parallel FFTW's routines so I have to stick with FFTW and seek performance gains in the nonlinear (7.) and update (9.) steps of Algorithm 1, or in the higher-level structure of Algorithm 1. However, the algorithm is already so minimalistic that I have trouble imagining further simplifications. The memory bottlenecks in steps 7. and 9. could possibly be alleviated by absorbing them into the FFT steps, but that would again require pursuing FFTW performance. In the end the FFTs comprise most of the execution time in all cases whereby little further improvement is achievable. The biggest bottleneck is on the hardware side—a task-specific FFT chip could boost the transforms.

But it is not that bad: Using the approach outlined in our graphene article the largest problem size I studied here, linear size L = 32 768 grid points, can correspond almost up to 900 nm. A square-micron sheet of monolayer graphene has roughly 38 million carbon atoms. Using the optimization algorithm we mention in our paper the typical number of iterations needed to relax a PFC system is perhaps on the order of ~104. Based on Figure 4 it takes some days for a desktop PC to crunch micron-sized systems of graphene—not bad, right? And if that is not enough, there still are amplitude expansions.

Projects

Juggling (1.1.2017)

Vocabulary (7.12.2016)

High-performance phase field crystal (6.8.2016)

Phase field crystal (17.9.2015)

Image stacking (23.7.2015)

Wind tunnel (13.12.2014)

Nebulae (8.11.2014)

Optical lithography (22.7.2014)

Evolution (22.5.2014)

Covalent molecules (23.11.2013)

Sudoku (18.7.2013)

Hama beads (4.6.2013)