This is the API documentation for milupHPC - a parallel smoothed particle hydrodynamics (SPH) targeting GPU cluster via CUDA-aware MPI.
Relevant links:
About
The computational method Smoothed Particle Hydrodynamics is an expedient method for solving hydrodynamic equations and is in combination with a method for self-gravity applicable for astrophysical simulations. Since computationally expensive, appropriate methods and technologies need to be utilized. Especially parallelization is a promising approach.
Thus, this work aims to describe and implement a multi-GPU Smoothed Particle Hydrodynamics code with self-gravity via the Barnes-Hut method using CUDA-aware MPI targeting GPU cluster. The Barnes-Hut method is an approximative N-body tree method subdividing the simulation domain hierarchically and approximating the potential of a distant group using a multipole expansion of the potential which reduces the overall complexity of the gravitational force computation. While the parallelization for shared memory systems is relatively straightforward, many challenges and problems arise for distributed memory parallelization.
The approach presented here is based on the parallelization of the tree as constructed for the Barnes-Hut method. Every process' local tree contains the particles assigned to this process and in addition nodes identical for all processes forming a common coarse tree. This shared part of the tree enables processes to gain knowledge about the local trees on the other processes. Moreover, the tree constructed in this way permit the introduction of space-filling-curves which themselves are basis for an efficient domain decomposition and load-balancing. In addition, the parallel tree facilitates the determination of communication patterns which are necessary for calculating the forces arising from the Barnes-Hut method and SPH. Particle exchange via message passing is necessary since not all particles are accessible on all processes as the target architecture are distributed memory systems. Besides, this thesis investigates efficient GPU implementations of SPH and the Barnes-Hut algorithm including the gravitational force computation and fixed-radius near neighbor search.
High Performance Computing Smooth(ed) Particle Hydrodynamics
The successor of miluphcuda targeting GPU cluster via CUDA aware MPI.
This repository implements a multi-GPU SPH & N-body (via Barnes-Hut) algorithm using C++11 and CUDA-aware MPI by combining already proven parallelization strategies and available implementations
- single-GPU version inspired/adopted from:
- multi-node (or rather multi-CPU) version inspired/adopted from:
- M. Griebel, S. Knapek, and G. Zumbusch. Numerical Simulation in Molecular Dynamics: Numerics, Algorithms, Parallelization, Applications. 1st. Springer Pub- lishing Company, Incorporated, 2010. isbn: 3642087760
with new ideas and parallelization strategies.
- for a versatile single-GPU implementation refer to miluphcuda
- for a multi-CPU implementation (self-gravity via Barnes-Hut only) refer to jammartin/ParaLoBstar
Parallelization/Implementation
- Parallelization embraces both
- multi-node parallelization via message-passing and
- single-node parallelization for **GPU**s via CUDA
implemented using C++ and CUDA-aware MPI.

Implementation details
include/src
├── cuda_utils // CUDA utilities
│ ├── cuda_launcher.cuh/cu
│ ├── cuda_runtime.h/cpp
│ ├── cuda_utilities.cuh/cu
│ └── linalg.cuh/cu
├── device_rhs.cuh/cu
├── gravity // Gravity related functions/kernels
│ └── gravity.cuh/cu
├── helper.cuh/cu // buffer class
├── helper_handler.h/cpp // buffer handling
├── integrator // Integrator classes, inheriting from miluphpc.h/cpp
│ ├── device_explicit_euler.cuh/cu
│ ├── device_leapfrog.cuh/cu
│ ├── device_predictor_corrector_euler.cuh/cu
│ ├── explicit_euler.h/cpp
│ ├── leapfrog.h/cpp
│ └── predictor_corrector_euler.h/cpp
├── main.cpp // main
├── materials // material handling (for SPH)
│ ├── material.cuh/cu
│ └── material_handler.cuh/cpp
├── miluphpc.h/cpp // base class for dispatching simulation
├── particle_handler.h/cpp // particle class handling
├── particles.cuh/cu // particle information/class
├── processing
│ └── kernels.cuh/cu
├── simulation_time.cuh/cu // simulation time class
├── simulation_time_handler.cpp // simulation time handling
├── sph // SPH related functions/kernels
│ ├── density.cuh/cu
│ ├── internal_forces.cuh/cu
│ ├── kernel.cuh/cu
│ ├── kernel_handler.cuh/cu
│ ├── pressure.cuh/cu
│ ├── soundspeed.cuh/cu
│ ├── sph.cuh/cu
│ ├── stress.cuh/cu
│ └── viscosity.cuh/cu
├── subdomain_key_tree // Tree related functions/kernels as well as domain decomposition
│ ├── subdomain.cuh/cu
│ ├── subdomain_handler.h/cpp
│ ├── tree.cuh/cu
│ └── tree_handler.h/cpp
└── utils // C++ utilities
├── config_parser.h/cpp
├── h5profiler.h/cpp
├── logger.h/cpp
└── timer.h/cpp
The actual implementation and dispatching of the simulation includes
- pre-processing (inititial particle distribution, ...)
- preparation tasks like
- initializing the MPI and CUDA environment
- reading the initial particle distribution
- memory allocation, ...
- the actual simulation in dependence of the used integration scheme
- encapsulated in the right-hand-side (as depicted in the following picture)
- advancing the particles
- post-processing
- ...
However, as the following suggests, the simulation differs for single and multi-node execution, whereas multi-node execution requires more and more sophisticated functionalities to ensure the correctness of the algorithms.

Prerequisites/Dependencies
For more information and instructions refer to Prerequisites.md
- in general there is no need for the usage of the GNU compiler and OpenMPI as MPI implementation, as long as a proper C++ compiler as well as MPI implementation (CUDA-aware) are available and corresponding changes in the Makefile are done
Usage
- you need to provide an appropriate H5 file as initial (particle) distribution
- build/compile using the Makefile via:
make
- for debug:
make debug
- using cuda-gdb:
./debug/cuda_debug.sh
- for single-precision:
make single-precision
(default: double-precision)
- run via **
mpirun -np <np> <binary> -n <#output files> -f <input hdf5 file> -C <config file> -m <material-config>
**
<binary>
within bin/
e.g. bin/runner
<input hdf5 file>
: appropriate HDF5 file as initial (particle) distribution
<config file>
: configurations
<material-config>
: material configurations
- as well as correct preprocessor directives:
include/parameter.h
- clean via:
make clean
, make cleaner
- rebuild via:
make remake
Preprocessor directives: parameter.h
#define DEBUGGING 0
#define SAFETY_LEVEL 2
#define DIM 3
#define SI_UNITS 1
#define CUBIC_DOMAINS 1
#define GRAVITY_SIM 1
#define SPH_SIM 1
#define INTEGRATE_ENERGY 0
#define INTEGRATE_DENSITY 1
#define INTEGRATE_SML 0
#define DECOUPLE_SML 0
#define VARIABLE_SML 1
#define SML_CORRECTION 0
#define SPH_EQU_VERSION 1
Input HDF5 file
- for gravity only
- provide mass "m", position "x" and velocity "v"
GROUP "/" {
DATASET "m" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SIMPLE { ( <num particles> ) / ( <num particles> ) }
}
DATASET "v" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SIMPLE { ( <num particles> , <dim> ) / ( <num particles> , <dim> ) }
}
DATASET "x" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SIMPLE { ( <num particles> , <dim> ) / (<num particles> , <dim> ) }
}
}
}
- with SPH provide (at least)
- provide mass "m", material identifier "materialId", internal energy "u", position "x" and velocity "v"
GROUP "/" {
DATASET "m" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SIMPLE { ( <num particles> ) / ( <num particles> ) }
}
DATASET "materialId" {
DATATYPE H5T_STD_I8LE
DATASPACE SIMPLE { ( <num particles> ) / ( <num particles> ) }
}
DATASET "u" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SIMPLE { ( <num particles> ) / ( <num particles> ) }
}
DATASET "v" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SIMPLE { ( <num particles> , <dim> ) / ( <num particles> , <dim> ) }
}
DATASET "x" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SIMPLE { ( <num particles> , <dim> ) / (<num particles> , <dim> ) }
}
}
}
Config file
; IO RELATED
; ------------------------------------------------------
; output directory (will be created if it does not exist)
directory bb/
; outputRank (-1 corresponds to all)
outputRank -1
; omit logType::TIME for standard output
omitTime true
; create log file (including warnings, errors, ...)
log true
; create performance log
performanceLog true
; write particles to be sent to h5 file
particlesSent2H5 true
; INTEGRATOR RELATED
; ------------------------------------------------------
; integrator selection
; explicit euler [0], predictor-corrector euler [1], leapfrog [2]
integrator 1
; initial time step
timeStep 1e-4
; max time step allowed
maxTimeStep 1e-4
; end time for simulation
;timeEnd 6e-2
; SIMULATION RELATED
; ------------------------------------------------------
; space-filling curve selection
; lebesgue [0], hilbert [1]
sfc 1
; theta-criterion for Barnes-Hut (approximative gravity)
theta 0.5
; smoothing parameter for gravitational forces
smoothing 2.56e+20
; SPH smoothing kernel selection
; spiky [0], cubic spline [1], wendlandc2 [3], wendlandc4 [4], wendlandc6 [5]
smoothingKernel 1
; remove particles (corresponding to some criterion)
removeParticles true
; spherically [0], cubic [1]
removeParticlesCriterion 0
; allowed distance to center (0, 0, 0)
removeParticlesDimension 3.6e14
; execute load balancing
loadBalancing false
; interval for executing load balancing (every Nth step)
loadBalancingInterval 1
; how much memory to allocate (1.0 -> all particles can in principle be on one process)
particleMemoryContingent 1.0
; calculate angular momentum (and save to output file)
calculateAngularMomentum true
; calculate (total) energy (and save to output file)
calculateEnergy true
; calculate center of mass (and save to output file)
calculateCenterOfMass false
; IMPLEMENTATION SELECTION
; ------------------------------------------------------
; force version for gravity (use [2])
; burtscher [0], burtscher without presorting [1], miluphcuda with presorting [2],
; miluphcuda without presorting [3], miluphcuda shared memory (experimental) [4]
gravityForceVersion 0
; fixed radius NN version for SPH (use [0])
; normal [0], brute-force [1], shared-memory [2], within-box [3]
sphFixedRadiusNNVersion 3
Material config file
materials = (
{
ID = 0
name = "IsothermalGas"
#sml = 1e12
sml = 5.2e11
interactions = 50
artificial_viscosity = { alpha = 1.0; beta = 2.0; };
eos = {
type = 3
};
}
);
...
Command line arguments
./bin/runner -h
gives help:
Multi-GPU CUDA Barnes-Hut NBody/SPH code
Usage:
HPC NBody [OPTION...]
-n, --number-output-files arg
number of output files (default: 100)
-t, --max-time-step arg time step (default: -1.)
-l, --load-balancing load balancing
-L, --load-balancing-interval arg
load balancing interval (default: -1)
-C, --config arg config file (default: config/config.info)
-m, --material-config arg material config file (default:
config/material.cfg)
-c, --curve-type arg curve type (Lebesgue: 0/Hilbert: 1)
(default: -1)
-f, --input-file arg File name (default: -)
-v, --verbosity arg Verbosity level (default: 0)
-h, --help Show this help
Samples/Validation
The code validation comprises the correctness of dispatched simulation on one GPU and multiple GPUs, whereas identical simulation on one and multiple GPUs are not mandatorily bitwise-identical. By suitable choice of compiler flags and in dependence of the used architecture this is in principle attainable. However, this is generally not useful to apply for performance reasons and therefore at this point not presupposed. Amongst others, three test cases were used for validating the implementation:
- the Plummer test case is a gravity-only test case
- the Taylor–von Neumann–Sedov blast wave test case is a pure hydrodynamical SPH test case
- the isothermal collapse test case self-gravitating SPH test case utilizing almost all implemented features
each color represents a process, thus a GPU
Plummer
- Plummer model: four GPUs with dynamic load balancing every 10th step (top: lebesgue, bottom: hilbert)

The Plummer model is a gravity only test case, the distribution is stable over time enabling the validation as shown in the following picture.

Taylor–von Neumann–Sedov blast wave
- Sedov explosion: one and two GPUs

The density in dependence of the radius for t = 0.06 and the semi-analytical solution as comparison.

Boss-Bodenheimer: Isothermal collapse
- Boss-Bodenheimer: isothermal collapse

Contour plot, showing the density at 1.2 x free-fall time.