Content - fb4d2a261f4ae04a0bb54fe51f454e45e941f0d4 - b9a1384/TODO

visit type:

Tip revision: ced4d9906bc8d32ac610097399fdfdf7a5e40a0a authored by Jussi Eloranta on 20 December 2021, 02:30:57 UTC
Update README

Tip revision: ced4d99

TODO

- use malloc/free when moving blocks to/from GPU. This way the GPU node does not need as much memory (but we can use swap...)
- write kernels in .h files that can be included by both openmp and cuda code.
- try using local memory for derivatives in CUDA (avoid jumping in memory)
- CUDA performance of Crank-Nicolson is poor... too complex kernels. There are many if's but at least they don't diverge.
- Poisson solver for other than periodic BC
- Fortran support: iso_c_binding, interface / end interface.
- CUDA: most things have periodic BCs hardwired...
- For mem2gpu we need to track in which space we are so that we can upload the data to GPU in the correct subformat. 
- GPU shuffle routines not tested.
- FFT abs boundaries for predict-correct.
- update manual - it is now quite out of date.
- Implement cfunction the same way as rfunction. These functions need to use linear interpolation...
- When many GPUs are available, it might be beneficial to run the low leverl GPU loops using OpenMP. For just a couple of GPUs, there did not seem to be benefit.
- Finite difference routines are much faster on single GPU than FFT-based. It is likely that the top level user code is doing excess FFT/IFFTs.

also fgrep TODO *.[ch] to find out more specific issues.

=====================================

CUDA performance next:

- Striding over nx (2d grid with 1d inside thread) did not improve performance.
- __restrict__ keyword
- if-else tests

Browse the archive

https://github.com/jmeloranta/libgrid