Raw File
Tip revision: afd65f03a52e4d9d2536feb9be93150bdb8dec37 authored by Christian Trott on 12 February 2024, 22:42:12 UTC
Merge pull request #2093 from ndellingwood/master-release-4.2.01
Tip revision: afd65f0
# Change Log

## [4.2.01]( (2024-01-17)
[Full Changelog](

### Bug Fixes:

- LAPACK: magma tpl fixes [\#2044](
- BLAS: fix bug in TPL layer of `KokkosBlas::swap` [\#2052](
- ROCm 6 deprecation fixes for rocsparse [\#2050](

## [4.2.00]( (2023-11-06)
[Full Changelog](

### New Features

#### BLAS updates
- Implement BLAS2 syr() and her() functionalities under kokkos-kernels syr() [\#1837](

- New component added for the implementation of LAPACK algorithms and to support associated TPLs [\#1985](
- Fix some issue with unit-test definition for SYCL backend in the new LAPACK component [\#2024](

#### Sparse updates
- Extract diagonal blocks from a CRS matrix into separate CRS matrices [\#1947](
- Adding exec space instance to spmv [\#1932](
- Add merge-based SpMV [\#1911](
- Stream support for Gauss-Seidel: Symbolic, Numeric, Apply (PSGS and Team_PSGS) [\#1906](
- Add a MergeMatrixDiagonal abstraction to KokkosSparse [\#1780](

#### ODE updates
- Newton solver [\#1924](

### Enhancements:

#### Sparse
- MDF performance improvements exposing more parallelism in the implementation
  - MDF: convert remaining count functor to hierarchical parallelism [\#1894](
  - MDF: move most expensive kernels over to hierarchical parallelism [\#1893](
- Improvements to the Block Crs Matrix-Vector multiplication algorithm
  - Improve BSR matrix SpMV Performance [\#1740](
  - Disallow BsrMatrix tensor-core SpMV on non-scalar types [\#1937](
  - remove triplicate sanity checks in BsrMatrix [\#1923](
  - remove duplicate BSR SpMV tests [\#1922](
- Only deep_copy from device to host if supernodal sptrsv algorithms are used [\#1993](
- Improve KokkosSparse_kk_spmv [\#1979](
  - Add 5 warm-up calls to get accurate, consistent timing
  - Print out the matrix dimensions correctly when loading from disk
- sparse/impl: Make PSGS non-blocking [\#1917](

#### ODE
- ODE: changing layout of temp mem in RK algorithms [\#1908](
- ODE: adding adaptivity test for RK methods [\#1896](

#### Common utilities
- Common: remove half and bhalf implementations (now in Kokkos Core) [\#1981](
- KokkosKernels: switching from printf macro to function [\#1977](
- OrdinalTraits: constexpr functions [\#1976](
- Parallel prefix sum can infer view type [\#1974](

#### TPL support
- BSPGEMM: removing cusparse testing for version older than 11.4.0 [\#1996](
- Revise KokkosBlas::nrm2 TPL implementation [\#1950](
- Add TPL oneMKL GEMV support [\#1912](
- oneMKL spmv [\#1882](

### Build System:
- CMakeLists.txt: Update Kokkos version to 4.2.99 for version check [\#2003](
- CMake: Adding logic to catch bad Kokkos version [\#1990](
- Remove calling tribits_exclude_autotools_files() [\#1888](

### Documentation and Testing:
- Update create_gs_handle docs [\#1958](
- docs: Add testing table [\#1876](
- docs: Note which builds have ETI disabled [\#1934](
- Generate HTML docs [\#1921](
- github/workflows: Pin sphinx version [\#1948](
- github/workflows/docs.yml: Use up-to-date doxygen version [\#1941](

- Unit-Test: adding specific test for block sparse functions [\#1944](
- Update SYCL docker image to Cuda 11.7.1 [\#1939](
- Remove printouts from the unit tests of ger() and syr() [\#1933](
- update testing scripts [\#1960](
- Speed up BSR spmv tests [\#1945](
- Test_ODE_Newton: Add template parameters for Kokkos::pair [\#1929](
- par_ilut: Update documentation for fill_in_limit [\#2001](

### Benchmarks:
- perf_test/sparse: Update GS perf_test for streams [\#1963](
- Batched sparse perf_tests: Don't write to source tree during build [\#1904](
- ParILUT bench: fix unused IS_GPU warning [\#1900](
- BsrMatrix SpMV Google Benchmark [\#1886](
- Use extraction timestamps for fetched Google Benchmark files [\#1881](
- Improve help text in perf tests [\#1875](

### Cleanup:
- iostream clean-up in benchmarks [\#2004](
- Rename TestExecSpace to TestDevice [\#1970](
- remove Intel 2017 code (no longer supported) [\#1920](
- clean-up implementations for move of HIP outside of experimental [#1999](

### Bug Fixes:
- upstream iostream removal fix [\#1991](, [\#1995](
- Test and fix gemv stream interface [\#1987](
- Test_Sparse_spmv_bsr.hpp: Workaround cuda 11.2 compiler error [\#1983](
- Fix improper use of execution space instances in ODE tests. Better handling of CudaUVMSpaces during build. [\#1973](
- Don't assume the default memory space is used [\#1969](
- MDF: set default verbosity explicitly to avoid valgrind warnings [\#1968](
- Fix sort_and_merge functions for in-place case [\#1966](
- SPMV_Struct_Functor: initialize numExterior to 0 [\#1957](
- Use rank-1 impl types when rank-2 vector is dynamically rank 1 [\#1953](
- BsrMatrix: Check if CUDA is enabled before checking architecture [\#1955](
- Avoid enum without fixed underlying type to fix SYCL [\#1940](
- Fix SpAdd perf test when offset/ordinal is not int [\#1928](
- Add KOKKOSKERNELS_CUDA_INDEPENDENT_THREADS definition for architectures with independent thread scheduling [\#1927](
- Fix cm_generate_makefile --boundscheck [\#1926](
- Bsr compatibility [\#1925](
- BLAS: fix assignable check in gemv and gemm [\#1914](
- mdf: fix initial value in select pivot functor [\#1916](
- add missing headers, std::vector -> std::vector<...> [\#1909](
- Add missing <vector> include to Test_Sparse_MergeMatrix.hpp [\#1907](
- Remove non-existant dir from CMake include paths [\#1892](
- cusparse 12 spmv: check y vector alignment [\#1889](
- Change 'or' to '||' to fix compilation on MSVC [\#1885](
- Add missing KokkosKernels_Macros.hpp include [\#1884](
- Backward-compatible fix with kokkos@4.0 [\#1874](
- Fix for rocblas builds [\#1871](
- Correcting 'syr test' bug causing compilation errors with Trilinos [\#1870](
- Workaround for spiluk and sptrsv stream tests with OMP_NUM_THREADS of 1, 2, 3 [\#1864](
- bhalf_t fix for isnan function [\#2007](

## [4.1.00]( (2023-06-16)
[Full Changelog](

### New Features

#### BLAS updates
- Adding interface with execution space instance argument to support execution of BLAS on stream
  - Norms on stream [\#1795](
  - Blas1 on stream [\#1803](
  - Blas2 and 3 on stream [\#1812](
- Improving BLAS level 2 support by adding native implementation and TPL for GER, HER and SYR
  - Implementation for BLAS2 ger [\#1756](
  - Implement BLAS2 syr() and her() functionalities under kokkos-kernels syr() [\#1837](

#### Batched updates
- Optimizing algorithms for single input data
  - Add calls to KokkosBlas Dot and Axpy for team batched kernels when m==1 [\#1753](
  - Add calls to KokkosBlas Gemv and Spmv for team batched kernels when m==1 [\#1770](

#### Sparse updates
- Adding stream support to ILUK/SPTRSV and sort/merge
  - Streams interface for SPILUK numeric [\#1728](
  - Stream interface for SPTRSV solve [\#1820](
  - Add exec instance support to sort/sort_and_merge utils [\#1744](
- Add BsrMatrix SpMV in rocSparse TPL, rewrite BsrMatrix SpMV unit tests [\#1769](
- sparse: Add coo2crs, crs2coo and CooMatrix [\#1686](
- Adds team- and thread-based lower-bound and upper-bound search and predicates [\#1711](
- Adds KokkosKernels::Impl::Iota, a view-like where iota(i) = i + offset [\#1710](

#### Misc updates
- ODE: explicit integration methods [\#1754](

### Enhancements:

#### BLAS
- refactor blas3 tests to use benchmark library [\#1751](

#### Batched
- batched/eti: ETI host-level interfaces [\#1783](
- batched/dense: Add gesv DynRankView runtime checks [\#1850](

#### Sparse
- Add support for complex data types in MDF [\#1776](
- Sort and merge improvements [\#1773](
- spgemm handle: check that A,B,C graphs never change [\#1742](
- Fix/enhance backend issues on spadd perftest [\#1672](
- Spgemm perf test enhancements [\#1664](
- add explicit tests of opt-in algorithms in SpMV [\#1712](

#### Common utilities
- Added TplsVersion file and print methods [\#1693](
- Add basis skeleton for KokkosKernels::print_configuration [\#1665](
- Add git information to benchmark context [\#1722](
- Test mixed scalars: more fixes related to mixed scalar tests [\#1694](
- PERF TESTS: adding utilities and instantiation wrapper [\#1676](

#### TPL support
- Refactor MKL TPL for both CPU and GPU usage [\#1779](
- MKL: support indices properly [\#1868](
- Use rocsparse_spmv_ex for rocm >= 5.4.0 [\#1701](

### Build System:
- Do not change memory spaces instantiation defaults based on Kokkos_ENABLE_CUDA_UVM [\#1835](
- KokkosKernels: Remove TriBITS Kokkos subpackages (trilinos/Trilinos#11545) [\#1817](
- CMakeLists.txt: Add alias to match what is exported from Trilinos [\#1855](
- KokkosKernels: Don't list include for non-existant 'batched' build dir (trilinos/Trilinos#11966) [\#1867](
- Remove non-existant subdir kokkos-kernels/common/common (#11921, #11863) [\#1854](
- KokkosKernels: Remove non-existent common/src/[impl,tpls] include dirs (trilinos/Trilinos#11545) [\#1844](

### Documentation and Testing:
- Enable sphinx werror [\#1856](
- Update cmake option naming in docs/comments [\#1849](
- docs/developer: Add Experimental namespace [\#1852](
- docs: Add profiling for compile times [\#1843](
- Ger: adding documentation stubs in apidocs [\#1822](
- .github/workflows: Summarize github-DOCS errors and warnings [\#1814](
- Blas1: docs update for PR #1803 [\#1805](
- apt-get update in hosted runner docs check [\#1797](
- scripts: Fix github-DOCS [\#1796](
- Add --enable-docs option to cm_generate_makefile [\#1785](
- docs: Add stubs for some sparse APIs [\#1768](
- .github: Update to actions/checkout@v3 [\#1767](
- docs: Include BatchedGemm [\#1765](
- .github: Automation reminder [\#1726](
- Allow an HTML-only docs build [\#1723](
- SYCL CI: Specify the full path to the compiler [\#1670](
- Add github DOCS ci check & disable Kokkos tests [\#1647](
- Add rocsparse,rocblas, to enabled TPLs in cm_test_all_sandia when --spot-check-tpls [\#1841](
- cm_test_all_sandia: update to add caraway queues for MI210, MI250 [\#1840](
- Support rocSparse in rocm 5.2.0 [\#1833](
- Add KokkosKernels_PullRequest_VEGA908_Tpls_ROCM520 support, only enable KokkosBlas::gesv where supported [\#1816](
- scripts: Include OMP settings [\#1801](
- Print the patch that clang-format-8 wants to apply [\#1714](

### Benchmarks:
- Benchmark cleanup for par_ilut and spmv [\#1853](
- SpMV: adding benchmark for spmv [\#1821](
- New performance test for par_ilut, ginkgo::par_ilut, and spill [\#1799](
- Include OpenMP environment variables in benchmark context [\#1789](
- Re-enable and clean up triangle counting perf test [\#1752](
- Include google/benchmark lib version in benchmark output [\#1750](
- Refactor blas2 test for benchmark feature [\#1733](
- Adds a better parilut test with gmres [\#1661](
- Refactor blas1 test for benchmark feature [\#1636](

### Cleanup:
- Drop outdated workarounds for backward compatibility with Kokkos [\#1836](
- Remove dead code guarded [\#1834](
- Remove decl ETI files [\#1824](
- Reorganize par_ilut performance test [\#1818](
- Deprecate Kokkos::Details::ArithTraits [\#1748](
- Drop obsolete workaround #ifdef KOKKOS_IF_ON_HOST [\#1720](
- Drop pre Kokkos 3.6 workaround [\#1653](
- View::Rank -> View::rank [\#1703](
- Prefer Kokkos::View::{R->r}ank [\#1679](
- Call concurrency(), not impl_thread_pool_size() [\#1666](
- Kokkos moves ALL_t out of Impl namespace [\#1658](
- Add KokkosKernels::Impl::are_integral_v helper variable template and quit using Kokkos::Impl::are_integral trait [\#1652](

### Bug Fixes:
- Kokkos 4 compatibility: modifying the preprocessor logic [\#1827](
- blas/tpls: Fix gemm include guard typo [\#1848](
- spmv cusparse version check modified for cuda/11.1 [\#1828](
- Workaround for #1777 - cusparse spgemm test hang [\#1811](
- Fix 1798 [\#1800](
- BLAS: fixes and testing for LayoutStride [\#1794](
- Fix 1786: check that work array is contiguous in SVD [\#1793](
- Fix unused variable warnings [\#1790](
- Use KOKKOS_IMPL_DO_NOT_USE_PRINTF in Test_Common_UpperBound.hpp [\#1784](
- Batched Gesv: initializing variable to make compiler happy [\#1778](
- perf test utils: fix device ID parsing [\#1739](
- Fix OOB and improve comments in BsrMatrix COO constructor [\#1732](
- batched/unit_test: Disable simd dcomplex4 test in for intel > 19.05 and <= 2021. [\#1857](
- rocsparse spmv tpl: Fix rocsparse_spmv call for rocm < 5.4.0 [\#1716](
- compatibility with 4.0.0 [\#1709](
- team mult: fix type issue in max_error calculation [\#1706](
- cast Kokkos::Impl::integral_constant to int [\#1697](

## [4.0.01]( (2023-04-19)
[Full Changelog](

### Bug Fixes:
- Use the options ENABLE_PERFTEST, ENABLE_EXAMPLES [\#1667](
- Kokkos Kernels version: need to use upper case variables [\#1707](
- CUSPARSE_MM_ALG_DEFAULT deprecated by cuSparse 11.1 [\#1698](
- blas1: Fix a couple documentation typos [\#1704](
- CUDA 11.4: fixing some -Werror [\#1727](
- Remove unused variable in KokkosSparse_spgemm_numeric_tpl_spec_decl.hpp [\#1734](
- Reduce BatchedGemm test coverage time [\#1737](
- Fix kk_generate_diagonally_dominant_sparse_matrix hang [\#1689](
- Temporary spgemm workaround matching Trilinos 11663 [\#1757](
- MDF: Minor changes to interface for ifpack2 impl [\#1759](
- Rocm TPL support upgrade [\#1763](
- Fix BLAS cmake check for complex types [\#1762](
- ParIlut: Adds a better parilut test with gmres [\#1661](
- GMRES: fixing some type issues related to memory space instantiation (partial) [\#1719](
- ParIlut: create and destroy spgemm handle for each usage [\#1736](
- ParIlut: remove par ilut limitations [\#1755](
- ParIlut: make Ut_values view atomic in compute_l_u_factors [\#1781](

## [4.0.0]( (2023-21-02)
[Full Changelog](

### Features:
- Copyright update 4.0 [\#1657](
- Added google benchmark to kokkos kernel and to the CI [\#1626](

#### Completing BLAS Level 1:
- ROTG: implementation of BLAS level1 rotg [\#1529](
- ROT: adding function to rotate two vector using Givens rotation coefficients [\#1581](
- ROTMG: adding rotmg implementation to KokkosBlas [\#1560](
- ROTM: adding blas 1 function for modified rotation [\#1583](
- SWAP: adding implementation of level 1 BLAS function [\#1612](

#### New incomplete factorization algorithms:
- MDF implementation in parallel [\#1393]( and [\#1624](
- Jgfouca/par ilut [\#1506](

#### New additional features
- Add utility `KokkosSparse::removeCrsMatrixZeros` [\#1681](
- Add spgemm TPL support for cuSparse and rocSparse [\#1513](
- Add csr2csc [\#1446](
- Adding my weighted graph coarsening code into kokkos-kernels [\#1043](
- VBD/VBDBIT D1 coloring: support distributed graphs [\#1598](

### Implemented enhancements:
- New tests for mixed-precision GEMM, some fixes for BLAS tests with non-ETI types [\#1615](
- Spgemm non-reuse: unification layer and TPLs [\#1678](
- Remove "slow mem space" device ETI [\#1619](
- First phase of SpGEMM TPL refactor [\#1582](
- Spgemm TPL refactor [\#1618](
- cleaned messages printed at configuration time [\#1616](
- Batched dense tests: splitting batched dense unit-tests [\#1608](
- sparse/unit_test: Use native spmv impl in bsr unit tests [\#1606](
- ROT* HIP: testing and improving rocBLAS support for ROT* kernels [\#1594](
- Add main functions for batched sparse solver performance tests [\#1554](
- Batched sparse kernels update [\#1546](
- supernodal SpTRSV : require invert-diag option to use SpMV [\#1518](
- Update --verbose option in D2 coloring perftest [\#1486](

### Reorganization:
- Modular build: allowing to build components independently [\#1504](
- Move GMRES from example to sparse experimental [\#1620](
- Remove Experimental::BlockCrsMatrix (replaced with Experimental::BsrMatrix) [\#1458](
- Move {Team,TeamVector}Gemv to KokkosBlas [\#1435](
- Move SerialGEMV to KokkosBlas [\#1433](

### Build System:
- CMake: export version and subversion to config file [\#1680](
- CMake: update package COMPATIBILITY mode in anticipation of release 4.0 [\#1645](
- FindTPLMKL.cmake: fix naming of mkl arg to FIND_PACKAGE_HANDLE_STANDARD_ARGS [\#1644](
- KokkosKernels: Use KOKKOSKERNELS_INCLUDE_DIRECTORIES() (TriBITSPub/TriBITS#429) [\#1635](
- Fix docs build [\#1569](
- KokkosKernels: Remove listing of undefined TPL deps (trilinos/Trilinos#11152) [\#1568](

### Testing:
- Update nightly SYCL setup [\#1660](
- Add github DOCS ci check & disable Kokkos tests [\#1647](
- docs: Fix RTD build [\#1490](
- sparse/unit_test: Disable spmv_mv_heavy for all A64FX builds [\#1555](
- ROTMG: rocblas TPL turned off [\#1603](
- Fix HIP nightly build on ORNL Jenkins CI server [\#1544](
- Turn on cublas and cusparse in CLANG13CUDA10 CI check [\#1584](
- Add clang13+cuda10 PR build [\#1524](
- .githob/workflows: Fix redundant workflow triggers [\#1527](
- Add GCC test options for C++17 and disable perftests for INTEL19 [\#1511](
- Add INTEL19 and CUDA11 CI settings [\#1505](
- .github/workflows: use c++17 [\#1484](

### Bug Fixes:
- Workaround for array_sum_reduce if scalar is half_t and N is 3, 5 or 7 [\#1675](
- Fix the nondeterministic issue in SPILUK numeric [\#1683](
- Fix an error in Krylov Handle documentation [\#1659](
- ROTMG: loosen unit-test tolerance for Host TPLs [\#1638](
- SWAP: fixing obvious mistake in TPL layer : ( [\#1637](
- Fix 1631: Use Kokkos::LayoutRight with CrsMatrix values_type (Trilinos compatibility) [\#1633](
- Cuda/12 with CuSPARSE updates [\#1632](
- Fix 1627: cusparse 11.0-11.3 spgemm symbolic wrapper [\#1628](
- Make sure to call ExecutionSpace::concurrency() from an object [\#1614](
- SPGEMM: fixing the rocsparse interface [\#1607](
- Fix Trilinos issue 11033: remove compile time check to allow compilation with non-standard scalar types [\#1591](
- SPMM: fixing cuSPARSE issue with incompatible compute type and op [\#1587](
- ParILUT: convert two lambdas to functors [\#1580](
- Update kk_get_free_total_memory for SYCL [\#1579](
- SYCL: Use KOKKOS_IMPL_DO_NOT_USE_PRINTF instead of printf in kernels [\#1567](
- Rotg fixes for issue 1577 [\#1578](
- Rotg update: fixing the interface [\#1566](
- Fix rotg eti [\#1534](
- Fix to include KokkosBatched_Util.hpp [\#1565](
- TeamGemvInternal: reintroduce 12-arg invoke method [\#1561](
- Rename component options to avoid overloaded usage in Trilinos [\#1641](
- Avoid the SIMD code branch if the batched size is not a multiple of the vector length [\#1552](
- SYCL: Fix linking with ze_loader in Trilinos [\#1551](
- ARMPL Fixes and Workarounds [\#1543](
- Test_Graph_coarsen: replace HostMirror usage with auto [\#1538](
- Fix spgemm cusparse [\#1535](
- Warning fixes: Apple Clang complains about [-Werror,-Wunused-but-set-variable] [\#1532](
- In src/batched/dense: Barrier after broadcast [\#1520](
- Graph coarsen: fix test [\#1517](
- KokkosGraph_CoarsenHeuristics: remove volatile qualifier from join [\#1510](
- Replace capture [\#1502](
- utils: implicit copy-assign deprecated in array_sum_reduce [\#1494](

## [3.7.01]( (2022-12-01)
[Full Changelog](

### Bug Fixes:

- Use CRS matrix sort, instead of Kokkos::sort on each row [\#1553](
- Change template type for StaticCrsGraph in BsrMatrix [\#1531](
- Remove listing of undefined TPL deps [\#1568](
- Fix using SpGEMM with nonstandard scalar type, with MKL enabled [\#1591](
- Move destroying dense vector descriptors out of cuSparse sptrsv handle [\#1590](
- Fix `cuda_data_type_from` to return `CUDA_C_64F` for `Kokkos::complex<double>` [\#1604](
- Disable compile-time check in cuda_data_type_from on supported scalar types for cuSPARSE [\#1605](
- Reduce register pressure in batched dense algorithms [\#1588](

### Implemented enhancements:

- Use new cusparseSpSV TPL for SPTRSV when cuSPARSE is enabled with CUDA >= 11.3 [\#1574](

## [3.7.00]( (2022-08-18)
[Full Changelog](

### Features:

#### Final Bsr algorithms implemented for multigrid:
- Sparse: bsr transpose algorithm [\#1477](
- BSR block SpGEMM implementation [\#1099](

#### Adding batched dense linear and non-linear system solvers:
- Add batched GESV [\#1384](
- Newton solver: serial on device implementation of Newton's method [\#1479](

#### Add sparse matrix conversion:
- Add csc2csr [\#1342](
- csc2csr: update Kokkos_Numeric.hpp header inclusion [\#1449](
- sparse: Remove csc2csr copy [\#1375](

#### New documentation in readthedocs
- Added [\#1451](
- Restructure docs [\#1368](

#### Fix issues with TPLs for mutlivector SPMV
- Add cuSparse TPL files for CrsMatrix-multivector product [\#1427](

### Deprecations:
- Add template params to forwarding calls in deprecated KokkosKernels::… [\#1441](

### Implemented enhancements:

- SPILUK: Move host allocations to symbolic [\#1480](
- trsv: remove assumptions about entry order within rows [\#1463](

#### Hierarchical BLAS algorithms, added and moved from batched:
- Blas serial axpy and nrm2 [\#1460](
- Move Set/Scale unit test to KokkosBlas [\#1455](
- Move {Serial,Team,TeamVector} Set to KokkosBlas [\#1454](
- Move {Serial,Team,TeamVector}Scale to KokkosBlas [\#1448](

#### Code base organization and clean-ups:
- Common Utils: removing dependency on Sparse Utils in Common Utils [\#1436](
- Common cleanup [\#1431](
- Clean-up src: re-organizing the src directory [\#1398](
- Sparse utils namespace [\#1439](

#### perf tests updates, fixes and clean-ups:
- dot perf test: adding support for HIP and SYCL backend [\#1453](
- Add verbosity parameter to GMRES example. Turn off for testing. [\#1385](
- KokkosSparse_spiluk.cpp perf test: add int-int guards to cusparse codes [\#1369](
- perf_test/blas: Check ARMPL build version [\#1352](
- Clean-up batched block tridiag perf test [\#1343](
- Reduce lots of macro duplication in sparse unit tests [\#1340](

#### Infrastructure changes: ETI and testing upgrades, minor fixes
- sycl: re-enabling test now that dpcpp has made progress [\#1473](
- Only instantiate Kokkos's default Cuda mem space [\#1361](
- Sparse and CI updates [\#1411](
- Newer sparse tests were not following the new testing pattern [\#1356](
- Add ETI for D1 coloring [\#1401](
- Add ETI to SpAdd (symbolic and numeric) [\#1399](
- Reformat example/fenl files changed in 1382 [\#1464](
- Change Controls::getParameter error message from stdout to stderr [\#1416](

#### Kokkos alignment: update our implementations to use newer Kokkos features
- Arith traits integral nan [\#1438](
- Kokkos_ArithTraits: re-implementation using Kokkos Core [\#1406](
- Value-initialize result of MaxLoc reduction to avoid maybe uninitialized warning [\#1383](
- Remove volatile qualifiers in reducer join(), init(), and operator+= methods [\#1382](

#### BLAS and batched algorithms updates
- Update Batched GMRES [\#1392](
- GEMV: accumulate in float for scalar = bhalf_t [\#1360](
- Restore BLAS-1 MV paths for 1 column [\#1354](

#### Sparse and Graph updates
- Minor updates to cluster Gauss-Seidel [\#1372](
- Add unit test for BsrMatrix and BlockCrsMatrix spmv [\#1338](
- Refactor SPGEMM MKL Impl [\#1244](
- D1 coloring: remove unused but set variable [\#1403](

#### half precision paper
- Minor changes for half precision paper [\#1429](
- Add benchmarks for us-rse escience 2022 half precision paper [\#1422](

### Bug Fixes:
- TPLs: adding CUBLAS in the list of dependencies [\#1482](
- Fix MKL build errors [\#1478](
- Fixup drop layout template param in rank-0 views [\#1476](
- BLAS: fixing test that access results before synching [\#1472](
- Fix D1 color ETI with both CudaSpace and UVM [\#1471](
- Fix arithtraits warning [\#1468](
- Fix build when double not instantiated [\#1467](
- Fix -Werror [\#1466](
- Fix GitHub CI failing on broken develop [\#1461](
- HIP: fix warning from ExecSpaceUtils and GEMV [\#1459](
- Removes a duplicate cuda_data_type_from when KOKKOS_HALF_T_IS_FLOAT [\#1456](
- Fix incorrect function call in KokkosBatched::TeamGEMV unit test [\#1444](
- Fix SYCL nightly test [\#1419](
- Fix issues with cuSparse TPL availability for BsrMatrix SpMV [\#1418](
- SpMV: fixing issues with unit-tests tolerance [\#1412](
- Address 1409 [\#1410](
- Fix colliding include guards (copy-paste mistake) [\#1408](
- src/sparse: Fix & check for fence post errors [\#1405](
- Bspgemm fixes [\#1396](
- Fix unused parameter warnings in GEMM test. [\#1381](
- Fixes code deprecation warnings. [\#1379](
- Fix sign-compare warning in SPMV perf test [\#1371](
- Minor MKL fixes [\#1365](
- perf_test/batched: Temporarily disable tests [\#1359](
- Fix nightly builds following promotion of the math functions in Kokkos [\#1339](

## [3.6.01]( (2022-05-23)
[Full Changelog](

### Bug Fixes and Improvements:

- Improve spiluk numeric phase to avoid race conditions and processing in chunks [\#1390](
- Improve sptrsv symbolic phase performance (level scheduling) [\#1380](
- Restore BLAS-1 MV paths for 1 column [\#1354](
- Fix check that view has const type [\#1370](
- Fix check that view has const type part 2 [\#1394](

## [3.6.00]( (2022-02-18)
[Full Changelog](

### Features: 

#### Batched Sparse Linear algebra
- Kokkos Kernels is adding a new component to the library: batched sparse linear algebra.
- Similarly to the current dense batched algorithms, the new algorithms are called from
- the GPU and provide Team and TeamVector level of parallelism, SpMV also provides a Serial
- call on GPU.

- Add Batched CG and Batched GMRES [\#1155](
- Add Jacobi Batched preconditioner [\#1219](

#### Bsr and Tensor core algorithm for sparse linear algebra
- After introducing the BsrMatrix in release 3.5.0 new algorithms are now supporting this format.
- For release 3.6.0 we are adding matrix-vector (matvec) multiplication and Gauss-Seidel as well as an
- implementation of matvec that leverages tensor cores on Nvidia GPUs. More kernels are expected to
- support the Bsr format in future releases.

- Add Spmv for BsrMatrix [\#1255](
- Add BLAS to SpMV operations for BsrMatrix [\#1297](
- BSR format support in block Gauss-Seidel [\#1232](
- Experimental tensor-core SpMV for BsrMatrix [\#1090](

#### Improved AMD math libraries support
- rocBLAS and rocSPARSE TPLs are now officially supported, they can be enabled at configure time.
- Initial kernels that can call rocBLAS are GEMV, GEMM, IAMAX and SCAL, while rocSPARSE can be
- called for matrix-vector multiplication. Further support for TPL calls can be requested on slack
- and by GitHub issues.

- Tpl rocBLAS and rocSPARSE [\#1153](
- Add rocBLAS GEMV wrapper [\#1201](
- Add rocBLAS wrappers for GEMM, IAMAX, and SCAL [\#1230](
- SpMV: adding support for rocSPARSE TPL [\#1221](

#### Additional new features
- bhalf: Unit test Batched GEMM [\#1251]( 
-   and demostrate GMRES example convergence with bhalf_t (
- Stream interface: adding stream support in GEMV and GEMM [\#1131](
- Improve double buffering batched gemm performance [\#1217](
- Allow choosing coloring algorithm in multicolor GS [\#1199](
- Batched: Add armpl dgemm support [\#1256](

### Deprecations:
- Deprecation warning: SpaceAccessibility move out of impl, see #1140 [\#1141](

### Backends and Archs Enhancements:

#### SYCL:
- Full Blas support on SYCL [\#1270](
- Get sparse tests enabled and working for SYCL [\#1269](
- Changes to make graph run on SYCL [\#1268](
- Allow querying free/total memory for SYCL [\#1225](
- Use KOKKOS_IMPL_DO_NOT_USE_PRINTF instead of printf in kernels [\#1162](

#### HIP:
- Work around hipcc size_t/int division with remainder bug [\#1262](

#### Other Improvements:
- Replace std::abs with ArithTraits::abs [\#1312](
- Batched/dense: Add Gemm_DblBuf LayoutLeft operator [\#1299](
- KokkosKernels: adding variable that returns version as a single number [\#1295](
- Add KOKKOSKERNELS_FORCE_SIMD macro (Fix #1040) [\#1290](
- Algo::Level{2,3}::Blocked::mb() [\#1265](
- Batched: Use SerialOpt2 for 33 to 39 square matrices [\#1261](
- Prune extra dependencies [\#1241](
- Improve double buffering batched gemm perf for matrix sizes >64x64 [\#1239](
- Improve graph color perf test [\#1229](
- Add custom implementation for strcasecmp [\#1227](
- Replace __restrict__ with KOKKOS_RESTRICT [\#1223](
- Replace array reductions in BLAS-1 MV reductions [\#1204](
- Update MIS-2 and aggregation [\#1143](
- perf_test/blas/blas3: Update SHAs for benchmarking [\#1139](

### Implemented enhancements BuildSystem
- Bump ROCm version 4.2 -> 4.5 in nightly Jenkins CI build [\#1279](
- scripts/cm_test_all_sandia: Add A64FX ci checks [\#1276](
- github/workflows: Add osx CI [\#1254](
- Update SYCL compiler version in CI [\#1247](
- Do not set Kokkos variables when exporting CMake configuration [\#1236](
- Add nightly CI check for SYCL [\#1190](
- Update cmake minimum version to 3.16 [\#866](

### Incompatibilities:
- Kokkos::Impl: removing a few more instances of throw_runtime_exception [\#1320](
- Remove Kokkos::Impl::throw_runtime_exception from Kokkos Kernels [\#1294](
- Remove unused memory space utility [\#1283](
- Clean up Kokkos header includes [\#1282](
- Remove private Kokkos header include (Cuda/Kokkos_Cuda_Half.hpp) [\#1281](
- Avoid using #ifdef KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_* macro guards [\#1266](
- Rename enumerator Impl::Exec_{PTHREADS -> THREADS} [\#1253](
- Remove all references to the Kokkos QThreads backend [\#1238](
- Replace more occurences of Kokkos::Impl::is_view [\#1234](
- Do not use Kokkos::Impl::is_view [\#1214](
- Replace Kokkos::Impl::if_c -> std::conditional [\#1213](

### Bug Fixes:
- Fix bug in spmv_mv_bsrmatrix() for Ampere GPU arch [\#1315](
- Fix std::abs calls for rocBLAS/rocSparse [\#1310](
- cast literal 0 to fragment scalar type [\#1307](
- Fix 1303: maintain correct #cols on A in twostage [\#1304](
- Add dimension checking to generic spmv interface [\#1301](
- Add missing barriers to TeamGMRES, fix vector len [\#1285](
- Examples: fixing some issues related to type checking [\#1267](
- Restrict BsrMatrix specialization for AMPERE and VOLTA to CUDA [\#1242](
- Fix compilation errors for multi-vectors in kk_print_1Dview() [\#1231](
- src/batched: Fixes #1224 [\#1226](
- Fix SpGEMM crashing on empty rows [\#1220](
- Fix issue #1212 [\#1218](
- example/gmres: Specify half_t namespace [\#1208](
- Check that ordinal types are signed [\#1188](
- Fixing a couple of small issue with tensor core spmv [\#1185](
- Fix #threads setting in pcg for OpenMP [\#1182](
- SpMV: fix catch all case to avoid compiler warnings [\#1179](
- using namespace should be scoped to prevent name clashes [\#1177](
- using namespace should be scoped to prevent name clashes, see issue #1170 [\#1171](
- Fix bug with mkl impl of spgemm [\#1167](
- Add missing $ to KOKKOS_HAS_TRILINOS in sparse_sptrsv_superlu check [\#1160](
- Small fixes to spgemm, and plug gaps in testing [\#1159](
- SpMV: mismatch in #ifdef check and kernel specialization [\#1151](
- Fix values dimension for block sparse matrices [\#1147](

## [3.5.00]( (2021-10-19)
[Full Changelog](

- Batched serial SVD [\#1107](
- Batched: Add BatchedDblBufGemm [\#1095](
- feature/gemv rps test -- RAJAPerf Suite Version of the BLAS2 GEMV Test [\#1085](
- Add new bsrmatrix [\#1077](
- Adding Kokkos GMRES example [\#1028](
- Add fast two-level mode N GEMV (#926) [\#939](
- Batched: Add BatchedGemm interface [\#935](
- OpenMPTarget: adding ETI and CMake logic for OpenMPTarget backend [\#886](

**Implemented enhancements Algorithms and Archs:**
- Use float as accumulator for GEMV on half_t (Fix #1081) [\#1082](
- Supernodal SpTRSV: add option to use MAGMA TPL for TRTRI [\#1069](
- Updates for running GMRES example with half precision [\#1067](
- src/blas/impl: Explicitly cast to LHS type for ax [\#1073](
- Update BatchedGemm interface to match design proposal [\#1054](
- Move dot-based GEMM out of TPL CUBLAS [\#1050](
- Adding ArmPL option to spmv perf_test [\#1038](
- Add (right) preconditioning to GMRES [\#1078](
- Supernodal SpTRSV: perform TRMM only if TPL CuBLAS is enabled [\#1027](
- Supernodal SpTRSV: support SuperLU version < 5 [\#1012](
- perf_test/blas/blas3: Add dgemm armpl experiment [\#1005](
- Supernodal SpTRSV: run TRMM on device for setup [\#983](
- Merge pull request #951 from vqd8a/move_sort_ifpack2riluk [\#972](
- Point multicolor GS: faster handling of long/bulk rows [\#993](
- Make CRS sorting utils work with unmanaged [\#963](
- Add sort and make sure using host mirror on host memory in kspiluk_symbolic [\#951](
- GEMM: call GEMV instead in certain cases [\#948](
- SpAdd performance improvements, better perf test, fix mtx reader columns [\#930](

**Implemented enhancements BuildSystem:**
- Automate documentation generation [\#1116](
- Move the batched dense files to specific directories [\#1098](
- cmake: Update SUPERLU tpl option for Tribits [\#1066](
- cmake/Modules: Allow user to use MAGMA_DIR from env [\#1007](
- Supernodal SpTRSV: update TPLs requirements [\#997](
- cmake: Add MAGMA TPL support [\#982](
- Host only macro: adding macro to check for any device backend [\#940](
- Prevent redundant spmv kernel instantiations (reduce library size) [\#937](
- unit-test: refactor infrastructure to remove most *.cpp [\#906](

**Implemented enhancements Other:**
- Allow reading integer mtx files into floating-point matrices [\#1100](
- Warnings: remove -Wunused-parameter warnings in Kokkos Kernels [\#962](
- Clean up CrsMatrix raw pointer constructor [\#949](
- unit_test/batched: Remove *_half fns from gemm unit tests [\#943](
- Move sorting functionality out of Impl:: [\#932](

- Deprecation warning: SpaceAccessibility move out of impl [\#1141](
- Workaround error with intel [\#1128](
- gmres: disable examples for builds with ibm/xl [\#1123](
- CrsMatrix: deprecate constructor without ncols input [\#1115](
- perf_test/blas/blas3: Disable simd verify for cuda/10.2.2 [\#1093](
- Replace impl/Kokkos_Timer.hpp includes with Kokkos_Timer.hpp [\#1074](
- Remove deprecated ViewAllocateWithoutInitializing [\#1058](
- src/sparse: spadd resolve deprecation warnings [\#1053](
- Give full namespace path for D2 coloring [\#999](
- Fix -Werror=deprecated errors with c++20 standard [\#964](
- Deprecation: a deprecated function is called in the SpADD perf_test [\#954](

**Enabled tests:**
- HIP: enabling all unit tests [\#968](
- Fix build and add CI coverage for LayoutLeft=OFF [\#965](
- Enable SYCL tests [\#927](
- Fixup HIP nightly builds [\#907](

**Fixed Bugs:**
- Fix SpGEMM for Nvidia Turing/Ampere [\#1118](
- Fix #1111: spmv tpl instantiations [\#1112](
- Fix C's numCols in spadd simplified interface [\#1102](
- Fix #1089 (failing batched UTV tests) [\#1096](
- Blas GEMM: fix early exit logic, see issue #1088 [\#1091](
- Fix #1048: handle mode C spmv correctly in serial/openmp [\#1084](
- src/batched: Fix multiple definitions of singleton [\#1072](
- Fix host accessing View in non-host space [\#1057](
- Fix559: Intel 18 has trouble with pointer in ternary expr [\#1042](
- Work around team size AUTO issue on kepler [\#1020](
- Supernodal SpTrsv: fix out-of-bound error [\#1019](
- Some fixes for MAGMA TPL and gesv [\#1008](
- Merge pull request #981 from Tech-XCorp/4005-winllvmbuild [\#984](
- This is a PR for 4005 vs2019build, which fixes a few things on Windows [\#981](
- Fix build for no-ETI build [\#977](
- Fix invalid mem accesses in new GEMV kernel [\#961](
- Kokkos_ArithTraits.hpp: Fix isInf and isNan with complex types [\#936](

## [3.4.01]( (2021-05-19)
[Full Changelog](

**Fixed Bugs:**
- Windows: Fixes for Windows [\#981](
- Sycl: ArithTraits fixes for Sycl [\#959](
- Sparse: Added code to allow KokkosKernels coloring to accept partial colorings [\#938](
- Sparse: Include sorting within spiluk [\#972](
- Sparse: Fix CrsMatrix raw pointer constructor [\#971](
- Sparse: Fix spmv Serial beta==-1 code path [\#947](

## [3.4.00]( (2021-04-25)
[Full Changelog](

- SYCL: adding ETI and CMake logic for SYCL backend [\#924](

**Implemented enhancements Algorithms and Archs:**
- Two-stage GS: add damping factors [\#921](
- Supernodal SpTRSV, improve symbolic performance [\#899](
- Add MKL SpMV wrapper [\#895](
- Serial code path for spmv [\#893](

**Implemented enhancements BuildSystem:**
- Cmake: Update ArmPL support [\#901](
- Cmake: Add ARMPL TPL support [\#880](
- IntelClang guarding __assume_aligned with !defined(__clang__) [\#878](

**Implemented enhancements Other:**
- Add static_assert/throw in batched eigendecomp [\#931](
- Workaround using new/delete in kernel code [\#925](
- Blas perf_test updates [\#892](

**Fixed bugs:**
- Fix ctor CrsMat mirror with CrsGraph mirror [\#918](
- Fix nrm1, removed cublas nrminf, improved blas tests [\#915](
- Fix and testing coverage mainly in graph coarsening [\#910](
- Fix KokkosSparse for nightly test failure [\#898](
- Fix view types across ternary operator [\#894](
- Make work_view_t typedef consistent [\#885](
- Fix supernodal SpTRSV build with serial+openmp+cuda [\#884](
- Construct SpGEMM C with correct ncols [\#883](
- Matrix Converter: fixing issue with deallocation after Kokkos::fininalize [\#882](
- Fix >1024 team size error in sort_crs_* [\#872](
- Fixing seg fault with empty matrix in kspiluk [\#871](

## [3.3.01]( (2021-01-18)
[Full Changelog](

**Fixed Bugs:**
- With CuSparse enabled too many variants of SPMV were instantiated even if not requested. Up to 1GB executable size increase.

## [3.3.00]( (2020-12-16)
[Full Changelog](

**Implemented enhancements:**
- Add permanent RCM reordering interface, and a basic serial implementation [\#854](
- Half\_t explicit conversions [\#849](
- Add batched gemm performance tests [\#838](
- Add HIP support to src and perf\_test [\#828](
- Factor out coarsening [\#827](
- Allow enabling/disabling components at configuration time [\#823](
- HIP: CMake work on tests and ETI  [\#820](
- HIP: KokkosBatched - hip specialization [\#812](
- Distance-2 maximal independent set [\#801](
- Use batched TRTRI & TRMM for Supernode-sptrsv setup [\#797](
- Initial support for half precision [\#794](

**Fixed bugs:**
- Fix issue with HIP and Kokkos\_ArithTraits [\#844](
- HIP: fixing round of issues on AMD [\#840](
- Throw an exception if BLAS GESV is not enabled [\#837](
- Fixes -Werror for gcc with c++20 [\#836](
- Add fallback condition to use spmv\_native when cuSPARSE does not work [\#834](
- Fix install testing refactor for inline builds [\#811](
- HIP: fix ArithTraits to support HIP backend [\#809](
- cuSPARSE 11: fix spgemm and spmv\_struct\_tunning compilation error [\#804](

- Remove pre-3.0 deprecated code [\#825](

## [3.2.01]( (2020-11-17)
[Full Changelog](

**Fixed bugs:**

- Cpp14 Fixes: [\#790](

## [3.2.00]( (2020-08-19)
[Full Changelog](

**Implemented enhancements:**

- Add CudaUVMSpace specializations for cuBLAS IAMAX and SCAL [\#758](
- Add wiki examples [\#735](
- Support complex\_float, complex\_double in cuSPARSE SPMV wrapper [\#726](
- Add performance tests for trmm and trtri [\#711](
- SpAdd requires output values to be zero-initialized, but this shouldnt be needed [\#694](
- SpAdd doesnt merge entries correctly [\#685](
- cusparse SpMV merge algorithm [\#670](
- TPL support for SpMV [\#614](
- Add two BLAS/LAPACK calls needed by: Sptrsv supernode \#552 [\#589](
- HashmapAccumulator has several unused members, misnamed parameters [\#508](

**Fixed bugs:**

- Nightly test failure: spgemm unit tests failing on White \(Power8\) [\#780](
- supernodal does not build with UVM enabled [\#633](

## [3.1.01]( (2020-05-04)
[Full Changelog](

** Fixed bugs:** 

- KokkosBatched QR PR breaking nightly tests [\#691](

## [3.1.00]( (2020-04-14)
[Full Changelog](

**Implemented enhancements:**

- Two-stage & Classical Gauss-Seidel [\#672](
- Test transpose utilities [\#664](
- cuSPARSE spmv wrapper doesn't actually use 'mode' [\#650](
- Distance-2 improvements [\#625](
- FindMKL module: which mkl versions to prioritize [\#480](
- Add SuperLU as optional CMake TPL [\#545](
- Revamp the ETI system [\#460](

**Fixed bugs:**

- 2-stage GS update breaking cuda/10+rdc build [\#673](
- Why CrsMatrix::staticcrsgraph\_type uses execution\_space and not device\_type? [\#665](
- TRMM and TRTRI build failures with clang/7+cuda9+Cuda\_OpenMP and gcc/5.3+OpenMP [\#657](
- cuSPARSE spmv wrapper doesn't actually use 'mode' [\#650](
- Block Gauss-Seidel test fails when cuSPARSE is enabled [\#648](
- cuda uvm test failures without launch blocking - expected behavior? [\#636](
- graph\_color\_d2\_symmetric\_double\_int\_int\_TestExecSpace seg faults in cuda/10.1 + Volta nightly test on kokkos-dev-2 [\#634](
- Build failures on kokkos-dev with clang/7.0.1 cuda/9.2 and blas/cublas/cusparse tpls [\#629](
- Distance-2 improvements [\#625](
- trsv - internal compiler error with intel/19 [\#607](
- complex\_double misalignment still breaking SPGEMM [\#598](
- PortableNumericCHASH can't align shared memory  [\#587](
- Remove all references to Kokkos::Impl::is\_same [\#586](
- Can I run KokkosKernels spgemm with float or int32 type? [\#583](
- Kokkos Blas: gemv segfaults [\#443](
- Generated kokkos-kernels file names are too long and are crashing cloning Trilinos on Windows [\#395](

## [3.0.00]( (2020-01-27)
[Full Changelog](

**Implemented enhancements:**

- BuildSystem: Standalone Modern CMake support [\#491](
- Cluster GS and SGS: add cluster gauss-seidel implementation [\#455](
- spiluk: Add sparse ILUK implementation [\#459](
- BLAS gemm: Dot-based GEMM Cuda optimization for C = betaC + alphaA^TB - [\#490]
- Sorting utilities: [\#461](
- SGS: Support multiple rhs in SGS efficiently [\#488](
- BLAS trsm: Add support and interface for trsm [\#513](
- BLAS iamax: Implement iamax [\#87](
- BLAS gesv: [\#449](
- sptrsv supernodal: Add supernodal sparse triangular solver [\#552](
- sptrsv: Add cusparse tpl support for sparse triangular solve, cudagraphs to fallback [\#555](
- KokkosGraph: Output colors assigned during graph coloring [\#444](
- MatrixReader: Full matrix market support [\#466](

**Fixed bugs:**

- gemm: Fix bug for complex types in fallback impl [\#550](
- gemv: Fix degenerate matrix cases [\#514](
- spgemm: Fix cuda build with complex\_double misaligned shared memory access [\#500](
- spgemm: Wrong team size heuristic used for SPGEMM when Kokkos deprecated=OFF [\#474](
- dot: Improve accuracy for float and complex_float [\#574](
- SpMV Struct: Fix bug with intel\_17\_0\_1 [\#456](
- readmtx: Fix invalid read due to loop condition [\#453](
- spgemm: Fix hashmap accumulator bug yielding crashes and wrong results [\#402](
- KokkosGraph: Fix distance-1 graph coloring segfault [\#275](
- UniformMemoryPool: does not re-initialize chunks that are freed [\#530](

## [2.9.00]( (2019-06-24)
[Full Changelog](

**Implemented enhancements:**

- KokkosBatched: Add specialization for float2, float4 and double4 [\#427](
- KokkosBatched: Reduce VectorLength (16 to 8) [\#432](
- KokkosBatched: Remove experimental name space for batched blas [\#371](
- Capability: Initial sparse triangular solve capability [\#435](
- Capability: Add support for MAGMA GESV TPL [\#409](
- cuBLAS: Add CudaUVMSpace specializations for GEMM [\#397](

**Fixed bugs:**

- Deprecated Code Fixes [\#411](
- BuildSystem: Compilation error on rzansel [\#401](

## [2.8.00]( (2019-02-05)
[Full Changelog](

**Implemented enhancements:**

- Capability, Tests: C++14 Support and Testing [\#351](
- Capability: Batched getrs [\#332](
- More Kernel Labels for KokkosBlas [\#239](
- Name all parallel kernels and regions [\#124](

**Fixed bugs:**

- BLAS TPL: BLAS underscore mangling [\#369](
- BLAS TPL, Complex: Promotion 2.7.24 broke MV unit tests in Tpetra with complex types [\#360](
- GEMM: GEMM uses wrong function for computing shared memory allocation size [\#368](
- BuildSystem: BLAS TPL macro not properly enabled with MKL BLAS [\#347](
- BuildSystem: make clean - errors [\#353](
- Compiler Workaround: Internal compiler error in KokkosBatched::Experimental::TeamGemm [\#349](
- KokkosBlas: Some KokkosBlas kernels assume default execution space [\#14](

## [2.7.24]( (2018-11-04)
[Full Changelog](

**Implemented enhancements:**

- Enhance test\_all\_sandia script to set scalar and ordinal types [\#315](
- Batched getri need [\#305](
- Deterministic Coloring [\#271](
- MKL - guard minor version for MKL v. 18 [\#268](
- TPL Support for all BLAS functions using CuBLAS [\#247](
- Add L1 variant to multithreaded Gauss-Seidel [\#240](
- Multithreaded Gauss-Seidel does not support damping [\#221](
- Guard 1-phase SpGEMM in Intel MKL  [\#217](
- generate makefile with-spaces option  [\#98](
- Add MKL version check [\#7](

**Fixed bugs:**

- Perf test failures w/ just CUDA enabled [\#257](
- Wrong signature for axpy blas functions [\#329](
- Failing unit tests with float - unit test error checking issue [\#322](
- cuda.graph\_graph\_color\* COLORING\_VBD test failures with cuda/9.2 + gcc/7.2 on White [\#317](
- KokkosBatched::Experimental::SIMD\<T\> does not build with T=complex\<float\> [\#316](
- simple test program fails using 3rdparty Eigen library [\#309](
- KokkosBlas::dot is broken for complex, due to incorrect assumptions about Fortran ABI [\#307](
- strides bug in kokkos tpl interface.  [\#292](
- Failing spgemm unit test with MKL [\#289](
- Fix the block\_pcg perf-test  when offsets are size\_t [\#287](
- spotcheck warnings from kokkos  [\#284](
- Linking error in tpl things [\#282](
- Build failure with clang 3.9.0 [\#281](
- CMake modification for TPLs. [\#276](
- KokkosBatched warnings [\#259](
- KokkosBatched contraction length bug [\#258](
- Small error in KokkosBatched\_Gemm\_Serial\_Imp.hpp with SerialGemm\<Trans::Transpose,\*,\*\> [\#147](

## [2.7.00]( (2018-05-24)
[Full Changelog](

**Implemented enhancements:**

- Tests: add capability to build a unit test standalone [\#233](
- Make KokkosKernels work without KOKKOS\_ENABLE\_DEPRECATED\_CODE [\#223](
- Replace KOKKOS\_HAVE\_\* FLAGS with KOKKOS\_ENABLE\_\* [\#219](
- Add team-based scal, mult, update, nrm2 [\#214](
- Add team based abs [\#209](
- Generated CPP files moving includes inside the ifdef's [\#199](
- Implement BlockCRS in Kokkoskernels [\#184](
- Spgemm hash promotion [\#171](
- Batched BLAS enhancement [\#170](
- Document & check CMAKE\_CXX\_USE\_RESPONSE\_FILE\_FOR\_OBJECTS=ON in CUDA build [\#148](

**Fixed bugs:**

- Update drivers in perf\_tests/graph to use Kokkos::initialize\(\) [\#200](
- unit tests failing/hanging on Volta [\#188](
- Inner TRSM: SIMD build error; manifests in Ifpack2 [\#183](
- d2\_graph\_color doesn't have a default coloring mechanism [\#168](
- Unit tests do not build with Serial backend [\#154](

## [2.6.00]( (2018-03-07)
[Full Changelog](

**Implemented enhancements:**

- Spgemm hash promotion [\#171](
- Batched BLAS enhancement [\#170](

**Fixed bugs:**

- d2\_graph\_color doesn't have a default coloring mechanism [\#168](
- Build error when MKL TPL is enabled [\#135](

## [2.5.00]( (2017-12-15)
[Full Changelog](

**Implemented enhancements:**

- KokkosBlas:   Add GEMM interface  [\#105](
- KokkosBlas:   Add GEMM default Kernel [\#125](
- KokkosBlas:   Add GEMV that wraps BLAS \(and cuBLAS\) [\#16](
- KokkosSparse: Make SPMV test not print GBs of output if something goes wrong.  [\#111](
- KokkosSparse: ETI SpGEMM and Gauss Seidel and take it out of Experimental namespace [\#74](
- BuildSystem:  Fix Makesystem to correctly build library after aborted install [\#104](
- BuildSystem:  Add option ot generate\_makefile.bash to define memoryspaces for instantiation [\#89](
- BuildSystem:  generate makefile tpl option [\#66](
- BuildSystem:  Add a simpler compilation script, README update etc [\#96](

**Fixed bugs:**

- Internal Compiler Error GCC in GEMM [\#129](
- Batched Team LU: bug for small team\_size [\#110](
- Compiler BUG in IBM XL pragma unrolling [\#92](
- Fix Blas TPL enables build [\#77](
- Batched Gemm Failure [\#73](
- CUDA 7.5 \(GCC 4.8.4\) build errors [\#72](
- Cuda BLAS tests fail with UVM if CUDA\_LAUNCH\_BLOCKING=1 is not defined on Kepler [\#51](
- CrsMatrix: sumIntoValues and replaceValues incorrectly count the number of valid column indices. [\#11](
- findRelOffset test assumes UVM [\#32](

## [0.10.03]( (2017-09-11)

**Implemented enhancements:**

- KokkosSparse: Fix unused variable warnings in spmv\_impl\_omp, spmv Test and graph color perf\_test [\#63](
- KokkosBlas:  dot: Add unit test [\#15](
- KokkosBlas:  dot: Add special case for multivector \* vector \(or vector \* multivector\) [\#13](
- BuildSystem: Make KokkosKernels build independently of Trilinos [\#1](
- BuildSystem: Fix ETI System not to depend on Tpetra ETI [\#5](
- BuildSystem: Change CMake to work with new ETI system [\#19](
- BuildSystem: Fix TpetraKernels names to KokkosKernels [\#4](
- BuildSystem: Trilinos/KokkosKernels reports no ETI in almost any circumstance [\#29](
- General:     Kokkos::ArithTraits\<double\>::nan\(\) is very slow [\#35](
- General:     Design and Define New UnitTest infrastructure [\#28](
- General:     Move Tpetra::Details::OrdinalTraits to KokkosKernels [\#22](
- General:     Rename files and NameSpace to KokkosKernels [\#12](
- General:      PrepareStandalone: Get rid of Teuchos usage [\#2](
- General:      Fix warning with char being either signed or unsigned in ArithTraits [\#60](
- Testing:      Make all tests run with -Werror [\#68](

**Fixed bugs:**

- SPGEMM Test Fails for Cuda when compiled through Trilinos  [\#49](
- Fix ArithTraits min for floating points [\#47](
- Pthread ETI error [\#25](
- Fix CMake Based ETI for Threads backend [\#46](
- KokkosKernels\_ENABLE\_EXPERIMENTAL causes build error  [\#59](
- ArithTraits warnings in CUDA build [\#71](
- Graph coloring build warnings [\#3](

\* *This Change Log was automatically generated by [github_changelog_generator](*
back to top