Stretching GPU performance for GEMMs and tensor contractions
https://github.com/ROCmSoftwarePlatform/Tensile