In-place matrix multiplication for rectangular matrices [Available for September 2026]

Matrix multiplication is one of the commonly-used computations across a wide range of applications. For example, most implementations of neural networks implement very large numbers of matrix multiplications. Almost all processor and GPU manufacturers provide libraries with carefully hand-tuned fast matrix multiplication routines. All the best-known algorithms matrix multiplication are out-of-place; that is, the result matrix occupies separate memory from the input matrices. This is in contrast to other problems, such as sorting, with efficient in-place sorting algorithms that produce the output in the same array as the input, by over-writing the input. The goal of this project is to design and create an efficient implementation of one or more in-place matrix multiplication algorithms.

Last year a student did their project on in-place matrix multiplication for square matrices, that is for matrices where the number of rows is equal to the number of columns. They found that the square matrices could be divided into a sub-blocks to create a matrix multiplication algorithm that is not fully in-place, but which uses much less memory than the fully out-of-place algorithm. The goal of this project is to develop an analogous technique for non-square matrices.

Multiplication of rectangular matrices can be more complicated that for square matrices. For some matrix dimensions, the size of the result matrix is larger than the two input matrices combined. In this case it is clear that no in-place algorithm is possible. For other matrix dimensions, the size of the result matrix is very much smaller than the size of input matrices. In this case, an in-place algorithm seems much more feasible. Matrix multiplication is often described as C = A x B. If the A matrix is short and wide and the B matrix is tall and narrow, the resulting C matrix will be short and narrow. But if the A matrix is tall and narrow, and the B matrix is short and wide, the C matrix will be tall and wide.

An important aspect of this project is an efficient implementation of the in-place matrix multiplication algorithm. The most obvious choices for an efficient implementation are to use C/C++ or FORTRAN. There is potential to do the project on GPU using CUDA, but it would make sense to start on CPU and only later extend it to GPU if you have plenty of time remaining.