In-place matrix multiplication

Matrix multiplication is one of the commonly-used computations across a wide range of applications. For example, most implementations of neural networks implement very large numbers of matrix multiplications. Almost all processor and GPU manufacturers provide libraries with carefully hand-tuned fast matrix multiplication routines. All the best-known algorithms matrix multiplication are out-of-place; that is, the result matrix occupies separate memory from the input matrices. This is in contrast to other problems, such as sorting, with efficient in-place sorting algorithms that produce the output in the same array as the input, by over-writing the input. The goal of this project is to design and create an efficient implementation of one or more in-place matrix multiplication algorithms.

An important question is whether an in-place matrix multiplication is possible. One important part of the answer depends on the dimensions of the matrices. For some matrix dimensions, the size of the result matrix is larger than the two input matrices combined. In this case it is clear that no in-place algorithm is possible. For other matrix dimensions, the size of the result matrix is very much smaller than the size of input matrices. In this case, an in-place algorithm seems potentially more feasible. An important special case is where both input matrices are square with dimensions N x N. In this case, the result matrix also has dimensions N x N, meaning that the result matrix has the same size as either of the input matrices. The square matrix case is the case that the project should start with, and depending on the algorithms you develop for the square case, you might complete the entire project using square matrices.

An important aspect of this project is an efficient implementation of the matrix multiplication algorithm. The most obvious choices for an efficient implementation are to use C or C++, perhaps with OpenMP for thread and vector parallelism. There is potential to do the project on GPU using CUDA, but it would make sense to start on CPU and only later extend it to GPU if you have plenty of time remaining (which is unusual).