Efficient double-double matrix multiplication using the AVX512 vector instruction set [Taken]

Current processors typically support at least two variations of floating point arithmetic. Single precision (or binary32) numbers consist of 32 bits, including a sign bit, 8 exponent bits, and 23 bits to represent the most significant digits of the number. Double precision (or binary64) numbers provide 11 exponent bits, and 52 bits for the significant digits. However, there are applications where double precision does not provide enough significant digits to give an accurate result. In these cases, a higher precision floating point type may be necessary.

The double-double type represents a number x using two double precision floating point numbers, x.high and x.low, where the number represented has the value x.high + x.low. The high part represents the most significant digits of the floating point number. The low part represents the next most significant digits that cannot be represented in the high part. There are well-known algorithms for performing addition, multiplication, and other operations on double-double types. For example, see: https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Double-double_arithmetic

The goal of this project is to explore the implementation of one or more fast matrix multiplication routines for double-double arithmetic using the AVX-512 instruction set. AVX-512 is a vector instruction set supported by recent x86_64 processors from Intel. In AVX-512 the word size is 512-bits (64 bytes), which means that eight double precision floating point values can fit within each each vector register. The project should investigate building highly-efficient implementations of matrix multiplication, with optimizations similar to those found in highly-tuned libraries for double precision arithmetic.

The project may explore different data layouts for matrices of double-double values. Multicore parallelism is also likely to be important given that double-double arithmetic requires many double-precision floating point operations to implement one double-double operation.