Luke Leighton is a Libre Ethical Technology Specialist.
He has been using, programming and reverse-engineering computing
devices continuously for 44 years, has a BEng (Hons), ACGI, in
Theory of Computing from Imperial College, and recently put that
education to good use in the form of the Libre-SOC
Project: an entirely Libre-Licensed 3D Hybrid CPU-VPU-GPU based on
OpenPOWER. He writes poetry and has been developing a HEP Physics theory
for the past 36 years in his spare time.
Advanced Cray-style Vectors are being developed for the Power ISA, as a
Draft Extension for submission to the new OpenPOWER ISA Working Group,
named SVP64. Whilst in-place Matrix Multiply was planned for a much
later advanced version of SVP64, an investigation into putting FFMPEG's
MP3 CODEC inner loop into Vectorised Assembler resulted in such a large
drop in code size (over 4x reduction) that it warranted priority
Discrete Cosine Transform (DCT), Discrete Fourier Transform (DFT)
and Number-Theory Transform (NTT) form the basis of too numerous
high-priority algorithms to count. Normal SIMD Processors and even
normal Vector Processors have a hard time dealing with them: inspecting
FFMPEG's source code reveals that heavily optimised inline assembler (no
loops, just hundreds to thousands of lines of assembler) is not uncommon.
The focus of this NLnet-sponsored research is therefore to create enhancements
to SVP64 to be able to cover DFT, DCT, NTT and Matrix-Multiply entirely
in-place. In-place is crucially important for many applications (3D, Video)
to keep power consumption down by avoiding register spill as well as L1/L2
cache strip-mining. General-purpose RADIX-2 DCT and complex DFT will be
shown and explained, as well as the in-place Matrix Multiply which does
not require transposing or register spill for any sized (including non-power-of-two)
Matrices up to 128 FMACs. The basics of SVP64, covered in the Overview , will also
be briefly described.