Project Stage 3 - Does the shoe fit?

This blog post is for my SPO600 class I am participating in at Seneca College, this is related to stage 3 of our final project detailed here.

The post detailing my selection and general course of action for this project can be found here. The previous parts of my project where I detail my implementation plan and the issues I encountered can be found here, and here.

For the final portion of my project I was provided some interesting information from the author of RTM, Nicholas Frechette. In the process of reaching out to discuss the build errors I encountered during my attempts to test the library on the CDOT AArch64 system "Israel", Nicholas mentioned to me that SVE optimizations may not make sense for the library. My correspondence with Nicholas was unfortunately, limited but I believe he provided me with enough information to investigate this fact. I was provided with two threads to pull on: Scalability is not necessary, and RTM is optimized for latency instead

Scalable Vector Extension

The first and most obvious application of SVE is the fact that it offers scalability in the form of variable length vectors ranging from 128 to 2048 bits. This is the primary feature of SVE/SVE2, and I have a quote directly from Nicholas explaining that these variable length vector registers aren't a priority for use in RTM:

"RTM is built for the type of realtime applications used in games, VR, and other such use cases. This type of code involves mostly the following types: quaternions, QVV transforms, 4x3 affine matrices, and 4x4 matrices, and vector3/vector4. Code often mixes many of these types and as such the width of types used is known ahead of time. It is typically very rare to process large amounts of these in a Structure Of Array form which would benefit from SVE." 
-Nicholas Frechette

It appears to me the sentiment is that the extra work that would go into working with the scalable vectors, controlling the predication of the lanes, and utilizing the extra width of these SVE vectors is just not worth it. If they are confident that the width of the data in this type of code is always known, and will never exceed 128 bits, then working with NEON on AArch is very well doing everything they need. RTM is also used mainly for gaming applications, this means that there is a lack of truly large-scale calculations that would warrant the extra scalability of the SVE registers. The optimizations made by processing larger data sets with SVE are not the kind RTM is aiming to accomplish. RTM aims to do math as quickly as possible, in gaming and realtime applications the optimal response time is the point, not making processing of larger and larger pieces of data quick. This is a very important distinction to make, and it leads into the final half of my investigation.

Throughput vs. Latency

To introduce this final point I would like to share another quote provided to me by Nicholas:

"RTM is optimized for latency while SPMD and SVE are aimed at throughput. Latency is the primary reason why RTM doesn't enable FMA on x64 by default as that too is currently optimized for throughput (at least on AMD hardware)." 
-Nicholas Frechette

So what exactly is the difference between latency and throughput? In order to discover this myself I had to do some digging, and the most straightforward answer I could find was in response to this stackoverflow post. The answer references an old intel article covering the measure of each of these concepts, but it is no longer accessible so this is the closest we can get.

The way I understand it, the difference is multitasking. Throughput means starting an instruction and beginning further instructions in a specific amount of cpu cycles. Latency dictates how many cpu cycles until the results will be available for use. It's a matter of raw speed vs working through more data efficiently, as I understand it, but feel free to leave a comment if I'm off target on this one!

If this is in fact how this works then I understand completely the thought process here. As a gamer I see the benefits in calculations being made as fast as possible. Bringing realtime calculations to the absolute fastest execution time will do wonders for quality of life and smoothness of gameplay. If previous versions of intrinsics for SIMD operations simply do the job they need to do, and do them with best-in-class latency, then there's no use in changing course.

Conclusion

My conclusion for this is going to be no surprise, I'm certain Nicholas knows his library best and I trust his judgment, I believe the logic is sound and the jump to supporting SVE isn't a priority. I'd like to say solving the issues that stand with RTM on AArch64 are a much larger priority, but that's besides the point. I will note that in my correspondence with Nicholas he did express interest in investigating the applications of SVE, but only once the library develops and expands into use cases that better support the optimizations SVE provides. Perhaps in the future RTM will expand to ARMv9 support and leverage SVE/SVE2 in it's own way.

For now though, NEON will have to cut it for AArch64! 

Comments