ARM vs x86 memory ordering in Unreal Engine

Posted on 31 October 2025

ARM vs x86 memory ordering in Unreal Engine

I thought I’d write up the most infuriating bug I’ve had to deal with lately - how it manifested; the many things that could’ve caused it (but didn’t); what actually did cause it; and how I fixed it.

Metron, our realtime dynamic music middleware, is a portable C++ replayer core which is wrapped by a tracker-style DAW interface and can be used standalone or as a plugin for Unreal Engine. Writing Metron is effectively how I migrated my Java skills to C++ and it’s been a very rewarding experience on the whole. UE4’s Audio Mixer gave me a great sandbox in which to learn how to create DSP effects and synths, while breaking Metron out to raw C++ has in turn been a wild ride of learning to cope without all the safety nets UE gives you.

Learning realtime programming, you very quickly run into the various issues that certain well-established golden rules exist to prevent: to generalise, they mostly boil down to “do not cause or allow anything to happen if you don’t know exactly how long it will take”. With realtime audio, you’re not doing all your work 60 or 120 times per second like the graphics and gameplay folks are. You have to do it 48,000 times per second - and if you can’t, your game’s unplayable. People will forgive minor visual glitches a lot more willingly than stuttering audio. This means you have a tight time budget that you can calculate by estimating an idealised maximum time that your per-buffer DSP work can take before buffer underruns occur, e.g.

(512 / 48000) * 1000 = 10.67ms

and then ensuring that you do all your work in slightly less than that time headroom - ideally a LOT less that time.

Keeping to this budget means you can’t have your core DSP loop sprinkled with calls to anything that happens in unbounded time, which is…actually quite a lot of stuff if you want your code to form part of any interactive application, game or VST plugin. So your core loop needs to be free of potentially latent or expensive things such as memory allocation (including vector/TArray resizing, which does memory allocation under the hood), UI updates, networking calls, input polling and so on; all those things need to be run on a different thread, and synchronisation between your realtime thread and your ‘everything else’ thread (let’s call it the UI thread) should be very carefully managed so it doesn’t a) happen way more often than it absolutely needs to and b) doesn’t happen in such a way that causes - or could potentially cause due to factors outside your control - a buffer underrun. Those factors, by the way, include almost everything else that’s going on in your game or the user’s operating system.

Safe communication between RT (realtime audio) and UI threads is something I had to learn pretty early on; Unreal Engine has some good tools for this and I was able to replicate some of those behaviours in raw C++. The idea is for the producing thread to queue up events which the consuming thread can dequeue at a safe interval of its choosing, so that data races can be avoided. If the UI thread gets a user interaction that should affect the sound (ie a knob is turned which should control a sound’s amplitude), then an event is pushed into the queue so that it can be dequeued on the RT side when it’s safe to do so - usually at the buffer block processing boundary, just before the next buffer block’s DSP work is done. The audio callback is called, the queue is flushed and all lambdas/etc are fired, then the buffer loop begins so that per-sample work can be done.

There’s a similar need for extreme care when passing data back to the UI thread from the RT thread, and in an audio context that’s often going to be the sort of telemetry that a user needs in addition to the audio they can hear: dB levels of each track, so a level meter can be drawn on screen; current sequencer transport position and playback state; that sort of thing. If the data is a single integer, float or bool, you’ll often get away with using a std::atomic variable that can be safely written one one side and read on the other side. There’s a range of memory order options to choose from. That’s foreshadowing…

Metron is a dynamic music system, so being able to communicate between UI and RT threads only once per buffer is inadequate; we need much tighter timing than that, so I segment the buffer into even smaller chunks according to musical sequencer timing (all of which is calculated on the audio thread). You could segment the buffer per-sample, ie run 512 processing loops in a 512-sample block with a dequeue from the UI thread between each one, but that would normally destroy your CPU budget. For a PPQN of 24 (which is low by modern DAW standards but adequate for Metron), even at high musical tempos we only need to segment each 512 or 1024 sample buffer a few times.

So timing is tight and the music system is very responsive to UI-rate control input from the game thread. The SampleVoice instrument which is core to Metron and makes it an extremely flexible sample-based synth works by having an instance per tracker-channel (which is how polyphony caps are kept under control) and then swapping in new sampledata pointers every time a new note is encountered, regardless of the note’s instrument. If the last note was a kick drum and the new one is a snare drum, that’s fine: just swap the sampledata pointer and read what’s there. If the pointer is wrong, or has died because it wasn’t properly kept alive, we’ll soon know about it because there’ll be a crash (or a load of telltale logs).

Why, then, was I getting a bug where an instrument would go silent on perhaps one note in a dozen, or perhaps once every 30 seconds it would start at the wrong sample offset? And why would this happen only on ARM, never on x86? With no errors or asserts firing, no absence of PCM data, no resource pressure (songs would be using around 4% of audio CPU budget)? It was massively annoying and it took me about 6 months, on and off, to track down. You can get away with this sort of bug sometimes with pad sounds or lots of staccato lead notes, but this game has a jungle/drum and bass soundtrack full of meticulously sequenced breakbeat chops: if one of those goes wrong, the whole flow is ruined. So I couldn’t ignore it.

After trying loads of stuff that didn’t work, I ended up going down some esoteric rabbit holes: thread migration seemed like a likely candidate. The main box it ticked was that it’s naturally something that’ll vary between CPU architectures. Also, UE does migrate the audio thread around quite a bit on x86 as far as I can tell: there’s nothing sacrosanct about the audio thread in terms of absolute thread ID, and it can jump around according to <waves hands at UE’s bazillion lines of mysterious core engine code>. But this worked fine on x86. I wasted a lot of time trying to demand at init time that the audio thread be constrained to a single core, or just two, because I could see that my buffer callbacks were alternating between cores 3, 4 and 5. But the target platform - Meta Quest 3 - doesn’t respect those requests, so nope.

Anyway, I wasn’t a million miles away. It was an ARM vs x86 thing, but it was…atomic memory read/write order, just as foreshadowed! x86 doesn’t really care about atomic read/write order because it can always see the entire state across all cores, and it knows which order things should be done in. It lets you get away with being sloppy. And because I didn’t truly understand the implications of the memory order args on my atomic load()/store() calls, I was very sloppy. ARM doesn’t give you a safety net. If you don’t plan your atomic operations properly, you’ll get valid but unexpected results and no errors will be thrown, especially if your code is very defensively written.

So the bug was that the state of a sample voice’s PCM-buffer reader was subject to tearing which could (safely!) invalidate a note right after it was triggered (hence the dropped notes) and/or load a stale start-offset position to the note (hence the wonky breakbeat chops). The solution was to hold an atomic state struct per PCM-buffer reader that was trivially copyable POD (plain old data); and to use the correct std::memory_order args when updating its PCM pointer or ready flag. Pointers/flags/vars in the reader now can’t be touched out of sequence and can’t become stale. All my stolen notes, returned.

Honestly, I still couldn’t explain every memory_order option if my life depended on it, but at least I’ve learnt that if my writer/producer uses release and my reader/consumer uses acquire, I’ll probably be okay on ARM. And I’ll also be okay on x86, which doesn’t really care.