class: center, middle # We need to go faster A communication from the native world to the Web Audio community Paul Adenot,
Web Audio Conference 2022, Cannes, France --- class: center, middle ## Not a "this year in Web Audio" this time around (although I hear there are news on that tomorrow just before lunch and there is one (1) new API I'm talking about here, can't help it sorry) --- class: center, middle #
Doing more with less The Zen of high performance real-time programming
Web Audio Conference 2022, Cannes, France --- class: center, middle #
27 weird tricks to improve your audio code! The 19th will surprise you!
Web Audio Conference 2022, Cannes, France --- class: center, middle # Real-time audio programming is demanding to the programmer and the machine The maths are complex, programming is often low-level, real-time programming has lots of constraints, latencies are increasingly small --- class: center, middle # On the web, you don't know what the machine running your code is capable of Is it a $50 smartphone, is it a $5000 MacBook Pro M1 Max, what else is running on the machine, what browser does the user run, what OS and OS version is the user running, is the GPU using shared memory, what audio output and input devices are hooked up to the device, is your DSP code running on an efficient core or a powerful core, what frequency stepping is the CPU running at right now, is the battery under 15%, is the computer plugged in, is the CPU thermal throttling ? --- class: center, middle # The amount of unknowns in a space that requires determinism is rather problematic --- class: center, middle # We've had those problems in native for years, time to applies some classic solutions to the Web --- # From dog slow code
--- # To dog fast (!?) code
--- # Agenda ## First, cajole the CPU ## Then, always be good friends with the memory ## Do a hug to the scheduler ## ... but in the end, always verify --- class: middle, center # Cajole the CPU --- # CPUs are complex beasts CPUs are really impressive these days, but some work is needed by the programmer to reach high efficiency. > The best optimization one can implement is always not to do the thing one > thought they needed to do. Out of order execution, prefetching, superscalar pipelining, power management, multi-cores, hyperthreading, branch prediction, register renaming, speculative execution... --- class: center, middle # Implement your DSP using Web Assembly, minimize JS⇔WASM roundtrips If there is only one slide to remember...
bytes --- # Caching computations vs. recomputing This needs to be measured each time: memory accesses are slow, but sometimes the cached result is very slow to compute. --- # Cheating is OK - How much resolution/precision do you need ? - Hybrid lookup table + interpolation ? - What is the input domain of your trigonometry functions anyways ? -
, 3.2 times faster than _sin(x)_ --- class: middle, center # Be good friends with the memory --- # Memory accesses often (but not always) dominate Fetching a **single** byte from memory is about 400 times slower than doing an addition. But you can't fetch a single byte, you can only fetch a complete line: pack your stuff neatly. CPU have caches, but caches are small, it best to fully utilize them. The bigger the cache size the slowest it is. Cache are organized in lines (64 bytes on x86 and regular ARM, 128 bytes on Apple Silicon) --- # To put things in perspective (Typical numbers on x86/DDR4)
3GHz CPU cycle
L1 cache access (64kB)
≈ 2 s
L2 cache access (256kB)
≈ 9 s
L3 cache access (16MB, shared)
≈ 43 s
Main memory access (shared)
≈ 6 min
[Admiral Grace Hopper Explains the Nanosecond](https://www.youtube.com/watch?v=9eyFDBPk4Yw) --- # Sequential and temporal locality Things that are accessed together should be together in memory: - filter coefficients - other algorithm state - planar vs. interleaved audio - multi-channel processing - **You name it, it's all memory!** --- # KISS **Always use arrays, never ever a linked list or anything else fancy\*, keep it simple and stupid most of the time.**
\* applies almost always, you will know when not to respect this rule, terms and condition apply.
--- # Reduce working set size - Caches have a finite size - The faster they are, the smaller they are - You really really _really_ want to maximize their utilization - Don't waste memory, choose the right types, pack your structs [The Lost Art of Structure Packing](http://www.catb.org/esr/structure-packing/) --- # Speed of memory copies Rule of thumb on x86, DDR4: roughly 1ms per megabyte when caches aren't hot (often when fetching audio assets), for multi-megabyte copies. Obscenely fast on Apple Silicon compared to x86, still very slow. It trash caches, schedule copies in a smart way. Stagger/delay memory copies if they are really needed, but not right now. ⚠️ shared memory with the GPU ! --- # Pay attention to the big picture Heavy memory operations on the device slow down **everything**, the memory bus is a shared resources. The GPU frequently shares the memory with the CPU, and has high bandwidth requirements. --- class: middle, center # Hug the scheduler --- class: middle, center # Scheduling
In real-time programming, it doesn't matter if it's fast or slow, it needs to finish
before the deadline
Coming: source mapping of C++/Rust/whatever directly in the profiler
--- # Reasonning about optimizations, the percentage reduction gotcha
--- # The AudioRenderCapacity API Allows run time measurements, to e.g. back-off processing to something lighter on some devices.
--- class: small # Thanks !
Nested for loop goes brrrrrr
A good boy called Flitwick