Building standalone model binaries for audio inference

For a recent problem I had to crawl and categorize a large number of audio files from the internet. As the bottleneck was deemed to be downstream link I decided to use a pre-trained event classification model and run inference on the CPU.

Another requirement was that the crawler were running on multiple server and consistent for several weeks. My choice of language for such infrastructure heavy task is Rust and I looked into ways how to deploy an inference model.

It turned out that the process for creating a standalone binary in Rust is pretty simple. Even better, wrapping a model pre-trained on a large event classification dataset is possible. This means that in the end I could just copy the binary to any x86 machines and run them, without the need to setup virtual environments or install any packages.

In this post I will demonstrate how to convert a pre-trained model to ONNX and put it into a binary. Our goal is to create a standalone executable taking samples from stdin and classifying them into 527 known audio classes.

ONNX and audio pattern recognition

The Open Neural Network Exchange (ONNX) is an open standard for machine learning interoperability. It provides the definitions to export the compute graph of a machine learning model and consumer can implement the main operators for a specific architecture.

In our case we the excellent EfficientAT model, pre-trained on AudioSet and compressed into smaller CNN architectures. Exporting a model to an ONNX file is simple enough.

We first load the model mn10_as with acceptable performance

from hear_mn import mn01_all_b_mel_avgs
from hear_mn.helpers.utils import NAME_TO_WIDTH
from hear_mn.models.MobileNetV3 import get_model

mn10_as = get_model(width_mult=NAME_TO_WIDTH("mn10_as"), pretrained_name="mn10_as",
                    collect_component_ids=tuple(list(range(16)))).cuda()

and then use the ONNX export of PyTorch to save our compute graph to efficientat.onnx:

torch.onnx.export(wrapper, torch_input, "efficientat.onnx", \
        input_names = ["melspec"], output_names = ["logits"], \
        dynamic_axes = { \
            "melspec": {0: "batch_size", 3: "time_axis"}, \
            "logits": {0: "batch_size" }})

This attributes melspec to the input of mel spectrogram features and logits for our 527 logits we use for classification. It further defines two dynamic axis to vary the number of batches and frames for inference (number of channels and features are fixed though and correspond to mel filterbanks).

Standalone inference

With the exported model we can hop to a new Rust binary and install the x86_64 MUSL toolchain.

$ cargo new --bin audio-inference
$ rustup target install x86_64-unknown-linux-musl

To actually load our ONNX model for inference we use the excellent tract crate provided by Sonos.

cargo add tract @ 0.21

and do a test load of our ONNX model:

fn main() {
    let mut model_cursor = Cursor::new(include_bytes!("../efficientat.onnx"));
    let model = tract_onnx::onnx()
        .model_for_read(&mut model_cursor).unwrap()
        .into_runnable().unwrap();

    dbg!(&model);
}

We then compile with the MUSL target

$ cargo build --release --target x86_64-unknown-linux-musl

and, voila, have a standalone binary without any dependencies

$ ldd ./target/x86_64-unknown-linux-musl/release/audio-inference
	statically linked

Feeding and Filterbank

The remaining part is mainly diligence work. We need to implement a Mel filterbank producing same features as the PyTorch implementation and feed it to our model.

To match the same implementation, I first export the Mel filterbank and STFT windows from PyTorch to npy files.

Then I implement a struct which first performs the STFT with our custom window and then converts the complex frequency coefficients into a Mel features with our custom filterbank.

pub struct Features {
    // ..
}

impl Features {
    // ..

    pub fn preprocess(&mut self, inp: &[f32]) -> Array2<f32> {
        // apply pre-emphasize filter
        // ..

        // center with reflect mode
        // ..

        // calculate power of each frequency bin
        let spec = self.stft2(samples.view());
        let spec = spec.mapv(|x| x.norm().powf(2.0));

        // project to log-space, normalized, mel coefficients
        self.fbank.dot(&spec.t()).mapv(|x| (x+0.00001).ln()).mapv(|x| (x + 4.5) / 5.)
    }

For the full implementation take a look here.

Improve memory allocation performance characteristics

I found that the model performed poorly for the MUSL target. My immediate suspect was the memory allocator and indeed the malloc implementation of MUSL can be pretty slow.

Instead I added the mimalloc allocator by Microsoft to my project

cargo add mimalloc

and register it at the top of my main file

use mimalloc::MiMalloc;

#[global_allocator]
static GLOBAL: MiMalloc = MiMalloc;

and the performance problems were gone.

Putting it all together

What remains to be done? We need to read samples from some input stream and associate the resulting logits to pre-defined classes of our model.

As the dataset is multi-label I just sort them by their evidence and print the top 10 classes for the input.

Add the hound crate for reading WAV files to f32 vectors

cargo add hound

and combine with our feature extractor and model

// load audio file and convert to mel features
let mut reader = hound::WavReader::open(&env::argv[1]).unwrap();
let samples = reader.samples::<f32>().collect::<Vec<_>>();
let features = Feature::new(1024 /* FFT size */, 800 /* window size */, 320 /* overlap */)
    .preprocess(&samples);

// perform inference with ONNX model
let input: Tensor = features.into();
let nframes = features.shape()[1];
let features = features.into_shape(&[1, 1, 128, nframes]).unwrap();
let result = model.run(tvec!(features.into())).unwrap();

// get logits from output
let res = result[result.len() - 2].to_array_view::<f32>().unwrap();

10 August 2024