← raphael mukondiwa
allocator lab: building memory allocators from scratch
written
status living document · updated as new allocators are implemented
c++ memory systems repo ↗

// why

Memory allocation is one of those things that is easy to take for granted. You call malloc, you get a pointer, you move on. But when you are writing performance-sensitive systems; trading infrastructure, game engines, embedded firmware; the allocator is not background noise. It is a design decision with real latency consequences.

I wanted to understand what is actually happening beneath malloc. Not just conceptually, but at the implementation level: what does the allocator track, how does it handle alignment, what are the real tradeoffs between speed and flexibility. This project is my way of working through that; one allocator at a time, starting simple and getting progressively more complex.

// the bump allocator

The bump allocator is the simplest possible allocator. You reserve a fixed block of memory upfront, keep a pointer to the current position, and every allocation just bumps that pointer forward by the requested size. No free list. No headers. No bookkeeping per allocation. Just a pointer and an offset.

c++class BumpAllocator {
    char*       m_buffer;  // start of memory pool
    std::size_t m_size;    // total capacity
    std::size_t m_offset;  // current position

    void* bump_alloc(std::size_t size, std::size_t alignment) {
        std::size_t aligned = (m_offset + alignment - 1) & ~(alignment - 1);
        if (aligned + size > m_size) return nullptr;
        void* ptr = m_buffer + aligned;
        m_offset  = aligned + size;
        return ptr;
    }

    void reset() { m_offset = 0; }
};

The alignment logic is the only non-obvious part. Most hardware requires that objects live at addresses that are multiples of their size; a 64-bit double needs to be at an 8-byte-aligned address, for example. If you ignore alignment, the CPU either silently fixes it at a performance cost, or on stricter architectures, throws a bus error.

The expression (m_offset + alignment - 1) & ~(alignment - 1) rounds m_offset up to the next multiple of alignment using a bitmask. It only works when alignment is a power of two; which is always true for natural alignment requirements. The idea: alignment - 1 gives you a mask of the low bits, flipping it gives you a mask that zeros them out, and adding alignment - 1 before masking ensures you round up rather than down.

"allocation in O(1). deallocation in O(1). individual frees: impossible."

// tradeoffs

The bump allocator is fast because it does almost nothing. Allocation is two arithmetic operations and a bounds check. There is no fragmentation tracking, no free list traversal, no coalescing. In a tight loop allocating many small objects, this is as fast as an allocator gets.

The cost is that you cannot free individual allocations. The only reclaim mechanism is reset(), which drops the offset back to zero and reuses the entire buffer. This makes the bump allocator useless for general-purpose allocation; but exactly right for specific patterns: per-frame allocations in a game engine, per-request allocations in a web server, temporary scratch space in a hot path.

The other limitation is fixed capacity. The buffer is allocated upfront and does not grow. If you run out, bump_alloc returns nullptr and you have to handle it. In practice this means you need to know your upper bound ahead of time, which is a constraint that rules out a lot of use cases.

property value note
allocation speedO(1)two ops and a bounds check
deallocation speedO(1)reset only; no individual frees
fragmentationnonelinear layout, no gaps
memory overheadnoneno per-allocation headers
flexibilitylowno individual deallocation
use caseper-frame, per-request, scratch buffers

// what i learned

Alignment is not an afterthought. Before implementing this I understood alignment conceptually; after implementing it I understand why the bitmask trick works and why alignment must be a power of two for it to be valid. That is a different kind of understanding.

The simplest allocator is also the most honest one. It does not hide its constraints. You give it memory, it gives it back to you in order, and when it is full it tells you. More complex allocators add flexibility but they also add hidden costs: fragmentation, lock contention, cache pressure from metadata. The bump allocator has none of those costs, and none of that flexibility. Understanding that tradeoff at the implementation level changes how you think about allocator choice.

// coming next

This project is ongoing. The bump allocator is the foundation; subsequent implementations will add progressively more capability and complexity:

allocator description
pool allocatorfixed-size blocks; O(1) alloc and free; no fragmentation
free list allocatorvariable-size blocks; tracks freed regions for reuse
slab allocatorobject caching; used in Linux kernel; reduces init overhead
arena allocatorregion-based; multiple bumps with selective region reset
stack allocatorLIFO discipline; O(1) free with ordering constraint

Each one is a different answer to the same question: given a constraint on allocation patterns, how fast can you make the allocator without sacrificing correctness. Notes for each implementation will be added here as they are finished.