// why
Memory allocation is one of those things that is easy to take for granted. You call
malloc, you get a pointer, you move on. But when you are writing
performance-sensitive systems; trading infrastructure, game engines, embedded firmware;
the allocator is not background noise. It is a design decision with real latency
consequences.
I wanted to understand what is actually happening beneath malloc. Not just
conceptually, but at the implementation level: what does the allocator track, how does
it handle alignment, what are the real tradeoffs between speed and flexibility. This
project is my way of working through that; one allocator at a time, starting simple and
getting progressively more complex.
// the bump allocator
The bump allocator is the simplest possible allocator. You reserve a fixed block of memory upfront, keep a pointer to the current position, and every allocation just bumps that pointer forward by the requested size. No free list. No headers. No bookkeeping per allocation. Just a pointer and an offset.
c++class BumpAllocator {
char* m_buffer; // start of memory pool
std::size_t m_size; // total capacity
std::size_t m_offset; // current position
void* bump_alloc(std::size_t size, std::size_t alignment) {
std::size_t aligned = (m_offset + alignment - 1) & ~(alignment - 1);
if (aligned + size > m_size) return nullptr;
void* ptr = m_buffer + aligned;
m_offset = aligned + size;
return ptr;
}
void reset() { m_offset = 0; }
};
The alignment logic is the only non-obvious part. Most hardware requires that objects live at addresses that are multiples of their size; a 64-bit double needs to be at an 8-byte-aligned address, for example. If you ignore alignment, the CPU either silently fixes it at a performance cost, or on stricter architectures, throws a bus error.
The expression (m_offset + alignment - 1) & ~(alignment - 1) rounds
m_offset up to the next multiple of alignment using a bitmask. It only
works when alignment is a power of two; which is always true for natural alignment
requirements. The idea: alignment - 1 gives you a mask of the low bits,
flipping it gives you a mask that zeros them out, and adding alignment - 1
before masking ensures you round up rather than down.
// tradeoffs
The bump allocator is fast because it does almost nothing. Allocation is two arithmetic operations and a bounds check. There is no fragmentation tracking, no free list traversal, no coalescing. In a tight loop allocating many small objects, this is as fast as an allocator gets.
The cost is that you cannot free individual allocations. The only reclaim mechanism is
reset(), which drops the offset back to zero and reuses the entire buffer.
This makes the bump allocator useless for general-purpose allocation; but exactly right
for specific patterns: per-frame allocations in a game engine, per-request allocations
in a web server, temporary scratch space in a hot path.
The other limitation is fixed capacity. The buffer is allocated upfront and does not
grow. If you run out, bump_alloc returns nullptr and you have
to handle it. In practice this means you need to know your upper bound ahead of time,
which is a constraint that rules out a lot of use cases.
| property | value | note |
|---|---|---|
| allocation speed | O(1) | two ops and a bounds check |
| deallocation speed | O(1) | reset only; no individual frees |
| fragmentation | none | linear layout, no gaps |
| memory overhead | none | no per-allocation headers |
| flexibility | low | no individual deallocation |
| use case | per-frame, per-request, scratch buffers | |
// what i learned
Alignment is not an afterthought. Before implementing this I understood alignment conceptually; after implementing it I understand why the bitmask trick works and why alignment must be a power of two for it to be valid. That is a different kind of understanding.
The simplest allocator is also the most honest one. It does not hide its constraints. You give it memory, it gives it back to you in order, and when it is full it tells you. More complex allocators add flexibility but they also add hidden costs: fragmentation, lock contention, cache pressure from metadata. The bump allocator has none of those costs, and none of that flexibility. Understanding that tradeoff at the implementation level changes how you think about allocator choice.
// coming next
This project is ongoing. The bump allocator is the foundation; subsequent implementations will add progressively more capability and complexity:
| allocator | description |
|---|---|
| pool allocator | fixed-size blocks; O(1) alloc and free; no fragmentation |
| free list allocator | variable-size blocks; tracks freed regions for reuse |
| slab allocator | object caching; used in Linux kernel; reduces init overhead |
| arena allocator | region-based; multiple bumps with selective region reset |
| stack allocator | LIFO discipline; O(1) free with ordering constraint |
Each one is a different answer to the same question: given a constraint on allocation patterns, how fast can you make the allocator without sacrificing correctness. Notes for each implementation will be added here as they are finished.