Crashing Industrial Control Systems at Pwn2Own Miami 2022

Pwn2Own Industrial Hacking Contest (#1)

Earlier this year, the JFrog Security research team competed in the Pwn2Own Miami 2022 hacking competition which focuses on Industrial Control Systems (ICS) security. We were proud to take part in this competition and join other researchers in the effort to make mission-critical industrial environments safe and secure.

During the Pwn2Own Miami competition we competed and won in two categories –

  • Denial of Service on the OPC UA C++ server
  • Denial of Service on the OPC UA .NET server (CVE-2022-29866)

Denial-of-service (DoS) vulnerabilities in industrial control systems can potentially allow attackers to disrupt critical operations. For example, on controllers used for motion control, exploiting a denial-of-service vulnerability can interrupt a factory’s production line or cause physical damage because of misalignments between moving parts.

In this blog post, we will share some details about the Open Platform Communications (OPC) UA (United Architecture) protocol and do a technical deep-dive into the denial-of-service (DoS) vulnerabilities we discovered that allowed us to crash these servers.

What is OPC UA?

OPC UA is a protocol that allows both industrial devices and endpoint clients to send and receive data in a secure and reliable way. The biggest advantage of this protocol is its unified API, which allows an endpoint client to access the data that was collected by all sorts of industrial devices with a single standard API, instead of learning and implementing the specific API for each industrial device.

In the example figure below, any of the devices (either the OPC UA Client or the OPC UA Server) could be replaced transparently by a different OPC UA compliant device (even from a different manufacturer) and communications would persist without any further major changes.

What is OPC UA

OPC UA is widely used by major ICS vendors such as Siemens, Bosch, Rockwell automation, ABB and more.

To clarify – these vendors and their products are just used as OPC UA examples and we have not tested whether they are susceptible to the DoS vulnerabilities described in this blog.

For example, the Siemens S7 series automation controllers use OPC-UA:

Siemens S7 series automation controllers use OPC-UA

What OPC UA Vulnerabilities Were Found?

OPC UA .NET Standard Stack Exhaustion (CVE-2022-29866)

One of the OPC UA servers that we competed for is OPC UA .NET Standard server, an implementation of an OPC UA server in C#, provided by the OPC Foundation.

We were able to crash the server with a stack exhaustion bug that led to Denial of Service.

According to GitHub, this server is currently used by 286 other repositories, including Microsoft’s Industrial-IoT repository which discovers industrial assets on-site and automatically registers them in the cloud ,

Vulnerability Summary

There is a recursion in the code that handles the TranslateBrowsePathsToNodeId request. The recursion might cause the server to crash due to a Stack Overflow exception that occurs after exceeding the max recursion limit. This exception cannot be caught by a try..catch block.

All versions before 1.4.368.58 are affected.

Vulnerability Details

The recursion lies in the function MasterNodeManager.cs::TranslateBrowsePath():

private void TranslateBrowsePath(
    OperationContext           context,
    INodeManager               nodeManager,
    object                     sourceHandle,
    RelativePath               relativePath,
    BrowsePathTargetCollection targets,
    int                        index)
{
    ...
    // check for end of list.
    if (index < 0 || index >= relativePath.Elements.Count)
    {
        return;
    }
 
    ...
 
    // process next hops.
    for (int ii = 0; ii < targetIds.Count; ii++)
    {
        ...
 
        // recursively follow hops.
        TranslateBrowsePath(
            context,
            nodeManager,
            sourceHandle,
            relativePath,
            targets,
            index + 1);
    }
}

This function handles the TranslateBrowsePathsToNodeIds request specified in
https://reference.opcfoundation.org/v104/Core/docs/Part4/5.8.4/:

TranslateBrowsePathsToNodeIds

The function is first called with the startingNode and subsequently gets called (recursively) according to the elements in relativePath, where each element’s format is:

startingNode

For example, given a browsePath with startingNode = “NodeA” and relativePath defined as:

TargetName = “NodeB”

IsInverse = False

referenceTypeId = HasComponent

TargetName = “NodeC”

referenceTypeId = HasComponent

The server will resolve the browsePath as:

browsePath

And return NodeC’s nodeId.

The bug is that there is no limit to the amount of recursive calls in the code, so given a long enough relativePath – the code will throw a  StackOverflowException due to reaching max recursion limit of the environment.

As we mentioned, the StackOverflowException cannot be caught with a try..catch block. From MSDN

“Starting with the .NET Framework 2.0, you can’t catch a StackOverflowException object with a try/catch block, and the corresponding process is terminated by default”

Exploiting the bug

While trying to built a long relativePath and exploit this bug, we faced the following obstacles:

  1. The path must eventually point to an existing node
  2. The TCP-based OPC protocol doesn’t allow a packet larger than 64KB

Pointing to an existing node

In order to overcome the first obstacle we used a method similar to path/directory traversal (we use the equivalent of “../Directory/../Directory/../Directory”). When the server translates a relativePath, it starts with the startingNode and walks through the relativePath’s elements – which guide the server how to walk between nodes. In addition, each relativePath’s element has a flag called isInverse which guides the code whether to traverse to one of the node’s parents (when set to True), instead of traversing to one of the children of that node (when False).

In order to make sure that we are using an existing node, we used a node with namespace 0. A namespace defines a set of types, this particular namespace is a built-in namespace which contains all of the standard types of the server. We used one of the standard types of namespace 0 – RelativePath.

We built the malicious request ֱto include a single browsePath which has a startingNode named “RelativePath” and a relativePath which looks like this:

browsePath

Notice that “HasEncoding” and “EncodingOf” are actually the same referenceTypeId, but they got different names because of the IsInverse field.

Sending this request will result in the following recursion, allowing us to send a large packet while always pointing to an existing node  –

recursion

Avoiding the TCP size limit

In order to overcome the second obstacle we used the HTTP-based OPC protocol instead of the (raw) TCP-based one. The HTTP-based OPC protocol has a larger max size limit – allowing a max packet size of 4MB, much more accommodating compared to the 64KB limit of the TCP-based OPC protocol.

To successfully crash the OPC server in our environment, we needed at least 7000 elements in the relative path. Our exploit uses 65534 elements (one less than the default config’s max array size) for making sure the payload will crash as many configurations as possible.

Our exploit can crash both the ConsoleReferenceServer and the ReferenceServer (a GUI version of the server). After sending the exploit, the ConsoleReferenceServer will crash with a StackOverflowException:

StackOverflowException

While the ReferenceServer will crash without printing anything.

Unified Automation C++ Demo Server UaF DoS Vulnerability

One of the OPC UA servers that we competed for is Unified Automation’s OPC UA C++ demo server, an implementation of an OPC UA server in C++, provided by Unified Automation to serve as a reference for an OPC UA server implementation.  We were able to crash the server with a use-after-free bug that led to Denial of Service.

Vulnerability Summary

In Unified Automation’s C++ Demo Server there exists a UaF (use-after-free) that can cause denial of service and possibly remote code execution.

This UaF is caused due to lack of synchronization primitives when calling two specific methods that the C++ demo server exports to the user

Vulnerability Details

OPC UA allows the server to export methods that can be called by the client, in a similar fashion to RPC, by sending the Call message. The exported methods can be called asynchronously.

In the OPC UA C++ demo server there is a race condition that can cause a UaF which leads to several impacts, including a denial of service.

Demo_DynamicNodes_CreateDynamicNode and Demo_DynamicNodes_DeleteDynamicNode are methods that are exported by the OPC UA server for the client to call.

Consider the following code from these two functions, taken from the “C++ based OPC UA Client/Server SDK + Pub/Sub Bundle”, which contains the demo server as an example:

UaStatus NodeManagerDemo::Demo_DynamicNodes_CreateDynamicNode(const ServiceContext& serviceContext)
{
    UaStatus ret;
 
    if ( m_pDynamicNode != NULL ) // m_pDynamicNode is a property of the node manager
    {
        // The node is already created
        return OpcUa_BadInvalidState;
    }
    ...
    m_pDynamicNode = new OpcUa::FolderType(
        UaNodeId(UaGuid::create(), getNameSpaceIndex()),
        "New Dynamic Node",
        getNameSpaceIndex(),
        this);
      ret = addNodeAndReference(pDynamicNodes->getUaReferenceLists(), m_pDynamicNode, OpcUaId_Organizes);
    if ( ret.isNotGood() )
    {
        // Adding the node failed
        // Release the node since the NodeManager did not take over ownership
        m_pDynamicNode->releaseReference();
        m_pDynamicNode = NULL;
        return ret;
    }
    else
    {
        UaVariant defaultValue;
        OpcUa::DataItemType* pVariable = NULL;
 
        defaultValue.setUInt32(0);
        pVariable = new OpcUa::DataItemType(
            UaNodeId("Demo.DynamicNodes.Variable1", getNameSpaceIndex()),
            "Variable1",
            getNameSpaceIndex(),
            defaultValue,
            OpcUa_AccessLevels_CurrentReadOrWrite,
            this);
        ret = addNodeAndReference(m_pDynamicNode, pVariable, OpcUaId_HasComponent);
 
   ....
}
    return ret;
}
UaStatus NodeManagerDemo::Demo_DynamicNodes_DeleteDynamicNode(const ServiceContext& serviceContext)
{
      UaStatus ret;
 
    if ( m_pDynamicNode == NULL )
    {
        // The node is not created yet
        return OpcUa_BadInvalidState;
    }
 
    ...
 
    ret = deleteUaNode(m_pDynamicNode, OpcUa_True, OpcUa_True, OpcUa_True);
    // Release our reference to the node
    m_pDynamicNode->releaseReference();
    m_pDynamicNode = NULL;
 
   
    return ret;
}

Both functions do not use a mutex to assure that the object pointed by m_pDynamicNode isn’t changed in the middle of the function’s execution.

When releaseReference() is called and its refcount is 0 the object destructor is called –

int __thiscall ReferenceCounter::releaseReference(ReferenceCounter *this)
{
  int currentCount;
  ReferenceCounter *thisa

  thisa = this;
  currentCount = ua_atomic_decrement(&this->m_refCount);
  if ( !currentCount && thisa )
    (thisa->~ReferenceCounter)(thisa, 1);
  return currentCount;
}

This free operation, together with the fact that Demo_DynamicNodes_CreateDynamicNode and Demo_DynamicNodes_DeleteDynamicNode functions can be called in parallel can cause a UaF.

An attacker may simultaneously and repeatedly call Demo_DynamicNodes_CreateDynamicNode and Demo_DynamicNodes_DeleteDynamicNode using OPC UA’s call message. These invocations can happen in the middle of the execution of addNodeAndReference. When adding Demo.DynamicNodes.Variable1, another thread will run the Demo_DynamicNodes_DeleteDynamicNode function and execute releaseReference which will call the destructor, this will cause the FolderType object to be used after it was freed which will cause a crash due to a virtual function call with a garbage pointer. For example, inside the NodeManagerUaNode::addUaReference function in the bold line:

UaStatus *__thiscall NodeManagerUaNode::addUaReference(NodeManagerUaNode *this, UaStatus *result, UaReferenceLists *pSourceNode, UaReferenceLists *pTargetNode, const UaNodeId *referenceTypeId)
{
  …
  pNewReference = 0;
 …
  pNewReference = some_reference // NewReference get assigned
    if ( pNewReference )
    {
     …
        pSourceNode->addTargetNode(pSourceNode, pNewReference);
    …
      }
…
return result

Where pSourceNode is our m_pDynamicNode, this will either cause a crash due to a dereference of an illegal address (for example vptr will be 0) or it can be used to cause remote code execution if the attacker can allocate the freed object and direct its vptr to its own fake vtable.

In addition, the lack of a locking primitive can be exploited in further ways, such as simultaneously calling Demo_DynamicNodes_DeleteDynamicNode from two threads. Depending on the timing, both threads may pass the NULL check of m_pDynamicNode, causing a similar UaF or causing the slower thread to crash due to a NULL pointer dereference when calling releaseReference() with our m_pDynamicNode as NULL.

Exploitation

This race condition can be triggered easily by spawning two threads where one calls Demo.DynamicNodes.CreateDynamicNode and the other Demo.DynamicNodes.DeleteDynamicNode.

This method will reliably crash the server.

Abusing the UaF for remote code execution

Within the boundaries of the pwn2own competition, we could not exploit this UaF vulnerability to cause remote code execution, due to time and stability constraints.

Here, we will elaborate about the strategy we took and what blocked us from achieving a stable-enough RCE exploit.

The crash is caused when the Demo C++ server tries to access a virtual function from the vtable of said object and fails on trying to access an unmapped address. Thus our exploit strategy would have been:

  1. Spawn another thread that tries to make an allocation in the same size of m_pDynamicNode in order to “grab” the freed m_pDynamicNode object from the allocator.
  2. Spray a fake vtable in the heap. The vtable will contain a stack pivot gadget which will allow us more control over the execution of the eventual ROP chain.
    We could use the fact that the C++ server allows us write arbitrary size array of strings with our input, each of those strings can serve as the fake vtable with the gadgets
  3. Pop a shell using the ROP chain.

Due to ZDI’s desire to make the environment as close to reality as possible, the targets were run on the latest Windows 10 64-bit with every mitigation technique enabled (ASLR, NX and so on).

The userspace frontend heap allocator in Windows 10 is called segment heap, for our allocation size it uses a heap allocator called LFH (Low Fragmentation Heap). There’s a great lecture by Saar Amar on the topic.

This makes exploitation difficult due to the randomization of free chunks we get in every malloc operation (as compared to Glibc’s default ptmalloc2 which would get us the last freed block).

This means that if we don’t get to rewrite the freed object until the vtable call, we would crash the process. This prevents us from just trying to allocate the right object over and over until we succeed. Our statistics for grabbing the freed object were too low for pwn2own (which allows only 3 re-tries) even with heap shaping methods.

Summary

The JFrog Security Research team had a great time researching this target and learning more about modern Windows exploitation. Next year we intend to return to pwn2own and target many more categories! So stay tuned.

Check out our related blog post on achieving an RCE on OPC UA, by exploiting a different vulnerability that we’ve discovered during this research.  We also encourage you to follow the latest discoveries and technical updates from the JFrog Security Research team in our security research blog posts and on Twitter @JFrogSecurity.