Yesterday Never Dies: Securing Stateless Systems Against Downgrade Attacks
As discussed in the Turnkey Whitepaper, Turnkey’s application of Trusted Execution Environments (“TEEs”) provides enormous security benefits to our customers, and is in fact one of the core differentiators that sets us apart from the competition. However, the very statelessness that makes TEEs such a powerful and secure primitive also poses its own set of security challenges. Chief among these is a vulnerability to downgrade attacks, wherein a malicious adversary with privileged access attempts to trick a TEE by replaying an outdated, historical Organization
in place of the current revision.
In this post, we’ll trace the evolution of Turnkey’s defense against downgrade attacks—from the crude beginnings of our Refresher system to the present-day elegance of our Merkle tree solution—exploring what worked, what didn’t, and what we learned along the way.
But First: Understanding the Threat Model
Before moving on, let’s establish an appropriate working definition for what we mean when we talk about a downgrade attack:
downgrade attack: An attack in which a stateful input is maliciously reverted to a historical state in order to elicit desired output.
In our case, the “stateful input” in question is the Organization
. In essence, this represents a full and complete snapshot of an Organization. Among other things, the Organization
contains collections of Users
as well as their public keys, and also a collection of Policies
which govern what types of Activities
the Organization
allows. The Organization
is also timestamped and cryptographically signed by the TEE which produced it (the Notarizer
), which means that its authenticity can be cryptographically verified.
Now, let’s consider one of the core flows in Turnkey’s backend: policy evaluation.
In this flow, the Coordinator
retrieves the most recent revision of an Organization
and passes it along to the Policy Engine
to determine whether a given Activity
should be allowed to proceed. The Policy Engine
determines that, based on the Policies
present in the Organization
, this Activity
should in fact be denied, and thus correctly elicites a DENIED
response. In short, things are working as intended (no funny business here!).
Now, let’s demonstrate what a downgrade attack by introducing a malicious Coordinator
(i.e. one that is executing compromised code):
NOTE: In order to understand the scenario above, it is important to understand Turnkey’s threat model. Specifically: TEEs (like our
Policy Engine
) are treated as “trusted” components due to their verifiablity, but all other components (like ourCoordinator
) are treated as “untrusted”. This is why we must consider scenarios like the one above, wherein one of our “untrusted” components becomes compromised, despite the fact that such a scenario would necessitate an attacker with privileged access to our infrastructure. For more information, see the Threat model section of the Turnkey Whitepaper.
In this scenario, our compromised Coordinator
deliberately retrieves an outdated Organization
instead of the most recent one, and submits it to the Policy Engine. The Policy Engine
determines that, based on the Policies
of the old Organization
, this Activity
should be allowed, and thus an incorrect APPROVED
response is returned.
Age-Based Repudiation: Our First Solution
Fortunately, Turnkey was aware of this attack vector from day one, and even the earliest versions of our product have had a robust defense against it. Our initial defense was simple but effective: We’ll have our TEEs enforce a “maximum age” requirement on each and every Organization
, and reject any Organization
exceeding this age.
Continuing with the policy evaluation flow discussed above, we can clearly see how this would be effective in preventing downgrade attacks:
Huzzah! Our Policy Engine
correctly flags the old Organization
as being “too old”, and thus successfully thwarts an attempted downgrade attack.
But hang on… Let’s imagine now that, a few days later, the following scenario occurs:
*Gasp*: Our Policy Engine
has incorrectly flagged the most recent revision of an Organization
as being “too old”, despite the fact that it was the right Organization
for the job.
Whatever value we choose for our “maximum age” threshold will be completely arbitrary with respect to the organic evolution of an Organization
: While one Organization
may change state every hour, another Organization
may remain unchanged for days, months, or even years. It isn’t safe to make assumptions about the cadence at which an Organization
evolves, because this is based on factors entirely outside of our control.
That is, unless we introduce a Service solely responsible for ensuring that, at any given moment, each and every legitimate Organization
fulfills our “maximum age” requirement.
Enter–The Refresher
The sole purpose of our Refresher Service was to constantly iterate through our datastore, locate Organizations
nearing the “maximum age” threshold, and then “refresh” them by using a Notarizer
TEE. The Refresh
operation itself functions as you might expect: The timestamp of the Organization
is updated, a new Signature is applied, and the resulting payload is returned.1
Functionally speaking, this solution is perfectly adequate: By ensuring that all valid Organizations
in our datastore do not exceed the “maximum age” threshold, our Refresher has made it safe to enforce said “maximum age” in our TEEs. It would seem that we’ve solved the problem of downgrade attacks once and for all!
… But How Does it Scale?
Simply put: Not well at all. Unfortunately, the workload of our Refresher scales O(n)
with respect to the number of Organizations
in our database. While this poor scalability was acceptable when n
was of the order of 10k or 100k, it became less so as n
exceeded 1m. As n
began numbering in the 10’s of millions, the Refresher system became a serious burden: At this point, the vast majority of Notarizer
requests were Refresh
operations, as opposed to Notarize
operations (which represent actual Organization
updates)!
We decided our system needed an overhaul.
Introducing the Merkle Tree: A Scalable Solution!
We were able to address the poor scalability of the above solution through the introduction of a useful cryptographic primitive: The Merkle tree!
Structural Characteristics
The Merkle tree depicted above is much smaller than our production tree, but it displays the relevant structural characteristics:
- Leaf Nodes
- Each
Leaf
is assigned a uniqueOrganization
; Leaf.Val
is set to hash of theOrganization
;Leaf.T
is set to the timestamp of theOrganization
;- Each
Leaf
is cryptographically signed;
- Each
- Parent Nodes
- The
Val
of each parent node is the hash of the concatenation of its children (i.e.hash(L || R)
);
- The
- Root
- The
Val
of the root is the hash of the concatenation of its children (i.e.hash(L || R)
); - The
Root
contains the timestampT
of its last update; - The
Root
is cryptographically signed;
- The
Conceptually, we can summarize the structure of our tree as follows:
Each
Leaf
is a verifiable snapshot of one Organization state, while theRoot
is a verifiable snapshot of all Organization states.
See Appendix to learn more about some interesting and unique aspects of our Merkle tree data representation!
Putting our Merkle Tree to Work
Now that we understand the basic structure of our Merkle tree, let’s talk about how this data structure can be used to defend against downgrade attacks, and why it may represent a more scalable solution that our previous approach.
Previously, our “Age-Based Repudiation” defense against downgrade attacks boiled down to the enforcement of the following condition:
An
Organization
must have a sufficiently recent timestamp.
But now, with the introduction of our Merkle tree, we can introduce the following condition:
An
Organization
must prove inclusion in a sufficiently recent version of our Merkle tree.
To understand how this new revised condition might help us, let’s examine a specific inclusion proof from our example tree:
NOTE: For those less familiar with Merkle tree mechanics, you can think of a “Merkle inclusion proof” as the minimal set of leaves & nodes required to reproduce and verify the
Root
.
The graphic above shows the inclusion proof for Organization_8
, and also displays an exciting quality which hints at the power of this solution: The timestamp of Organization_8
lags many days behind the timestamp of our Root
!
To appreciate the power of this observation, let’s perform a simple thought experiment: Imagine that several days pass, during which the state of Organization_8
remains the same. In fact, let’s go one step further: Let’s imagine that none of our Organizations have changed state over this period of time, because everyone in the world is enjoying a long weekend. In this case, would our aforementioned Refresher
get to similarly enjoy some much needed R&R? Absolutely not!! In fact, our poor, overworked Refresher
would be grinding through the millions of unchanged Organizations
to ensure that each one is nice and fresh on Monday morning.
Now let’s replay this scenario in our Merkle tree world. Imagine that, in place of our Refresher
, we instead have a Merkle Tree Updater
Service whose job it is to update our Merkle tree with Organization
hashes as Organizations
change state. What would the long weekend have looked like for our Merkle Tree Updater
? Something like this:
While, sadly, our Merkle Tree Updater
would be obliged to work over the long weekend, the only task it would need to perform is a periodic refresh of the Root
to ensure it is kept sufficiently fresh. That’s it! One single piece of data!
To put this in more explicit terms, the work of our Merkle Tree Updater
scales O(1)
with respect to the number of Organizations
in our database, which is a massive improvement over the O(n)
scalability of our Refresher
system!
Updating our Merkle Tree
So now that we understand the basics of how our Merkle tree works, let’s talk about Merkle tree updates. Continuing with our thought experiment, we can imagine that Monday morning has arrived, and there’s now a steady trickle of Organization
state changes occurring as Turnkey customers return to work.
The good news is that Merkle tree updates are straightforward and very efficient; however, they must be performed synchronously with respect to the tree itself (i.e. you cannot perform two Merkle tree updates concurrently). This presents a design challenge: If performed in-line with our Notarization
flow, these Merkle tree updates would act as a bottleneck, effectively prohibiting Turnkey from performing Notarizations
concurrently.
To eliminate this bottleneck, we have designed our Merkle Tree Updater
as a background process: It periodically scans our DB for Organizations
which have changed state, and then performs a single, highly-efficient batch update operation. Effectively, the evolution of our Merkle tree is eventually consistent with respect to our Organizations
, rather than immediately consistent.
However, in order to accommodate our “eventually consistent” Merkle tree, we must refine our conditional statement, which we previously summarized as follows:
An
Organization
must prove inclusion in a sufficiently recent version of our Merkle tree.
The problem with the statement above is that, immediately after an Organization
is notarized, the condition above will fail until the update has “eventually” been incorporated into the Merkle tree. To account for this, we instead use a coallescing condition, with the prefix being the condition from our original “Age-Based Repudiation” system:
An
Organization
must have a sufficiently recent timestamp –OR– must prove inclusion in a sufficiently recent version of our Merkle tree.
The former threshold (a.k.a. the max_organization_age
) must be kept larger than the maximum amount of time it might take for an Organization
update to show up in our Merkle Tree, while the latter threshold (a.k.a. the max_merkle_root_age
) must be kept larger than the cadence at which our Merkle tree is updated.
Bringing it All Together
The importance of defending our stateless systems against downgrade attacks has been on Turnkey’s radar since day one; however, so has the necessity of ensuring that those defense systems are able to scale with our product. While our infrastructure was able to absorb the cost of the poorly-scaling “Age-Based Repudiation” in the early days of Turnkey, the need to overhaul this system became more and more pressing with each onboarded customer. Thankfully, necessity truly is the mother of innovation, and we were ultimately able to implement a robust and highly scalable solution built around a fascinating and powerful cryptographic primitive: The Merkle tree.
I hope you’ve enjoyed this deep-dive into downgrade attack prevention half as much as I’ve enjoyed building these solutions, and that perhaps I’ve even sparked some ideas about how you can adapt similar Merkle tree designs in your own stateless systems. If that’s the case, please don’t hesitate to reach out and share your thoughts and ideas!
Appendix: Merkle Tree Data Representation
We made the decision to build our Merkle tree implementation in-house rather than using a 3rd party library, and this afforded us the opportunity to make several interesting design optimizations. Grokking these implementation details is not required to grok the overall design, which is why I’ve moved them to an appendix; however, they are both interesting and instructive!
Deferred Node Creation
The first optimization is a relatively obvious one, but worth mentioning nonetheless, as it is a prerequisite for the following optimizations: We designed our tree such that parent nodes with zero children are not written to the database, but rather exist “implicitly”.
Fixed Depth & Structure
The second optimization we chose is to have a Merkle Tree with a fixed depth and structure. Specifically: Our production Merkle tree is binary, and has a fixed depth of 35, which means it can support just over 17 billion Organizations.2
Note that using this type of Merkle tree would have been prohibitively difficult without the aforementioned Deferred Node Creation, as it would have required pre-populating our datastore with over 34 billion null nodes. However, with deferred node creation in place, we only needed to populate our datastore with a single node: The empty Root
!
Implicit Relational Structure
If you’ve gotten this far, you’re in for a treat, because this one is a real gem!
If you’re considering what the DB schema might look like for a Merkle tree node in, say, a SQL datastore, you might be tempted to model the relationships explicitly (e.g. with each node having parent
, left
, and right
FK’s). However, given the aforementioned fixed depth and structure of our tree, we can instead encode these relationships directly into our indexing scheme!
As you can see in the graphic above, we’ve chosen to assign integer ID’s to our nodes, and we’ve adopted a left-to-right, top-to-bottom scheme. Given this choice of indexing scheme, and also taking advantage of the fact that this is a binary tree, we can find the parent_id
for a given child_id
via an elegant bit-shifting operation:
parent_id = child_id >> 1
In English: The ID of the parent is the ID of the child bit shifted to the right. Try it out if you don’t believe me!
But this isn’t just elegance for the sake of elegance. It actually provides some very real advantages:
- Reduced memory footprint due to elimination of FK fields.
- Lightning-fast proof generations which leverage bit-shift operations rather than joining across FK’s.
- DB implementation agnosticism thanks to the fact that we’re not explicitly modeling relationships.
Further Reading
- Turnkey Whitepaper - This is the most comprehensive overview of Turnkey to date. Learn more about our unique architecture, design philosphy, and security posture.
- Compressing Authority - An fantastic publication by John Driscoll describing the use of Accumulators to defend stateless systems against downgrade attacks in non-distributed systems. While the solution described here is technically distinct, Compressing Authority was nevertheless a huge inspiration for me when I first set out to solve this problem using cryptographic primitives.
-
An uber-knowledgable reader might recognize that this statement is not technically accurate: Rather than applying signatures directly to
Organizations
, Turnkey uses a dedicated object called aNotarization
to encapsulate the signature, timestamp, and other relevant metadata. However, I’ve chosen to removeNotarizations
from this design description for the sake of simplicity. ↩︎ -
If you’ve read this far, I’m sure you might be wondering what our plan is once we start approaching this massive number of Organizations? The answer is that we will horizontally scale our Merkle trees by creating one (or more) additional trees, and then assigning new
Organizations
with atree_index
at the time of creation. ↩︎