Jump Crypto / Certus One were one of the many teams who were "all hands on deck" during the Solana incident on Sept 14, working through the night alongside other core contributors like Chorus One, Neodyme, Solana Labs, P2P as well as more than a thousand dedicated validator and RPC operators to bring mainnet-beta back online.

In this blog post, we'll recount the events of the day and explain what happened and how it relates to the inner workings of Solana's consensus and block propagation.

Note that while most of the root causes were known shortly after the incident, some of the specifics of how the network behaved are still under investigation, and a detailed technical root cause analysis as well as further software improvements are being worked on.

Solana Labs already shared their preliminary findings:

9-14 Network Outage Initial Overview
On September 14th, the Solana network was offline for 17 hours. No funds were lost, and the network returned to full functionality in under 24 hours.

Timeline

  • Sept 14 12:00 UTC — Raydium launches the GRAPE IDO
  • Sept 14 12:12 UTC — Mainnet-Beta stops producing rooted slots
  • Sept 14 14:10 UTC — Solana v1.6.23 released with performance improvements
  • Sept 14 15:02 UTC — Agreement reached on restarting with slot 96542804
  • Sept 14 18:26 UTC — Solana v1.6.24 released with more fixes
  • Sept 14 19:21 UTC — Instructions posted for 1st restart
  • Sept 14 20:19 UTC — Restart attempt reaches 80% quorum, but fails due to a bug
  • Sept 15 01:20 UTC — Solana v1.6.25 release that fixes the bug
  • Sept 15 02:00 UTC — Instructions posted for successful 2nd restart
  • Sept 15 05:30 UTC — Mainnet-Beta awakens 🐙

The incident

On Sept 14 at 12:00 UTC, the GRAPE protocol launched an on-chain token offering on Solana's mainnet-beta network on a first-come-first-serve basis. Parameters were already known beforehand, and some people really, really wanted some of those tokens.

Right around 12:00 UTC, a massive flood of transactions sent by multiple bots hit the network, trying to win the race and outcompete each other—effectively, and likely unwittingly, executing a distributed denial of service attack on mainnet-beta.

Unfortunately, this caused mainnet-beta to first slow down and then halt. The flood of transactions did not stop even after the network stalled.

Slots per second
Finalized transactions per second
Peak transactions received by validator's banking stages across the network

Some nodes were receiving upwards of 300k transactions per second.

Most likely, this is only the tip of the iceberg: These statistics only include data from validators who opted into sending data to Solana's public metrics server. It also does not include transactions that were dropped in the operating system's network stack before the node could process them. Network forensics data for Certus One's primary validator (using our NetMeta tool) indicates peaks of raw transaction data of >1 Gbps and 120k packets per second, essentially indistinguishable from a volumetric DDoS attack.

It came very close to overwhelming hardware and network rather than Solana's node software, which is a testament to the throughput of the transaction ingestion stages:

Throughput by destination port on the main Certus validator (in/out). TPU traffic is transactions sent directly to the validator, TPUfwd is transactions bounced off another validator.

Indeed, according to monitoring data provided by the hosting company Marbis (thanks!), the volume even exceeded physical interface capacity at times, causing some traffic to be dropped on the switch port before it even reached our validator.

Looking at source port distributions, it is evident that this is client traffic evenly distributed across the >30k port range typically reserved for client applications, confirming that the majority of traffic were sent by one or many clients rather than forwarded by cluster nodes:

NetMeta heatmap of source port distribution of TPU traffic on Certus' primary validator

What happened?

Most production incidents are the result of an unlucky combination of different circumstances, and this one is no exception.

A flood of transactions alone shouldn't be able to stall any blockchain, especially one as performant as Solana, and Solana indeed survived many large floods on the Tour de SOL incentivized testnet (after all, it's easy money if you can break it).

So, why was this one different?

The jury on some of the details is still out while the final report is being worked on, but here's a number of factors that contributed:

Write Locks

One of Solana's distinguishing features is its Sealevel runtime, which can execute non-conflicting transactions in parallel. Transactions are conflicting when one depends on the output of another—specifically, when one transaction wants to write an account that another transactions wants to read or write as well. Those transactions cannot be executed in parallel, and are instead executed in a boringly sequential fashion.

One of the goals when writing Solana programs is to optimize account locking behavior to make sure that a program's transactions do not write-lock the same set of accounts:

One of the bot authors clearly didn't know that, and their transactions write-locked a whopping 18 accounts. Unfortunately, this included the global program managing SPL tokens, as well as the Serum program:

This prevented all transactions touching these accounts from being executed in parallel, including the bot transactions as well as any other transactions using these programs, significantly reducing the network's ability to cope with the flood of transactions.

Allowing this was a bug— there's no point in write-locking global programs. Bad locking behavior should only be able to affect a single program rather than the entire chain!

This was a known issue, and a fix that would ignore write locks on programs was about to be released. The network reboot flipped the switch which enabled the new behavior, fixing this attack vector (and unexpectedly improving overall network performance!).

Forwarding

Solana has no mempool. Instead, clients submit transactions directly to the current leader's Transaction Processing Unit (TPU). Any left-overs, or transactions sent to a node which isn't the current leader, will automatically be forwarded to the next leader in line—one of Solana's core innovations called Gulfstream.

This forwarding mechanism has a number of advantages over a mempool like preventing many kinds of front- and backrunning. But in this case, while it wasn't the cause, it also wasn't helping—in addition to receiving transactions by the bots themselves, leaders were now also forwarding surplus transactions to each other, adding to the deluge.

One of the changes introduced in the network reboot rate limits this forwarding behavior to prevent such amplification of floods.

Application-level retries

In a similar vein, Solana RPC nodes retry failed transactions in their queue—a helpful feature during normal network conditions, but when the chain is congested, it exacerbates the problem. Similarly, many applications like wallets or bots retry on their own when they observe that transactions failed to land, assuming the problem's with their own network connection or RPC node rather than a chain-wide issue.

This is not a bug, but an inherent fundamental issue with any lossy network. It's quite similar to what happens with TCP retransmissions when a network link on the internet is congested, and the subject of a whole field of research into flow control—essentially, the art of differentiating between packet loss and congestion to respond appropriately to each, avoiding bufferbloat and amplification of overload.

The Solana 1.8.x release will include tunables to fine-tune the RPC retry behavior, allowing applications to be more intelligent about retrying transactions, like by expiring ephemeral transactions more quickly or implementing exponential back-off.

Vote Congestion

Solana uses on-chain vote transactions for consensus—leaders include votes they received via the gossip network into blocks they produce. This allows consensus to take advantage of Solana's super-fast turbine block propagation.

While votes propagate using a privileged mechanism (gossip) rather than the TPU, they're eventually included in blocks, like any other transactions. When the transaction processing pipeline was stalled by the flood for the reasons detailed above, it caused leaders to fail to include vote transactions, likely leading to the loss of consensus.

Solana has since gained a mechanism that prioritizes vote transactions, which prevents regular transactions from "drowning out" vote transactions.

Nodes OOMing after the stall

One of the reasons Solana is so fast is its forking behavior. Now, you might ask—forking? Isn't forking a thing eventually consistent (AP) Proof of Work systems do?

Solana is indeed special—it actually uses two finality mechanisms, fork selection for short term concurrency and full PoS consensus for finality.

One of Solana's key innovations is the use of Proof of History—a technique closely related to verifiable delay functions—to establish a cluster-wide clock, allowing the network to fast-forward over slots which belong to slow or unresponsive leaders, without having to wait for a synchronous round of consensus. When voting on a fork, leaders commit to them for a specific period of time (called the lockout period, currently 32 slots).

These loose ends would be "tied up" whenever a block reaches the max lockup, reverting to the so-called root slot the chain agreed upon. If you request data from the chain with the Finalized commitment level, you'll only ever see slots rooted by 2/3+ of the cluster.

(this is a simplified explanation—there's more to it that wouldn't fit this article, like fork switching and more— see the docs)

When the chain stalls, no new root blocks are made for lack of consensus, so there's never any universally-agreed-upon root blocks, and nodes keep forking and accumulating everyone's fork state in memory in case they need need to revert to any of them:

This caused some nodes running with less than the recommended specs to run out of memory. In retrospect, this occurred when the network was already hosed, but it caused unnecessary confusion at the time.

Certus One's hilariously oversized fleet of Solana RPC nodes going from ~5 TiB of RAM usage to ~7 TiB. All of our nodes remained online and functional during the incident.

Second time's the charm

Unfortunately, the first restart attempt failed due to a very unlucky bug. During the first restart, it quickly became clear that something was amiss. Some nodes reported that the number of active stake reported by their nodes was jumping around wildly.

During chain startup, the chain waits for 80% (rather than 66.6%) of the stake to come online before it resumes consensus. Verifying that there's at most 20% of offline stake when the cluster restarts ensures that there's enough safety margin to stay online in case nodes fork or go back offline right after the restart.

This procedure was tested many times before.... except so much new SOL was created by inflation in the meantime, it ended up overflowing the maximum value of a unsigned 64-bit variable. This was unexpected, especially given that it was discussed before:

We're thousands of years from creating 18B SOL... so why did it fail? Well, turns out the value is multiplied by 100 to compare the percentage of online stake with the threshold... The bug was promptly fixed.

Picking the restart slot

Restarting the network mainly entails figuring out what slot to restart from and booting nodes from a snapshot made from that slot using their trusted local state (we can't just tell validators to download a snapshot we made from a server—trust nobody!).

Right now, determining a safe restart slot involves off-chain consensus by humans. However, automated recovery from situation like this one is feasible—the node simply doesn't know how to handle this situation yet, and halts to let the adults humans handle it.

A brief intro to practical Byzantine Fault Tolerance (pBFT)

Remember root slots and the Finalized commitment level? Most applications actually use the Confirmed commitment instead, which uses a mechanism called optimistic confirmation and is much faster—almost the same as block time—with only marginally lesser finalization guarantees.

Both require 2/3+ of the network to vote on a block, the difference being how much stake would be slashed to unwind a block in case some of these 2/3+ decide to also vote on a different fork in order to cause a rollback—also called double signing or equivocation. Slashing is a common property of byzantine fault tolerant Proof of Stake protocols, and Solana is no exception (with some additional twists beyond the scope of this article).

Let's assume that the Solana network is network-partitioned into two groups, one >2/3 supermajority and one <1/3 minority:

Once the 2/3+ supermajority voted on a block, this block is provably final.

... but is it?

If 1/3+ of the network were evil (the "byzantine" in Byzantine Fault Tolerance) and able to censor network packets, they could sign two blocks for the same slot, and show one half of the network one block and another block to other half, leading to two separate 2/3+ majorities and two valid conflicting blocks causing double spends (and possibly the awakening of Chtuhl̷u). It must never, ever happen!

This scenario is a variant of the so-called nothing-at-stake problem, a central design consideration for Proof of State game theoretical incentive models.

Preventing this is one of the main purposes of PoS consensus, and this is where the "at stake" part of PoS comes in! Equivocation needs to be punished such that node operators have too much to lose if they ever tried to double sign. This is commonly called slashing—a node that double signs will be punished by burning some of the funds staked to it, incentivizing delegators to delegate to trustworthy operators who won't do that.

While Solana has no automated slashing yet, a double signing attempt would halt the network and almost certainly lead to manual slashing by network governance when restarting.

Optimistic confirmation

With Finalized, a malicious rollback would require at least 1/3+ of the stake to be slashed (assuming 100% slashing), but it is quite slow since it has to wait for rooted blocks to ensure there's full finality.

But as it turns out, for many use cases, we don't actually need the guarantee of 1/3+ of active stake being slashed—a ridiculous amount—and instead accept a slightly less ridiculous percentage in exchange for faster finality by not having to wait for roots. Instead, once 2/3+ nodes agree on a slot, we immediately consider the block finalized. This is particularly attractive on a real-world network where such attacks would be very impractical to execute and are considered highly unlikely to ever occur.

This is called optimistic concurrency or Confirmed consistency level. The current implementation requires at least ~4.6% of total active stake to be slashed for finality violations to be possible.

Whether to use Finalized or Confirmed depends on the level of risk an application is willing to accept. For cross-chain or exchange transfers of large sums, Finalized may be used, while trading bots or other user-facing applications may prefer Confirmed.

The restart

In the case of a network halt, this means the network needs to roll back to the last agreed-upon optimistically confirmed slot, rather than the last root slot, to avoid rolling back transactions that may have triggered external side effects like exchange transfers.

It is considered safe to roll back to the last optimistic slot— it's possible to establish off-chain consensus on what the highest slot was.

Third parties like exchanges relying on optimistic confirmation would have a very strong incentive to ensure that the latest block they saw isn't lost.

In the Sept 14 restart, the teams and validators used many different sources to determine the latest slot to restart from—telemetry data sent to the public metrics.solana.com server as well as data gathered by various validators, RPC node operators, exchanges and others checking their node's latest optimistically confirmed slot.

All of these agreed that 96542804 was the correct slot to use, and instructions were posted on how to create a local snapshot for this block, introduce a hard fork and restart.

Reflections

Safety vs. liveness

Distributed systems are fickle beasts, and Solana is one of the most complex consensus engines ever built, right up there along Spanner and friends. All distributed systems—be it centralized payment processor, databases, or blockchains—have one thing in common. A single implementation bug is all it takes to bring it crashing down. Solana runs extensive simulated network tests, an incentivized testnet and bug bounty program for quality assurance, but no amount of testing and process can possibly prevent every single bug. Only lessons learned from real-world production usage will, over time.

This is why it's called mainnet-beta — Solana is still in active development and already very usable and safe for real world applications, but some features related to liveness like slashing, fine-grained fees, or transaction prioritization simply aren't implemented yet. Downtime is still a (remote) possibility. In fact, mainnet-beta has been performing much, much better in its early days than anyone expected—all the extensive testing paid off!

The famous CAP theorem also applies to blockchains—a system can only pick two of consistency (every read sees all previous writes), availability (requests always succeed), and partition tolerance (continue working when there's packet loss between nodes).

In a real distributed system, the choice is only between AP and CP — partition tolerance is non-negotiable (any network can, and will, partition).

Solana, like most fast-finality PoS chains, is a CP system.

If in doubt, it will sacrifice availability (i.e. stop) rather than serve stale data or allow unsafe writes.

This is what happened during yesterday's stall or the one in December—the node software entered an irrecoverable state which needed human intervention to resolve, but was never at any risk of losing funds.

Why did this not happen after the Star Atlas launch?

Mainnet-Beta briefly suffered after the Star Atlas IDO a couple weeks earlier, which saw a similar flood of transactions, but swifly recovered on its own.

So, why did the network survive then, and crash now?

We'll have to wait for the full root cause analysis before we know for sure, but chances are it is related to a supermajority of nodes going offline, leading to an unrecoverable stall.

This is the node count during the Star Atlas launch:

Number of nodes (not weighted by stake)

Compare that to the Sept 14 stall:

Number of nodes (not weighted by stake)

What didn't happen?

Let's clear up a couple of misconceptions floating around:

  • The chain was not "stopped" by anyone, not even the validators. It halted on its own due to loss of consensus. No single entity has the power to do so, anyway—it would require a coordinated effort by at least 1/3+ of the voting power.
  • The stall was not related to the chain's degree of centralization. Adding more nodes does not magically fix a denial of service vulnerability equally affecting all nodes.

Validator responsiveness

The sudden burst of activity of hundreds of operators and community members was truly a sight to behold. Many showed up in #mb-validators on Discord minutes after the chain stalled (most have alerting systems to wake them up), some staying awake for more than 18 hour or working in shifts to make it through the two restarts.

Unfortunately, a small minority of validators weren't quite as responsive, causing unnecessary delays in bringing the network back online. Some of this an unavoidable price to pay for decentralization, but it also included a small number of large validators with millions of SOL staked to them who have less of an excuse. We don't want to point fingers, but delegators should do their own research and consider improving the network's robustness and decentralization by delegating to smaller and highly engaged validators.

Overall, responsiveness was excellent considering that 400 (!) of more than 1000 node operators across every timezone had to manually bring their nodes back online, twice. The first restart would've succeeded within hours if it weren't for that pesky bug!

Bystander syndrome

Solana's ecosystem has grown at a massive pace, to the point where it's hard to keep track of all the new projects and launches popping up left and right. Sometimes, the core developers hear about a project on its launch day. Far from the cries of some crypto pundits, there's no single coordinating entity—with all the challenges that entails.

So, when Raydium announced the GRAPE launch for Sept 14, 12:00 UTC, with the Star Atlas fresh in everyone's minds, nobody thought to ask them to maybe wait until the 1.6.23 release, assuming that surely, "someone else" has already considered this—after all, this is much too obvious of a thing to miss!

...oops

Not like anybody could've seen it coming! 🔮

Incident response is hard

Outages are an unfortunate reality of operating distributed systems. Everything fails, eventually—decentralized blockchains just like centralized exchanges, big internet service providers, or even Google (despite all their decades of experience operating one of the world's most complex systems). It's not a question of if, but "when".

Most incident response systems have their roots in real-world emergency response

Effectively managing incidents significantly affects the outcome, and knowing how to do so is an academic discipline of its own. The distributed nature of decentralized ecosystems adds a whole set of coordination challenges.

Incidents are incredibly stressful for everyone involved and rarely practiced, and without regular practice, most teams are pretty terrible at it.

Solana Labs and the other teams involved weren't new to incident response, though, and the overall response felt calm, focused and effective (perhaps even a little too calm).

That being said, there's still room for improvement, and we believe we could benefit from formalizing some of this ad-hoc incident management process.

For instance, it was quite hard for our team to maintain situational awareness during the incident beyond our immediate area of focus. It was at times unclear to us who was working on what, and what to focus on next.

Some suggestions to help the core teams make faster decisions during an incident:

  • Have a set of clearly assigned roles during an incident, including an incident manager and someone responsible for communicating the current status to the public, with formal hand-off across time zones. This is particularly helpful with multiple teams being part of the incident response.
  • Maintain a single shared document collecting all findings in a single place to avoid duplicating investigative work.
  • Have pre-established playbooks to handle common situations. A playbook was already in place for the cluster restart procedure, which was very helpful. There's other scenarios—like figuring out whether or not the cluster entered an unrecoverable fork—where a documented decision tree could have sped up things
  • Perhaps even run regular cross-team "mock" incidents to get familiar with the process.

Google's SRE books have(1) multiple(2) chapters(3) on the topic of effective incident response, and while it's targeted at internal teams, many lessons they learned over time are also applicable to cross-team incident response.

Antifragility

It would be a shame to waste a perfectly good incident by not learning from it! Real-world performance data collected during the stall already led to a number of improvements in the node software, some even already deployed on mainnet-beta.

While painful in the short term, systems become robust and antifragile by being pushed to their limits. Even the most sophisticated simulations cannot replicate the chaotic and byzantine nature of a real-world distributed system involving thousands of nodes, billions in total value locked, hundreds of protocols and a complex framework of incentives.

The incident did not uncover any fundamental issues with Solana's consensus and reaffirmed the importance of many upcoming areas of improvement on the engineering roadmap. While the downtime was unfortunate, Solana has emerged from the incident stronger than before—and the experience gained by the core teams and validators is invaluable.