“The Evolution of SRE at Google”を読んでみた | クラウド・分散システム研究室

以下の記事が会社のSlackで共有されていたので読んでみました．

Tim Falzone, Ben Treynor Sloss, “The Evolution of SRE at Google”, Dec. 18, 2024

この記事ではGoogleのSREでシステム障害への対処にSTAMP/STPAを利用したことが書いてありました．STAMP/STPAはそれぞれSystems-Theoretic Accident Model and Processes/System-Theoretic Process Analysisの略語です．STPAはSTAMPに含まれる一部マサチューセッツ工科大のNancy Leveson氏により提案された安全性の分析の考え方です．この考え方はコンピュータサイエンスに限って適用できるものではなく，自動車やロボットをはじめとした工学に広く適用できるものです．詳しくは以下の記事を参照してください．

STAMP/STPAとは何か：基礎から学ぶSTAMP/STPA（1）（1/4 ページ） – MONOist

セキュリティの分野ではSTAPを拡張したSTPA-SafeSecやSTPA-Secが提唱されているそうです．興味があれば以下の記事を参照してください．

岡本圭史, 岡野浩三, “STAMP海外事例の紹介：STPA-SafeSec”, SEC journal Vol.13 No.4, Mar. 2018

ここでは，元のGoogleのSREのエンジニアが書いた記事の中でも特に興味深かった部分を直接引用して紹介していきます．以下の文にあるように，Googleでも一連の直列の失敗(a liner sequence of failures)を意識はしていないもののシステム障害の説明に使用していたそうです．この一連の直列の失敗とは，ドミノ倒しのようにある1つのシステムの障害が別の1つのシステムへ伝搬し，さらにまた別の1つのシステムへと伝搬していくような連鎖障害をさしていると理解しています．このモデルを使う場合，システムの安全性を分析する際に問題があったそうです．

A prevalent way of explaining the cause of an outage at Google is as a linear sequence of failures. As we’ll show, this type of causality model has limitations when analyzing system safety. Sentences like, “a bug combined with insufficient rate limits, caused thousands of servers to go offline” abound in our postmortems. We don’t explicitly call out our use of a linear chain causality model, but as Leveson writes, “accident models explain why accidents occur, and they determine the approaches we take to prevent them. While you might not be consciously aware you are using a model when engaged in these activities, some (perhaps subconscious) model of the phenomenon is always part of the process.” (Leveson 2012, 15)
(Ref. Tim Falzone, Ben Treynor Sloss, “The Evolution of SRE at Google”, Dec. 18, 2024)

STAMPではハザード状態(Hazard states)を障害(Loss operation)が発生する状態として定義しています．つまり平常状態から障害に変化する過程は，平常状態→ハザード状態→障害として定義されています．また，障害を防ぐには，ハザード状態になるのを防いだり，ハザード状態になった場合に回復したりする必要があると説明されています．特にこの考え方ではシステムの一部として考えるのではなく，全体がハザード状態になるのを防ぐのを重視しているのが印象的です．

Hazard states are not discrete events. They do not describe anything at the individual system component-level. A hazard state is a property of the system as a whole, and the system can be in a hazard state for a long period of time before an accident occurs. That gives engineers a much larger target to aim at when trying to prevent outages. Rather than trying to eliminate any single failure that could occur anywhere in the system, we work to prevent the system from entering a hazard state. And if we do enter a hazard state, if we can detect it and take action to transition from the hazard state back to normal operations, we can prevent any accident from occurring.
(Ref. Tim Falzone, Ben Treynor Sloss, “The Evolution of SRE at Google”, Dec. 18, 2024)

以下の文章ではSTAPでUCA(unsafe control action)になる原因を4つに大別していました．UCAはその名のとおりシステムが安全でなくなる動作です．原文ではGoogle社内のインフラに対してリソースクオータを過小に設定したことで，UCAになったことが説明されています．UCAの4つの原因をそれぞれ見ていきます．1つ目は，必要な制御動作が提供されていない場合です．これはエラーや異常に対する考慮が不足している場合に該当しそうです．2つ目は制御動作が提供されているものの，それが不正確であったり不十分なものである場合です．これは例えば制御ロジックの問題に該当しそうです．3つ目は制御動作が間違った時間または間違った順序である場合です．これはネットワークの問題でリトライが発生した場合で起きやすいと感じました．4つ目は制御動作がすぐに停止されるか，長い間に停止されるかの場合です．短期間の停止ではバッファにキャッシュが少ない場合にキャッシュヒット率が低く問題が起きそうです．長期間の停止ではキューにデータが平常時よりも多く残っている場合に問題がおきそうです．

An example of this phenomenon occurred at Google in 2021. We set and enforce resource quotas for some kinds of internal software running on our infrastructure. To maximize efficiency, we also monitor how much of its quota each software service uses. If a service consistently uses less resources than its quota, we automatically reduce the quota. In STPA terms, this quota rightsizer has a control action to reduce a service’s quota. From a safety perspective, we then ask when this action would be unsafe. As one example, if the rightsizer ever reduced a service’s quota below the actual needs of that service, it would be unsafe—the service would be resource-starved. This is what STPA calls an unsafe control action (UCA).

STPA analyzes each interaction in a system to determine comprehensively how the interaction must be controlled in order for the system to be safe. Unsafe control actions lead to the system entering one or more hazard states. There are only four possible types of UCA:

A required control action is not provided.

An incorrect or inadequate control action is provided.

A control action is provided at the wrong time or in the wrong sequence.

A control action is stopped too soon or applied for too long.

(Ref. Tim Falzone, Ben Treynor Sloss, “The Evolution of SRE at Google”, Dec. 18, 2024)

この記事ではGoogleのSREで採用しているSTAMP/STPAについての記事を読んだ感想や気になった点を説明しました．Root Cause Analysisを効率的に行う方法についてキャッチアップしていきたいと思います．オススメな記事や論文があれば私宛に連絡いただけると幸いです．

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル