Last active Nov 23, Code Revisions 5. Embed What would you like to do? Embed Embed this gist in your website. Share Copy sharable link for this gist. Learn more about clone URLs. Download ZIP. Editors Prof. Scope and Purpose Software is an integral part of our lives today. Camera-ready: Camera-ready version of the accepted chapters incorporating revisions if any is expected to be submitted by July 30, Book Publication: The book is anticipated to appear in print by the end of Further info: Inquiries and submissions can be forwarded to Prof.
Additionally, research in this field is informed by related areas such as control systems, machine learning, artificial intelligence, agent-based systems, and biologically inspired computing. The objective of SEAMS is to bring together researchers and practitioners from academia, industry and government, to investigate, discuss, examine and advance the fundamental principles, the state of the art, and the solutions addressing critical challenges of engineering self-adaptive and self-managing systems.
The idea is to have a vote at the end from the audience to decide who was more convincing. Members of the audience can ask questions, and both sides need to answer those questions.
Empirical study that evaluates or compares existing techniques or derives relevant findings using a research method experiment, survey, case study, grounded theory, …. Literature review on a research topic in the field. Authors of long papers are encouraged to submit their supplementary material for recognition of artifacts that are functional, reusable, available, replicated, or reproduced.
Accepted long papers whose supplementary material has been evaluated positively will receive corresponding artifact badges. To submit supplementary material, an extra one page abstract not included in the proceedings should be attached to the submitted long paper, which describes the material, provides information to access the material, supports the evaluation of the material, and justifies why the material deserves the badges the authors are applying for.
Different organizations handle on-call compensation in different ways; Google offers time-off-in-lieu or straight cash compensation, capped at some proportion of overall salary. The compensation cap represents, in practice, a limit on the amount of on-call work that will be taken on by any individual. This compensation structure ensures incentivization to be involved in on-call duties as required by the team, but also promotes a balanced on-call work distribution and limits potential drawbacks of excessive on-call work, such as burnout or inadequate time for project work.
Being an SRE on-call typically means assuming responsibility for user-facing, revenue-critical systems or for the infrastructure required to keep these systems up and running. SRE methodology for thinking about and tackling problems is vital for the appropriate operation of services. Modern research identifies two distinct ways of thinking that an individual may, consciously or subconsciously, choose when faced with challenges [Kah11] :. When one is dealing with the outages related to complex systems, the second of these options is more likely to produce better results and lead to well-planned incident handling.
The importance and the impact of the services and the consequences of potential outages can create significant pressure on the on-call engineers, damaging the well-being of individual team members and possibly prompting SREs to make incorrect choices that can endanger the availability of the service. Stress hormones like cortisol and corticotropin-releasing hormone CRH are known to cause behavioral consequences—including fear—that can impair cognitive functions and cause suboptimal decision making [Chr09].
Under the influence of these stress hormones, the more deliberate cognitive approach is typically subsumed by unreflective and unconsidered but immediate action, leading to potential abuse of heuristics. Heuristics are very tempting behaviors when one is on-call. For example, when the same alert pages for the fourth time in the week, and the previous three pages were initiated by an external infrastructure system, it is extremely tempting to exercise confirmation bias by automatically associating this fourth occurrence of the problem with the previous cause.
While intuition and quick reactions can seem like desirable traits in the middle of incident management, they have downsides. Intuition can be wrong and is often less supportable by obvious data.
Thus, following intuition can lead an engineer to waste time pursuing a line of reasoning that is incorrect from the start. Quick reactions are deep-rooted in habit, and habitual responses are unconsidered, which means they can be disastrous.
The ideal methodology in incident management strikes the perfect balance of taking steps at the desired pace when enough data is available to make a reasonable decision while simultaneously critically examining your assumptions.
The most important on-call resources are:. The appropriate escalation of outages is generally a principled way to react to serious outages with significant unknown dimensions.
Google SRE uses the protocol described in Managing Incidents , which offers an easy-to-follow and well-defined set of steps that aid an on-call engineer to rationally pursue a satisfactory incident resolution with all the required help. This protocol is internally supported by a web-based tool that automates most of the incident management actions, such as handing off roles and recording and communicating status updates.
This tool allows incident managers to focus on dealing with the incident, rather than spending time and cognitive effort on mundane actions such as formatting emails or updating several communication channels at once.
SRE teams must write postmortems after significant incidents and detail a full timeline of the events that occurred. By focusing on events rather than the people, these postmortems provide significant value. Rather than placing blame on individuals, they derive value from the systematic analysis of production incidents.
Mistakes happen, and software should make sure that we make as few mistakes as possible. Recognizing automation opportunities is one of the best ways to prevent human errors [Loo10]. What happens if operational activities exceed this limit? The SRE team and leadership are responsible for including concrete objectives in quarterly work planning in order to make sure that the workload returns to sustainable levels.
Temporarily loaning an experienced SRE to an overloaded team, discussed in Embedding an SRE to Recover from Operational Overload , can provide enough breathing room so that the team can make headway in addressing issues. Ideally, symptoms of operational overload should be measurable, so that the goals can be quantified e.
Misconfigured monitoring is a common cause of operational overload. All paging alerts should also be actionable. Low-priority alerts that bother the on-call engineer every hour or more frequently disrupt productivity, and the fatigue such alerts induce can also cause serious alerts to be treated with less attention than necessary.
See Dealing with Interrupts for further discussion. It is also important to control the number of alerts that the on-call engineers receive for a single incident.
0コメント