Good evening everyone! We are glad to share the news that in February we are launching a new stream on the course
“Devops - Practices and Tools” , which means that it’s time to finish what we started and publish the third part of the article:
“Why SRE Documentation is Important” . Go!
Documents for managing SRE commandsSRE teams need reliable and accessible documentation to work efficiently.
Team siteNote: Instead of a site, you can use a separate space or section in the Confluence / Wiki.
The team site is important because it coordinates information and documentation related to the SRE team and its projects. For example, in Google, many SRE teams use g3doc (Google's internal documentation platform, where docks live in source code along with associated code), and some teams use g3doc and Google Sites: in this case, g3doc pages are closely related to implementation code / details.
Charter team
SRE teams must maintain a published charter that describes motivation for work and documents ongoing engagement. The charter is necessary to establish the identity, main goals and meaning of the team throughout the company.
The charter usually contains the following elements:
- High-level description of the responsibility of the team. Including the type of services supported by the command (and how), related systems, examples.
- A brief description of a couple of the most important services supported by the team. This section also highlights key technologies and the difficulties involved in their use, the benefits of involving SRE, and their responsibilities.
- Key principles and values of the team.
- References to the command site and documentation.
It also assumes the presence of a vision statement (vision vision of the future — an inspiring description of the team’s long-term goals) and a road map for several quarters.
Documentation for integrating new SREsInvesting in training tools and materials for new employees has a positive effect on the speed of employee integration in workflows. It is beneficial for the SRE teams to train beginners as soon as possible with all the necessary skills for shift work. Zoe's story clearly shows how the lack of comprehensive training for a new employee makes a minor incident a serious failure.
Many SRE teams prepare new employees for shifts with the help of checklists. The checklist for a shift usually covers high-level areas (divided into subsections) in which team members must understand. Examples of such areas include manufacturing concepts, front-end and back-end, automation and tools, monitoring and logging. Also, the checklist may include instructions for preparing for the shift and tasks performed during the shift.
For training new members of the SRE team, they also use role-playing exercises (they call them Wheel of Misfortune - Wheel of Failure on Google). Such an exercise is a failure scenario with a specific set of data and signals that SRE may hypothetically need to solve a problem during a shift. The team members take turns playing the role of an engineer on duty to hone the skill of eliminating the consequences of a failure and the skill of debugging the system. Wheel of Misfortune checks if every member of the team knows where to find the documentation needed to fix the problem, and how to deal with the failure.
Storage ManagementAll information of the SRE team can be scattered across multiple sites, the local repository and Google Drive folders, which makes it very difficult to find the right one. As happened in the previously described example, the critical operational tool and instructions for its use were not available for Zoe (SRE on duty), as they were hidden in her technical lead's personal directory, and the inability to find them significantly increased the duration of service failure. To get rid of such problems, you need to structure all the information and make sure that the team members know where to look for and store it, and how to support it. A well-developed structure will help the team find information faster. New team members will get up to speed, and engineers on duty will quickly solve problems.
Here are some guidelines on how to create and maintain a documentation repository:
- Identify key stakeholders and conduct brief interviews to identify all needs.
- Find as much documentation as possible and analyze the gaps in the content.
- Basicly structure your site to create new documentation in the right places.
- Move existing documentation to a new location.
- Archive and demolish old documentation.
- Perform regular checks to ensure the quality / consistency of supported documentation.
- Make sure that standard search queries produce the necessary documents at the very top of the search results list.
- Use signals, such as Google Analytics, to evaluate standard practices.
Repository support note: it is important to regularly check and update documentation. The name of the owner and the date of the last check should be visible - this information helps to ensure the accuracy of the selected document. Zoe in the history was able to find only outdated documentation of a critical tool, thereby losing the ability to quickly solve the problem. Unreliable and outdated documentation makes SRE less efficient, which negatively affects the reliability of managed services.
Repository availabilitySRE commands must ensure that the documentation remains available even in the event of a failure and inaccessibility of the standard repository. Each SRE in Google has its own copy of critical documentation. This copy is available on an encrypted compact storage device or some kind of removable, but secure physical media that each SRE has on duty.
Documentation for service decommissioningWhen the service life cycle comes to an end, SRE will decommission it in a predictable manner. This section provides recommendations for documentation on service outage.
It is important to announce in advance to users about the decommissioning of the service and provide a schedule and steps. Your ad should explain when the registration of new users ends, how the existing and future bugs will be processed, and when the service finally stops working. Clearly mark all important dates and the decline in SRE support, send out interim announcements as you progress.
Simple email distribution is not enough - you need to update the main page of the documentation, playbooks and codelabs. Also, if possible, comment on the header files. Describe the details of the announcement in a document (in addition to the letter) that users can refer to. The letter should be as short as possible, but at the same time informative, reflecting all the main points. Describe additional details: business motivation to turn off the service, which tools users can use to migrate to another service, what support is available during migration. It is also worth creating a FAQ page, filling it over time with new information on questions asked by users.
The Role of Technical Documentation EditorsTechnical editors (or technical writers) provide services that make SRE more efficient and productive. The range of tasks is not limited to writing individual documents on the requirements specified by the SRE team.
Here are some practical recommendations for technical editors for working with SRE teams.
- Technical editors cooperate with SRE to create documentation for the operation of the launched services and production documentation for SRE products and tools.
- They create and update documentation repositories, structure and reorganize them in accordance with the needs of users, improve individual documents as part of the overall management of the repository.
- Editors help identify improvements, required documentation, and information management. This includes evaluating documentation for gathering requirements, improving documents and websites created by engineers, advising teams on the rules for creating, organizing, redesigning, searching and maintaining documentation.
- Editors should evaluate and improve documentation tools to provide better SRE solutions.
TemplatesTechnical editors also provide templates that simplify the creation and use of SRE documentation. Templates do the following:
- Simplify the creation of documentation, giving engineers a clear structure for creating new documents.
- Add sections of all necessary documents to complete the documentation.
- Help the reader to quickly understand the topic of the document, the type of information and how it is organized.
Site Reliability Engineering contains several sample documentation templates. In this section, we will provide some more examples to show how templates provide a structure and a guide for engineers to fill out with content.
Service overviewOverviewWhat is it? What is he doing? High-level describe the functionality provided to customers (end user, components, etc.).
ArchitectureExplain how architecture works. Describe the movement of data between components. Consider adding a system diagram with critical dependencies and flow requests and data.
Customers and DependenciesList all clients (belonging to other teams) that depend on it and all services (belonging to other teams) on which it depends. (This can also be demonstrated in the form of a system diagram.)
Code and ConfigurationExplain the production structure. Where is it running? List binaries, jobs, data centers and configuration file settings, or indicate where they are all located. Also provide the location of the code and, if necessary, information about the build.
List and describe the configuration files, changes and ports required to operate this product or service.
Describe the following: what configuration files have been changed for this product or service? How is the setting?
ProcessesDescribe the following: What daemons and other processes should be running for the service to work? What control scripts were created to control the service?
OutputList and describe the log files created by the component and which observations are performed. Describe the following: What logs are generated by this component? What is in each file? What are the recommendations for studying these files? What aspects of the component should be monitored for reliable service operation?
Dashboards and ToolsInsert links to appropriate dashboards and tools.
PowerSpecify the power of a single instance; Data center globally: QPS, bandwidth and latency values.
SLAProvide accessibility targets.
Standard ProceduresAdd links to procedures, including load testing, updates / push / flag states, and so on. Add links to alert documentation in the playbook of alerts.
ReferencesAdd links to component design documentation or related components, usually written by the development team, as well as other related information.
PlaybookTitleIn the title, specify the name of the alert (for example, Normal Alert_ AlertVery General).
OverviewDescribe the following: What does this alert mean? Does it come to the pager or just to the mail? What factors trigger the alert? What parts of the service are affected? What alerts are associated with it? Who needs to be notified?
Hazard Level AlertsExplain the severity of the alert and the impact of the affected parts on the overall condition of the service.
the confirmationProvide clear instructions on how to verify and confirm status.
Problem SolvingList and describe debugging methods and related sources of information. Do not forget to link to the corresponding dashboards. Enable alerts. Describe the following: What will appear in the logs when the alert is triggered? What are the debugging handlers? Are there any useful scripts and commands? What output do they generate? Are there any additional tasks that need to be solved after the alert has been removed?
DecisionDescribe and list all possible solutions to the problem causing the alert. Describe the following: How to solve the problem and eliminate the alert? What commands to run to reboot? Who will be notified if the alert has worked due to user actions? Who has experience debugging a similar problem?
EscalationList and describe the path escalation. Indicate the person or team to be notified and when to do it. If escalation is not necessary - write about it.
Related LinksProvide links to related alerts, procedures, overview documentation.
Quarterly service report
Introduction
Describe the service for which the team is responsible.
Capacity PlanningIncluding:
- The actual demand for the service, starting from the last 6-8 quarters, expressed in the metrics most relevant to the service (for example, QPS or DAU).
- Forecast of demand for the next 8 quarters.
- Capacity Plan, satisfying the projected demand at the required level of redundancy - specify the deficit and / or risks of capacity planning.
We also recommend adding forecasts for past 2-4 quarters so that the reader can evaluate the stability and accuracy of forecasts.
Run SLA / AvailabilityAll services supported by SRE must have a written SLA, according to which each quarter performance is evaluated.
The SLA section should contain the parameters of the main service components for measuring the quarterly fulfillment of the SLA conditions, as well as a link to the written SLA team.
Related Incidents (Optional)List 3-5 major incidents or failures per quarter.
Achievements (Optional)List the main achievements for the quarter.
SLA Changes (Preferred)Recent changes in SLA.
Service Details (Preferred)Section may include growth, statistics of delays, and so on.
Team Information (Optional)It may include information about team members, status, projects, shift statistics.
Data Sources (Required)Describe the sources used to obtain the availability values, calculation methods, provide links to the corresponding dashboards.
Team CharterWho are weIn one sentence (~ 1 line), describe the technological environment, customers and team suggestions, as well as the degree of involvement SRE and special expertise.
Supported ServicesTo further clarify the scope of work, describe the services (or their group) that the team supports.
How We Distribute TimeScoping helps to create a roadmap and achieve and support long-term goals.
Team ValuesClearly describe the values. This affects how the team members interact with each other, and how your team is perceived by others.
ConclusionRegardless of whether you are a SRE, or a SRE manager, or a technical editor, you now understand the critical importance of documentation in the life of an effective SRE team. Good documentation allows the SRE team to grow and adhere to a clear methodology for managing new and existing services.
Thus, we have published the final part of this article, the
first and
second parts can be read by clicking on the hyperlinks, and you can get even more useful information in our
open lesson , which will be held on February 19th. Waiting for everybody!