Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audit Site Downtime Playbook #9134

Merged
merged 2 commits into from
Jul 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions source/content/guides/disaster-recovery/01-introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ subtitle: Introduction
description: Address emergency downtime situations on the Pantheon platform
tags: [webops]
contributors: [joshlieb, joan-ing]
reviewed: "2021-07-26"
reviewed: "2024-07-25"
type: guide
permalink: docs/guides/disaster-recovery
editpath: disaster-recovery/01-introduction.md
Expand All @@ -17,6 +17,6 @@ product: [--]
integration: [--]
---

This guide is focused on the immediate actions a Pantheon customer should take in the event of a catastrophic site failure. In all cases, the first step should be to file an emergency downtime on-call ticket. Filing a ticket will immediately escalate the incident and ensure the fastest and most effective service.
This guide is focused on the immediate actions a Pantheon customer should take in the event of a catastrophic site failure. In all cases, the first step should be to file an emergency downtime on-call ticket, regular ticket, or open a support chat depending on support tier. Filing an emergency Pantheon On-Call ticket will immediately escalate the incident and ensure the fastest and most effective service.

This guide does not cover all potential post-disaster recovery processes. Such processes will depend on the nature of the incident and the impact that it has on your site. You must engage with Pantheon support during the incident, and the support team will help determine what remediation steps are required.
14 changes: 7 additions & 7 deletions source/content/guides/disaster-recovery/02-planning-ahead.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ subtitle: Planning Ahead
description: Avert potential site failures
tags: [webops, workflow]
contributors: [joshlieb, joan-ing]
reviewed: "2021-07-26"
reviewed: "2024-07-25"
permalink: docs/guides/disaster-recovery/planning-ahead
editpath: disaster-recovery/02-planning-ahead.md
contenttype: [guide]
Expand All @@ -21,29 +21,29 @@ Disasters are sometimes unavoidable, but steps can be taken to ensure that these

## Monitor and Optimize Performance

Keep ahead of performance issues by regularly reviewing performance with the [New Relic Application Performance Monitor](/guides/new-relic) (APM), included with all non-Basic Site plans. For more information, refer to the [Pantheon New Relic documentation](/guides/new-relic).
Keep ahead of performance issues by regularly reviewing performance with the New Relic Application Performance Monitor, included with all non-Basic Site plans. For more information, refer to the [Pantheon New Relic documentation](/guides/new-relic).

New Relic also provides a performance monitoring service that can send notification of downtime or degraded performance by email and other channels. Refer to the documentation on [New Relic Alerts](https://docs.newrelic.com/docs/alerts-applied-intelligence/overview/) for more information.

If you have been assigned a Customer Success Manager, you will receive periodic technical reviews. These sessions include training on how to use New Relic to proactively address performance issues.
A dedicated Customer Success Manager (CSM) is included for all Enterprise (contract) Accounts. Dedicated CSMs will meet with you regularly to provide site performance audits. These sessions include training on how to use New Relic to proactively address performance issues.

All sites are different, and there are many different performance issues that can emerge. Review Pantheon's [Optimizing Performance](/performance) documentation for tips and troubleshooting techniques for all layers of the platform.

## Optimize Your Cache Hit Rate
## Optimize Your Cache Hit Ratio

The Pantheon Global CDN delivers pages directly to users from the Varnish edge page cache, and this layer serves both as insulation against unexpected traffic spikes as well as against application and infrastructure issues.

### Process

* Determine the extent to which your site is using the edge cache by requesting a cache hit rate report from support. This report shows the cache hit rate for the full site on a daily basis.
* Determine the extent to which your site is using the edge cache by reviewing your cache hit ratio report from the Metrics tab of the Live environment in the Site Dashboard. For more information on metrics in the Site Dashboard, see [Measuring Site Traffic](/guides/account-mgmt/traffic).

* Test the cacheability of individual pages by examining the page headers using CURL or developer tools. Refer to [Testing Global CDN Caching](/guides/global-cdn/test-global-cdn-caching) for more information.
* Test the cacheability of individual pages by examining the page headers using curl or developer tools. Refer to [Testing Global CDN Caching](/guides/global-cdn/test-global-cdn-caching) for more information.

* Optimize your caching strategy by checking cookies, application configurations, and session management. Refer to our [Debug Caching Issues](/debug-cache) documentation.

* The platform average for site caching is ~80%

* Persistent uptime only works with cacheable content. The higher the caching rate the more protection it will automatically provide you.
* Experience Protection only works with cacheable content. The higher the caching rate the more protection it will automatically provide you. Refer to [Confirm Experience Protection](/guides/global-cdn/experience-protection) for more information.

## Notify Support of Potential Risks

Expand Down
62 changes: 20 additions & 42 deletions source/content/guides/disaster-recovery/03-site-goes-down.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ subtitle: What to Do If Your Site Goes Down
description: Working with Pantheon support during emergencies
tags: [webops]
contributors: [joshlieb, joan-ing]
reviewed: "2021-07-26"
reviewed: "2024-07-25"
type: guide
permalink: docs/guides/disaster-recovery/site-goes-down
editpath: disaster-recovery/03-site-goes-down.md
Expand All @@ -17,37 +17,33 @@ product: [--]
integration: [--]
---

## Open an Emergency Downtime Ticket
## Open a Support Ticket

In all cases, the first step is to file an emergency downtime on-call ticket. Even if you escalate the incident to your Account Manager or Customer Success Manager, the support engineers will be the ones diagnosing the cause of downtime and working to get your site back up, and a ticket is the fastest way to get them up to speed and engaged.
In cases of downtime or significant functional failure in the Live environment, the first step is to open a support ticket. Even if you escalate the incident to your dedicated Customer Success Manager (included for all Enterprise contract customers), our support engineers will be the ones diagnosing the cause of downtime and working to get your site back up, and a ticket is the fastest way to get them up to speed and engaged.

These tickets can be filed from the Site or Organizational Dashboard. To create a ticket, navigate to the **Support** tab in the Dashboard and click **Trigger Pantheon On-Call** in the Escalate with Pantheon On-Call box. Note that these tickets should only be reserved for downtime or significant functional failures on the Live environment.
Diamond and Platinum Account customers can report and escalate site downtime by clicking **Trigger Pantheon On-Call** from the Support tab. In cases where the dashboard is inaccessible, a ticket can be filed using a telephone ticketing service, accessible at **1(866)415-7624**. Note that this is strictly for filing a ticket, and you will not reach a support engineer by using this method.

In cases where the dashboard is inaccessible, a ticket can be filed using a telephone ticketing service, accessible at **1(866)415-7624**. Note that this is strictly for filing a ticket, and you will not reach a support engineer by using this method.
All other account types should click **Open Ticket** to open a support interaction to report site downtime.

Please include as much information as possible. A support engineer will work with you to diagnose the cause, and any information that you can provide will shorten the investigation time.
![Show platinum support features in the site dashboard](../../../images/dashboard/new-dashboard/platinum-support-site-dashboard.png)

Note that these tickets should be reserved for downtime or significant functional failure in the Live environment, only.
Please include as much information as possible. A support engineer will work with you to diagnose the cause, and any information that you can provide will shorten the investigation time.

## Check for Ongoing Platform Incidents

Sites can go down for various reasons, and although the support team aims to diagnose the cause of downtime, customers can perform their own diagnostics.

Pantheon platform status is tracked at https://status.pantheon.io/, and all customers are encouraged to sign up for status page updates. Although a site can be taken down by isolated platform issues that are not systemic enough to warrant a platform status alert, these are rare. Tickets should still be filed even when the downtime is caused by an identified platform incident - we need to know who has been affected, and how it is affecting their sites.

If you receive a notification ahead of discovering that the incident has affected your site, you can still file a ticket - even if we are already working to fix an identified issue, we need to know who has been affected and how it is affecting their sites.
Pantheon platform status is tracked at [status.pantheon.io](https://status.pantheon.io/), and all customers are encouraged to sign up for status page updates. Although a site can be taken down by isolated platform issues that are not systemic enough to warrant a platform status alert, these are rare. Tickets should still be filed even when the downtime is caused by an identified platform incident - we need to know who has been affected, and how it is affecting their sites.

Because incidents are declared when a platform issue meets a minimal downtime or service degradation threshold, it is possible that you will receive a notification for an incident that is not affecting your site. Conversely, there are cases where a site is affected by an issue with the platform, but this issue is isolated to resources specific to the site, and a platform incident is not declared.

## Incident Escalation

Although filing an emergency on-call ticket will escalate your downtime incident within the support team and ensure you receive the fastest response, you may also want to alert your broader Pantheon account team. Depending on the situation, your escalation path may differ.

Note that tickets and chat have tier-specific response time objectives, while email, phone, and Slack channels do not. Refer to [Support Features and Response Times](/guides/support/#support-features-and-response-times) for details.

#### Support Channels
### Support Channels

* **Ticketing**: If your Elite site is suffering downtime on the Live environment, your first step should be to open an emergency on-call ticket. Chat normally has a quicker response time, but emergency on-call tickets are absolutely escalated and response times to these tickets should be comparable.
* **Ticketing**: If your site is suffering downtime on the Live environment, your first step should be to open a support ticket. Chat normally has a quicker response time, but emergency on-call tickets are absolutely escalated and response times to these tickets should be comparable.

* **Slack**: Diamond tier accounts can have access to a dedicated Slack channel in which customers can interact directly with their CSM, AM, and primary support resources. This is primarily intended as a means for quick communication and collaboration, and should not be used in lieu of the ticketing system, as there are no SLOs associated with Slack channels.

Expand All @@ -59,41 +55,23 @@ Note that tickets and chat have tier-specific response time objectives, while em

</Alert>

* **Diamond and Platinum Account customers** can call Pantheon's premium technical support line directly for any technical issues, escalations, site, billing, or overages queries. The phone number can be found in your Workspace, in the Support tab.
* **Diamond and Platinum Account customers** can call Pantheon's premium technical support line directly for any technical issues, escalations, site, billing, or overages queries. The phone number can be found in your Workspace, in the Support tab. Note that this is strictly for filing a ticket, and you will not reach a support engineer by using this method.

#### Escalation Paths
### Escalation Paths

Depending on the account tier, your escalation path may differ. Escalation paths include the following:

* Dedicated Customer Success Engineer: All Diamond tier accounts have a named senior support engineer, and tickets and issues are routed preferentially to them.

* Customer Success Manager: Serves as a coordinator when support involves multiple teams, or when additional subject matter experts need to be brought into the process. The CSM is also responsible for any post-incident RCA or performance reviews.

* Account Manager: Also serves as a coordinator of support efforts.

* Managed Updates: If the issue arises from a Managed Updates deployment, the first point of escalation is the MU Engagement Manager currently involved in deploying the updates, and secondarily the Manager of the Managed Updates team.

### Phone and Teleconference Support
* Dedicated Support Team: All Diamond tier accounts have a dedicated support team, and tickets and issues are routed preferentially to them. This escalation is an automatic part of the intake process once a ticket is opened.

A phone call or teleconference can be requested for emergency support. This can be done while filing the emergency ticket, or in the open ticket thread. The available resources depend on the current staffing situation, but named resources can be requested and can join when available.
* Customer Success Manager (CSM): Included for all Enterprise (contract) customers, CSMs serve as a coordinator when support involves multiple teams, or when additional subject matter experts need to be brought into the process. The CSM is also responsible for any post-incident Customer Incident Analysis or performance reviews.

### Escalation to Your Dedicated Customer Success Engineer

All Diamond tier accounts have a named Senior Support Engineer, and tickets and issues are routed preferentially to them. If the designated engineer is not available (i.e. if the incident happens during non-business hours), there are fallback assignment paths to ensure that Diamond tickets are quickly assigned to senior support staff. This escalation should be an automatic part of the intake process.

### Account Team Escalation

During an active incident, the support engineers will be the primary parties engaged in bringing the site back to health. The Account Manager and Customer Success Manager are escalation points for the following scenarios:

* **Pantheon Support is not responsive**: In the event that Pantheon support is either unresponsive or unhelpful, your account team can escalate the issue internally and use additional resources.

* **Coordination across multiple teams is required**: The support engineers will loop in resources from other teams as needed, but where there are multiple workstreams required to remedy an issue, the Account Manager or Customer Success Manager can assist in coordinating across teams.

* **Post-incident review**: Where a formal review is required to produce a Root Cause Analysis, your Customer Success Manager can produce a customer-specific version in addition to any RCA published on the Status portal.
* Managed Updates: If the issue arises from a [Managed Updates](/guides/professional-services/managed-updates) deployment, the first point of escalation is the MU Engagement Manager currently involved in deploying the updates, and secondarily the Manager of the Managed Updates team.

### Professional Services Escalation

Incidents may involve managed services like the Advanced Global CDN, Signal Sciences Integration, and Managed Updates. Support for these layers is handled by the core Support team, and escalation to the appropriate Professional Services team is at the discretion of the support engineers. The support engineers have been trained to handle many AGCDN issues and have tooling that gives them access directly to edge configurations, but there are aspects that may need to be handled by Professional Services.
Incidents may involve managed services like the Advanced Global CDN, WAF Integration, and Managed Updates. Support for these layers is handled by the core Support team, and escalation to the appropriate Professional Services team is at the discretion of the support engineers. The support engineers have been trained to handle many AGCDN issues and have tooling that gives them access directly to edge configurations, but there are aspects that may need to be handled by Professional Services.

Dedicated CSMs (included for all Enterprise contract customers) have the ability to escalate these issues and have access to resources that can assist with expediting the triaging and remediation of issues. If you need to speak with additional teammates at Pantheon that work in our professional services team, CSMs can facilitate those conversations.

### Executive Escalation

Expand All @@ -105,10 +83,10 @@ Incident management is a collaboration between Pantheon Support and the customer

Key tools that you can use for ongoing diagnosis of issues include:

* New Relic gives you real-time insight into application performance, and the slowest transactions are profiled with full stack traces that can isolate specific code, query, or external services bottlenecks.The New Relic Application Performance Monitor (APM) can be used to track current-state performance and dig into transaction traces to isolate bottlenecks and break points. Refer to the [New Relic](/guides/new-relic) documentation for more information.
* New Relic gives you real-time insight into application performance, and the slowest transactions are profiled with full stack traces that can isolate specific code, query, or external services bottlenecks. The New Relic Application Performance Monitor can be used to track current-state performance and dig into transaction traces to isolate bottlenecks and break points. Refer to the [New Relic](/guides/new-relic) documentation for more information.

* MySQL, PHP, and Nginx logs provide forensic data for incident review. Refer to [Log Files on Pantheon](/guides/logs-pantheon)

* ACDN logs can be piped directly into customer-managed log management applications. Setup by Professional Services is required.
* AGCDN logs can be piped directly into customer-managed log management applications. Setup by Professional Services is required.

The Customer Success Engineering team will work with you through the existing emergency ticket. If additional issues are uncovered you may want to open a new ticket to allow for a cleaner set of interactions, especially if additional Pantheon resources are brought in for review and assistance.
Loading
Loading