Skip to content

Commit

Permalink
CSE review of downtime procedure
Browse files Browse the repository at this point in the history
  • Loading branch information
jms-pantheon committed Jul 30, 2024
1 parent d717863 commit e857d98
Show file tree
Hide file tree
Showing 4 changed files with 16 additions and 25 deletions.
2 changes: 1 addition & 1 deletion source/content/guides/disaster-recovery/01-introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,6 @@ product: [--]
integration: [--]
---

This guide is focused on the immediate actions a Pantheon customer should take in the event of a catastrophic site failure. In all cases, the first step should be to file an emergency downtime on-call ticket. Filing a ticket will immediately escalate the incident and ensure the fastest and most effective service.
This guide is focused on the immediate actions a Pantheon customer should take in the event of a catastrophic site failure. In all cases, the first step should be to file an emergency downtime on-call ticket, regular ticket, or open a support chat depending on support tier. Filing an emergency Pantheon On-Call ticket will immediately escalate the incident and ensure the fastest and most effective service.

This guide does not cover all potential post-disaster recovery processes. Such processes will depend on the nature of the incident and the impact that it has on your site. You must engage with Pantheon support during the incident, and the support team will help determine what remediation steps are required.
13 changes: 6 additions & 7 deletions source/content/guides/disaster-recovery/02-planning-ahead.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,30 +21,29 @@ Disasters are sometimes unavoidable, but steps can be taken to ensure that these

## Monitor and Optimize Performance

Keep ahead of performance issues by regularly reviewing performance with the [New Relic Application Performance Monitor](/guides/new-relic) (APM), included with all non-Basic Site plans. For more information, refer to the [Pantheon New Relic documentation](/guides/new-relic).
Keep ahead of performance issues by regularly reviewing performance with the New Relic Application Performance Monitor, included with all non-Basic Site plans. For more information, refer to the [Pantheon New Relic documentation](/guides/new-relic).

New Relic also provides a performance monitoring service that can send notification of downtime or degraded performance by email and other channels. Refer to the documentation on [New Relic Alerts](https://docs.newrelic.com/docs/alerts-applied-intelligence/overview/) for more information.

A dedicated CSM is included for all Enterprise (contract) Accounts. Dedicated CSMs will meet with you regularly to provide site performance audits. These sessions include training on how to use New Relic to proactively address performance issues.
A dedicated Customer Success Manager (CSM) is included for all Enterprise (contract) Accounts. Dedicated CSMs will meet with you regularly to provide site performance audits. These sessions include training on how to use New Relic to proactively address performance issues.

All sites are different, and there are many different performance issues that can emerge. Review Pantheon's [Optimizing Performance](/performance) documentation for tips and troubleshooting techniques for all layers of the platform.

## Optimize Your Cache Hit Rate
## Optimize Your Cache Hit Ratio

The Pantheon Global CDN delivers pages directly to users from the Varnish edge page cache, and this layer serves both as insulation against unexpected traffic spikes as well as against application and infrastructure issues.

### Process

* Determine the extent to which your site is using the edge cache by reviewing your cache hit rate report from the Metrics tab of the Live environment in the Site Dashboard. For more information on metrics in the Site Dashboard, see [Measuring Site Traffic](/guides/account-mgmt/traffic).
* Determine the extent to which your site is using the edge cache by reviewing your cache hit ratio report from the Metrics tab of the Live environment in the Site Dashboard. For more information on metrics in the Site Dashboard, see [Measuring Site Traffic](/guides/account-mgmt/traffic).

* Test the cacheability of individual pages by examining the page headers using CURL or developer tools. Refer to [Testing Global CDN Caching](/guides/global-cdn/test-global-cdn-caching) for more information.
* Test the cacheability of individual pages by examining the page headers using curl or developer tools. Refer to [Testing Global CDN Caching](/guides/global-cdn/test-global-cdn-caching) for more information.

* Optimize your caching strategy by checking cookies, application configurations, and session management. Refer to our [Debug Caching Issues](/debug-cache) documentation.

* The platform average for site caching is ~80%

<!--TODO: SME input wanted on the following line. Is "persistent uptime" a feature we're referencing here? Rephrase this item using the same language as product and marketing for whatever we're talking about here (e.g., Experience Protection via the GCDN or Multizone Failover or some feature of AGCDN or AGCDN + WAF/IO)-->
* Persistent uptime only works with cacheable content. The higher the caching rate the more protection it will automatically provide you.
* Experience Protection only works with cacheable content. The higher the caching rate the more protection it will automatically provide you. Refer to [Confirm Experience Protection](/guides/global-cdn/experience-protection) for more information.

## Notify Support of Potential Risks

Expand Down
16 changes: 6 additions & 10 deletions source/content/guides/disaster-recovery/03-site-goes-down.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,18 +33,14 @@ Please include as much information as possible. A support engineer will work wit

Sites can go down for various reasons, and although the support team aims to diagnose the cause of downtime, customers can perform their own diagnostics.

Pantheon platform status is tracked at https://status.pantheon.io/, and all customers are encouraged to sign up for status page updates. Although a site can be taken down by isolated platform issues that are not systemic enough to warrant a platform status alert, these are rare. Tickets should still be filed even when the downtime is caused by an identified platform incident - we need to know who has been affected, and how it is affecting their sites.

If you receive a notification ahead of discovering that the incident has affected your site, you can still file a ticket - even if we are already working to fix an identified issue, we need to know who has been affected and how it is affecting their sites.
Pantheon platform status is tracked at [status.pantheon.io](https://status.pantheon.io/), and all customers are encouraged to sign up for status page updates. Although a site can be taken down by isolated platform issues that are not systemic enough to warrant a platform status alert, these are rare. Tickets should still be filed even when the downtime is caused by an identified platform incident - we need to know who has been affected, and how it is affecting their sites.

Because incidents are declared when a platform issue meets a minimal downtime or service degradation threshold, it is possible that you will receive a notification for an incident that is not affecting your site. Conversely, there are cases where a site is affected by an issue with the platform, but this issue is isolated to resources specific to the site, and a platform incident is not declared.

## Incident Escalation

Although filing an emergency on-call ticket will escalate your downtime incident within the support team and ensure you receive the fastest response, you may also want to alert your broader Pantheon account team. Depending on the situation, your escalation path may differ.

Note that tickets and chat have tier-specific response time objectives, while email, phone, and Slack channels do not. Refer to [Support Features and Response Times](/guides/support/#support-features-and-response-times) for details.

### Support Channels

* **Ticketing**: If your site is suffering downtime on the Live environment, your first step should be to open a support ticket. Chat normally has a quicker response time, but emergency on-call tickets are absolutely escalated and response times to these tickets should be comparable.
Expand All @@ -59,21 +55,21 @@ Note that tickets and chat have tier-specific response time objectives, while em

</Alert>

* **Diamond and Platinum Account customers** can call Pantheon's premium technical support line directly for any technical issues, escalations, site, billing, or overages queries. The phone number can be found in your Workspace, in the Support tab.
* **Diamond and Platinum Account customers** can call Pantheon's premium technical support line directly for any technical issues, escalations, site, billing, or overages queries. The phone number can be found in your Workspace, in the Support tab. Note that this is strictly for filing a ticket, and you will not reach a support engineer by using this method.

### Escalation Paths

Depending on the account tier, your escalation path may differ. Escalation paths include the following:

* Dedicated Support Team: All Diamond tier accounts have a dedicated support team, and tickets and issues are routed preferentially to them. This escalation is an automatic part of the intake process once a ticket is opened.

* Customer Success Manager (CSM): Included for all Enterprise (contract) customers, CSMs serve as a coordinator when support involves multiple teams, or when additional subject matter experts need to be brought into the process. The CSM is also responsible for any post-incident RCA or performance reviews.
* Customer Success Manager (CSM): Included for all Enterprise (contract) customers, CSMs serve as a coordinator when support involves multiple teams, or when additional subject matter experts need to be brought into the process. The CSM is also responsible for any post-incident Customer Incident Analysis or performance reviews.

* Managed Updates: If the issue arises from a [Managed Updates](/guides/professional-services/managed-updates) deployment, the first point of escalation is the MU Engagement Manager currently involved in deploying the updates, and secondarily the Manager of the Managed Updates team.

### Professional Services Escalation

Incidents may involve managed services like the Advanced Global CDN, Signal Sciences Integration, and Managed Updates. Support for these layers is handled by the core Support team, and escalation to the appropriate Professional Services team is at the discretion of the support engineers. The support engineers have been trained to handle many AGCDN issues and have tooling that gives them access directly to edge configurations, but there are aspects that may need to be handled by Professional Services.
Incidents may involve managed services like the Advanced Global CDN, WAF Integration, and Managed Updates. Support for these layers is handled by the core Support team, and escalation to the appropriate Professional Services team is at the discretion of the support engineers. The support engineers have been trained to handle many AGCDN issues and have tooling that gives them access directly to edge configurations, but there are aspects that may need to be handled by Professional Services.

Dedicated CSMs (included for all Enterprise contract customers) have the ability to escalate these issues and have access to resources that can assist with expediting the triaging and remediation of issues. If you need to speak with additional teammates at Pantheon that work in our professional services team, CSMs can facilitate those conversations.

Expand All @@ -87,10 +83,10 @@ Incident management is a collaboration between Pantheon Support and the customer

Key tools that you can use for ongoing diagnosis of issues include:

* New Relic gives you real-time insight into application performance, and the slowest transactions are profiled with full stack traces that can isolate specific code, query, or external services bottlenecks.The New Relic Application Performance Monitor (APM) can be used to track current-state performance and dig into transaction traces to isolate bottlenecks and break points. Refer to the [New Relic](/guides/new-relic) documentation for more information.
* New Relic gives you real-time insight into application performance, and the slowest transactions are profiled with full stack traces that can isolate specific code, query, or external services bottlenecks. The New Relic Application Performance Monitor can be used to track current-state performance and dig into transaction traces to isolate bottlenecks and break points. Refer to the [New Relic](/guides/new-relic) documentation for more information.

* MySQL, PHP, and Nginx logs provide forensic data for incident review. Refer to [Log Files on Pantheon](/guides/logs-pantheon)

* ACDN logs can be piped directly into customer-managed log management applications. Setup by Professional Services is required.
* AGCDN logs can be piped directly into customer-managed log management applications. Setup by Professional Services is required.

The Customer Success Engineering team will work with you through the existing emergency ticket. If additional issues are uncovered you may want to open a new ticket to allow for a cleaner set of interactions, especially if additional Pantheon resources are brought in for review and assistance.
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Bringing a site back from downtime and remediating the cause of downtime to ensu

## External Threats

The Signal Sciences and Advanced GCDN layers are primarily managed by Pantheon’s Professional Services team, with some updates and support tasks performed by the Customer Success Engineering team, which also performs intake on the initial request. This ensures that responses can meet the contracted SLA, and Pantheon aims to escalate/reassign to Professional Services on an as needed basis.
The WAF and Advanced GCDN layers are primarily managed by Pantheon’s Professional Services team, with some updates and support tasks performed by the Customer Success Engineering team, which also performs intake on the initial request. This ensures that responses can meet the contracted SLO, and Pantheon aims to escalate/reassign to Professional Services on an as needed basis.

In the event of an attack, exploit, or other issue related to the global edge, file a ticket via the normal support channels, with an on-call emergency ticket filed in cases where downtime or serious service degradation occurs, and notify the Pantheon account team via Slack.

Expand All @@ -40,7 +40,6 @@ A set of static pages can be hosted directly and when certain failure conditions
## Infrastructure Failover
In cases where the Google Cloud Platform infrastructure becomes compromised, Pantheon support can trigger a Multizone failover to redirect traffic at the load-balancing layer to a backup cluster of application servers on an alternate zone. For more information, refer to the [Multizone Failover](/multizone-failover) documentation.


Multizone failover is not designed to protect against issues on the Global CDN, on the load balancing layer, or at the application level. The automated monitoring that triggers a failover condition is focused on infrastructure issues. The zonal redundancy has an identical codebase, a continually replicated database, and uses a common filesystem, application issues would cause the same failure conditions regardless of zone.

Failover has an impact on the Object Cache service - the cache will be automatically rebuilt in the new zone on failover, but this is transactionally heavy, and the site should be tested to determine the performance impact of a mass cache rebuild. This test can be scheduled by filing a support ticket.
Expand All @@ -61,20 +60,17 @@ In cases where the site code, database, or assets have become corrupted or compr
#### Managed Updates Deployment Issue
As part of the Managed Updates deployment process, a Multidev will be cloned from the Live environment. It will be used primarily for testing, but also as a backup. If the Live deployment fails, results in a regression, or compromises the site, this Multidev will be used as the source to restore Live to a pre-deploy state.


#### Codebase is Unrecoverable
The codebase can be restored from a selected backup via Terminus - the Dashboard **Restore Tools** restore all aspects of the site, and cannot be used to selectively restore. For more information, refer to the [Backup Restore](/terminus/commands/backup-restore) documentation information.

#### Reverting a Bad Commit to Pantheon
If a bad commit has been deployed to your Pantheon site, you can roll back the commit using Git. The process depends on the nature of the change and whether it involves core updates or upstream updates, etc. For more information, refer to the [Undo Commits](/undo-commits) documentation.

#### Database and Filesystem Issues
The **Database/files** tools on the Site Dashboard can be used to clone either the files or database from a different environment (Test to Live, for example). For more information, refer to the [Database Workflow](/database-workflow) docuemntation.

The **Database/files** tools on the Site Dashboard can be used to clone either the files or database from a different environment (Test to Live, for example). For more information, refer to the [Database Workflow](/database-workflow) documentation.

#### Restoring a Database from a Backup
The database can be restored from a selected backup via Terminus. The Dashboard **Restore** tools restore all aspects of the site, and cannot be used to restore selectively. For more information, refer to the [Backup Restore](/terminus/commands/backup-restore) documentation.


#### Restoring a Database from an External Dump
The database can be restored from an external dump using the **Database/files** tools on the Site Dashboard. An archive file can be uploaded, or a MySQL archive accessed on a remote location. For more information, refer to the [Database Workflow](/database-workflow) docuemntation.
The database can be restored from an external dump using the **Database/files** tools on the Site Dashboard. An archive file can be uploaded, or a MySQL archive accessed on a remote location. For more information, refer to the [Database Workflow](/database-workflow) documentation.

0 comments on commit e857d98

Please sign in to comment.