Respond Recover

From CyberWiki
Jump to navigation Jump to search

Response and recovery is a fundamental cyber security domain. The objective of response and recovery is to classify the severity, manage and mitigate the impact of an incident. It also includes limiting the impact and restoring normal operations as quickly as possible. Response and recovery included several preparatory steps to ensure well-coordinated actions across multiple stakeholders.

Key response and recovery topics include:

  • Incident Response Program
  • Incident Classification
  • Incident Response Playbooks
  • Exercises and Training
  • Disaster Recovery Plans (System)
  • Backup and Recovery (Asset)
  • Security Operations and Incident Classification

Incident Response Program

Incident Response Program enables timely and effective business restoration via coordinated, strategic, and pre-planned processes. OT Response and Recovery require unique considerations to prioritize availability among the traditional Confidentiality, Integrity, and Availability (CIA) triad and to accommodate limited resources in hardware, software, and expertise.

Current incident response program practices and recommendations include:

  • Organizations document their overarching OT Cyber Security Incident Response in an incident response plan, which provides information, guidance, and structure to support response and recovery activities. In North America, NERC CIP-003 and CIP-008 address incident reporting and response requirements, which help shape the incident response plan.
  • The Cyber Security Incident Response Team (CSIRT) is the core group of multidisciplinary resources responsible for quickly and efficiently returning impacted cyber systems to normal operations. The CSIRT involves multiple internal players across OT, IT, operation, facility, legal, HR, compliance, and communications. The team coordinates with external partners such as OEM vendors, law enforcement, and information-sharing forums. It is a best practice to document the roles and responsibilities of all potential stakeholders in the incident response plan.
  • Various policies, plans, and procedures serve different purposes and often interact together to create an effective response and recovery program. Establishing clear definitions and purposes, handover points, and relationships is crucial for a smooth transition between response and recovery phases and collaborating teams.
  • Ensure the plan's users and stakeholders can easily access necessary policies, procedures, contact information, and other documents when needed. Offline copies at applicable sites are advisable in case of communication loss.

Relevant EPRI Resources

Incident Classification

Incident classification refers to categorizing incidents into unique groups using specific incident attributes. Classifying incidents helps incident management direct newly identified incidents to the correct responders with the necessary skills to respond effectively. It also helps direct incident response personnel to specific playbooks containing low-level procedures for responding to specific incidents.

Current incident classification key practices include:

  • NERC CIP-003 and CIP-008 require a cyber security incident response plan to include processes to classify incidents.
  • CSIRT analysts benefit from maintaining familiarity with Indicators of Compromise (such as system alerts, error reports, anomalous network behavior, low activity, unexpected configuration changes, and system crashes), their sources, and the mechanisms for analyzing them, including using automated tools.
  • Incident prioritization is evaluated by considering the severity and urgency levels based on business, public, and safety impact, recoverability, and possible future impacts. Ongoing evaluation of escalation criteria determines if the priority level should be escalated.
  • Incident classification categories can be utilized to develop specific playbooks and reporting procedures and perform exercises.

Relevant EPRI Resources

Incident Response Playbooks

Incident response playbooks provide a repeatable, step-by-step process for minimizing the impact of specific incidents that are complex to respond to and recover from or pose the most risk to the organization. The playbooks provide tailored details for a specific incident type, facility, cyber system, or responsible group.

Current playbook key practices and considerations:

  • The playbook integrates existing cyber security incident response processes and procedures while enhancing communications channels and interaction between playbook stakeholders and uncovers potential gaps in baseline cyber security controls.
  • Incident response playbooks can be an effective tool for training cyber security personnel during tabletop exercises.
  • The playbook steps must not conflict with other emergency plans. Establishing priorities between steps included in various emergency plans may be necessary. For instance, any steps to ensure personnel safety and health should be prioritized over steps to restore cyber systems or operations.
  • Effective playbook development requires a few prerequisites: an established Incident Response Plan, a defined Cyber Security Incident Response Team (CSIRT) role, the generation of fleet risk analysis and assessment, and the implementation of baseline cyber security controls.
  • Before beginning work on a playbook, it is crucial to determine the scope of the playbook, the audience for which the playbook is written, and the level of detail for the various components of the playbook to minimize the amount of rework or wasted effort.
  • It is practically impossible to develop a playbook for every possible incident that could occur. This is particularly true in the case of detailed, granular playbooks designed for specific systems. Organizations should prioritize incident scenarios that are of high risk.

Relevant EPRI Resources

Exercises and Training

Incident response exercises and training are crucial in ensuring the successful implementation of the developed incident response program. They serve as the opportunity to review and improve the pre-determined strategies and activities and raise awareness among the players and stakeholders.

Current exercise and training key practices include:

  • Incident response exercises are a great form of training. The exercise may be tabletop, hands-on, or hybrid. Distributed play exercises such as GridEx and Cyber Europe serve as an opportunity to participate in multi-stakeholder, large-scale exercises.
    • In North America, NERC E-ISAC hosts the biennial grid security and resilience exercise, GridEx, which provides organizations a forum to practice how they would respond to and recover from coordinated cyber and physical security threats and incidents.
    • In Europe, the ENISA organizes the biennial Cyber Europe to test and improve the collaboration response capabilities of private companies, government agencies, and other relevant stakeholders. Cyber Europe 2024 focuses on the energy sector, inspired by a record number of attacks in recent years and ongoing geopolitical tension threatening the energy infrastructure.
  • NERC CIP-003, CIP-008, and CIP-009 require “testing” of periodic incident response and recovery plans by responding to real cyber security incidents or performing tabletop or operational exercises.

Relevant EPRI Resources

Incident Response Scenarios

Incident response scenarios are sequential, narrative accounts of a hypothetical incident that provide the catalyst for the exercise and introduce situations that will inspire responses. They can be developed to demonstrate specific exercise objectives focused on a particular combination of cyber incidents, cyber systems, processes, players, or standards.

Current scenario key facts and practices:

  • Scenario use cases include hands-on, hands-off, and a combination whose application advantages and disadvantages influence the use case determination. The purpose and limitations determine the scenario application.
    • Scenario limitations include any logistical and organizational constraints such as the availability of participating facilities and personnel, and limitations to the exercise environment.
  • Effective incident response scenarios clearly define the goals, address the organizational risk, and facilitate learnings, discussions, and post-exercise debriefing for continuous improvement.
  • A foundational scenario provides an overall frame to the detailed scenario. It can be identified by considering risks of the OT environment, attack pathway, impact and observable indicators.
  • Optional injects are independent events such as loss of corporate communication, loss of power, erroneous alerts, or false positives and can be used in combination with the planned scenario to exercise specific topics of interest, manage difficulty depending on maturity, and respond to changing conditions.

Relevant EPRI Resources

Disaster Recovery Plans (System)

A Disaster Recovery Plan (DRP) is a written process or set of strategies for recovering and protecting cyber systems in case of a major hardware or software failure caused by cyber events. It prioritizes restoring crucial assets, provides predefined courses of action, coordinates responses, and promotes collective decision-making without the pressure of hurried choices. This plan ensures consistency in recovery efforts, aligning actions with the organization’s overarching security policies.

Current Disaster Recovery key practices and considerations include:

  • Various policies, plans, and procedures serve different purposes and often interact together for an effective response and recovery program. Establishing clear definitions and purpose, handover points, and relationships is crucial for a smooth transition between response and recovery phases and collaborating teams.
  • Before developing a DRP, an organization should have developed a cyber security incident response plan. The DRP supplements the incident response plan for eradication and recovery phases of major cyber security incidents causing system-wide disruptions.
    • The organization should have a good understanding and documentation of the network, system configuration, and backup and restoration plan for applicable cyber systems. These will be utilized to establish the recovery process and referenced in the DRP.
  • A DRP is activated once the cyber event has been confirmed, declared, classified, contained, and evidence gathered. The condition for activation can leverage existing incident classifications to provide further clarifications.
  • Recovery prioritization strategy may be determined by considering the recovery objectives, critical function and system/asset, dependencies or interdependencies, eradication and recovery strategy, and continuous damage analysis.

Relevant EPRI Resources

Backup and Recovery (Asset)

Backup and recovery refer to activities to manage a secure copy of the current state and return to normal operations, whether by restoring from backups, rebuilding, or replacing hardware or software.

Current backup and recovery key practices include:

  • It is best practice to employ a graded approach to backup and recovery based on the risk and capability associated with the device and the organizational maturity.
  • Backup and recovery tools may be embedded in a system or purchased from third-party vendors, but no single tool can backup, restore, and manage all backups. OT asset owners must carefully evaluate available tools to manage and automate the backup and recovery process.

Relevant EPRI Resources

Security Operations and Incident Classification

Security operations refer to continuous infrastructure monitoring for cyber security anomalies and trends. The Security Operations Center (SOC) team is often the first to identify and respond to cyber security incidents.

Key Security Operations Facts and Considerations in power generation facilities include:

  • OT security operations capabilities have expanded as advanced tools such as SIEM and network monitoring became available to the OT systems. Advanced analytics techniques utilizing artificial intelligence and machine learning are rapidly progressing and offer opportunities for automation. Maintenance requirements of installed security tools should be considered for effective implementation.
  • Integrated Security Operations Centers (ISOC) refers to the idea of centralizing data feeds from physical security systems and IT and OT network cyber security systems to be monitored by ISOC staff who have advanced incident response training and are familiar with the systems. ISOC capabilities can be further advanced by integrating data for Monitoring and Diagnostics.
  • Identifying key cyber security metrics is becoming critical for security operations to communicate effectively and consistently and to prioritize the overwhelming amount of security-related data.
  • Workforce development for OT security operations is a persistent challenge due to a shortage of skilled OT security professionals.

Relevant EPRI Resources