Disaster recovery is the planning and implementation of a process whereby a company can recover from a catastrophic information technology failure. The three main categories of disaster exposure include natural threads and hazards (including hurricanes, flooding, earthquakes and fire), technical and mechanical hazards (such as power outages, gas leaks, accidental or deliberate Halon discharges, or chemical spills) and human activities and threats (like computer error, loss of records, vandalism, sabotage or epidemic) (Rike, 2003).
The goal of disaster recovery planning in information technology is to restore access to business data and system resources as quickly as possible, as well as to minimize data loss and physical resource loss.
Disaster recovery must address each of the main categories of threat, assess the likely impact and the chance of occurrence of each one and plan reactions and facilities accordingly. Disaster recovery is not only important for the IT-based company, but for any company which is vulnerable to natural disaster or malicious attack.
Proper planning of a disaster recovery framework will increase response time, minimize data loss and speed recovery and regained access to data and computing resources. Disaster recovery planning for information technology includes: data assurance with a proper backup and restore procedure; network continuity; intrusion detection and response; proper facilities planning including air conditioning, fire detection and control and environmental sensors; and personnel training in order to ensure proper response.
A business’s disaster recovery framework may extend beyond its information technology into facilities management, human resources and other operations. Disaster recovery is a relatively new facet of information technology planning which has rapidly become more important as businesses have become more dependent on technology resources. Many modern businesses come to a standstill without their technology base, and this can be devastating to the business. Rike (2003) noted that 93% of companies which suffer a major data loss go out of business within five years following that loss.
However, according to Rike, many companies are unprotected from this danger – two surveys noted that only 35% of small and midsize businesses have a disaster recovery framework in place, while only 36% of all businesses and government offices have such a framework. Disaster Recovery Case Studies One of the first discussions of disaster recovery in information technology occurred after the 1995 Kobe earthquake in Japan. Garland and Morimoto (1996) provide an account of the outcome of the Kobe University disaster recovery framework on their IT infrastructure, as well as the effects of the earthquake itself.
The Kobe earthquake, referred to as the “Great Hanshin Earthquake Disaster”, struck the Kobe area in the early morning hours of January 17, 1995. Aftershocks and fires worsened the damage caused by the earthquake, cutting off communications and electricity to the region. Transportation routes were completely blocked due to collapsed roadways and damaged rail lines. The earthquake, which measured at 7. 2 on the Richter scale and left almost 5,400 dead as well as 400,000 homeless in its wake, was one of the worst disasters that have occurred in modern Japan.
The university, where the authors were teaching at the time, lost two professors and thirty nine students, as well as all its laboratory animals. Data loss was extensive, and computing equipment loss was exacerbated by physical damage caused by falling furniture and books. The university’s telephone and fax connections were completely cut off. However, despite the damage to the university’s infrastructure and community, Internet connectivity was able to be restored within a few hours of the earthquake.
The resulting email access (there were no extensive Web-based resources at the time) allowed students and staff outside communication, a means to reassure loved ones and provided a connection to government disaster recovery resources. University personnel also used cellular phones, a then-nascent technology, to connect to the outside world. Kobe University was using the best available technology at the time, which allowed for quick recovery of the lightweight machines.
The IT personnel at the university noted specifically that the hardest-hit IT resources were the older-style, stationary, heavyweight servers and storage units, rather than the newer equipment which was designed to be moved and handled. Specific successes of the Kobe University disaster recovery included: use of alternate routes of communication, broadcast communication to all personnel involved (including students and staff), fast restoration of outside connectivity, setup of alternate email access points and gateways to continue to provide communication and the use of more robust, newer hardware resources.
Some of the problems with the university’s disaster recovery were lack of system-wide backup plan leading to widespread data loss, insecure physical premises leading to damage, including fall damage to computer equipment placed inappropriately close to other hazards and environmental system failure leading to the death of the lab animals. Because Kobe University is the first instance of formalized study of disaster recovery in information technology, there are a number of questions which arise from the planning and execution of the recovery.
What are the priorities of the business or organization when planning? How do you put into place organization-wide policies, such as data backup, which reduce the risk of failure? How do you deal with facilities and functions (such as public utility infrastructure) that are out of your control? A more recent demonstration of the importance of disaster preparedness and recovery was Hurricane Katrina, in 2005.
Chenoweth, Peters and Naremore (2006) analyzed the disaster preparedness and recovery response of a New Orleans hospital during the hurricane and the flooding that followed. East Jefferson General Hospital, located in Jefferson parish, was one of three hospitals in New Orleans to remain open during and after the storm. The hospital planned for a two to three day emergency situation; staffers brought appropriate supplies for only a few days.
There were over 3,000 people, including staff, patients and community members, as well as a handful of pets, sheltering at the hospital by the time the storm hit New Orleans on August 28. The hospital’s IT staff worked quickly to move critical equipment out of harm’s way – they moved data center equipment to upper floors and PCs and other equipment away from windows, printed out hard copies of patient records, contact information and other vital data, and set up a hospital command post with PCs, telephones and fax machines for outside connectivity.
The hospital itself did not sustain a high degree of physical damage in the storm, in contrast with Kobe University. However, the infrastructure of the city itself was virtually destroyed, with electricity, telephone and water cut off, roads blocked and food and drinking water supplies tight. The hospital was isolated from the rest of the world for over a week as external recovery crews worked. East Jefferson Community Hospital did have a written disaster recovery framework in place prior to Hurricane Katrina.
According to Chenoweth et al (2006), the IT department had a hot site arrangement with SunGard; weekly backups of the hospital’s data were stored in a local tape vault, occasionally retrieved for safe storage in SunGard’s offsite facility in New Jersey. Unfortunately, the evacuation of the vault’s staff left the tapes inaccessible. During the storm, the hospital lost first grid power and then generator power; communications were lost as the Bell South CO, then the onsite CO, and finally the hospital’s Cox internet cable connection went down.
The rapidly changing situation, according to the authors, forced a reprioritization of IT resources and efforts from internal systems maintenance to restoring and maintaining communication with the outside world. The IT staff found a usable dialup line and set up email access using some of the PCs on-site; they also leveraged spotty cellular service and messaging services to maximize communications, which allowed them to coordinate with rescue teams and officials and arrange for food, water and generator deliveries. The internal telephone system was also utilized to maintain communication throughout the hospital.
A secondary concern to the hospital, according to Chenoweth et al (2006), was its employees; particularly, circumventing the normal payroll system, which was inaccessible, in order to provide funds to employees who were suffering high expenses due to evacuation. This was accomplished by using the Internet to provide a funds transfer to each employee approximating their last paycheck. Similar workarounds were created for accounts receivable, with employees manually entering charges and emailing them to the system provider for processing.
The hospital’s outsourced IT provider also had its own issues to deal with; it had to locate missing employees (which was accomplished within three days by using a broadcast approach of Internet connections and message boards and contacting family and friends of the staffers; this is in contrast to many other companies, which were still struggling to locate employees by November) and prevent employee burnout by arranging for relief staffers. East Jefferson Community Hospital’s IT infrastructure was back up and running only a week after the storm hit, and began providing patient services immediately.
Its disaster recovery framework, as well as quick thinking in repositioning the framework when it became clear that it did not match the profile of the disaster it was supposed to counter, was a clear factor in the hospital’s fast recovery and return to service. Following the experience during Katrina, the hospital’s IT staff investigated its disaster recovery framework and cited a number of changes which should be made, including increased emergency communications capacity, maintaining high-speed Internet access and implementing an automatic switching mechanism should one generator go down again.
Disaster Recovery Framework Design The experiences of Kobe University and East Jefferson Community Hospital clearly indicate the need for robust disaster recovery planning. While disaster recovery is not always a matter of life and death as it was in these two cases, it can often mean the difference between a company that recovers successfully and one that is driven out of business by a critical failure. How can a company begin to develop a disaster recovery framework, and how extensive does this framework need to be?
Benton (2007) suggested that the disaster recovery framework must begin with a formal business impact assessment. This assessment draws on the knowledge and experience of the IT staff and the CIO to determine what the critical pieces of IT infrastructure are for a given company. A business impact analysis (BIA) is a way in which the contribution or importance of a given business resource can be analyzed and expressed in dollars and cents terms, in order to allow corporate officers to determine the correct emphasis during disaster recovery.
The BIA also includes subjective observations of the resource’s importance, giving an overall view of the organization to the decision makers. The second piece of the decision-making process is the risk analysis. What kinds of disasters are likely, Benton asked, and how much damage are they likely to cause should they occur? Exactly how likely is a disaster to happen? Benton urged caution on this question; as he pointed out, the risk of being unprepared is potentially far greater than the cost of preparedness.
Rike (2003) discussed the risk analysis that should be performed before beginning a business inventory analysis and disaster recovery planning. Risks should be analyzed in three different dimensions: the type of risk, the likelihood of the risk and the magnitude of the risk. Rike divided risk types into three general categories: natural threats and hazards, technical and mechanical hazards and human activities and threats. Rike noted that it is not always possible to predict some types of disasters, such as human activities, while some activities, such as common weather phenomena, can be planned for in advance.
The third dimension of risk analysis is the magnitude of the potential risk. Rike identified three categories of magnitude: community-wide disasters, such as the Kobe earthquake and Hurricane Katrina as discussed above; localized to a building or a group of buildings, such as water leak or electricity outage; or individual, or only affecting a single organization, department or worker. A disgruntled worker sabotaging data exemplifies this situation. Rike (2003) outlined a proposed schedule and method for designing a disaster recovery framework.
The first step, obtaining top management buy-in and support, is critical in order to fund and implement the disaster recovery framework. It is also necessary for top staff to be informed of disaster recovery procedures because they will be ultimately responsible for its implementation. The second step Rike suggested was to establish a planning committee staffed with personnel from facilities, information technology and other critical departments who will be responsible for planning and implementing the policy. The third step in Rike’s method is to perform a risk assessment and conduct a BIA.
The risk assessment should include determining the type of risk the behavior is subject to and its likelihood, the consequences of each scenario, the estimated cost of each scenario, replacement cost of data, equipment and staff recovery versus disaster framework implementation, and the potential risk of the worst-case scenario occurring. Rike’s fourth step is determination of critical business facilities – business equipment, connectivity through Internet and phone lines, internal phone system, fire and fumigant systems and other facilities required to continue to operate.
This step also includes the determination of disaster recovery procedures and documentation, vital records and personnel. Step five is the procurement and preparation of disaster recovery facilities, including offsite storage facilities, inventory of critical documents, policy and procedure manuals, master lists of staff contact information, vendor information, account numbers and other vital information, and a review of security and environmental systems. Step six is preparation of a written framework, taking into account the information gathered in steps one through five.
Rike recommended that a standard format and software package should be used to write the framework, rather than a customized solution. The framework should then be reviewed on a frequent basis to ensure continued alignment with company business and goals as well as changes to potential risk. The final step in Rike’s methodology is to test the written framework in order to make sure it is feasible. In order to begin developing a disaster preparedness framework, Benton suggested a company-wide IT inventory, detailing application, storage and server assets.
These assets could then be ranked into categories depending on the importance of the business application and replacement cost of the equipment. There are two main ranking criteria. Recovery time objective (RTO) is the optimal maximum amount of time between disaster and service resumption. Recovery point objective (RPO) is the maximum amount of allowable data loss. Benton recommended a multi-tier system; at the top level should be no data loss and minimal downtime, or an RTO and RPO of close to 0, reserved for mission-critical services and business units that provide immediate revenue for the company.
Business units should then be ranked in descending order according to their revenue generating potential and criticality. At its lowest level, Benton suggested that the RTO could be extended out to 72-96 hours. Rike (2003) identified key questions to use when conducting the BIA, including “how would the department in question operate if online systems were not available? ” and “what is the minimum space required for the department to operate? ” Benton prioritized two critical preplanning steps for disaster recovery.
The first was data consolidation, or optimizing the protection of data by assembling all critical data in a single location for ease of backup and recovery. This can be established by use of a centralized file server in a small organization or use of a SAN or NAS scheme in a larger one. The second prerequisite, which can be more complicated than storage consolidation, is server consolidation. This step can be complicated because the performance profile of servers can vary, and processing and network access can vary between them. Benton further discussed the complexities of disaster recovery of data.
Among the problems he noted are difficulties with logical consistency and order of recovery. If standard file backup technologies are used, these backups may not be logically consistent when they are recovered because they will be recovered to a slightly different point in time. Newer snapshot technologies can alleviate this problem, however. Another inconsistency issue is data replication, which may be interrupted when the write heads lose power. Finally, order of recovery will be important because some applications and servers will be dependent on other servers being restored first in order to maintain logical consistency.
Benton also noted that disaster recovery should be maintained separately from periodic backups and archival procedures, because data storage procedures for periodic backups and archival procedures may not be adequate or appropriate for disaster recovery. Finally, Benton remarked that hardware designated for disaster recovery should be exercised in a non-emergency situation in order to ensure that it is properly configured and connected. Rike (2003) recommended a course of action in the event that the disaster recovery framework needs to be put into action following a physical disaster.
The first step in Rike’s method is to perform a damage assessment in order to determine the scope and type of damage, the size of the area affected and what assets have been damaged. Rike’s second step is damage control by environment stabilization. In the event of physical damage, the damage can become permanent very quickly. Rike suggested that the physical environment must be stabilized by drying the air, removing water and soot particles, restoring air conditioning and whatever other cleanup can be performed.
She suggested that material such as power generators, sump pumps to remove standing water, high-powered fans, plastic sheeting, absorbent materials and other cleanup equipment should be kept on hand in order to speed environmental stabilization. Once the environment is stable, Rike prioritized activation of the emergency team as defined in the disaster recovery framework, and then restoration and cleanup; this cleanup can in some cases be performed by business staff, but in some cases, such as a toxic spill or mold contamination, should be handled by specially trained professionals.
While Rike discussed physical disaster recovery resulting from primarily natural or mechanical threats, Patnaik and Panda (2003) discussed data recovery from a malicious attack, addressing the human threat perspective. Malicious attack on data and application resources can come either from within the business (most often from a disgruntled employee) or outside the business (hackers or industrial spies). As Patnaik and Panda noted, it is not necessarily possible to distinguish a malicious attack from a legitimate data transaction.
According to the authors, requirements for protecting data from malicious attack include protection from unauthorized users, detection of hostile activities and damage recovery. Unfortunately, as the authors noted, in the case of a database storage system it is not always possible, even with these precautions in place, to catch all potential malicious transactions. This is particularly problematic when the malicious actor is someone who has trusted access to a system. If a malicious transaction is committed to the database, it is then seen as legitimate and may be propagated to other areas of the database through normal interactions.
In order to prevent this spread, a quick recovery is required. Unfortunately, the authors noted, the size of database logs often precludes a fast recovery, due to extended periods of time spent accessing and applying the logs. In order to remedy this, Patnaik and Panda proposed a partitioned or segmented log solution which allows recovery of a malicious transaction to access only one of the log segments in order to perform recovery, rather than the full logs. This increases recovery time by an order of magnitude over applying the full redo log, according to the authors.
Disaster recovery is a relatively inexpensive method of assuring business continuity in the wake of a natural, physical or human event or attack. The costs of not having a disaster recovery framework is, as Rike (2003) noted, extremely high – 93% of businesses which suffer a major data loss go out of business within five years. The experiences of Kobe University and East Jefferson Community Hospital demonstrate the value of a disaster recovery framework, as well as the importance of examining priorities when deciding on the framework.
While physical premises may be covered by insurance in some cases, the same is not typically true for data, institutional knowledge, continued business and personnel. In order to implement a data recovery framework, one can follow Rike’s (2003) methodology, beginning with gaining the support of senior staff and the appointment of a disaster recovery planning committee, performing risk analysis, a BIA, and determining and putting in writing a disaster recovery framework and finally testing the framework to ensure its viability.
These steps will help to protect the business in the event of a disaster, whether it is natural, mechanical or human in origin, and whether it is localized or community-wide. Research Proposal In order for a business to determine whether a disaster recovery framework is appropriate for their business, as well as to analyze the relative risks and costs of implementing a disaster recovery framework and replacing lost business assets and personnel in the event of a disaster. Following steps three and four of Rike’s methodology will provide a determination of utility of a disaster preparedness framework for a given business.
In order to perform this analysis, the assent of senior staff members should be obtained. This analysis can be conducted in the following manner. First, perform Rike’s third step, that of risk analysis and assessment. This assessment should evaluate the potential threat to the business and its effects in three dimensions: type of threat (natural, mechanical or human), magnitude of threat (individualized, localized, community-wide), and likelihood (certain, likely, unlikely, extremely unlikely). Questions that should be asked during this risk assessment include: What is the natural environmental pattern of the geographic area? Is the area subject to earthquakes, flooding, hurricanes or other natural phenomena?
• Are current environmental control provisions such as Halon systems and fire detection systems up to date? • How likely is attack by a human threat? Does the company tend to have disgruntled workers, or no? How much access does any individual worker have to the data and application servers? • What is the replacement cost of data, equipment and staff versus the cost of disaster recovery framework implementation? What is the potential for the worst-case scenario to occur?
After the risk analysis is complete, step four of Rike’s methodology, determination of critical business resources, should be implemented. This step includes asking the following questions: • What is the minimum amount of servers, Internet connectivity, communications capacity, space, documentation, data and staff the company can continue to operate on? • Who is the critical staff? What is the critical data? How many single points of failure are there?
Step four of Rike’s methodology, the business impact analysis or BIA, is the final method of analysis in determining the benefit of the disaster recovery framework to an individual organization. The BIA examines each aspect of a business’s function and determines which functions are critical to the business’s continued operation, as well as which functions can be brought back online after the most critical operations are stabilized. This examination should include all facets of a business, including seemingly unimportant functions such as facilities management, janitorial access and human resources records access.
Business functions should be ranked on a matrix of direct and immediate benefit to the business, determined by their immediate monetary value as well as subjective perceptions of importance. Using a combination of a risk and cost analysis to determine the likelihood of risk occurring and the cost of implementation versus non-implementation, a business needs analysis to determine critical business requirements, and a BIA to determine critical business functions, it will be possible to determine whether a disaster recovery framework makes sense for a given business, as well as what type of disaster recovery framework should be implemented.
It is the author’s contention that disaster recovery planning makes sense for every business, and should be implemented at a level that will ensure business continuity and hasten recovery should a disaster occur. Customization of disaster recovery planning should be done using the risk, cost and business needs analysis to create a framework that will allow the business to secure its own interests in the event of a small or large disaster.
No disaster recovery framework is perfect, and there can always be situations that remain unconsidered, as East Jefferson Community Hospital’s experience showed. However, having an initial disaster recovery plan in place made it easier to reprioritize resource allocation when there were unexpected issues. As von Moltke remarked, “no plan survives contact with the enemy” – but that is no reason not to plan.