Ola gringos.
This week we performed the usual installation of Windows server 2003 and all the trimmings as well as adding an ftp site and testing it.
Class notes as below.
Disaster Recovery
Introduction
What is a disaster? Most organisations now rely on their IT and network infrastructure for their usual business processes to continue. Certainly on September 11, 2001 when the World Trade Centre was attacked, the disaster was truly catastrophic for all the organisations housed in the World Trade Centre towers – staff were killed, computer systems and data were destroyed.
Not all disaster are so catastrophic – for example, the failure of the hard drive in a domain controller would be considered a disaster as users would be unable to gain access to network resources. For the process of establishing a disaster recovery plan, a disaster may be defined as any unplanned interruption to normal business procedures that results from an interruption to the IT and network infrastructure that supports these business processes. This can include the system hardware and software components, the data; the staff that help maintain these systems and the buildings that house those systems.
A disaster recovery plan (DRP) is a plan that an organisation develops and maintains to reduce the impact of a disaster and reduce the amount of time taken to recover from the disaster. The DRP will have two main goals:
• Firstly, to prevent disruption from events that can be anticipated
• Secondly, to reduce the impact of disruptive events that cannot be avoided by documenting the steps to follow in the event of a disaster
Disaster recovery planning is often mentioned along with another term – business continuity planning and the development of a business continuity play (BCP). The differences between disaster recovery planning and business continuity planning are not that clearly defined and vary between organisations. For the purposes of this Unit, we will work with the definition that disaster recovery planning focuses on the recovery of the IT system infrastructure to the state that it was before the disaster struck to support the recovery of the business. Business continuity planning usually has a broader focus being concerned with ensuring that the organisation can continue all business activities after a disaster.
For example, think of a local council public library. The core business of this library is to provide the service of lending books, along with providing reference and resource materials to the general public. If there was a disaster in the library such as a fire, the DRP would detail strategies to manage the restoration of the IT functions of the library such as the membership and catalogue details store electronically. The BCP would detail strategies to ensure that the library could resume its core business of lending books to the public, possible from a new location and with new books. The two plans are interrelated by have separate outcomes.
In this lesson you will look at disaster recovery planning and how as a network systems administrator you can provide input into the planning of the DRP and provide information on this DRP to the system users.
Why plan for disaster
Organisations are now very dependent on their IT services for the conduct of their daily business. Some of the disasters that can occur include:
• Natural disaster such as fire, flood, earthquake, lightning, landslide, severe windstorm, hurricane or tsunami
• Failure of resources internal to the organisation such as equipment or network failure
• Failure of a resource external to the organisation such as power failure or telecommunication failure over which the organisation has no control
• User errors
• Criminal activities by people either internal to the organisation or external to the organisation such as hackers
• Attacks from viruses or other malware
It is necessary to plan how the organisation will recover from a disaster for several reasons including:
• Interruption to business: The service or goods provided by the orgnisation will be disrupted with potential loss of customers who are inconvenienced by the lack of service or late supply of goods
• Financial lost: If the organisation can’t fulfill its normal business processes then that has the ability to impact on an organisation very quickly
• Legal responsibility: Organisations have legal responsibilities such as the maintenance of records
Developing the disaster recovery plan
Organisations need to take every possible measure to ensure efficient and effective recovery in the event of a disaster. The following steps provide a guideline for developing the disaster recovery plan.
Develop the planning policy statement
To develop a DRP will require the input of different groups within the organisation, including the network administrator, management and other interested parties from the organisation. it is important that the DRP is developed with the support of the organisation’s management as an effective plan requires the input from many sources and will also require management approval for expenditure.
During this step the scope of the DRP will be established. Large organisations may develop separate disaster recovery plans for different subsystems of their IT services.
Conduct a risk analysis
The business process of the organisation will be identified at this stage with input from users, department managers and senior management.
As a network administrator, your input will be required to identify the IT resources that support these business processes. At this stage all the IT systems and components need to be identified, usually through an auditing process. The possible threats or risks are identified and the likelihood of that event occurring is assessed.
The impact of the loss then needs to be identified an allowable outage times defined. For example, consider the event of a domain controller failing and users not being able to connect to the network to use the organisations accounting application to send out the monthly invoices. This will have a lot more impact on the organisation than the failure of a switch with five users attached.
Identify preventative controls
Preventative controls are measures that can be taken to reduce the effects of system disruptions and can increase system availability. When conducting the risk assessment it can help identify areas that the risk impact can be removed or lessened by preventative controls. For example, from the risk assessment it may have been identified that the impact of the domain controller failing would be a lot less if another domain controller was installed.
Develop recovery strategies
Recovery strategies ensure that the system may be recovered quickly and effectively following a disruption. Strategies that can be considered here include data recovery strategies such as backup and the use of an alternate site in the event of a catastrophic disaster.
Develop the disaster recovery plan
The disaster recovery plan is a formal document that contains detailed guidance and procedures for restoring a damaged system
Testing the disaster recovery plan
It is important that the DRP is tested to identify any gaps in the planning and allows for staff to be trained in the DRP procedures
Maintaining the disaster recovery plan
As systems are upgraded, the DRP should be updated.
Assessing the risk
The processes of risk assessment is a series of steps that involves:
• identify components
• identify threats
• assess likelihood
• consider the impact
After the risks are assessed, preventive controls need to be identified and disaster recovery strategies developed.
Identify components
The first step in assessing the risk is to identify and document the major components of the network required to support the business processes. As a network administrator, your input will be required to help complete this documentation. Details will be required of:
• computer hardware
• software
• network hardware
• data
• people
The documentation should include network diagrams and building floor plans showing the location of equipment.
Identify threats
The next step is to identify the threats and risks that exist for each of the components. Some examples include:
• theft
• vandalism
• fire
• flood
• power loss
• unauthorised access
Assess likelihood
The next step is to assess how likely the occurrence of the threat might be. This assessment is usually based on previous history – look at whether this event has occurred before. A ranking scheme will be used similar to the scheme below:
• low – unlikely to happen
• medium – may happen
• high – threat is likely to occur
Consider the impact
The next step is to consider the impact of each event. A ranking scheme would also be used here
• very serious – critical business functions cannot be performed
• serious – normal operations are disrupted
• non-critical – the disruption can be dealt with by other methods
Preventive Controls
The table below details some of the preventive controls that might be identified. As administrator, you will be able to advise what procedures are already in place and identify areas that where the preventive measures will need to be implemented.
Preventative measure Protects against:
RAID disc array Depending on the level of RAID installed this can protect against a disk failure
Surge protector Protect against power fluctuations
UPS Protect against short power outages and protect against power fluctuations
Generator Protect against long power outages
Installation of antivirus software Protect against virus threats
Redundancy Protect against component failure
User security Protection against unauthorized access to the system
Access control Protect against unauthorized access to data
Encryption Protect against unauthorized access to data
Redundancy
Redundancy is the duplication of information or hardware equipment components to ensure that should a primary resource fail, a secondary resource can take over its function. By introducing redundancy to a system the fault tolerance of the system is increases. Fault tolerance is the ability for a computer system to continue operating correctly in the event of a failure of one or more components to withstand and recover from a failure.
RAID (Redundant Independent Disks)
The concept of RAID was developed by researchers at the University of California at Berkeley in the 1980’s and was known as Redundant Arrays of Inexpensive Disks. The fundamental principles of RAID is to combine two or more hard drives into a single logical unit providing fault tolerance and/or improved performance. RAID technology uses three techniques:
• mirroring
• parity
• striping
Mirroring – the system writes the same data to different disks at the same time. If one disk fails, the system can operate from the working drive. Mirroring allows for data redundancy but does not improve the system performance. It has a high overhead cost as 50% of the disks on the array are reserved for duplicate data.
Parity – is a technique of checking whether data has been lost or written over by storing an additional bit with each byte of data
Striping – is a technique where bytes or groups of bytes are distributed across multiple drives, so more than one disk is reading and writing simultaneously which improves the data transfer performance. Striping provides no fault tolerance.
RAID can be implemented as either:
• hardware RAID which includes a set of disks and a separate dedicated RAID disk controller and will appear to the operating system as one hard drive
• software RAID which uses software (usually provided by the operating system) to implement and control RAID over two or more disks
The more expensive RAID systems support hot swapping, which means that a drive can be replaced while the rest of the system is still functioning. These drives are known as hot swappable.
RAID levels
There are various levels of RAID that can be implemented which offered different levels of fault tolerance, performance, reliability and cost and these levels are summarised in the table below:
RAID Level Description Comments
1 Data is striped across multiple drives Faster performance, no fault tolerance
2 Data is mirrored across multiple disks (usually two) High fault tolerance
3 Data is striped across three or more drives at a byte level with the parity information written to a dedicated parity disk The dedicated parity drive is a single point of failure
4 Similar to RAID 3, but the striping is implemented at a block level Not commonly used
5 Data and parity is stripped across three or more drives Widely used, if one drive fails, the data from the failed drive can be rebuilt from the data store on the other drives in the array
Further levels of RAID are available by using arrays that use a combination of the techniques defined above, for example RAID 10 is RAID 1 + 0. The drives are striped for performance (RAID 0), and all stripped drives are duplicated (RAID 1) for fault tolerance.
Implementing RAID systems does not replace having backup procedures. Although most levels of RAID offer a degree of fault tolerance, they do not protect against such disasters as unexpected hard disk failures, failures of support hardware or physical damage.
Power Protection
Without power none of the computer or network systems will work. Problems in the power system include:
• Blackout: total loss of power. This may be for a few minutes, a few hours or in the case of a sever natural disaster such as bushfire, a few days
• Brownout: a condition where the voltage of the electrical supply is below the standard level. For computer equipment to work, the voltage level must remain in a specified range
• Surge: a condition where there is a momentary spike in the electrical supply which can be harmful to computer equipment. The most common time for this to occur is during a thunderstorm
Preventive measures that organizations can take to protect against power problems include the installation of surge protectors (sometimes called surge arresters), uninterruptable power supplies (UPS) or generators.
Surge protector
A surge protector is a device that filters the incoming electrical supply to provide a constant voltage by removing any surges. This is the cheapest option and does not provide any protection against a brownout or blackout.
Uninterruptible Power Supply (UPS)
A UPS is a device designed to provide a backup power supply from batteries for a short period of time in the event of a blackout. It is usually used to allow a proper shutdown of the equipment that it is protecting, ensuring that files and data is not corrupted. Depending on the configuration of a device it may also offer protection against a brownout. Most UPS devices include surge protection in their design.
There are two common types of UPS systems available – standby UPS and continuous UPS. The standby UPS runs the computer from the normal power source and any drop in voltage is detected by the UPS which switches over to the battery power automatically. A standby UPS is suitable home or small business use and is cheaper.
A continuous (sometimes called online) UPS runs the computer from the battery supply which means that the device has a faster response time to a power failure.
Standby generator
A standby generator will allow for the provision of power to a site for an extended period. For critical operations, UPS systems will be backed up by a standby generator to ensure no downtime.
Recovery Strategies
Recovery strategies provide a means to restore IT operations quickly and effectively following a service disruption. The table below details some of the recovery strategies that might be identified. As a network administrator, you will be able to advise what procedures are already in place and identify areas that where a recovery strategy need to be developed.
Recovery strategy Why used
Backup and recovery strategy To restore data that has been lost or corrupted
Use of hot, warm or cold sites To recover from a major disaster
Equipment replacement strategy To replace stole, damaged or faulty equipment
Backup and recovery strategy
The backup and recovery strategy needs to record details for the organizations backup and restore procedures including:
• what data is to be backed up
• backup frequency and method (full, differential or incremental)
• who is responsible to perform the backup
• location of onsite backup media
• location of offsite backup media and contact details if any
• backup test procedures
• restoration procedure
Offsite storage of backup media is important, as it provides a greater level of security as there is the risk that a disaster such as a fire or flood will destroy the original data and the backup. The offsite location should be in a secure location from which it can be retrieved quickly.
Use of hot, warm or cold sites
If the organization suffers a large scale disaster such as the loss of a building to fire, the disasters recovery plan may include a strategy to perform the system operations at an alternate site.
Alternate sites are categorized by the level of readiness for system operation that they have, ranging from a cold site to warm site to hot site.
A cold site is the cheapest solution and provides only basic services. It has no IT equipment or infrastructure such as network cabling or office equipment and will need to be equipped before operations can resume.
A warm site is equipped with some or all of the equipment and services needed to begin operations. Before operations can be resumed computers and software will need to be installed and configured.
A hot site is a site that is available 24 hours a day, 7 days a week. These sites allow an organization to continue normal operations within a very short period of time.
Equipment replacement strategy
The equipment replacement strategy will detail how IT equipment will be replaced if required. It will include details of any preferred suppliers or service level agreements and contact details. It will also include details of insurance policies and contact details.
Disaster recovery plan format
The disaster recovery plan will be a formal document that includes all of the details that we have discussed in this lesson.
• Title page
• Table of contents
• Version information – the version number, author and revision history
• Introduction – including what the major goals and objectives of the plan are
• Disaster recovery scope =- details of what the plan covers
• Identification of the major business processes and associated IT resources
• Emergency notification procedures
• Disaster recovery team members – names, contact numbers, roles and responsibilities
• Procedure to declare a disaster
• Risk assessment
• Risk analysis
• Recovery priorities
• Preventative measures and procedures
• Recovery strategies and procedures
• Insurance details – details and contacts
• Vendor details
• Disaster recovery plan maintenance – who is responsible to update the plan
• Disaster recovery plan testing – what is the test plan and who will conduct the test
• Training – policies and procedures for training your organizations employees
• Conclusion
• Bibliography
• Appendices – copies of all the procedures that have been referenced
Testing the disaster recovery plan
The disaster recovery plan should be tested to verify the completeness of the plan. As a network engineer you will do part of this testing as you implement the recovery strategies developed.
Testing ensures that these strategies and procedures are understood by all the staff and that they are correct. Testing can identify problem areas that need to be rectified and the disaster recovery plan or procedure modified.
The test plan should identify how the tests will be conducted. For example, it may not be possible to test the failure of a domain controller or email server during working hours and the test may be scheduled out of hours or the failure may be simulated by applying the procedure to an identical server off line.
As data and system backup is an integral part of disaster recovery, it should be tested regularly.
Informing users about the disaster recovery plan
The disaster recovery plan is not normally available to all members of staff, but users need to be informed about the contents of the disaster recovery plan as it affects them.
Two areas of main concern for users are backup and viruses.
Backup
Users need to be informed where data should be stored so that it is backed up as part of the disaster recovery plan. They should also be informed of the frequency of the backup and the procedure to follow if they want to recover files from the backup media.
Viruses
Users need to be informed of the procedure to follow if the antivirus software detects a virus.
This information may be included in an organizations IT policy or published on the organizations IT intranet site.
No revision questions this week.
That's all for now folks......toodlepip geezers.
Monday, October 13, 2008
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment