Definition

reliability, availability and serviceability (RAS)

Paul Kirvan

By

Paul Kirvan

What is reliability, availability and serviceability (RAS)?

Reliability, availability and serviceability (RAS) is a set of related attributes that must be considered when designing, manufacturing, purchasing and using a computer product or component. The term was first used by IBM to define specifications for its mainframes and originally applied only to hardware. Today, RAS is relevant to software as well and can be applied to networks, applications, operating systems (OSes), personal computers, servers and even supercomputers.

The three components of the term mean different things. Together they describe the level at which a user can expect a computer component or software to perform.

How does RAS work?

Each part of the term reliability, availability and serviceability describes a specific type of performance for computer components and software.

Reliability

The term reliability refers to the ability of computer hardware and software to consistently perform according to certain specifications. More specifically, it measures the likelihood that a specific system or application will meet its expected performance levels within a given time period.

In theory, a reliable product is free of technical errors. In practice, vendors commonly express product reliability as a percentage. The IEEE sponsors the IEEE Reliability Society (IEEE RS), an organization devoted to reliability in engineering.

Chart showing how the 9s translate to network downtime — The nines are used to calculate the percentage of network availability guaranteed in a service-level agreement or other contract. They can be translated into quantifiable hours, minutes and seconds of allowable network services downtime.

Mean time between failures (MTBF) is one metric used to measure reliability. For most computer components, the MTBF is thousands or tens of thousands of hours between failures. The longer the uptime is between system outages, the more reliable the system is. MTBF is dividing the total uptime hours by the number of outages during the observation period.

Service-level agreements and other contracts often use the nines to describe guaranteed levels of reliability and availability. For instance, five 9s means a reliability level of 99.999% is being promised. The system or component in question will be available 99.999% of the time. Such systems could only be down five minutes a year, so five nines is a high level of reliability. Organizations relying on high-availability systems often require a minimum of four nines or less than an hour of downtime per year.

Availability

Availability is the ratio of time a system or component is functional compared to the total time it is required or expected to function. This can be expressed as a proportion, such as 9/10 or 0.9 or as a percentage, which in this case would be 90%.

To calculate availability of a component or software program, divide the actual operating time by the amount of time it was expected to operate. For example, if a device is working for 50 minutes out of an hour, it has 83.3% availability. MTBF can be used to describe availability as well as reliability. A higher MTBF would mean higher availability.

Sometimes availability is expressed in qualitative terms. For instance, it might measure the extent to which a system can continue to work when a significant component or set of components is unavailable or not operating.

List of four availability management metrics — System and software availability are measured by several different metrics. See four important ones here.

Serviceability

Serviceability is the ease with which a component, device or system can be maintained and repaired. Early detection of potential problems is a critical factor of serviceability. In determining serviceability, it's important to consider how easy it is to do the following:

Diagnose issues.
Repair problems.
Obtain parts.
Take a system down to effect repairs.
Return it to operation.

Mean time to repair (MTTR) is a metric used to measure serviceability. It's calculated by taking the total amount of time spent on repairs in a given time period and dividing it by the number of repairs. For example, if 20 minutes of time is spent on repairs resulting from two outages, the MTTR is 10 minutes.

Some systems are self-monitoring and use diagnostics to automatically identify and correct software and hardware faults before more serious trouble occurs. For example, OSes such as Microsoft Windows 365 include built-in features that automatically detect and fix computer issues, and antivirus software and spyware autoprotect features include detection and removal programs. Ideally, maintenance and repair operations cause as little downtime or disruption as possible.

Descriptions of data center uptime tiers — Data centers use uptime tiers to ensure the right levels of availability are tied to specific components, systems and software.

Important RAS features and design elements

There are many ways to improve availability and reliability, in particular. These include deploying computer systems and subsystems with more powerful CPUs, and multiple processors and memory modules, and using component redundancy, error detection firmware and error correcting code.

Some of the key ways that RAS is designed into hardware and software are the following:

Overengineering. Systems are designed beyond the minimum specifications.
Duplication. Extensive use of redundant systems and components eliminates single points of failure and improves RAS.
Recoverability. Fault-tolerant engineering methods help ensure RAS.
Automatic updating. These systems keep OSes and critical applications current without user intervention.
Data backup. Effective data backup prevents catastrophic loss of critical information and maintains data integrity.
Data archiving. Archiving systems ensure older data is available when needed for audits and recovery needs.
Power-on replacement. This is the ability to hot swap components or peripherals, making upgrades and repairs easier.
Virtual machines (VMs). The use of VMs minimizes the impact of OS and software issues.
Surge suppressors. These minimize the risk of component damage resulting from power anomalies.
Continuous power. Uninterruptible power supply lets systems remain operational when there is an interruption in the regular power supply.
Backup power sources. Batteries and generators keep systems operational during extended power interruptions.

The RAS concept is particularly important when designing a data center. Find out more about how to build a data center.

This was last updated in April 2023

Continue Reading About reliability, availability and serviceability (RAS)

What's the difference between network availability and reliability?

Why is high availability important in cloud computing?

Is there an alternative to error correction and detection codes?

Uptime Institute's data center tier standards

3 best practices to achieve high availability in cloud computing

networking (computer)
Networking, also known as computer networking, is the practice of transporting and exchanging data between nodes over a shared ...
What is SD-WAN (software-defined WAN)? Ultimate guide
Software-defined WAN is a technology that uses software-defined networking concepts to distribute network traffic across a wide ...
local area network (LAN)
A local area network (LAN) is a group of computers and peripheral devices that are connected together within a distinct ...

identity management (ID management)
Identity management (ID management) is the organizational process for ensuring individuals have the appropriate access to ...
fraud detection
Fraud detection is a set of activities undertaken to prevent money or property from being obtained through false pretenses.
single sign-on (SSO)
Single sign-on (SSO) is a session and user authentication service that permits a user to use one set of login credentials -- for ...

CIO

IT budget
IT budget is the amount of money spent on an organization's information technology systems and services. It includes compensation...
project scope
Project scope is the part of project planning that involves determining and documenting a list of specific project goals, ...
core competencies
For any organization, its core competencies refer to the capabilities, knowledge, skills and resources that constitute its '...

Workday
Workday is a cloud-based software vendor that specializes in human capital management (HCM) and financial management applications.
recruitment management system (RMS)
A recruitment management system (RMS) is a set of tools designed to manage the employee recruiting and hiring process. It might ...
core HR (core human resources)
Core HR (core human resources) is an umbrella term that refers to the basic tasks and functions of an HR department as it manages...

Customer Experience

martech (marketing technology)
Martech (marketing technology) refers to the integration of software tools, platforms, and applications designed to streamline ...
transactional marketing
Transactional marketing is a business strategy that focuses on single, point-of-sale transactions.
customer profiling
Customer profiling is the detailed and systematic process of constructing a clear portrait of a company's ideal customer by ...

Close