Resilience in IT

Introduction

Disruptive events in IT nowadays occur more often. One cannot stop them and many companies have to deal with them eventually. The essence of resilience is to build a decisive advantage by being prepared for disruptive events. Companies today need to pay attention to their environments, generate unique strategic options, and align their resources appropriately to gain advantages over competitors (Hamel and Välikangas, 2003, Pettit et al., 2010). The concept of resilience supports companies in having the right capabilities for their vulnerabilities. Without being more resilient companies either face excessive risk (having high vulnerabilities and low capabilities) or eroded profitability (low vulnerabilities and high capabilities) (Pettit et al., 2010). Resilience is also making companies aware of possible connections and interdependencies that exist between vulnerabilities. The root lies in understanding disruptive events. Sheffi and Rice (2005) defined eight phases of disruptive events have: (1) preparation, (2) occurrence of a disruptive event, (3) first response, (4) initial impact, (5) time of full impact, (6) preparation for recovery, (7) recovery, (8) long-term impact (see Figure 1).

Figure 1 - Chronology of disruptive events (Sheffi and Rice, 2005)

Figure 1 – Chronology of disruptive events (Sheffi and Rice, 2005)

The emphasis is on recovering from a disruptive event and going “back to normal”. As this might be the objective for companies it can be the one for IT departments only short term (tactically). In the long run IT departments need to explore the nature of the disruptive event (strategically). Disruptive events in IT often occur due to technological innovations. In this case, it would be fundamentally wrong to bounce back to normal. The IT department needs to evaluate technological innovations and decide about its usage. Otherwise it loses its competitive advantage.

This paper will define resilience in the next section. After that challenges to become resilient are explored and applied to the IT. In the fourth section frameworks to become resilient are evaluated. In the fifth section measurement factors for resilience are described.

Defining Resilience

There are many different definitions for the term ‘resilience’ (Hamel and Välikangas, 2003, Pettit et al., 2010). Many definitions include the words “vulnerability” and “redundancy”. Acting against vulnerabilities implies to know what the threat is. Resilience, however, is the ability to be prepared for unforeseeable events. Therefore, dependability is only a part of resilience.

In general, the definitions can be separated into two groups. The first group understands resilience as withstanding disruptions to “quickly resume production by redistributing resources” (Hu et al., 2009, Erol et al., 2009, Holling, 1973, Haimes et al., 2008). In other words, bouncing back to normal. In this case every stage of Figure 1 would be run through. The second group sees resilience as adapting and evolving in the face of change (Erol et al., 2009, Pettit et al., 2010, Laprie, 2008). This understanding of resilience is more appropriate for IT as it allows to gain competitive advantages through disruptive events. Many words could be identified that are used in various definitions. The words “resume”, “redundancy”, “bounce back”, and “vulnerabilities” can be assigned to the first group of definitions (Hu et al., 2009, Hu et al., 2008, Pettit et al., 2010, Erol et al., 2009, Erol et al., 2010). The words “flexibility”, “anticipation”, “adaptive capacity”, and “capabilities” can be assigned to the second group of definitions (Hu et al., 2009, Pettit et al., 2010, Erol et al., 2009, Laprie, 2008).

The focus of resilience should be on anticipation, adaptive capacities, flexibility and the interdependence of vulnerabilities. This shows that there is a correlation with the concept of agility. Agility means being able to react to change quickly and flexibility is a part of agility. Further, agility can be understood as acting or reacting to change. For resilience it would be important to anticipate change (e.g. disruptive events) and be ready to act quickly to decrease the impact of disruptive events. Therefore, these two concepts could inform each other. The evaluation of this idea, however, is not part of this paper.

Resilience within the IT department will be needed in two cases: unexpected internal events, and/or unexpected external events. An unexpected internal event could be a lightning strike. In this case the first group of resilience definitions should be applied, where the main objective is to get back to normal. Within the IT this area is covered by dependability. An unexpected external event could be the development of disruptive technologies or changing user requirements based on technological innovation. In this case unexpected external events within the IT department have consequences in two areas: users of the IT and customers of the company. Users of the IT are other business units that make use of the services offered by the IT department. An unexpected event could be the demand by users to bring their own device to work. The IT department faces many challenges in this area, e.g. security, but can’t stop employees from bringing their own device eventually. Customers, on the other hand, are the end consumers that consume products and services produced by the company. An example of an unexpected event in this area could be the rise of social media, e.g. customers expect a representation of the company on a social media platform.

By taking into account “open” socio-technical systems a distinction can be drawn between “disruptive technologies” and “technological discontinuities”. The openness and complexity of today’s companies increases the company’s dependence on “a global financial, operational, and trade infrastructure”, e.g. cloud computing (Starr et al., 2003). Resilience supports a company in facing the risks they didn’t face in national markets and vertically integrated companies. In many companies today the IT department is vertically integrated. They own the whole supply chain, from hardware to software licenses and creation of services (e.g. support) with the company. In the future the IT department is likely become horizontally integrated, concentrating on core functions without owning other parts of the supply chain. In other words, they won’t own all the hardware anymore and are more likely to concentrate on the business value of services while sourcing necessary supplies from outside.

The definition of resilience by Hollnagel, Woods and Leveson will be adopted because it is suitable for IT and belongs to the second group of definitions: Resilience is the characteristic of managing the organisation’s activities to anticipate and circumvent threats to its existence and primary goals (Hollnagel et al., 2006).

Challenges to become resilient

The most important challenges companies face while becoming more resilient are related to risk management, control and management issues, and properties of technologies. This section will explain the challenges and draw on information and connectivity as two important aspects to become resilient from an IT perspective.

Organisational structures and processes can provide a foundation for resilience (Riolli and Savicki, 2003). By defining structures and processes the company defines how work is carried out. From an IT perspective, the objective is to keep the systems secure, protect the companies’ data and prevent unintended use of the systems (Lindvall et al., 2004). From this point of view, resilience and control don’t seem to work together. Too much control from the IT means that employees can’t appropriate methods and tools they need to respond to change (Ignatiadis and Nandhakumar, 2007). In addition, Ignatiadis and Nandhakumar (2007) found out that flexibility increases resilience and centralisation of control and knowledge decreases resilience. If there is not enough control by the IT dept., however, the company faces security risks because employees might use unsecure software, or problems of cooperation between employees may occur if too many employees use different software, and management challenges arise for the IT dept. in supporting the different employees in a dimension that they don’t have resources left for anticipating change. According to Riolli and Savicki (2003) employees are at the forefront of technological development. Therefore, it is an organisational matter to enable employees’ progress that is suited for tomorrow’s challenges. If they are controlled too much, they can’t develop in the right direction at the right point of time. From that point of view “open” socio-technical systems may be a better way to support resilience than controlling IT too much.

Enterprise risk management attempts to identify vulnerabilities and develop appropriate capabilities or countermeasures (Starr et al., 2003). While pursuing strategic objectives trade-offs have to be made, meaning, and choosing one risk over another. In a networked company this exercise becomes difficult because vulnerabilities have interdependencies and outcomes that are unforeseeable. Starr et al. (2003) explains three steps in order to become resilient: (1) diagnose company-wide risks and interdependencies, (2) adapt the company’s strategy and align resources, (3) endure risk and complexity. From that point of view risk management prepares a company for the expected risks. Resilience, however, is to be prepared for the unexpected events (unknown unknowns). Risk management is important in order to become resilient because it shapes the thinking of employees to discover potential disruptive events (e.g. war games) but risk management alone is not enough to be resilient.

Management challenges that stop a company from becoming more resilient are more related to general strategic thinking: cognitive challenge, strategic challenge, political challenge, and ideological challenge (Hamel and Välikangas, 2003). Cognitive challenge means becoming free of denial, nostalgia and arrogance. Further it means being aware of the fact that strategy loses effectiveness over time. Strategic challenge means that companies need to have alternatives, e.g. for processes, as well as being aware of the environment for being resilient. Political challenge means being able to divert resources from current activities to opportunities. Ideological challenge means being aware that today’s business models won’t be suitable for tomorrow (Hamel and Välikangas, 2003). As it is important to be aware of these biases in management thinking they are not specific to resilience.

From a technology perspective evolvability, assessability, usability, and diversity are important attributes. Laprie (2008) tried to apply these to resilience. However, the author only explored evolvability in more detail by explaining that it is the ability to successfully accommodate changes. The problem with the rest of the attributes is that they are more related to software and hardware issues. From this point of view they might enable resilience. For processes and organisational issues, however, they can’t be applied. It is not possible to address connections and interdependencies with these attributes and apply them to socio-technical systems.

Information and connectivity are two important factors for being resilient, as explained by Horne Iii (1997). Erol et al. (2009) developed a two-step framework which incorporates the factors to guide companies in becoming resilient. The framework does not provide unique insights into becoming resilient but focuses more on general issues of connectivity and information and therefore provides a starting point. The first step of the framework is to connect people, processes and information. This enables the company to become more flexible and responsive. The second step is to align IT with business goals. It can be argued that there are advantages and disadvantages of being connected. If a company is too connected a disruptive event can affect every area of a company. On the other hand, through a tight connection of business parts countermeasures can be initiated more easily (e.g. before a disruptive event reaches phase 5 “time of full impact”, see Figure 1).

The challenges explained in this section try to apply concepts from other areas to resilience. This is unlikely to provide the full benefits of being resilient as this section made clear. However, it can be concluded that resilience has to incorporate different areas of companies (e.g. control and risk) due to the interdependencies of disruptive events.

Frameworks to become resilient

The understanding, to use resilience to resume operations prior to a disruptive event, is present in most of the frameworks investigated. Other frameworks make the mistake of looking at vulnerabilities from a single point of view without paying attention to their interdependencies. Only one framework could be identified that is appropriate from an IT perspective, the definition proposed in this paper, and incorporates the connectivity and information issues raised in the previous section. First the two other frameworks are evaluated.

Madni and Jackson (2009) developed a framework which relates resilience to safety, reliability and survivability. It consists of disruptions, system attributes, methods and metrics (see Figure 2). In the first part of the framework disruptive events are indexed and catalogued. Then system attributes are defined where disruptions can have an impact. Attributes can be organisational infrastructure, system functionality, etc. Thereafter methods are defined to achieve resilience. Their main purpose is to enable trade-offs between productive production and safety. Finally metrics are defined (e.g. time/cost to restore operation). The objective of the framework is to help individual employees and organisations to know when and how to make trade-offs.

Framework by Madni and Jackson (2009)

Figure 2 – Framework by Madni and Jackson (2009)

The framework has substantial shortcomings. The authors don’t elaborate why disruptive events should be indexed and catalogued. Disruptive events are disruptive because they are different from other events in a way that one can’t be fully prepared. This element of the framework is contradictory to other arguments made by the authors. They pointed out that being resilient means not relying on past success in the face of disruptive events. This raises the question why disruptive events should be indexed. Further, the authors propose several methods to achieve resilience. They don’t explain, however, when to use which method. The best approach would be to connect the methods to the system attributes that are affected by disruptive events. A good element of the framework is the focus on trade-offs as resilience always comes with costs. Being aware of necessary trade-offs that need to be made from an efficiency perspective supports the employees in making their own decisions in a period of disruption. This can lead to increased agility and faster response to change eventually. From an IT point of view, the framework is paying too much attention to restoring configurations, operations, functionality, etc. The focus for IT needs to be on adapting to the changing environments caused by disruptive events.

The second framework, developed by McManus et al. (2007), focuses more on vulnerabilities. A 5-step management model was developed. The first step aims to build awareness of resilience issues through interviews, surveys, and brainstorming with employees. In step two organisational components that are critical for ongoing operations from an internal and external view are selected. In the following step these components are assessed for their preparedness for disaster. In the fourth step the result of the assessment is plotted onto matrices to visualise components that present the greatest risks. In the final step adaptive capacities are increased through practice and tests to evaluate the companies’ crisis preparation. The testing aims to develop leadership, decision making and communication skills that are important during disruption. The framework is designed as an iterative process and is not suitable as a one-off crisis management tool.

The major issue with this framework is its fixation on vulnerabilities. As identifying vulnerabilities is important to becoming more resilient, identifying and being aware of their interdependencies is more important. The authors didn’t include this in their framework. Further, through practising and testing for crisis one runs into the risk of being prepared for specific crisis but not disruptive events in general. Finally, the authors point out several times that their framework prepares a company for crisis. This shows their single minded view of disruptive events. Instead of seeing disruptive events as a crisis they can also be seen as an opportunity. Their framework is designed to restore operations in a crisis, similar to the previous framework. From an IT perspective the framework isn’t suitable as the connection of vulnerabilities should play an essential role. A failure in one system can lead to failures in other systems. Therefore being prepared for one vulnerability might give a false feeling of resilience as vulnerabilities can rapidly affect each other in IT.

The third framework was developed by Hollnagel et al. (2006) and promises to be a suitable candidate to be applied in IT. It originates in systems thinking where systems, e.g. a company, consists of subsystems, e.g. business units, and can be in either of three states: healthy, unhealthy, or catastrophic. In the healthy state, business goals are met and risks of operation are understood. In the unhealthy state, business goals are not met anymore and/or risks of incurring losses are high. In the catastrophic state, one or more subsystems or the entire system is destroyed. For an illustration of the transitions see Figure 3.

Figure 3 - Different states a system can be in

Figure 3 – Different states a system can be in by Hollnagel et al. (2006)

Business systems, e.g. a company, are dynamic open systems with subsystems on multiple levels (e.g. hierarchical structure). The business systems behaviour is defined by goals, policies, processes, etc. They define the decision making frame for the subsystems to stay in a healthy state. Subsystems then develop their own goals (defined or perceived) that drive the interactions with other subsystems, the business system, and/or environment. In order to achieve their goals, subsystems can adopt different types of control modes: strategic, tactic, opportunistic, scrambled, or a mixture of them (depending on the time frame). Finally, resilience emerges as a result of the subsystems ability to transition from one state, e.g. unhealthy state, into another, e.g. healthy state, or not to be affected by transitions from other subsystems (e.g. due to a disruptive event). This is achieved through reorganisation of system boundaries and/or change of control mode. For an example of a business system and its influences from outside see Figure 4.

Figure 4 - Business system with control mode

Figure 4 – Business system with control mode by Hollnagel et al. (2006)

In order to better predict the impact of behaviour from one subsystem on another or the whole system a feedforward based strategic control mode was adopted. This requires the following elements:

  • Defined system goals, including the acceptable level of risk
  • Continuous monitoring of state variables, which are the ones that could change the state of a subsystem
  • Continuous monitoring of state variables related to components of operational risk: people, systems, processes

In other words, companies need to be aware that subsystems are connected and have interdependencies. Not being aware of this interweaving has implications on the control mode. Meaning, the ability to react to change is limited because the implications of a decision in one subsystem on others can’t be understood. In a similar way, subsystems are also connected to the external environment (e.g. technological discontinuities).

In general, the framework incorporates interdependencies and connections of different parts of the company. Therefore, it doesn’t try to prepare for specific events but, through feedforward loops, to spot disruptive events and their potential consequences early. Through dividing a company into subsystems the consequences of a disruptive events in one subsystem has on others are easier to spot. In addition, it pays attention to the differing goals that subsystems and/or the business systems as a whole can have. Goals can be decisive for disruptive events as they determine how these events are approached and dealt with. From an IT perspective, separating the business system makes sense, as different IT systems can be separated into different subsystems, according to their purpose, and connected to each other. Interdependencies are visible easily as this is especially important for IT systems because a failure in one system can have consequences on other systems. Through feedforward loops the IT department is not viewed in isolation and therefore, alignment with the business can be considered.

Measuring resilience

Research lacks methodologies to measure resilience (Erol et al., 2010). Resilience is often measured by the level of vulnerability and recovery of an organisation. I believe, this is fundamentally wrong. Regarding vulnerability, one measures the level of vulnerability to a specific risk (Erol et al., 2009). Hence, reducing vulnerabilities increases resilience. The problem with this argumentation is the fact that one doesn’t know what the disruptive event will be before it occurs. In order to decrease the level of vulnerability one has to know what the threat is or have resources to start from scratch. In addition, the connection and interdependencies of vulnerabilities are the bigger threat. By decreasing the level of vulnerability for one threat one looks at disruptive events from a single point of view. In other words, the connection of vulnerabilities for different disruptive events can vary, meaning, a developed protection for a vulnerability might work only in specific situations. This approach to resilience is too narrow and the trade-off between efficiency of operation and risk is too high as a company can’t be protected against all vulnerabilities and all possible interdependencies. The second often used measure, recovery, poses a specific problem to IT. Recovery in relation to resilience often takes measures in recovery time and level of recovery (Erol et al., 2010). Recovery time means the time it takes a company to overcome a disruption and return to its normal state. The level of recovery can be lower, the same, or higher than the original one (Erol et al., 2010, Madni and Jackson, 2009). By stating that the level of recovery can be higher than the original one, the authors make a contradictory point to recovery time where it is about returning to the normal state. Recovery, however, means getting back to normal. This shouldn’t be the objective for IT as this approach stops the IT department from evolving and incorporating technological developments.

McManus et al. (2007) and Pettit et al. (2010) propose generic resilience indicators. They separated them into three groups: situation awareness, management of keystone vulnerabilities, and adaptive capacity. Although they pay much attention to vulnerabilities they provide a starting point to suggest suitable metrics. By taking often used words of resilience definitions four words could be identified that should inform metrics: adaptive capacity, anticipation, flexibility, and agility. Being resilient is one approach to be prepared for disruptive events. As one doesn’t know what the disruptive event will be and were it will happen a company can’t be prepared for events in particular and should be prepared for disruption in general. This means being able to adapt to changing circumstances and environments in a way that disruptive events are not perceived as disruption. The above mentioned metrics help a company to prepare in the proposed way. If these metrics are not fulfilled a company is not resilient.

Conclusion

In literature resilience is often related to bouncing back from a disruptive event. If this approach is the appropriate one depends on the way how bouncing back is perceived. If it means restoring previous operations and configurations it is generally not the appropriate on as this would mean a similar disruptive event would have the same consequences. If bouncing back means restoring the level of performance prior to the disruptive event it is generally appropriate approach, including IT. The measures often used, however, relate resilience, and with it “bouncing back”, to restoring operations, configurations, states, etc. IT offers another challenge to resilience as it needs to incorporate the results of disruptive events into future developments fast. In the future, technological discontinuities are likely to challenge companies in being resilient more frequently and heavily.

This article was written as part of my PhD progress with the School of Computer Science at the University of St Andrews.

References

EROL, O., HENRY, D., SAUSER, B. & MANSOURI, M. 2010. Perspectives on measuring enterprise resilience. Systems Conference 2010 4th Annual IEEE, 587-592.

EROL, O., MANSOURI, M. & SAUSER, B. 2009. A Framework For Enterprise Resilience Using Service Oriented Architecture Approach. 2009 International Conference on Management and Service Science, 127-132.

HAIMES, Y. Y., CROWTHER, K. & HOROWITZ, B. M. 2008. Homeland security preparedness: Balancing protection with resilience in emergent systems. Systems Engineering, 11, 287-308.

HAMEL, G. & VÄLIKANGAS, L. 2003. The quest for resilience. Harvard Business Review, 81, 52-63, 131.

HOLLING, C. S. 1973. Resilience and Stability of Ecological Systems. Annual Review of Ecology and Systematics, 4, 1-23.

HOLLNAGEL, E., WOODS, D. D. & LEVESON, N. 2006. Resilience engineering : concepts and precepts, Aldershot, England ; Burlington, VT, Ashgate.

HORNE III, J. F. 1997. A New Direction: The Coming Age of Organizational Resilience. Business Forum, 22, 24-28.

HU, Y. H. Y., LI, J. L. J. & HOLLOWAY, L. E. 2008. Towards modeling of resilience dynamics in manufacturing enterprises: Literature review and problem formulation. 2008 IEEE International Conference on Automation Science and Engineering, 279-284.

HU, Y. H. Y., LI, J. L. J. & HOLLOWAY, L. E. 2009. A modeling and aggregation approach for analyzing resilience of manufacturing enterprises. 2009 IEEE International Conference on Systems Man and Cybernetics, 692-697.

IGNATIADIS, I. & NANDHAKUMAR, J. 2007. The impact of enterprise systems on organizational resilience. Journal of Information Technology, 22, 36-43.

LAPRIE, J. C. From Dependability to Resilience.  International Conference on Dependable Systems & Networks (DSN 2008), 2008 Anchorage, Alaska. G8-G9.

LINDVALL, M., MUTHIG, D., DAGNINO, A., WALLIN, C., STUPPERICH, M., KIEFER, D., MAY, J. & KAHKONEN, T. 2004. Agile software development in large organizations. Computer, 37, 26-34.

MADNI, A. M. & JACKSON, S. 2009. Towards a Conceptual Framework for Resilience Engineering. IEEE Systems Journal, 3, 181-191.

MCMANUS, S., SEVILLE, E., BRUNSDON, D. & VARGO, J. Resilience Management – A framework for assessing and improving the resilience of organisations. 2007. 79.

PETTIT, T. J., FIKSEL, J. & CROXTON, K. L. 2010. ENSURING SUPPLY CHAIN RESILIENCE : DEVELOPMENT OF A by. Journal of Business, 31, 1-22.

RIOLLI, L. & SAVICKI, V. 2003. Information system organizational resilience. The International Journal of Management Science, 31, 227-233.

SHEFFI, Y. & RICE, J. J. B. 2005. A supply chain view of the resilient enterprise. MIT Sloan Management Reviews, 47, 41-48.

STARR, R., NEWFROCK, J. & DELUREY, M. 2003. Enterprise Resilience: Managing Risk in the Networked Economy. strategy+business, 1-10.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: