Thinking of resilience by its classic definitions; the ability to bounce back quickly, toughness, capacity to recover quickly, it is no wonder that this pandemic has tested all organizations. This comes from both the initial preparedness issues and the endurance issues from things stretching on for several months. The sudden pivot to remote work at the outset exposed many organizations, either from a lack of technical ability, or from in person expectations and processes that were grounded in decades of tradition and belief. If, as an organization, you still have concerns about how well your organization is coping, we are here to help. Through our decades of experience and diligence of approach, we can help you improve your team’s resiliency so that you are better prepared to handle both the now known critical issues, but also be prepared for likely yet unrealized issues.
The 4 Levels of Resiliency: A Short Primer
How resilient an organization is, depends on a variety of factors, but all minimally resilient organizations fall within one of the following four levels. In our experience, most organizations still struggle with disaster recovery and have not yet reached the business continuity level of resiliency. By getting a good idea of how prepared your organization is you are better positioned to address shortcomings and prepare for whatever life throws at you.
Level 1: Fault Tolerance
Fault tolerance refers to how well a system (such as a computer network or cloud cluster) can continue to operate without interruption if one or more components fail. A fault tolerant organization possess the ability keep working even if a mission-critical application or system experiences downtime or suffers some level of compromise.
At its simplest categorization, fault tolerance may be achieved by focusing on four core areas: hardware, software, data, and support infrastructure (e.g. power and cooling). Multiple considerations should be undertaken in each of these areas to improve your organization’s overall resiliency, here are a few critical ones to consider:
- You can make your hardware systems more resilient using redundant configurations. Whether that is through a fail over configuration to an identical or equivalent system (such as a secondary server) or through a load balanced and/or clustered configuration, can help ensure minimal disruptions if your primary system or a component of your load balanced configuration goes offline.
- Similarly, you can make your software implementations more resilient by relying on software instances. For example, you might make a database management system that contains critical information more resilient by running two copies of the software on two different servers. In these critical situations an appropriate disaster‑recovery (DR) deployment require two instances per location, whether in an active/passive or an active/active configuration.
- Your critical data stores may be made fault tolerant by employing the tried and true 3-2-1 backup rule.
- Have at least three copies of your data
- Store the copies on two different media
- Keep one backup copy offsite
In our current climate, it is also important in ensure that one of those copies is immutable, so that it is resilient to replication of corrupted data, or the very topical threat of ransomware.
- For on premise critical systems, software and data, power remains a critical factor. As obvious a point as that is, many have “lost” the knowledge on appropriately redundant power configurations. Ensuring that every component (e.g. power supply, UPS, generator) that delivers power possesses redundancy is essential. Simple mistakes like plugging your redundant power supply into the same UPS and/or generator as your primary, are common and can have terrible repercussions.
Level 2: Disaster Recovery
The second phase towards true resiliency is disaster recovery. Now that each of your key technological components are fault tolerant, you need to take a strong look at the eventuality that you will still need to recover. Despite the best levels of preparedness, there will still be situations that were unforeseen, or were too costly to be truly resistant against. Whether that comes from supplier/partner failures, complex cyberattacks, or an unforeseen problem with a change made in the environment, it is crucial to be prepared for such eventualities. The function of disaster recovery is a multi-discipline one, and is inclusive of people, process, and technology.
- People: Every critical component should have a clear RACI level understanding of what roles, mapped to the individual, are required not only to recover from disastrous scenarios, but also to maintain business operations in a deprecated stated (business continuity) while recovery is in progress, and the final and often overlooked role of restoring the business functionality to a fully realized state (business continuity) after basic recovery is achieved. Some of these elements apply more to business continuity, covered later, but need to be mentioned in disaster recovery activities due to their intertwined nature.
- Process: Understanding your overall disaster recovery strategy is without realized merit without appropriate tactical execution. This comes from the critical category of process, which in these instances, in inclusive of process and procedure. Knowing each step that is required to recover a single component, larger services or an entire environment is the baseline for having a shot at success of a predictable recovery. Further, having each step, broken up to a repeatable and tested procedure is the only way to ensure appropriate details are not missed, especially in scenarios where experienced resources are under strain, or you have more junior resources following a “script” that more senior personnel developed.
- Technology: Beyond the fault tolerant areas already accounted for. It is imperative to understand all of your capabilities for recovery and restoration of service. By ensuring you have the appropriate recovery technologies (e.g. backup and recovery managements systems, data management platforms) to significantly lower the risk of long term impact to your organization, and often times reduce the strain and overall cost of ownership for your personnel. As a note, understanding your capabilities and how they cover your specific requirements (e.g. restore point objective (RPO) and restore time objective (RTO)) makes a significant difference, rather than keeping your capabilities analysis at a general technological level.
Level 3: Business Continuity
Now that you have plans in place to ensure your critical technology related infrastructure and operations can weather a disaster you need to ensure business continuity for all systems and services. While disaster recovery can help minimize the impact of significant issues up to genuine disasters, it does not ensure the continuation of business operations on its own.
As an example, when stay at home orders first came into effect, many organizations scrambled to quickly move their workers remote as smoothly as possible. While disaster recovery can ensure that the core systems of your business can weather the immediate storm business continuity focuses on getting everything (people, process and technology), not just critical systems, back to a fully realized business as normal state.
Business continuity is more important than ever as organizations grapple with critical employees falling ill, and workers who need to care for sick relatives or supervise their children’s remote learning.
As highlighted in the disaster recovery level, people are a key consideration not only to be planned for from a gap perspective, but also as part of the strategy of dealing with issues. People focused processes such as cross-training, key task/initiative mapping and bench strength analysis can help ensure that if an employee falls ill or is otherwise unavailable for work, their critical tasks can be temporarily handed off to other employees.
Level 4: Resiliency
True resiliency builds on all three of the previous levels and essentially means that your organization is able to adapt quickly, avoid impact to operations, or resume normal operations almost immediately. This is much harder to achieve than the other three levels and involves having contingency plans for every probable or critically impacting known improbable scenarios. Additionally, as mentioned earlier, it also means being able to deal with situations that are generated from issues your partners/suppliers are experiencing.
A very relevant example from the non-technology world is that many healthcare organizations have reported issues with their PPE suppliers. To be truly resilient, a healthcare organization would need to have a relationship with a backup supplier in place so they can get enough PPE to keep their frontline workers safe on short notice.
Thinking Outside the Box Can Improve Resiliency
Government bodies and the education and healthcare sectors have been hit particularly hard by COVID and the logistical challenges it brought with it. Shifting to a mostly, and sometimes entirely, virtual workforce has been incredibly challenging for many organizations. This is particularly true for large and essential organizations like food service, manufacturing and especially healthcare, where service disruptions can have serious consequences for the health and safety of citizens.
A key example where industries have adapted to serve everyone, is where healthcare providers have become more resilient by investing in telemedicine. While telemedicine will likely never fully replace in-person appointments, it has been a a very effective way to provide care to patients in remote locations or those at heightened risk who must self isolate at home.
Telemedicine can also allow healthcare providers to stretch their finite resources further, making them more efficient while also expanding their reach.
Ensuring Compliance & Security While Your Employees are Remote
Ensuring security and compliance, whether your workers are at their desks, or at their kitchen tables, is key to organizational resilience. Unfortunately, too many organizations view security and compliance as an afterthought; something to be addressed once more important tasks have been delt with.
Keeping a remote workforce secure is often more complicated than keeping an on-site workforce secure because your organization’s security perimeter now extends into your employee’s homes. Even if you are confident that your organization’s network and on-site endpoints are appropriately secure, you cannot assume your employees have the knowledge or drive to be as diligent in this regard. When employee’s access sensitive data and systems using their home internet, and from shared devices, they may be inadvertently exposing your organization to more security risks.
Additionally, depending on the nature of your business, you may need to comply with regulations such as HIPAA, CCPA and COPPA, which all have complex privacy and related security requirements. An example of a key concern in this vein are man in the middle attacks. They are particularly concerning due to these unsecured home environments and systems, raising the likelihood of compromise and the impact of allowing malicious actors to view sensitive data, such as patient healthcare records, by intercepting traffic between employee’s and your network.
Steps Your Organization Can Take to Become More Resilient (& Save Money Doing It)
Becoming resilient is a culture and a process, and expert advice can go a long way towards preparing your team for whatever life throws at you.
Our team can engage through expert assessment or through interactive advisory to determine where you are today and where you want to go. There is no such thing as a one size fits all solution when it comes to organizational resiliency, so determining your exact needs is critical.
Once you have an idea of your current situation and what areas need to be improved, we will work with you to determine your short, medium, and long-term goals and create a tailored solution that:
- Minimizes or eliminates overlap. Using several products that solve the same problems is not only efficient; it is unnecessarily costly. Replacing a patchwork of products with a few carefully chosen options that meet all your needs can reduce costs and help eliminate blind spots.
- Is architected and built with our years of experience, supported by proven methodologies
- Relies on high-quality hardware and other supporting solutions, that can be configured to suit your needs.
- Improves your resiliency through people, process and technology, explores steps you can take to make your organization more resilient.
At EVOTEK, it is a consistent focus for us to continually improve and build our teams with an eye towards helping our clients be more resilient. I hope these considerations and high-level recommendations have helped you reflect on your own organization’s position towards resilience as well.