The Signature of Error is Change
Jan 4th, 2007
By Stephen Northcutt, Google+
Good Morning Everyone,
Summary: It worked yesterday, why doesn't it work today? Try this Google search for yourself ("it worked yesterday"), on Jan 03 2007, it yielded 17 million results. Let's look at one of the results and see what we learn as computer security managers:
I swear this exact set of code below worked yesterday. Environment is VB6,
latest MDAC, and PG odbc of 7.02.00.01
I made zero, zip, no changes to anything, just turned on my workstation and
blam, it quit working. Has anyone else had something like this happen?
In this case the user feels they had not made a change, but we know from computer science that the same program running the same input data yields the same results; if different results, either the program or the input data has changed. The problem this user illustrates is how do you know something has changed? Let's look at one more.
Please ignore that last post, the programer was suffering from imadoofus syndrome
adOpenDynamic doesn't return a record count, but adOpenKeyset does *sigh*
Now, to find out who changed my code ;).
Our point exactly! Much of the activity that leads to these outages is from unseen changes to networked devices such as servers, routers, switches, firewalls, and other devices. Changes in configuration, new software and services, access control changes, and even intrusive troubleshooting techniques can all contribute to the source of functionality failures that cost businesses money. Consider the research below from meganet and there is even more compelling data in that .pdf, great ammunition to convince senior management to run IT in a rational manner:
- 80% of all data is held on PCs (Source, IDC)
- 70% of companies go out of business after a major data loss (Source, DTI)
- 32% of data loss is due to user error (Source, Gartner Group)
- 10% of laptops are stolen annually (Source, Gartner Group)
- 15% of laptops suffer hardware failure annually (Source, Gartner Group)
Theses figures are staggering; human change causes more downtime than equipment failure, malicious activity, and faulty applications combined. This is a sign that these authorized changes are improperly planned for and/or tested. They also often lack sufficient roll-back plans. This results in low availability of computing resources when our goal is the opposite. "High availability involves both planned and unplanned outages," says Charles Garry, senior program director of infrastructure services at Meta Group, Inc. "Planned outages occur for things like application upgrades, hardware upgrades, patches, and basic maintenance. Unplanned outages can include user error, operator error, and actual hardware failure." Though the statistics vary, they always point to the same thing, the human factor has to be addressed to achieve improved uptime, Zaffos, a Gartner analyst "argues that 80 percent of downtime today is caused by user error and software failures, not hardware failures." He says that the failures resulting from software are created by complexity and that there is an almost infinite number of failures that can occur in a complex system."
An article by Dell quotes "Infonetics Research study, The Costs of Downtime: North American Medium Businesses 2006, found that mid-market enterprises (101 to 1,000 employees) lose an average of 1 percent of their annual revenues to downtime. Even more startling, another study has found that up to 40-50 percent of businesses that have suffered major service interruptions never recover completely, and fail within two to five years." The article goes on to say the answer is virtualization. While we are big fans of virtualization for disaster recovery and high availability, it does not prevent user or operator errors related to change. However, it does allow you to recover the base operating environment rapidly, by simply loading another virtual copy of the machine.
Despite the significant effort toward deploying change management systems, many organizations are unaware of the changes being made to resources that are critical to the support and continued operation of revenue-generating business applications. This is sometimes unintentional; administrators are unaware of policy, or there lacks a clear policy to define the procedures for change control. This can also be the result of "authorized" changes that bypass change control procedures. This is often the case when an administrator is instructed to respond to a security incident, or to apply a patch to resolve a security or functionality problem with software. When faced with the risk of system compromise due to a new vulnerability or the inability to complete a business process that is holding up a project, administrators often are pressured into violating policies and to "hurry up and FIX this problem!" Unfortunately, the FIX may be the root cause of larger problems that could be a more significant contributor to unplanned IT downtime.
A common example of this type of error is the implementation of patches and software upgrades to fix problematic software. There is an obvious driver to install the new software - the resolution of bugs, added functionality in software, mitigating security vulnerabilities, etc. However, this should not be the only consideration toward jumping into a software upgrade on production resources. Organizations should spend the time to analyze what the other contributing factors are toward upgrading software:
- Is my present hardware sufficient to meet the requirements of this new software?
- What dependencies am I going to encounter with this and other production software?
- What is the impact to the business if this software should not act as intended, or is problematic and bug-ridden?
- What kind of regression testing has been performed on this software? Does the testing environment mirror my production environment?
- Is there a backout procedure that can be applied if the new software is problematic?
- Do I have a complete backup of the system before applying the software upgrade?
Organizations may discover that applying software upgrades as soon as they are released may put the organization at a higher risk of unplanned downtime than the risk of encountering malicious activity. This is no excuse to not apply security patches though - this is still a critical component of systems management and risk mitigation. Information security managers should use caution, and identify the risks with installing software upgrades before deployment.
As a leader in information assurance we have a responsibility to understand and communicate the financial losses due to unmanaged change related downtime. We need to revisit the organization's change management process and invest the time and energy required to gain control, to the extent possible, of change in our organizations.Links and research were valid at the time this article was updated, Jan 03, 2007. If you cannot directly access a link, try using Internet cache, such as the Internet Archive Project.
1. Google search "it worked yesterday" Jan 03 2007, http://www.google.com/search?q=it+worked+yesterday