Fix the Automation, Not the Symptom

Intent

Avoid repeated failures caused by fixing only the symptom of poor automation (a failed deployment) rather the cause (limitations of the automated process).

Motivation

Many who embrace DevOps work hard to automate processes like setting up a test environment or deployment an application. Those processes will inevitably fail, or leave some work undone. One way to reduce the failure rate is to bar, by policy, the team from logging into the impacted servers and manually fixing the installation - especially in test environments. Instead, the automation should be updated to account for the scenario that led to failure so future deployments become more robust.

In environments where the fixing is tolerated, or worse yet seen as heroic, the limitations or errors of the adopted automation process persist. Where manual fixing is discouraged, the success rate of automated deployments typically increases consistently to very high rates.

Applicability

Use the Fix the Automation, Not the Symptom pattern when

  • automated deployments have high failure rates
  • you note that the failures you see are repeated from time to time. For instance, if a deployment failed because application configuration changes made by developers were not promoted alongside the application you may identify repeated manual configuration changes as a problematic failure mode.
  • a change is communicated from environment to environment on an as-broken basis. A change breaks the dev-test environment and that environment is fixed manually. Then, the same changes applied to a manual "QA" test environment and is again fixed manually only after asking development about it. When this anti-pattern is repeated towards (or even including) production deployment, the need for applying this pattern is high.

Consequences

  1. Teams adopting this strategy will overtime enhance their automation to account for more classes of change and have significantly fewer failures.
  2. Because investment in fixing automation is undertaken rather than a manual "quick-fix" short term downtime may be higher for any given incident. Recognizing that, this pattern may not be applicable to the production environments of some critical systems.

 

Comment

You need to be a member of DevOpsWire to add comments!

Join DevOpsWire

© 2012   Created by Kit Corry.

Badges  |  Report an Issue  |  Terms of Service