Cut Out Debilitating Outages and Build Production Resiliency With DataOps

By Kobi Korsah posted 13 days ago

  
While technology failure and system downtime are inevitable,
Cut Out Debilitating Outages and Build Production Resiliency With DataOps
SRE teams that invest in modernizing their data operations will be best positioned to deal with unplanned outages.

This article was originally published on the Delphix website here September 08, 2020.

Like data breaches, IT outages are a terrifying threat to companies. Service interruptions are a leading cause of reputational damage, customer dissatisfaction, and financial losses. On average, a single hour of downtime costs an organization $126,000.

As, for example, with TSB Bank’s epic IT meltdown in 2018, the retail and commercial bank achieved infamy for the largest banking system failure in UK history, costing the business $430 million. 

tsk bank failure tweet 1

tsk bank failure tweet 2

Software bugs and usability glitches make up the majority of production operation challenges. When you trace back to the cause of those defects, the majority are due to stale and invalid data in non-production environments. While technology failure and system downtime are inevitable, SRE teams that invest in modernizing their data operations will be best positioned to deal with unplanned outages.

As it turns out, findings from a new analyst report by IDC indicate that adopting an API-driven data operations platform mitigates business risks, accelerates time to market, and positively contributes to ongoing digital transformation initiatives.  

Study participants reported that their use of a data operations (DataOps) platform enabled automated data delivery that streamlined end-to-end integration testing. Accelerating and improving the testing process reduced the number of errors per application by 73%. 

“One of the biggest things is being able to manage data in minutes instead of hours or days,” a participant stated. “For example, if we want to do parallel testing or regression testing, we can set bookmarks as project teams do their work and they can rewind or fast forward to those bookmarks. We are able to do different kinds of testing within the same environment.”

In addition to reducing errors and defects, application development teams were able to decrease the number of errors reaching UAT by 55%, which dramatically reduced the need to retest and recode. More importantly, the number of errors leaking into production was cut by 70%, which significantly drove down business risk.

idc delphix report about dataops

As a result, organizations saw a substantial cutback in unplanned downtime incidents of  65%, and deployment of the platform enabled users to reclaim 76% of the time they were losing due to applications downtime. IDC also determined that these organizations were clawing back about $72,000 in lost revenue, through improved availability. 

“Our application performance has improved by 10–15%,” another participant added. “Delphix impacts performance because the data is all virtual layer. Employees are more productive because they are getting their changes to the application faster and the features give them efficiency. And if there is a bug, the resolution is quicker.”

These benefits were driven by two capabilities enabled by the Delphix DataOps Platform: 

  • Automating data delivery with APIs: Reducing the wait time to get access to secure copies of non-production data, provision, refresh, and integrate development environments within CI/CD workflows
  • Higher-quality data: Improving the quality of testing enabled the ability to run additional testing cycles, which in turn led to fewer data-related defects and errors. 

Final Thoughts

Brands that deliver superior customer experience bring in 5.7 times more revenue, according to research. Ensuring production resiliency is critical to responding to turbulent changes in customer demand and dealing with an unanticipated outage. 

Oftentimes businesses look to disaster recovery as the immediate and obvious solution when production data is corrupt. But DR is a massive undertaking that takes longer to fix the issue of the root cause as a full database restore into a non-prod environment could potentially take weeks, especially if it’s a multi-terabyte data warehouse-like environment. In short, traditional backup recovery solutions delay releases and inhibit the investigation into issues post-release. 

Mission-critical systems demand substantially higher levels of test quality and coverage than a non-prod system. In order to maintain service availability and perform real-time incident response, SREs need an API-driven data operations platform for thorough testing that can determine whether a deployed system is working correctly or better understand the reliability of the system. 

0 comments
2 views

Permalink