Is provisioning data taking days, creating severe delays in your highly
automated Continuous Integration and Continuous Delivery pipelines? You probably have a data friction problem.
This piece is an abbreviated version of the original article on sdarchitect.blog published May 6, 2019 - see the full blog post here.
Almost a year and half ago, I started seeing a new trend around DevOps adoption. At the time, I was with IBM as their global CTO of DevOps Adoption, leading the cross IBM-business unit DevOps adoption practice. Clients were beginning to ask for help in addressing the bottlenecks in their delivery pipeline, caused by the challenges involving the delivery of data to non-production environments in a timely manner.
Provisioning data was taking days, creating severe delays in their highly automated Continuous Integration and Continuous Delivery (CI/CD) pipelines. Data friction had become the bottleneck. It’s important to note these were clients who were mature DevOps adopters.
They had already addressed the first few key technology, process and organizational bottlenecks, especially those related to application delivery and infrastructure provisioning. They could provision any environment on demand and deploy their application or (micro)service to it, on demand via self-service by the practitioner needing to deploy. But they were struggling to get the right data to the right environment when they needed it.
Most organizations start their CI/CD journey of achieving flow in their delivery pipeline by addressing deployment automation of application code and the provisioning automation of environments. Some are even adopting push-button, full-stack provisioning for every deployment. Data provisioning as a part of the environment is typically the least automated set of steps.
There is, of course, no need to provision a database with production-like test data in a test environment within minutes, if provisioning the infrastructure, configuring the middleware, or deploying the application takes days or even weeks. When we speak of an automated full-stack deployment, the full stack does not typically include automation to get data provisioned to the full stack being deployed, which includes the database instance as a part of the middleware in the stack.
Provisioning test data is typically a separate step owned by database administrators (DBAs), who control, manage, secure and provision the data to the databases deployed, but that’s a huge problem. The DBA team cannot be treated as a silo, apart from the rest of the teams. You cannot achieve true CI/CD flow without addressing data provisioning, which is getting the right data to the right environment and made accessible by the right practitioners at the right time in a secure and compliant manner.
Breaking Down the Data Silo
There are several reasons for data silos. The first and foremost reason has to do with data security and compliance. As developers adopt CI/CD, more and more builds get delivered and testers and QA practitioners need to run more and more tests. They all need more dev and test environments with every individual needing his or her own instance of test data, which means the amount of data that needs to be stored and secured in these non-production environments is now exponentially higher.
For a typical organization that is mature on the DevOps adoption curve, it is not atypical to see dozens of non-prod data instances for each production database. I recently engaged with a customer, who has adopted CI/CD and is actively addressing the data challenges we are discussing.
For the 52 production databases they have for one of their business units, they had 3,092 non-prod data instances provisioned. That is approximately 60 non-prod data instances for each production database, and they expect this to go up at least 10x as they continue to improve the automation in their application delivery pipeline, giving each Dev/Test practitioner their own data instance to work against while allowing them to branch datasets in these data instances that, align with their development branching.
Think one test data instance for each Git branch for every developer and tester. All these instances need to be stored and secured. For this reason, organizations cannot achieve this without introducing the right technology, processes, and culture, also called DataOps. If each of the production databases in this scenario had an average of just 100 GB of a subset of data that was needed for testing, the 3000+ non-production data instances would require more than 300 TB of storage.
Obviously, that’s not an option, and it’s even worse when you consider real database sizes and need to scale that across business units, to hundreds of databases within the enterprise with 100s of TB of data each. Then, you’ve got to consider the time and DBA resources required to keep all these databases refreshed and up-to-date with current data to make tests viable. These data instances will also need to be regularly rewound to the initial state after every destructive test that is executed against them.
It’s a far cry for most organizations, where multiple Dev/Test teams share one single test database and keep stumbling upon each other during use. In parallel, there is also the need to manage and maintain the various versions of schemas as they evolve with code.
In addition, the security and compliance challenges are amplified as the number of data instances grow and the need to secure the data in non-prod environments becomes a real blocker. Who can have access to which data set? Which environments can they provision the data to? Are there regulatory compliance constraints that vary by the classification or risk sensitivity of the data being provisioned? Or what about the geographical location of the environment the data is being provisioned to?
To compound the challenge further, non-prod environments typically have lower levels of security hardening than production environments. Even when hardened against intrusion, provide the proverbial ‘keys to the kingdom’ to Dev/Test practitioners who need these environments and need access to the data. Now when the exposed surface area is, say 60x larger, it presents a massive challenge.
Architecting the Next Frontier of Your DevOps Journey
The solution is to bring all the practitioners who manage, govern and secure data, including data analysts, DBAs, security admins, into the DevOps fold. Make them a part of the DevOps adoption initiatives, it’s no different from the Dev/Test – Sec – Ops practitioner already included and help them adopt DevOps practices.
From the process and culture perspective, this will get them on the path to becoming more agile and reducing the impedance mismatch between their processes and those of other teams already adopting DevOps practices. They’ll start thinking in terms of addressing data friction, and it’ll become a core part of the team of practitioners all jointly achieving flow in the delivery pipeline.
But in order to successfully achieve this, they’ll also need a toolset of modern data management practices in their arsenal, designed with DevOps practices in mind. This toolset needs to have these core capabilities:
- The ability to take data from multiple data sources and database types and provision virtual instances of data to any non-production environment in the delivery pipeline, via self-service by practitioners who need the data, when and wherever they need it.
- The ability to mask and secure the data, so they can be compliant with regulatory and corporate controls through a policy or rule-based governance approach.
- The ability to fully integrate into the CI/CD pipeline automation framework of choice through an API-driven, self-service interface, allowing practitioners to manage, control, and collaborate similar to how they do for code.
- The ability to manage and govern all data instances through a single pane of control.
- Lastly, the ability to move fast. No more waiting for data provisioning that takes days or hours. They should be able to operate at the speed of the CI/CD tools in the pipeline
Addressing friction in the delivery caused by data is the next step to level-up on the DevOps maturity curve that is being adopted now by organizations. Data access can no longer be the bottleneck slowing down the delivery pipeline.
Organizations that are leveling up on the DevOps maturity curve are addressing friction in the delivery caused by data. Data access can no longer be the bottleneck that slows down the delivery pipeline. In today’s data-driven world, addressing data friction is table stakes to achieving CI/CD flow in the applications delivery pipeline.
Download the “Delphix for DevOps” datasheet to learn how the Delphix Dynamic Data Platform can integrate with your existing DevOps tools and workflows and accelerate application development for software development teams across your enterprise.