Six tips to control spiraling cloud costs
Ravi Julapalli, Director Enterprise Solutions
The term FinOps is often thought of as synonymous with cloud cost management, but FinOps is more than that.
Ravi Julapalli, Director Enterprise Solutions
Insights from the field on cloud cost optimization
Public cloud services such as AWS, Azure, and GCP provide unprecedented agility that businesses are only beginning to tap as adoption rates rise. And while organizations are not necessarily scaling back on cloud transformation projects given recent economic headwinds, they have prioritized controlling spend through cloud cost optimization.
The term FinOps is often thought of as synonymous with cloud cost management, but FinOps is more than that. Cloud FinOps is a holistic discipline that focuses on collaboration of engineering, finance, technology, and business teams to make data-driven spending decisions- decisions that reflect the business value of cloud expenditures and governance policies.
UST embraces best practices of the FinOps.org framework when developing comprehensive assessments and solutions for our customers. Our FinOps Transformation practice ensures a cloud operating model that meets the needs of the business without disrupting cloud efficiencies, security, or the speed of innovation you have worked so hard to achieve.
Here I'm sharing my top tips that have proven to save our clients on cloud costs and can be utilized at any stage in your cloud journey. Also note- although I am using AWS centric language for this post, the same holds true for all other major cloud providers such as Azure, GCP, Alibaba, OCI, etc.
1. “Lift” architecture is different from the “Shifted” architecture in a “Lift & Shift” project
We’ve seen over and over that the ‘lift and shift’ model for apps and databases ends up being very expensive in the long run. While you may be avoiding the upfront CapEx budget hit when you move to the cloud, keep in mind those on-prem hardware costs were typically amortized for three years. It may seem like the monthly OpEx charge saves you money on those lift and shift apps, but after three years, cloud infrastructure resources can cost much more than a legacy hardware solution.
That’s because lift and shift apps can’t fully leverage the flexibility of cloud architecture resulting in a lot of resources sitting idle for extended periods. Batch applications, for instance, only need resources at a specific time, and lay idle for the rest of the day. Costs add up quickly when you factor in all the environments needed to support the application pipeline such as Dev, UAT, and Prod. By automating resource provisioning and tear down, you can take full advantage of the programmability of cloud infrastructure and realize significant savings.
Transforming your application to a cloud-native design may be a heavy lift, but the pay-off is not only cost savings but also scalability and flexibility for the business. Rearchitecting legacy apps to a distributed microservices architecture facilitates rapid deployments and, in turn, faster time to market for innovation. While re-designing applications for the cloud, adding FinOps principals to non-functional requirements like performance and security will also greatly improve costs.
Rearchitecting data workloads can save a tremendous amount on cloud storage costs, which can be as much as 10x higher than on-prem storage. We were able to redesign data workflows for one of our customers by using an S3 bucket for their RDBMS, then initiating an ETL transformation so the data could be stored in a less expensive cloud-native NoSQL database.
2. Develop a cloud governance model based on FinOps principles
Effective cloud cost management isn’t just about optimizing expenditures. A robust FinOps governance model provides transparency, deployment controls, and audit mechanisms to avoid surprises in your monthly cloud bill.
Organizational mapping of cloud resources to business units becomes the foundation for visibility and control. Accountability through cost chargebacks to the line of business re-establishes the financial governance often lost in the shift to an OpEx model. That means standard financial levers such as budgeting, forecasting, and cost/benefit analysis are again available to business leaders. Even shared costs like Jira licenses or application gateways can be split across lines of business to avoid unplanned black box infrastructure costs.
While we want to avoid standardizing on legacy approval processes that may slow development velocity, governance rules are still necessary. A modern approach is to provide engineers access to pre-approved self-service portals to independently deploy resources in a compliant manner without the overhead of manual processing of service tickets.
3. Replace outdated SOPs
Updating CI/CD processes to include infrastructure as code and automated environment build-out leveraging cloud APIs also eliminates waste due to idle or overprovisioned resources. Automating the spin-up and tear-down of test environments once the testing is complete or at the end of the day should become a standard operating procedure.
Having all the automation in the world won't help though if someone needs to shut down an environment and their request goes unaddressed for 24 hours. Artificial queues created by ticket-based systems not only create barriers to development flow but also result in charges for unused resources. It is a common problem, especially with large enterprises and businesses in the regulatory domain that have created precise but inflexible processes that weren’t designed to take advantage of an ephemeral infrastructure.
4. Determine acceptable performance thresholds in production and DR environments
Businesses should have the freedom to create the cloud experience they want while weighing the cost versus the benefit of design choices. A crucial aspect to consider is acceptable application performance metrics. For instance, with one of our clients, we estimated that adjusting the acceptable latency threshold from 2 seconds to 2.5 seconds during peak load period would save >10% in cloud costs.
To ensure an overall cost-effective solution, it is important to consult business teams and take program budgets into consideration during the design process. For example, the general tendency is to have a DR environment that replicates the production environment. Consider though if your DR environment really needs to be the same size as production. Depending on tolerance, the business might accept slightly lower performance only when disaster strikes. Your DR performance strategy could be 95% of the time at 2-second latency and 5% at 5-second latency. You are still ensuring against outages if failover occurs, but servicing at a lower performance threshold. If that is acceptable, then your DR environment can be half the size of your production environment.
5. Examine default configurations and associated costs
Many companies are opting for multi-region deployments for applications or PaaS databases like AWS RDS or Azure SQL. Cloud providers can sometimes create up to seven read replicas of your managed database services by default, which significantly impacts storage costs. But not all apps/databases need the same load-balanced high availability multi-region configuration. This default design also creates a tremendous amount of traffic between the regions resulting in higher data transfer costs.
It's important to consider your HA and DR recovery policies when deciding on the number of regions. Going with a single or just two regions for geo-redundancy may increase RPO and RTO slightly. Will that be too slow for business, or is that an acceptable compromise for that application? If yes, we can eliminate as much as 60% of the unnecessary replication cost.
Also important is to review the advanced configuration settings of your PaaS database services. Non-production environments don't always require the same high availability/disaster recovery configuration as production environments.
6. Smart rightsizing
Underutilization is the most common issue that we see with our customers. Engineering teams tend to be cautious when determining the resources they need, which applies not only to compute resources like VMs, but also to containers and serverless functions. Built-in services like AWS Compute Optimizer can analyze utilization and Lambda functions to provide optimization recommendations.
It's a good idea to continuously monitor the utilization of resources to identify and remove any unused resources or resize them appropriately while ensuring that doing so won't affect any production workloads. Validation rules should be put in place, so you are not terminating resources that have something live running on them. Rightsizing non-prod environments is just as important. Among the most common contributors to cost overruns we see are underutilized or idle performance testing environments that are identical to production size.
Other areas to consider for infrastructure cost optimization include:
- Shifting workloads from high-cost to lower-cost regions
- Leveraging cost-saving reserved instances or spot instances
- Deleting old back-ups or unattached storage volumes
When implementing cost optimization measures, some choices, such as rightsizing infrastructure, are straight forward, while others are more challenging. My view of the savings vs. effort spectrum is shown in the diagram below:
Of course, cost optimization is only one aspect of a FinOps strategy. UST’s FinOps Transformation practice ties business goals to cloud architecture, development and build processes, operations, and governance policies. We’ll help you develop a transformation roadmap and support you with implementation across your cloud estate.
See how UST helped a global tech company implement governance policies, optimize utilization, and create forecasts for cloud usage.