Steps towards reducing Data Lake costs



 By Ashish Dubey, VP Solutions Architecture, Qubole

In an unpredictable world, cost savings are worth their weight in gold. Organizations looking to navigate the troubled waters of an unstable market and gloomy market drivers are faced with a Catch 22 situation – persist with current spending to outlast the dip of reduce costs drastically which could compromise their competiveness. This couldn’t be truer for data lakes which enable a machine learning and data analytics approach for businesses to enable a better customer experience on one hand, while allowing seemingly infinite cloud based resources to spiral out of hand quickly.

There is seemingly a better way for financial governance to present the best of both worlds to enterprises. A well strategized approach to ensure that data demand and cloud resource balance is maintained not only creates an environment that breeds innovation but also ensures that costs are optimized with minimal wastage. All this while ensuring critical client SLAs are not compromised.

Here’s a ready reckoner for CIO’s looking to maintain continuity in harnessing the power of the cloud and data analytics without compromising the bottom line through burgeoning data lake costs

  1. Encourage user diligence – Unabashed resource usage is the hallmark of disjointed data teams who do not understand larger business implications of spiraling infrastructure costs. Instilling responsible and austere data practices within teams ensures that everyone makes the right decisions on resource usage for each application being run. The compounding benefits of this can be massive.
  2. Monitoring shall set you free – If you do not understand the where what and how of your data resource consumption, you will not be able to plug wastage and identify areas of improvement on a user or cluster level. Course correction is easier when your radar is working fine. So monitor, monitor, monitor.
  3. Autoscaling on tap – Turning on or off clusters that are in use or not required aggressively drastically reduces costs, especially at a time where budgetary restrictions are critical. System engineers can choose this technique depending on the SLA criticality of the application being run. Proof of Concept or development work projects can tremendously benefit from this option.
  4. Lifecycle management time table – Your business usage patterns need to be analyzed to understand peaks and full capacity running schedules. This allows businesses to reprioritize non critical tasks to lesser usage time slots in the day.
  5. Be inclusive of all clusters – Every application needs a certain type of infrastructure and workload resilience based on the patterns of data inflow. Heterogeneous clusters provide nodes of various instance types (on-demand or spot for example) which can be automated with DIY scripts, without compromising load demand.
  6. Engine based testing – Faster query times equal lesser resource utilization. By running queries on commonly deployed engines like Presto, Hive and Spark, we can gauge where they run the fastest. Results in better end user performance.
  7. Tighten under-utilized resources – It is common to overestimate system criticality at the time of new application deployment. System stability usually takes precedence over resource utilization. However, businesses should incorporate sound iterative principles to ensure that due adjustments are made as data pattern identification unfolds. This requires precision data usage policy drafting and contingency planning.

A set of guard wheels allows new cyclists learn the ropes of balancing a lot sooner without too much damage. The same analogy is true of strong financial governance in the cloud. With the lurking macroeconomic climate, organizations need to deploy these principles stringently and cyclically to allow fast paced development and innovation without budgetary surprises.


Please enter your comment!
Please enter your name here