By Viswanath C, Sr. Director of Engineering, Fivetran India
More businesses in India are turning to public cloud services to digitalise their operations and gain a competitive advantage. According to the IDC, the local public cloud market will grow by 23.1percent CAGR until 2026 with sectors like manufacturing, banking, and insurance leading the way.
Combined with the Digital India initiative that is increasing access and usage of smartphones and the Internet among the population, there is a wealth of data for companies to analyse and base their decision-making on. Real-time data streaming may seem intuitive to make the most of this phenomenon. But there are actually instances when that is not optimal from both an operational and financial standpoint.
For one, tools with near real-time data streaming cost more financially and in terms of work hours when setting them up. There is also the question of maintaining the data pipelines. Unless the data connectors are fully automated, data engineers have to modify them whenever an app or any other data source is upgraded or changed in any other way.
Data teams can minimise expenses on data integration systems while maximising their impact by selectively pursuing real-time data access. There are cases where the latter is not necessary and can even impact the integrity of an enterprise’s data.
Differentiating between the need for real-time data and batched data
A common misconception about machine learning is that all data needs to be instantaneous. Triggering data needs to be real time, but not machine learning data. Teams can look at how often their executives make decisions based on model outputs and use this cycle time to inform how they manage their data and how fast they need it.
Some use cases — like credit card fraud algorithms and automated stock trading algorithms — require updates within seconds. These examples utilise computer-based decision models with little human input needed.
On the other end of the spectrum, manufacturing IoT data for product optimisation, defect density levels management, and predictive maintenance need not be updated within such a short time frame. Examples in retail include sifting through customer feedback, purchase histories, and online behaviour to create customer personas. For finance, there is credit risk analysis and investment risk modelling. In these situations, humans have more input on the data.
With the first category of use cases, decisions are automated and real-time data streaming is necessary. But with the second, companies can save significant costs and resources by batch processing data at regular intervals such as hourly, daily, or even weekly.
Identify whether real-time data or batched data is needed
Data teams can determine how data is being used by their organisations, and whether real-time streaming is necessary, through the following:
- Go through the whole data integration process: Analyse the process of data ingestion and determine how often decisions are made and whether it is made by a computer, an individual, or a group of people. If humans are part of the downstream actions, the whole process is going to take anywhere from hours or even weeks and, making the data move a few minutes faster won’t have a noticeable impact on the quality of decisions.
- Evaluate the complete latency/cycle time for processing data: Latency in data movement is only one factor in determining how long it will take to get results back. Track the total time between logging an event, processing, and potentially transforming that data, running the analytics model, and presenting the data back. Then utilize the length of the cycle to evaluate how quickly executives make decisions.
- Identify tools for both real-time and batched data: What are the tools that work well per function? What are the organisation’s requirements in terms of familiarity, features, cost, and reliability? This review should point to two or three systems that should cover the needs for both real time and batched data. Then look at how these tasks correlate with the needs of different teams, and the capabilities of different tools.
Managing all the requirements of a data science and analytics program takes work, especially as more departments within a company depend on the outputs of machine learning and AI. If companies can take a more analytical approach to defining how “real time” looks to them, or when each team or department actually needs the data, they can meet business goals while minimising costs – all while proving the value of data investments to decision-makers.
Optimise data latency without breaking the bank
Ultimately, there is no “one-size-fits-all” when it comes to making the most of your data. Engineering skills, analysts, computing power, and storage are all resources that should be used judiciously and effectively. For once, “time” is the resource that is readily available for organisations.
Restructuring an enterprise’s data infrastructure around the benchmark of real-time streaming is complex and requires multiple transformations across a complete pipeline. The expenses of having such a system may not be feasible for all organisations, especially when they are set up to create routine updates.
If a company looks at what’s most valuable in their data, which data needs immediate action and which is more valuable in the aggregate, and how often the company acts on that data, enterprises can maximise the rare resources of people and systems – and save time through alignment with the business.