High Rise: How data infrastructure is the key to successful AI/ML projects


By Viswanath C, Sr. Director of Engineering, Fivetran

Emergent technologies like artificial intelligence (AI) and machine learning (ML) are all the rage today, and with good reason. They offer the potential for business to scale new heights. A report by McKinsey shows that the percentage of all Indian organizations who have adopted AI rose to 65 percent from 2020 to 2021.

However, many organizations have struggled to realize the full potential of AI/ML. Many stumble by focusing on assembling large teams of data scientists at the expense of creating a foundation that provides reliable information across systems and enables easy accessibility for analysis. 

Without the latter, businesses often find their data scientists spending most of their time categorizing, validating, and preparing data instead of working on it to deliver insights. Thus, an effective AI/ML program needs a strong foundation in the form of high-quality data infrastructure.

The real cost of outdated data integration processes

According to our survey with Wakefield Research, businesses worldwide spend an average of USD 520,000 annually for a data engineering team with a median size of 12 members. Furthermore, these teams spend 44 percent of their time building and maintaining data pipelines. Despite the large consumption of resources, 71 percent of the survey respondents said their data pipelines were unreliable and led to the use of outdated or error-prone data in decision-making.

Automating data integration processes gives companies a one-two punch of cutting costs and maximizing efficiency in their data science programs, while at the same time freeing up data engineers to bring more value to the business.

Limited data leads to limited results

For ML to deliver on its potential, data sets must be complete and properly formatted. If any gaps exist in the data or if manual labeling is required as part of supervised ML, data science teams are usually the ones assigned to this time-intensive task at the expense of what they are meant to do. Worse, there is the risk that trimmed-down training data will lead to the ML model only telling us what we already know.

A better way forward involves drawing from an expansive central repository of information drawn from a variety of sources. This will aid in producing reliable results, leading to greater returns from the ML models. Ultimately, having access to comprehensive datasets provides organizations with accurate insights as well as confidence in their decision-making moving forward.

Breaking down silos means better data sources

Finding the balance between the volume and value of data when making predictions is one of the most critical challenges for data science programs. For example, a social media company with billions of interactions per day can utilize relatively low-value actions – for example selecting a reaction to content or viewing a video – for reliable prediction-making. 

But if it is trying to determine which customers will renew their contract at year’s end, the company will require more specific information and end up working with smaller data sets. This will result in serious consequences on the quality of the derived insights. In addition, since it would take months to see if their decisions are correct, this severely reduces the impact of the said data science program.

To avoid these scenarios, enterprises have to break down data silos across their organizations. By increasing the number of data sources and types of data being utilized, the company in the above example gains more clues, so to speak, on whether a customer will renew their contract.

Trust the process

To make sure that an AI/ML program is successful, companies should take the following steps: First, they need to clarify the program’s contribution towards their organization, as well as their infrastructure availability. 

Next, they need to collect all their data into one source like a cloud service or data lake so that any discrepancies can be found easily. Afterwards, businesses must follow the proper sequence of operations for developing a data science program: data analytics and business intelligence, followed by data engineering, and then data science. 

Last but not least, care must be taken to avoid neglecting the “housekeeping” work needed to maintain the foundation of the data science program. This consists of tasks such as cataloging, ensuring data hygiene, identifying the metrics that will make an actual impact on customer experience, and maintenance of data connections between systems.

Set up your data science program for success

Companies can gain valuable insights into their business by prioritizing the right infrastructure for data science. Not only does this approach yield significant ROI, but it will also lay the foundation for their data scientists’ long-term success. While infrastructure tasks are not flashy compared to others, it is the bedrock on which the most successful data science programs are built.


Please enter your comment!
Please enter your name here