Davit Buniatyan is the Founding CEO at Activeloop, the company behind the fastest-growing dataset format specifically designed for AI. 

In almost every decade, we witness a new epoch in the field of machine learning. The preceding 10 years have been about improving machine learning models and making them accessible to regular data scientists and developers.

Today, deep learning frameworks like PyTorch and TensorFlow have made cutting-edge advances in model architecture accessible to the tech community. The marginal returns on model optimization have reduced significantly. Our focus now must shift to the data on which we train our models on, as Andrew Ng pointed out. Let’s see how we can now enable data-centricity in AI for the next decade.

Data-centricity involves many distinct processes: data augmentation, labeling, cleaning, pre-processing and more. All these processes depend on data scientists being able to take the first step — create a dataset. This is something that data scientists struggle with even today, in organizations ranging from the smallest startup to the largest enterprise companies.

Industry experts estimate that 31% of ML projects die because of the lack of access to production-ready data. For more than 68% of companies, it takes weeks (in some cases months) to deploy machine learning models to production. This is happening because the process of creating production-ready datasets is inefficient.

Data inefficiency involves a lot of tedious, manual work for data science teams and a lot of decision-making overhead about which tools to use. On top of that, large dataset creation, especially for unstructured data, images, video, text and audio datasets, involves a high level of required expertise from data science teams who are normally used to working with tabular data. In a new world with an increasing amount of streaming data, that is a massive challenge in itself.

More importantly, after going through the exhausting process of creating a dataset for one task, this process can’t be easily reproduced. And if a data science team wants to generate a dataset for another task, they have to start over again!

Currently, there’s no established framework for making datasets that are accessible for machine learning models. Industries waste millions of dollars on infrastructure, resulting in inefficient work from their data science teams. 

Why is it so hard to create a dataset? Let me explain with an example. 

Imagine that you meet an alien (yes, a real alien) who speaks their own language. If you learn how to communicate with that alien, despite your mutual language barriers, that would speed up the technological advancement of humanity by a factor of 20. Wouldn’t that be great? Only one problem: How would you translate your ideas into a language the alien could understand? How would you translate the alien’s language into a language that you understand? You would need a universal “translator” that goes back and forth between your language and that of the alien.

The same problem is experienced by data scientists in any field that deal with large amounts of unstructured data. Data can include different types and formats (different “languages”), and all that variety of data is needed to “translate” it into a unified “language” that machine learning models (aliens) can understand. The language of ML models is tensors, which are mathematical representations of your data. 

It takes a Ph.D. level of knowledge, weeks of tedious manual work and hundreds of lines of code to turn unstructured data into these mathematical representations to feed it to ML models.

This is a waste of time for your best people. It is costly and ineffective and blocks the machine learning development process of the entire organization.

How can we address this situation? In my view, we should establish a unified framework that creates and manages datasets that are native to ML models and simple enough so that any data scientist or developer without a Ph.D. can accomplish it with two lines of code. This is the only viable step toward enabling data-centric AI.

If enterprises adopted a universal framework, they could significantly cut down on infrastructure and storage cost — by at least 30% for the companies we surveyed. Most importantly, they would reduce weeks of tedious work by their best people to just a few hours of reproducible work, resulting in faster delivery of machine learning products and features, and better business outcomes overall.

ML experts also think along those lines. For example, in our recent panel discussion during CVPR (the largest conference in computer vision), Siddhartha Sen, a researcher at Microsoft Research Lab, said, “I would like to see democratization of dataset management, so that researchers and developers don’t have to worry about building infrastructure to handle it.”

I’m glad that machine learning luminaries like Andrew Ng are bringing attention to the importance of systematic work on data. In the next few years, we will witness a new paradigm of data-centric machine learning infrastructure. Where the 2010s were about improving models, the 2020s are going to be all about data.


Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?


Follow me on Twitter or LinkedInCheck out my website