MAX, or Model Assets Exchange, is an online open-source repository for trainable/deployable AI models. You don’t necessarily have to be an AI expert to use the database – there’s even a tutorial that’ll walk you through developing an AI that can write captions – but some of the models available will probably only appeal to enterprise developers. CODAIT also launched the Data Assets Exchange (DAX). Where MAX hosts full AI models, DAX contains datasets that can be used to train your own. As open-source training datasets for AI aren’t exactly rare, but well-curated ones are, TNW reached out to Fred Reiss, the Chief Architect at CODAIT, to find out what was so special about DAX. What does IBM mean when it says the datasets will be “carefully curated?” Are they checked for bias or accuracy? What other kinds of datasets are planned for DAX? We experienced frustration with this lack of vetting firsthand while training models for the Model Asset eXchange — DAX’s sister site on developer.ibm.com with state-of-the-art deep learning models. For example, we had to expend a great deal of effort to obtain a usable data set to train our Named Entity Tagger model. Here at IBM’s CODAIT lab, we spend a good part of our time contributing to the open source software that underlies today’s AI systems — projects like Kubeflow, TensorFlow, PyTorch, Apache Spark, and Jupyter notebooks. One of the main functions of our organization is to help ensure that the code governance and quality of these open-source AI software components is up to IBM’s standards. We wanted to bring the same level of quality to the open source data that you run through this open source software. So we’re following a much more controlled approach with DAX, compared with other repositories of data sets you might find online. Every dataset in DAX is shepherded by a member of our team and reviewed by multiple other people within IBM. We start by collecting detailed information about the origins of the dataset and what kinds of problems the dataset would be a good fit for. When possible, we reach out to the original creator of the data. We collect detailed metadata about where the data comes from. We familiarize ourselves with the research papers behind the datasets. We even look at the actual data items themselves to check for potential legal and data quality issues. Every dataset goes through IBM’s own internal legal review process. Only then does a dataset go “live” on the site. And we don’t stop with just posting this vetted data. There are additional steps that we plan to take after datasets go up on DAX to create additional parallel content. You should start seeing the results of these efforts soon. We’re creating Jupyter notebooks that show how to read and analyze the contents of each dataset, either on your own laptop or on the IBM Cloud. And we’re writing ready-made training scripts for training deep learning models on the data. Users will be able to to try these scripts for free on IBM Watson Machine Learning, taking advantage of our GPU-accelerated cloud. The current offerings on DAX are pretty eclectic — the double pendulum videos dataset in particular stands out. What do you see developers using that for? Will developers be able to upload datasets to DAX? Some of the data sets on DAX are for advancing core science, while others have more immediate business applications. The double pendulum dataset is more in the former category, and it has a number of interesting scientific uses. The proposed challenge from the researchers who produced the dataset is a time series prediction task: create a model that predicts the state of the chaotic pendulum system. Predicting chaotic systems is a useful task for validating new kinds of models for numeric time series prediction and natural language analysis (natural language text being a sequence of words). You could also use the video as a sanity check for deep pose estimation algorithms. The physical configuration of the pendulum is designed such that the parts of the pendulum can be localized with subpixel accuracy without using machine learning. A generic machine learning algorithm that doesn’t have that domain knowledge should still be able to approach the same level of precision. For more information on IBM’s DAX, read the company’s blog post here and check out the datasets here. You can view the models available on MAX here. Our current focus is on enabling consumption by developers worldwide. Having this collection of vetted datasets opens up some exciting possibilities for other related parts of developer.ibm.com. Now we can add new Code Patterns that show how to use these data sets to cover end-to-end use cases. For example, the Financial Proposition Bank data set has some really cool applications for analyzing public companies’ quarterly reports. Also, we can use DAX datasets as a starting point for developers to train customized versions of our Model Asset Exchange models by mixing the DAX data with a little bit of their own local data.