A short introduction to Data Driven Discovery (D3M)

Mitar Milutinovic

Data Driven Discovery is a large Darpa project involving many organizations

In this talk I will present some of technologies made in the context of that program and how they are used

  • A general pipeline language speficiation using inputs, outputs, and abstract steps
  • Specification for primitives (a type of a step): methods, state, hyper-parameters
  • Hyper-parameter space definition
  • Standardized value types passed between steps: data + metadata
  • Metadata specification for every value passed between steps
  • Reference runtime
  • Specification for a record of a pipeline run
  • Metalearning database storing pipeline runs, pipelines, inputs and outputs

Given data and a problem definition, an AutoML system automatically creates a reasonable pipeline to preprocess this data and build a machine learning model solving the problem

Terminology: pipeline is a DAG of steps to get from input data all the way to outputs (e.g., predictions) of the model

Terminology: primitive is a reusable implementation of a step

Reasonable pipeline

How good is the pipeline/model?

Generally we compute a score of the pipeline using many metrics available (part of problem definition)

Best score is not always what you want

  • Complexity
  • Interpretability
  • Generalizability
  • Resources & data requirements

How good is AutoML system?

  • Resources needed
  • How quickly it creates a reasonable pipeline
  • How far is this pipeline from the best pipeline
  • How clean or structured input data should be
  • Which problem types it supports
  • How well it utilizes primitives available

Comparison of AutoML systems is hard

  • They use different sets of primitives for their pipelines
  • They use different datasets for evaluation

AutoML can be seen as a search problem over the space of possible pipelines, optimizing for the best score

One big family of approaches to AutoML is centered around metalearning

Metalearning approaches AutoML itself as an ML problem

How would data for such problem look like?

A regular tabular dataset

sepal length sepal width petal length petal width species

A metalearning dataset

dataset problem description pipeline score

Problem: how to represent data in each of those columns in a general way

At D3M we designed a pipeline language

  • Pipeline language utilizes open ecosystem of primitives
  • Addresses comparison between AutoML systems
  • Addresses representation of pipelines for metalearning

Existing languages are designed for humans

  • Calling conventions & special cases
  • Implicit or ad-hoc metadata (less typing), e.g., variable names, context
  • Repeated slicing and merging of data to select data

Documented in a human language

Our language:

  • A data-flow programming language
  • JSON-compatible structure, serialized as JSON or YAML
  • Extendable execution semantics

Our language:

  • Interoperability between building blocks from different existing libraries through wrapping
  • This is a labor intensive process which requires human input to translate from human-centric to computer-centric interface, i.e., describe all arguments and hyper-parameters

Our language:

  • Currently the primitives library consist of around 280 primitives wrapping sklearn, Keras, Pandas, and many custom implementations

Our language:

  • Fixed data types allowed between primitives
  • Have special data type for input datasets

"Standard" pipeline

Linear pipeline

Example pipeline

Example pipeline (2)

Execution semantics is independent from the pipeline

Basic execution in two phases:

  • Fit (train on given attributes and known targets)
  • Produce (compute predictions given attributes)

Primitive interface

Many mixins available for primitives to expose additional capabilities

  • Sampling compositionality
  • Probabilistic compositionality
  • Gradient compositionality


  • End-to-end optimization
  • Batching
  • Neural network representation

Every primitive defines hyper-parameters controlling the behavior of the primitive

Primitives can be passed as a hyper-parameter as well

Data flowing through the pipeline has metadata attached

  • Primitives can use and modify metadata together with data
  • Metadata is in a JSON-compatible standardized structure
  • Metadata can be attached with selectors to any part of data

Metadata selectors

Semantic types

  • Constants (URIs) describing the meaning of data
  • Part of metadata (attached to any data)

Example semantic types

  • .../Attribute
  • .../PrimaryKey
  • .../PredictedTarget
  • .../TrueTarget
  • .../Time
  • .../FileName

Semantic types enable linear pipelines

  • We use semantic types to select data to operate on instead of slicing
  • E.g., a supervised learner primitive can determine which columns are attributes and which targets automatically

Pipeline language is now used by 10 research groups

Those research groups are building AutoML systems

Language trade-offs

interoperability with existing tools performance expressiveness & generality uniform API

Pick two

Interoperability with existing tools

  • Written in Python
  • Using Pandas DataFrames with nested objects

Runtime performance

We allow lower runtime performance (train + test time) for individual pipelines

Lower runtime performance does make pipeline search explore less of the space

Expressiveness & generality

  • Wrapping existing tools
  • Organizing data structure to have similar properties across problem types
  • Use of semantic types

Uniform API

API with focus on automatic introspection and uniform calling conventions