August 3, 2022

Towards AI-ready data for computational science

Considering FAIR data principles and the progression towards data-driven science, it is only logical to strive for "machine-actionable" and fully reproducible scientific records.

Introduction

Considering FAIR data principles and the progression towards data-driven science, it is only logical to strive for "machine-actionable" and fully reproducible scientific records. For computational R&D, this includes a machine-readable record of the research conditions and parameters sufficient to exactly reproduce the published results (or within a certain range). It would be preferable to generate this record automatically during the research process for everyday use.

In the following, we examine how to arrange data associated with computational models and methods to create a machine-readable record. We will show three types of data structures and their respective schemas as defined using JSON Schema based on definitions available here.

Example use case

A very common type of calculation in computational chemistry and materials science is the optimization of a molecular structure or unit cell with respect to the nuclear coordinates - as shown in the above flowchart. This type of calculation entails two problems that are solved iteratively: calculating the energy and minimizing the energy with respect to nuclear coordinates. When density functional theory is used, the model gives rise to a set of non-linear equations, which have to be solved iteratively until self-consistency is achieved (self-consistent field - SCF). Here we want to focus on method data, so the model specification is omitted for clarity.

Simplified flowchart for structural optimization. Parameters associated with a step are shown in orange boxes.

Although some essential parameters are given here, others, such as the SCF algorithm, are implied. Without the context - for instance, the specification of the simulation software - we do not know what default parameters have been assumed. Thus, for a data record to be truly reusable, it should contain the required information to reproduce the full configuration, not just the part that was set by the user. This could be achieved by precisely specifying the simulation software (version, commit, compiler, etc.) or referencing a public external data record (e.g. a list with all default values for software X at version Y).

All-in-one

A straightforward approach is to collect all method information in a single object. This arrangement often resembles the input file structure of the simulation software packages. Name conflicts are typically solved by prepending the variable name with a declarative name or abbreviation. The following data structure contains the parameters of the structural optimization example:

{
    "opt_algorithm": "bfgs",
    "opt_convergence": 1e-8,
    "opt_maxiter": 65,
    "scf_convergence": 1e-5,
    "scf_maxiter": 55,
}

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "description": "all in one",
  "type": "object",
  "properties": {
    "opt_algorithm": {
      "description": "Optimization algorithm.",
      "type": "string",
      "enum": ["bfgs", "gdiis", "other"]
    },
    "opt_convergence": {
      "description": "Convergence criterion for optimization algorithm.",
      "type": "number",
      "exclusiveMinimum": 0,
      "default": 1e-7
    },
    "opt_maxiter": {
      "description": "Maximum number of iterations for optimization",
      "type": "integer",
      "exclusiveMinimum": 0,
      "default": 50
    },
    "scf_convergence": {
      "description": "Convergence criterion for the self-consistent field algorithm.",
      "type": "number",
      "exclusiveMinimum": 0,
      "default": 1e-6
    },
    "scf_maxiter": {
      "description": "Maximum number of SCF cycles",
      "type": "integer",
      "exclusiveMinimum": 0,
      "default": 50
    }
  }
}

Divide & conquer

Here we identify conceptually related elements and isolate them into separate data structures. This division allows us to aggregate parameters that are meant to be stored together (for instance, optimization threshold, optimization algorithm, etc.). Since each extracted data structure is associated with a certain concept, we may also enrich the data structure with contextual information such as a classification type or hierarchy.

Using the example above, a first attempt to divide consists of isolating attributes related to the optimization on the one hand and attributes related to the SCF on the other. Consequently, the data structure for the optimization may look like the following:

{
    "tier1": "optimization",
    "tier2": "differentiable",
    "tier3": "mixed-order",
    "type": "bfgs",
    "gradientNorm": 1e-8,
    "maxIter": 65,
}

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "description": "divide & conquer: Optimization",
  "type": "object",
  "properties": {
    "tier1": {
      "description": "1st level of classification tree: Optimization algorithms",
      "type": "string",
      "const": "optimization"
    },
    "tier2": {
      "description": "2nd level of classification tree: optimization algorithms for differentiable functions",
      "type": "string",
      "const": "differentiable"
    },
    "tier3": {
      "description": "3rd level of classification tree: mixed-order algorithms (e.g. quasi-Newton methods)",
      "type": "string",
      "const": "mixed-order"
    },
    "type": {
      "description": "Broyden-Fletcher-Goldfarb-Shanno algorithm",
      "type": "string",
      "const": "bfgs"
    },
    "gradientNorm": {
      "description": "Threshold for gradient norm (convergence criterion)",
      "type": "number",
      "exclusiveMinimum": 0,
      "default": 1e-7
    },
    "maxIter": {
      "description": "Maximum number of iterations",
      "type": "integer",
      "exclusiveMinimum": 0,
      "default": 50
    }
  }
}

From the "divide & conquer" strategy as applied here, it follows that a method will not be described by a single data structure but by a collection of those. There are multiple ways to represent the collection, for instance, in the form of a simple list or a directed acyclic graph.

Adding Ontological Context

Adding context can also be achieved by mapping attributes to types and properties defined in an ontology. A suitable format for this is JSON-LD (Linked Data), which includes a @context attribute to semantically annotate the data (the National Information Exchange Model project page has an excellent tutorial for this).

A JSON-LD data structure for the above example could look like this:

{
  "@context": "https://example.mat3ra.com/definitions/",
  "method": [
    {
      "@type": "Optimization_BFGS",
      "name": "bfgs",
      "gradientNorm": 1e-8,
      "maxIter": 65
    },
    {
      "@type": "SCF_DIIS",
      "name": "diis",
      "threshold": 1e-5,
      "maxIter": 55,
      "diisMaxSubspace": 12
    }
  ]
}

The fictional ontology, in this case, defines a general method type from which Optimization_BFGS and SCF_DIIS are derived. The derived types, in turn, contain general attributes inherited from one of the parent types (e.g. name) as well as attributes specific to its type (e.g. diisMaxSubspace).

The advantage of this strategy is that even though the data structure is not that different from the first approach, it contains definitions and relations between entities without substantially changing the data structure. On the other hand, the design and usage of an ontology is unfortunately not a trivial task.

Summary

Achieving reproducibility and AI-readiness through data standards is a highly complex task that is often overlooked. We here outline some of our thoughts on approaching the topic to facilitate a further discussion from the community.

You can read more about our efforts at:

and reach out to us at info@mat3ra.com to continue the conversation!

Further Reading