Skip to content

In-memory dataset¤

1. Purpose¤

The in-memory dataset is a small embedded RDF store that keeps all data in memory and exposes it via SPARQL. It is intended as a temporary working graph inside workflows, not as a large or persistent storage.

Typical use cases:

  • Collecting intermediate results during a workflow run.
  • Storing small lookup graphs used by downstream operators.
  • Testing or prototyping workflows without configuring an external RDF store.

2. Behaviour and lifecycle¤

  • The dataset maintains a single in-memory RDF model.
  • All read and write operations go through a SPARQL endpoint over this model.
  • Data exists only in memory:
    • It is not persisted to disk by this dataset.
    • After an application restart, the dataset contents are empty again.

Within a workflow:

  • The dataset can be used as both input and output:
    • Upstream operators can write triples/entities/links into it.
    • Downstream operators can read from it via SPARQL-based mechanisms.

3. Reading data¤

  • When used as a source, the dataset exposes its data as a SPARQL endpoint.
  • Queries and retrievals behave like against a normal SPARQL dataset:
    • Entity retrieval, path/type discovery, sampling, etc. are executed via SPARQL.
  • There is no file backing this dataset; everything comes from what has been written into the in-memory model during the lifetime of the process.

4. Writing data¤

The in-memory dataset accepts RDF data through:

  • Entity sink

    • Entities written by upstream components are converted to RDF triples and stored in the in-memory model.
  • Link sink

    • Links are written as RDF triples in the same model.
  • Triple sink

    • Triples are directly added to the in-memory model via SPARQL operations.

All three sinks ultimately write into the same in-memory graph; there is no separate physical storage per sink type.

5. Configuration¤

Clear graph before workflow execution¤

  • Parameter: Clear graph before workflow execution (boolean)
  • Default: true

Behaviour:

  • If true:

    • Before the dataset is used in a workflow execution, the graph is cleared (for writes via this dataset).
    • The workflow sees a fresh, empty in-memory graph at the start of the run.
  • If false:

    • Existing data in the in-memory graph is preserved when the workflow starts.
    • New data is added on top of whatever is already stored in the model.

This parameter controls whether the dataset behaves as a fresh scratch graph per workflow run or as a longer-lived in-memory graph within the lifetime of the running application.

6. Limitations and recommendations¤

  • Memory-bound

    • All data is kept in memory; large graphs will increase memory usage and may impact performance.
    • For large or production RDF graphs, use an external RDF store and a SPARQL dataset instead.
  • No persistence

    • Contents are lost when the application/server is restarted.
    • Do not treat this dataset as long-term storage.
  • Scope

    • Best suited for:
      • small to medium intermediate results,
      • testing and prototyping,
      • temporary data that can be regenerated by re-running workflows.

7. Example usage scenarios¤

  • Use as a temporary integration graph:

    • Multiple sources write into the in-memory dataset.
    • A downstream SPARQL-based operator queries the combined graph.
  • Use as a scratch area for experimentation:

    • Quickly test mapping or linking logic by writing output into the in-memory dataset.
    • Inspect the result via SPARQL without configuring an external endpoint.
  • Use as a small lookup store:

    • Preload a small set of reference triples (e.g. codes or mappings).
    • Let workflows query these during execution.

Parameter¤

None

Advanced Parameter¤

Clear graph before workflow execution (deprecated)¤

This is deprecated, use the ‘Clear dataset’ operator instead to clear a dataset in a workflow. If set to true this will clear this dataset before it is used in a workflow execution.

  • ID: clearGraphBeforeExecution
  • Datatype: boolean
  • Default Value: false
  • sparqlEndpoint — Data in the in-memory dataset does not persist beyond the running process. The SPARQL endpoint dataset connects to an external store that persists independently, which means switching between them changes not just where the data lives but whether it survives execution at all.
  • file — Switching from the in-memory dataset to the RDF file dataset is not just adding persistence. The RDF file dataset loads the entire file into memory at read time and constrains output to N-Triples — neither of which the in-memory dataset does.

Comments