Skip to content
This repository has been archived by the owner on Apr 29, 2021. It is now read-only.

DPU Overview

Joshua Selsky edited this page May 6, 2013 · 2 revisions

Data Processing Units

A data processing unit (DPU) is defined as a unit of work to be performed on a data payload defined by a Concordia schema, a schema ID and a schema version.

DPUs have a less restrictive specification than DSUs because their potential scope is much larger and variable.

DPUs must be stateless and anonymous processors of data: they cannot store data from a third-party and they cannot support authentication.

A DPU can be open or closed source and a library or RESTful web service.

Closed Source

While it is preferred that developers who go Open mHealth open source their software, it is not a practical requirement in all cases.

If you have a closed source DPU that you would like to enable for community use, Open mHealth asks that you make your algorithm available as a web service, register your DPU at registry.openmhealth.org (forthcoming), and implement a registry call to advertise the Open mHealth schema IDs your DPU consumes and produces.

Open Source

Great! You have options. It may not be practical for some DPUs to be invoked over the network, so you may make your DPU available as a library. However, the general Open mHealth approach is of units of work distributed over HTTP, so where it makes sense your DPU should be available via HTTP. As above, Open mHealth asks that you make your algorithm available as a web service, register your DPU at registry.openmhealth.org (forthcoming), and implement a registry call to advertise the Open mHealth schema IDs your DPU consumes and produces. If your DPU is a library, make sure it is well-documented especially regarding the schema IDs it consumes and produces.

Composing a Set of DPUs

A common use case we have heard from our community is that DPUs must be composable. For DPUs 'f', 'g' and 'h', they should be able to be invoked as f(x) → g(x') → h(x'') where the schema IDs on the data (each version of 'x') change as the pipeline executes. Some use cases may not benefit from this approach, but if each DPU is a unit of work, then it makes sense to spread common processing operations across a set of DPUs. In thinking about DPUs, analogies to mathematical and SQL operations are useful, e.g., ordering, smoothing, statistical analysis, grouping, etc.