Updated for release version: 1.0.4-alpha
This page is an introduction to processing pipelines in modelGUI. It describes the concept of processing pipelines, their particular design in mgui, and the GUI modules designed for creating, modifying, and executing them.
Table of Contents
|
Introduction
The concept of software pipelines has been around a long time. The current objective for modelGUI pipelines is in the sense of a "pseudo-pipeline", whose functioning is illustrated in the figure below:
In essence, the modelGUI pipeline "pipes" together in series a number of software processes, the first of which accepts some data file as input, and the last of which produces a file as output. Intermediate to each process is a temporary data file which acts as a buffer between the processes. Processes are described next.
Processes and Parameters
A pipeline process in modelGUI is a specification for a command-line process, which can be either a Java Process or a Native Process. A Java Process is one which can be called by means of a main function, which accepts an array of String parameters as an argument. A Native Process is compiled code that runs directly from the operating system command line; many software tools operate in this way. The modelGUI process is specified by the following elements:
- a name
- a command string (either a main Java class or a native command)
- a set of Parameters that can be passed to the command
- which parameter (if any) is to be used to specify the input file
- which parameter (if any) is to be used to specify the output file
A parameter is a value that can be used by a process to specify its input and output, or to control its behaviour. Parameters are passed to a process in the form of a space-separated set of String values, with parameter names prefixed by a "-" character, and their values, if they exist, following them. For example, the following command would specify an input file called "input.dat" and and output file called "output.dat". The third parameter, "-some_flag", has no value following it; this is known as a "flag" parameter, which specifies a boolean value; its presence sets the value to true; its absence set it to false. Note that "some_process" may take a number of additional parameters; however their inclusion is optional. If these parameters are not specified, the process will typically assign them to default values.
some_process -inputfile input.dat -outputfile output.dat -some_flag
Parameters are specified in modelGUI by the following elements:
- a name
- whether the parameter is optional
- whether the parameter has a value (if not, it is considered a flag parameter)
- whether the parameter's name should be included (in some cases, when the order of parameters is known, only the parameter's value is expected)
- a default value
Processes and parameters in modelGUI define the structure of a pipeline process. A pipeline itself is comprised of a series of instances of processes and parameters, which define the actual parameter values (including input and output) that define a particular instance of data processing. This organization is illustrated in the figure above.
Input and Output
Input and output for a process depends on its position in the pipeline and the type of processing it performs. The first process in the pipeline is typically assigned to a single input file, although this is not always the case (a process may generate its own data, or read from multiple files which are specified in its parameters). The input and output for a file are specified using particular parameters, which are in turn specified by the process itself. These parameters are used by the pipeline to direct the flow of data, by writing to and reading from temporary files, which act as buffers (i.e., the pipeline actively manipulates these parameters to direct the input and output of the underlying processes). If the parameters for these intermediate processes have already specified output files, these values are used as normal to produce intermediate output (i.e., by copying the temporary files to the specified location), which is often desirable in a pipeline. The final output of a pipeline can be a single file, specified by the last of the processes; it can also be any number of files, depending on the nature of the process.
Execution
A pipeline is executed by (optionally) setting its initial input file, and launching it; this initiates the execution of the first process in the list, and blocks (waits) until this process completes before executing the next one. At the termination of each process, if it was successful, the pipeline stores its output as a temporary file, which it sets as the input to the next process. In this way, an arbitrarily large (and potentially complex) sequence of processing can be performed on a data set, with only one execution command required. This elementary pipeline sequence can be enhanced using forks and projects, described below.
Forks
A fork, as its name suggests, is a divergence of the data pipeline into multiple independent pipelines, which can execute either in parallel or in serial. This is desirable for segments of a pipeline which are separable, or trivially parallel, meaning that their processes do not depend upon the output of any of the other pipelines. The use of a fork can in principle be used to apply the identical processing pipeline to multiple instances of the same sort of data (e.g., subject data from an experimental design); however, the modelGUI pipeline is designed to handle such identical parallel pipelines by means of projects (see below), and this is the preferred means of such processing. A fork is more intended for handling different sorts of processing pipelines on some data set, in parallel, and waiting until all its child pipelines have finished before passing the output and resuming the execution of the calling pipeline. The data flow of a pipeline fork is illustrated below:
In this figure, the fork is shown as another process in the pipeline, which receives an input and produces an output. The input to the fork is clearly the output of the previous process; the output however, is less straightforward, considering that there will be as many distinct outputs as there are pipelines in the fork. Indeed, there is no clear way to converge these multiple outputs. ModelGUI addresses this by allowing a single pipeline to be assigned as the "output pipeline" (i.e., the pipeline highlighted in red above); its output becomes the output of the fork itself, and is thus the only part of the fork which is piped through to the next process. In practice, this means that only one of the outputs can be buffered; the others can continue through the pipeline simply by being written to persistent storage and passed as ancillary parameters to future processes.
Projects
A project is an organizational framework for a data set consisting of a number of instances, each of which has the same form of data associated with it. The simplest example is an experimental setup having several subjects for which data are obtained using an identical protocol. Projects in modelGUI can be assigned to pipelines, such that an entire set of instances can be processed using the same pipeline, either in series or in parallel (although the latter is ideal for parallelized systems such as computing clusters, and its implementation is currently only planned for future releases of the modelGUI project).