Job configuration XML files are used for each job to specify the stages, tasks, and channels for the job, including inputs and outputs and other assorted settings related to a job.
It is almost never necessary to manually create or modify a job configuration XML file, as it is preferable to use the JobBuilder
or JobConfiguration
classes.
Job configuration files are created by serializing the JobConfiguration
class to XML, and read by deserializing.
Job
elementJob/AssemblyFileNames
elementJob/AssemblyFileNames/string
elementJob/Stages
elementJob/Stages/Stage
elementJob/Stages/Stage/TaskType
elementJob/Stages/Stage/DataInputType
elementJob/Stages/Stage/DataOutputType
elementJob/Stages/Stage/ChildStage
elementJob/Stages/Stage/ChildStagePartitionerType
elementJob/Stages/Stage/StageSettings
elementJob/Stages/Stage/StageSettings/Setting
elementJob/Stages/Stage/OutputChannel
elementJob/Stages/Stage/OutputChannel/MultiInputRecordReaderType
elementJob/Stages/Stage/OutputChannel/OutputStage
elementJob/Stages/Stage/OutputChannel/PartitionerType
elementJob/Stages/Stage/MultiInputRecordReaderType
elementJob/Stages/Stage/DependentStages
elementJob/Stages/Stage/DependentStages/string
elementJob/AdditionalProgressCounters
elementJob/AdditionalProgressCounters/AdditionalProgressCounter
elementJob/AdditionalProgressCounters/AdditionalProgressCounter/TypeName
elementJob/AdditionalProgressCounters/AdditionalProgressCounter/DisplayName
elementJob/SchedulerOptions
elementJob/JobSettings
elementJob/JobSettings/Setting
elementSchedulingMode
typeChannelType
typePartitionAssignmentMethod
typeStageId
typeJob
element
The <Job>
element specifies the configuration of a job. This is the root element of a job configuration XML file.
<Job name=xs:string> </Job>
Attribute | Use | Description |
---|---|---|
name | optional | A friendly name for the job. |
Element | Min occurs | Max occurs |
---|---|---|
<AssemblyFileNames> | 1 | 1 |
<Stages> | 1 | 1 |
<AdditionalProgressCounters> | 1 | 1 |
<SchedulerOptions> | 0 | 1 |
<JobSettings> | 0 | 1 |
Job/AssemblyFileNames
element
The <AssemblyFileNames>
element contains a collection of assembly file names that must be loaded for the job's tasks.
<AssemblyFileNames> </AssemblyFileNames>
Element | Min occurs | Max occurs |
---|---|---|
<string> | 0 | unbounded |
Job/AssemblyFileNames/string
element
The <string>
element for the <AssemblyFileNames>
element specifies the file name (without path information)
of an assembly that should be loaded for the tasks in this job.
<string>xs:string<string>
Job/Stages
element
The <Stages>
element contains a collection of all the stages (except child stages) in the job. The
definition order of the stages does not matter, as they will be ordered by dependencies for scheduling.
<Stages> </Stages>
Element | Min occurs | Max occurs |
---|---|---|
<Stage> | 0 | unbounded |
Job/Stages/Stage
element
The <Stage>
element specifies a processing stage in a Jumbo Jet job.
<Stage id=StageId taskCount=xs:int> </Stage>
Attribute | Use | Description |
---|---|---|
id | required | The identifier of this stage. For root stages, this must be unique within the job. For child stages, this value does not need to be unique. |
taskCount | required | The number of tasks in this stage. For stages with data input this attribute is informational, since the number of splits defined by the data input determines the actual number of tasks. |
Element | Min occurs | Max occurs |
---|---|---|
<TaskType> | 1 | 1 |
<DataInputType> | 0 | 1 |
<DataOutputType> | 0 | 1 |
<ChildStage> | 0 | 1 |
<ChildStagePartitionerType> | 0 | 1 |
<StageSettings> | 0 | 1 |
<OutputChannel> | 0 | 1 |
<MultiInputRecordReaderType> | 0 | 1 |
<DependentStages> | 0 | 1 |
Job/Stages/Stage/TaskType
element
The <TaskType>
element specifies the assembly qualified type name of a type implementing the ITask<TInput, TOutput>
interface that provides the data processing operation for this stage.
<TaskType>xs:string<TaskType>
Job/Stages/Stage/DataInputType
element
The <DataInputType>
element specifies the assembly qualified type name of a type implementing the IDataInput
interface
that determines the input for the task, such as a DFS file. This element is not used for stages that have an input channel or have no input.
<DataInputType>xs:string<DataInputType>
Job/Stages/Stage/DataOutputType
element
The <DataOutputType>
element specifies the assembly qualified type name of a type implementing the IDataOutput
interface
that determines the output for the task, such as a DFS file. This element is not used for stages that have an output channel or have no output.
<DataOutputType>xs:string<DataOutputType>
Job/Stages/Stage/ChildStage
element
The <ChildStage>
element specifies a child stage for the current stage; that is, a stage connected to the current stage using a pipeline channel.
See the <Stage> element.
Job/Stages/Stage/ChildStagePartitionerType
element
The <ChildStagePartitionerType>
element specifies the assembly qualified type name of a type implementing the IPartitioner<T>
interface that is used to partition records for the child stage of this stage. This element is not used if the stage has no child stage or the taskCount
attribute of the <ChildStage>
element is 1, in which case no partitioning occurs. Note that
partitioning may happen only once in a compound task.
<ChildStagePartitionerType>xs:string<ChildStagePartitionerType>
Job/Stages/Stage/StageSettings
element
The <StageSettings>
element specifies arbitrary settings that are available during task execution that apply only to the current stage.
<StageSettings> </StageSettings>
Element | Min occurs | Max occurs |
---|---|---|
<Setting> | 0 | unbounded |
Job/Stages/Stage/StageSettings/Setting
element
The <Setting>
element specifies a setting.
<Setting key=xs:string value=xs:string />
Attribute | Use | Description |
---|---|---|
key | required | The key that can be used to retrieve the setting. |
value | required | The value of the setting. |
Job/Stages/Stage/OutputChannel
element
The <OutputChannel>
element specifies a file or TCP output channel for the stage. This element is not used if the stage has data output
or no output, or if the stage has a pipeline output channel (in which case the <ChildStage>
element
is used instead.
<OutputChannel type=ChannelType partitionsPerTask=xs:int disableDynamicPartitionAssignment=xs:boolean partitionAssignmentMethod=PartitionAssignmentMethod forceFileDownload=xs:boolean> </OutputChannel>
Attribute | Use | Description |
---|---|---|
type | required | The type of the channel: file or TCP. |
partitionsPerTask | required | The number of partitions that each task in the receiving stage will process (setting this higher than 1 will create a stage with more partitions than tasks, allowing for the use of dynamic partition assignment for load balancing). |
disableDynamicPartitionAssignment | required |
Indicates whether to disable dynamic partition assignment if the partitionsPerTask attribute is higher than 1. Use for debugging purposes.
|
partitionAssignmentMethod | required |
Specifies the partition assignment method to use if the partitionsPerTask attribute is higher than 1.
|
forceFileDownload | required | Indicates whether to force file download even for local files with a file channel. Use for debugging purposes. |
Element | Min occurs | Max occurs |
---|---|---|
<MultiInputRecordReaderType> | 1 | 1 |
<OutputStage> | 1 | 1 |
<PartitionerType> | 1 | 1 |
Job/Stages/Stage/OutputChannel/MultiInputRecordReaderType
element
The <MultiInputRecordReaderType>
element specifies the assembly qualified name of a type that derives from
MultiInputRecordReader<T>
that is used as the multi-input record reader to combine input data from multiple
tasks in the tasks of the receiving stage.
<MultiInputRecordReaderType>xs:string<MultiInputRecordReaderType>
Job/Stages/Stage/OutputChannel/OutputStage
element
The <OutputStage>
element specifies the ID of the receiving stage of this channel. This may not be a compound stage ID.
<OutputStage>StageId<OutputStage>
Job/Stages/Stage/OutputChannel/PartitionerType
element
The <PartitionerType>
element specifies the assembly qualified type name of a type that implements the
IPartitioner<T>
interface that is used to partition records across tasks in the receiving stage.
<PartitionerType>xs:string<PartitionerType>
Job/Stages/Stage/MultiInputRecordReaderType
element
The <MultiInputRecordReaderType>
element specifies the assembly qualified type name of a type that derives from the
MultiInputRecordReader<T>
class that is used as the stage multi-input record reader to combine input from multiple channels.
This element is only used if the stage has more than one input channel.
<MultiInputRecordReaderType>xs:string<MultiInputRecordReaderType>
Job/Stages/Stage/DependentStages
element
The <DependentStages>
element specifies a collection of stage IDs of stages that may not be scheduled until all tasks in this
stage have completed. This is used to specify stages that depend on the output of this stage but are not directly or indirectly connected to this stage
via a channel. The specified stages may not be child stages (so the ID may not be a compound stage ID).
<DependentStages> </DependentStages>
Element | Min occurs | Max occurs |
---|---|---|
<string> | 0 | unbounded |
Job/Stages/Stage/DependentStages/string
element
The <string>
element for the <DependentStages>
element specifies the ID of a stage that may not be scheduled
until all the tasks in this stage have completed.
<string>StageId<string>
Job/AdditionalProgressCounters
element
The <AdditionalProgressCounters>
element contains a collection of additional progress counters for the job.
<AdditionalProgressCounters> </AdditionalProgressCounters>
Element | Min occurs | Max occurs |
---|---|---|
<AdditionalProgressCounter> | 0 | unbounded |
Job/AdditionalProgressCounters/AdditionalProgressCounter
element
The <AdditionalProgressCounter>
element specifies a friendly name for an additional progress counter used by the job.
<AdditionalProgressCounter> </AdditionalProgressCounter>
Element | Min occurs | Max occurs |
---|---|---|
<TypeName> | 1 | 1 |
<DisplayName> | 1 | 1 |
Job/AdditionalProgressCounters/AdditionalProgressCounter/TypeName
element
The <TypeName>
element specifies the assembly qualified type name of a type implementing the IHasAdditionalProgress
interface that provides additional progress information about the tasks in this job.
<TypeName>xs:string<TypeName>
Job/AdditionalProgressCounters/AdditionalProgressCounter/DisplayName
element
The <DisplayName>
element specifies a friendly name for the additional progress counter used by the Jet web portal.
<DisplayName>xs:string<DisplayName>
Job/SchedulerOptions
element
The <SchedulerOptions>
element specifies custom scheduler settings for a job. Custom schedulers are not required
to obey these settings (the default scheduler does).
<SchedulerOptions maximumDataDistance=xs:int dfsInputSchedulingMode=SchedulingMode nonInputSchedulingMode=SchedulingMode />
Attribute | Use | Description |
---|---|---|
maximumDataDistance | required | The maximum allowed distance between a task and its input data. The value 0 means the data must be local, 1 allows rack-local data, and 2 or higher allows any distance. This attribute only applies to stages with data input that is provided by a file system that supports locality (such as the Jumbo DFS). |
dfsInputSchedulingMode | required | The scheduling mode for tasks with data input. |
nonInputSchedulingMode | required | The scheduling mode for tasks without data input. |
Job/JobSettings
element
The <JobSettings>
element specifies arbitrary settings available during task execution that apply to the job as a whole.
<JobSettings> </JobSettings>
Element | Min occurs | Max occurs |
---|---|---|
<Setting> | 0 | unbounded |
Job/JobSettings/Setting
element
The <Setting>
element specifies a setting.
<Setting key=xs:string value=xs:string />
Attribute | Use | Description |
---|---|---|
key | required | The key that can be used to retrieve the setting. |
value | required | The value of the setting. |
SchedulingMode
typeIndicates the scheduling strategy to use by a task scheduler.
Value | Description |
---|---|
Default |
Use the strategy specified in the |
MoreServers | Favor TaskServers with a large amount of free task slots, spreading a job over as many nodes as possible. |
FewerServers | Favor TaskServers with a small amount of free task slots, spreading the job over as few nodes as possible. |
OptimalLocality | Do not schedule non-local tasks on a TaskServer even if there are no other tasks that could be assigned to that TaskServer. |
ChannelType
typeIndicates the type of a channel between stages.
Value | Description |
---|---|
File | Intermediate data is materialized in local files which are shuffled over the network. |
Tcp | Data is transfered directly between tasks via TCP connection without materializing it. It must be possible to run all tasks in the receiving stage simultaneously, and task level fault tolerance will not be available for this job. |
PartitionAssignmentMethod
typeIndicates how to assign partitions to tasks during initial partition assignment if the stage uses more than one partition per task.
Value | Description |
---|---|
Linear | Each task gets a linear sequence of partitions, e.g. task 1 gets partitions 1, 2 and 3, task 2 gets partitions 4, 5 and 6, and so forth. |
Striped | The partitions are striped across the tasks, e.g. task 1 gets partitions 1, 3, and 5, and task 2 gets partitions 2, 4, and 6. |
StageId
typeA string defining a non-compound stage ID. May not contain the characters '.', '_' or '-'.