The jet.config
configuration file provides configuration for the Jumbo Jet distributed execution engine.
ookii.jumbo.jet
elementookii.jumbo.jet/jobServer
elementookii.jumbo.jet/taskServer
elementookii.jumbo.jet/fileChannel
elementookii.jumbo.jet/tcpChannel
elementookii.jumbo.jet/mergeRecordReader
elementSchedulingMode
typeBinarySize
typeCompressionType
typeookii.jumbo.jet
element
The <ookii.jumbo.jet>
element provides configuration for Jumbo clients that need to connect to
the Jumbo Jet distributed execution engine, for the Jumbo Jet JobServer and TaskServers, and for Jumbo Jet jobs.
<ookii.jumbo.jet> </ookii.jumbo.jet>
Element | Min occurs | Max occurs |
---|---|---|
<jobServer> | 0 | 1 |
<taskServer> | 0 | 1 |
<fileChannel> | 0 | 1 |
<tcpChannel> | 0 | 1 |
<mergeRecordReader> | 0 | 1 |
ookii.jumbo.jet/jobServer
element
The <jobServer>
element provides information for clients to access the Jumbo Jet execution engine, and configuration for the Jumbo Jet JobServer.
For client applications, only the hostName
and port
attributes of this element are used.
<jobServer hostName=xs:string port=xs:int jetDfsPath=xs:string archiveDirectory=xs:string scheduler=xs:string maxTaskAttempts=xs:int maxTaskFailures=xs:int taskServerTimeout=xs:int taskServerSoftTimeout=xs:int dataInputSchedulingMode=SchedulingMode nonDataInputSchedulingMode=SchedulingMode schedulingThreshold=xs:float broadcastAddress=xs:string broadcastPort=xs:int listenIPv4AndIPv6=xs:boolean />
Attribute | Use | Description |
---|---|---|
hostName | optional | The host name of the Jumbo Jet JobServer. The default value is "localhost". |
port | optional | The port number for the Jumbo Jet JobServer's RPC service. The default value is "9500". |
jetDfsPath | optional |
The path on the Jumbo DFS (or other file system configured in dfs.config )
where files related to Jumbo Jet jobs are stored.
The default value is "/JumboJet". |
archiveDirectory | optional | The local directory where job configuration and statistics are archived for future retrieval via the Jumbo Jet web portal. If not specified, jobs are not archived and their information cannot be retrieved after they are removed from the current list of completed jobs. |
scheduler | optional |
The assembly qualified type name of a type that implements Ookii.Jumbo.Jet.Scheduling.ITaskScheduler that will
be used to schedule task execution on the Jumbo Jet cluster.
The default value is "Ookii.Jumbo.Jet.Scheduling.DefaultScheduler, Ookii.Jumbo.Jet". |
maxTaskAttempts | optional | The maximum number of times a single task may be re-executed before the job that contains the task is failed. The default value is "5". |
maxTaskFailures | optional | The maximum number of failures across all tasks that a job may experience before the job is failed. The default value is "20". |
taskServerTimeout | optional | The time in milliseconds after which a TaskServer is considered dead if it has not sent a heartbeat. The default value is "600000". |
taskServerSoftTimeout | optional | The time in milliseconds after which a TaskServer will no longer be considered for new tasks during scheduling if it has not sent a heartbeat. The default value is "60000". |
dataInputSchedulingMode | optional | The scheduling mode to use for tasks that have data input (e.g. a DFS file). This setting may not be used by all schedulers (the default scheduler does use it). The default value is "MoreServers". |
nonDataInputSchedulingMode | optional |
The scheduling mode to use for tasks that do not have data input (no input or channel input). This setting
may not be used by all schedulers (the default scheduler does use it). The OptimalLocality
value is not applicable to this setting.
The default value is "MoreServers". |
schedulingThreshold | optional | The fraction of tasks (between 0 and 1) of a stage using a file output channel that must be finished before tasks of the receiving stage of the channel are eligible to be scheduled. This setting is likely to only have significant impact if the stage has a number of tasks close to or less than the cluster's capacity. The default value is "0.4". |
broadcastAddress | optional | The UDP broadcast address for task completion notification. This should be set to your network's UDP multicast address if task completion notification broadcasts are enabled. For example, if your network address is 192.168.1.x with a network mask of 255.255.255.0, your multicast address is 192.168.1.255. The default value is "255.255.255.255". |
broadcastPort | optional | The port number to use for UDP broadcast for task completion notification. If this value is set to 0, task completion notification broadcasts are disabled. The standard Jumbo port number to use for this is 9550. The default value is "0". |
listenIPv4AndIPv6 | optional | Indicates whether the JobServer's RPC service should listen on both IPv6 and IPv4 addresses. On Windows, it is required to explicitly listen on both addresses if both are supported; on Linux, listening on IPv6 will automatically listen on the corresponding IPv4 address, so attempting to manually bind to that address will fail. If this setting is not specified, it defaults to "true" on Windows and "false" on Unix (which is correct for Linux, but you may need to manually set it for other Unix variants like FreeBSD). If either IPv6 or IPv4 connectivity is not available on the system, this setting has no effect. |
ookii.jumbo.jet/taskServer
element
The <taskServer>
element configures the Jumbo Jet TaskServers.
<taskServer taskDirectory=xs:string taskSlots=xs:int port=xs:int fileServerPort=xs:int fileServerMaxConnections=xs:int fileServerMaxIndexCacheSize=xs:int processCreationDelay=xs:int runTaskHostInAppDomain=xs:boolean logSystemStatus=xs:boolean progressInterval=xs:int heartbeatInterval=xs:int taskTimeout=xs:int immediateCompletedTaskNotification=xs:boolean listenIPv4AndIPv6=xs:boolean />
Attribute | Use | Description |
---|---|---|
taskDirectory | required | The local directory where temporary and intermediate data files for tasks of running jobs are stored on a TaskServer. |
taskSlots | optional | The maximum number of simultaneous tasks to execute per TaskServer. The default value is "2". |
port | optional | The port number for the TaskServer's RPC service (used for clients and the task umbilical protocol). The default value is "9501". |
fileServerPort | optional | The port number for the file server used for file channel shuffling operations. The default value is "9502". |
fileServerMaxConnections | optional | The maximum number of connections that the file server will accept. If a task attempts to connect to the file server while there are already the indicated maximum number of connections, the connection is refused so the task knows to connect to a different server first. This helps balance the load of the shuffling operation by preventing all tasks from reading data from the same TaskServer simultaneously. A reasonable guideline for this value is twice the number of task slots. The default value is "10". |
fileServerMaxIndexCacheSize | optional | The maximum number of entries in the file channel server's index cache before entries get evicted by new entries. The default value is "25". |
processCreationDelay | optional | The delay, in milliseconds, to apply before creating a new TaskHost process. This setting was used to work around a bug with rapid process creation in older versions of Mono, and should be left at 0 unless you are experiencing problems. The default value is "0". |
runTaskHostInAppDomain | optional | Indicates whether to use an AppDomain rather than a process for running tasks. This is not recommended except for debugging purposes. Tasks will always execute in an AppDomain regardless of this setting if a debugger is attached to the TaskServer. The default value is "false". |
logSystemStatus | optional | Indicates whether tasks should periodically log processor and memory status information to the task log file. System status is logged based on the progress interval. The default value is "false". |
progressInterval | optional | The interval in milliseconds at which tasks report progress to the TaskServer. The default value is "3000". |
heartbeatInterval | optional | The interval in milliseconds at which the TaskServer sends heartbeats to the JobServer. The default value is "3000". |
taskTimeout | optional | The time in milliseconds after which a task is failed and scheduled for re-execution if it has not reported progress. The default value is "600000". |
immediateCompletedTaskNotification | optional | Indicates whether to send immediate out-of-band heartbeats to the JobServer when a task completes. If set to false, the JobServer is not notified of task completion until the next heartbeat interval. The default value is "true". |
listenIPv4AndIPv6 | optional | Indicates whether the TaskServer's RPC service and file channel server should listen on both IPv6 and IPv4 addresses. On Windows, it is required to explicitly listen on both addresses if both are supported; on Linux, listening on IPv6 will automatically listen on the corresponding IPv4 address, so attempting to manually bind to that address will fail. If this setting is not specified, it defaults to "true" on Windows and "false" on Unix (which is correct for Linux, but you may need to manually set it for other Unix variants like FreeBSD). If either IPv6 or IPv4 connectivity is not available on the system, this setting has no effect. |
ookii.jumbo.jet/fileChannel
element
The <fileChannel>
element configures default settings for file channels between stages in a Jumbo Jet job.
<fileChannel readBufferSize=BinarySize writeBufferSize=BinarySize deleteIntermediateFiles=xs:boolean memoryStorageSize=BinarySize memoryStorageWaitTimeout=xs:int compressionType=CompressionType spillBufferSize=BinarySize spillBufferLimit=xs:float spillSortMinSpillsForCombineDuringMerge=xs:int enableChecksum=xs:boolean />
Attribute | Use | Description |
---|---|---|
readBufferSize | optional | The buffer size to use when reading intermediate files. The default value is "64KB". |
writeBufferSize | optional | The buffer size to use when writing intermediate files. The default value is "64KB". |
deleteIntermediateFiles | optional | Indicates whether to delete intermediate files after they are no longer needed or when the job finishes or fails. Set this to false to preserve the files to debug a failing job. The default value is "true". |
memoryStorageSize | optional |
The maximum size of the in-memory storage to use for shuffled segments. Setting this to a high value can improve performance,
but may lead to problems if the task itself uses a lot of memory. When using the MergeRecordReader on a file
channel and the purgeMemoryBeforeFinalPass setting on the <mergeRecordReader
element is set to true, you can use a high value even if the task uses a large amount of memory. Because this value is task dependent,
it can be more useful to override it using the job settings.
The default value is "100MB". |
memoryStorageWaitTimeout | optional | The time in milliseconds to wait for memory storage to become available for a shuffled segment before falling back to disk storage. The default value is "60000". |
compressionType | optional | The type of compression to apply to the file channel's intermediate data files. Enabling file channel compression will reduce intermediate file size and network load, but may significantly increase the CPU load of the tasks. Only use this if your data is highly compressable, the network is slow, or disk space for intermediate files is low. The default value is "None". |
spillBufferSize | optional | The size of the in-memory buffer in which to collect intermediate data produced by a task with a file output channel. The default value is "100MB". |
spillBufferLimit | optional | The threshold (between 0 and 1) at which a spill is triggered (the contents of the in-memory buffer holding intermediate output data are written to disk). The default value is "0.8". |
spillSortMinSpillsForCombineDuringMerge | optional |
If using a SpillSort operation with a combiner for the channel, the minimum number of spills that must have occurred in order to re-apply
the combiner during merging.
The default value is "3". |
enableChecksum | optional | Indicates whether to compute and verify checksums for intermediate data. The default value is "true". |
ookii.jumbo.jet/tcpChannel
element
The <tcpChannel>
element configures default settings for TCP channels between stages in a Jumbo Jet job.
<tcpChannel spillBufferSize=BinarySize spillBufferLimit=xs:float reuseConnections=xs:boolean />
Attribute | Use | Description |
---|---|---|
spillBufferSize | optional | The size of the in-memory buffer in which intermediate data is collected before sending it to the receiving stage's tasks. The default value is "20MB". |
spillBufferLimit | optional | The threshold (between 0 and 1) at which the contents of the intermediate data buffer are sent to the receiving stage's tasks. The default value is "0.6". |
reuseConnections | optional | Indicates whether the TCP channel keeps the connections to the receiving stage's tasks open in between spills.The default value is "false". |
ookii.jumbo.jet/mergeRecordReader
element
The <mergeRecordReader>
element configures default settings for the Ookii.Jumbo.Jet.MergeRecordReader<T>
class,
which is used when sorting using e.g. the JobBuilder.SpillSort
method.
<mergeRecordReader maxFileInputs=xs:int memoryStorageTriggerLevel=xs:float mergeStreamReadBufferSize=BinarySize purgeMemoryBeforeFinalPass=xs:boolean />
Attribute | Use | Description |
---|---|---|
maxFileInputs | optional | The maximum number of on-disk segments that may be merged in a single merge pass. The merge record reader will merge the input data in multiple passes until the remaining number of on-disk segments is below this value. During shuffling, a background disk merge is triggered if the number of on-disk segments exceeds twice this value. The default value is "100". |
memoryStorageTriggerLevel | optional | The threshold (between 0 and 1) of memory storage usage at which a background merge is started to merge all current in-memory segments to disk. The default value is "0.6". |
mergeStreamReadBufferSize | optional | The buffer size to use per segment for reading on-disk segments during a merge. The default value is "1MB". |
purgeMemoryBeforeFinalPass | optional | Indicates whether to merge all in-memory segments to disk before the final pass (the final pass will use only on-disk segments). This is useful if the task that is consuming the merged records requires a lot of memory. The default value is "false". |
SchedulingMode
typeIndicates the scheduling strategy to use by a task scheduler.
Value | Description |
---|---|
Default |
Use the default strategy, which is |
MoreServers | Favor TaskServers with a large amount of free task slots, spreading a job over as many nodes as possible. |
FewerServers | Favor TaskServers with a small amount of free task slots, spreading the job over as few nodes as possible. |
OptimalLocality | Do not schedule non-local tasks on a TaskServer even if there are no other tasks that could be assigned to that TaskServer. |
BinarySize
typeA quantity expressed using a binary scale suffix such as B, KB, MB, GB, TB or PB. The B is optional. Also allows IEC suffixes (e.g. KiB, MiB). Examples of valid values include "5KB", "7.5M" and "9GiB". Suffixes are not case sensitive. Scale is based on powers of 2, so K = 1024, M = 1048576, G = 1073741824, and so forth.
CompressionType
typeThe type of compression to use.
Value | Description |
---|---|
None | The data will not be compressed. |
GZip | The data will be compressed using the gzip compression algorithm. |