PFPGrowth Class |
Namespace: Ookii.Jumbo.Jet.Samples.FPGrowth
The PFPGrowth type exposes the following members.
Name | Description | |
---|---|---|
AccumulatorTaskCount |
Gets or sets the number of feature count accumulator tasks.
| |
AggregateTaskCount |
Gets or sets the aggregate task count.
| |
BinaryOutput |
Gets or sets a value indicating whether the output format is binary.
| |
BlockSize |
Gets or sets the block size of the job's output files.
(Inherited from BaseJobRunner.) | |
ConfigOnly |
Gets or sets a value indicating whether the job runner will only create and print the job configuration, instead of running the job.
(Inherited from JobBuilderJob.) | |
DfsConfiguration |
Gets or sets the configuration used to access the Distributed File System.
(Inherited from Configurable.) | |
FileSystemClient |
Gets the DFS client.
(Inherited from BaseJobRunner.) | |
FPGrowthTaskCount |
Gets or sets the FP growth task count.
| |
Groups |
Gets or sets the number of groups.
| |
InputPath |
Gets or sets the input path.
| |
IsInteractive |
Gets or sets a value that indicates whether the job runner should wait for user input before starting the job and before exitting.
(Inherited from BaseJobRunner.) | |
JetClient |
Gets the jet client.
(Inherited from BaseJobRunner.) | |
JetConfiguration |
Gets or sets the configuration used to access the Jet servers.
(Inherited from Configurable.) | |
JobOrStageProperties |
Gets or sets the property values that will override predefined values in the job configuration.
(Inherited from BaseJobRunner.) | |
JobOrStageSettings |
Gets or sets additional job or stage settings that will be defined in the job configuration.
(Inherited from BaseJobRunner.) | |
MinSupport |
Gets or sets the min support.
| |
OutputPath |
Gets or sets the output path.
| |
OverwriteOutput |
Gets or sets a value that indicates whether the output directory should be deleted, if it exists, before the job is executed.
(Inherited from BaseJobRunner.) | |
PartitionsPerTask |
Gets or sets a value indicating the number of partitions per task for the MineTransactions stage.
| |
PatternCount |
Gets or sets the pattern count.
| |
ReplicationFactor |
Gets or sets the replication factor of the job's output files.
(Inherited from BaseJobRunner.) | |
TaskContext |
Gets or sets the configuration for the task attempt.
(Inherited from Configurable.) |
Name | Description | |
---|---|---|
AccumulateFeatureCounts |
Accumulates the feature counts.
| |
AggregatePatterns |
Aggregates the patterns.
| |
ApplyJobPropertiesAndSettings |
Adds the values of properties marked with the JobSettingAttribute to the JobSettings dictionary,
applies properties set by the JobOrStageProperties property, and adds settings defined by the JobOrStageSettings property,
and .
(Inherited from BaseJobRunner.) | |
BuildJob |
Constructs the job configuration using the specified job builder.
(Overrides JobBuilderJobBuildJob(JobBuilder).) | |
CheckAndCreateOutputPath |
If OverwriteOutput is , deletes the output path and then re-creates it; otherwise,
checks if the output path exists and creates it if it doesn't exist and fails if it does.
(Inherited from BaseJobRunner.) | |
CountFeatures |
Counts the features.
| |
Equals | Determines whether the specified object is equal to the current object. (Inherited from Object.) | |
Finalize | Allows an object to try to free resources and perform other cleanup operations before it is reclaimed by garbage collection. (Inherited from Object.) | |
FinishJob |
Called after the job finishes.
(Inherited from BaseJobRunner.) | |
GenerateGroupTransactions |
Generates the group transactions.
| |
GetHashCode | Serves as the default hash function. (Inherited from Object.) | |
GetInputFileSystemEntry |
Gets a JumboFileSystemEntry instance for the specified path, or throws an exception if the input doesn't exist.
(Inherited from BaseJobRunner.) | |
GetType | Gets the Type of the current instance. (Inherited from Object.) | |
MemberwiseClone | Creates a shallow copy of the current Object. (Inherited from Object.) | |
MineTransactions |
Mines the transactions.
| |
NotifyConfigurationChanged |
Indicates the configuration has been changed. ApplyConfiguration(Object, DfsConfiguration, JetConfiguration, TaskContext) calls this method
after setting the configuration.
(Inherited from BaseJobRunner.) | |
OnJobCreated |
Called when the job has been created on the job server, but before running it.
(Inherited from JobBuilderJob.) | |
PromptIfInteractive |
Prompts the user to start or exit, if IsInteractive is .
(Inherited from BaseJobRunner.) | |
RunJob |
Starts the job.
(Inherited from JobBuilderJob.) | |
ToString | Returns a string that represents the current object. (Inherited from Object.) | |
WriteOutput |
Writes the result of the operation to the DFS using this instance's settings for BlockSize and ReplicationFactor.
(Inherited from JobBuilderJob.) |
This job is an implementation of the Parallel FP Growth algorithm described in the paper "PFP: Parallel FP-Growth for Query Recommendation" by Li et al., 2008.
This algorithm calculates the top-K frequent patterns for each item in the database, only regarding patterns that have the specified minimum support.
The algorithm has three steps: first, it counts how often each item occurs in the input database, filters out the infrequent features, and divides the resulting feature list into groups. Next, it generates group-dependent transactions from the input and runs the FP-Growth algorithm on each group. Finally, the results from each group are aggregated to form the final result.
The number of groups should be carefully selected so that the number of items per group it not too large. Ideally, each group should have 5-10 items at most for a large database.
The input for this job should be a plain text file (or files) where each line represents a transaction containing a space-delimited list of transactions.
This example demonstrates a more complicated Jumbo job, with several stages including more than one stage with file input. It uses scheduling dependencies, group aggregation, partition-based grouping using multiple partitions per task, dynamic partition assignment, and custom progress providers.