
Job client is a java program which carries out the whole process of interaction with hadoop.
1. Client
Job Submission:
- The submit() method on Job creates an internal JobSummitter instance and calls submitJobInternal() on it. Having submitted the job, waitForCompletion() polls the job’s progress after submitting the job once per second. If the reports have changed since the last report, it further reports the progress to the console. The job counters are displayed when the job completes successfully. Else the error (that caused the job to fail) is logged to the console (step 1) The job submission process implemented by JobSummitter does the following:
- As soon as job.waitForCompletion(true) executes it triggers the job client to start the job submission process and connect JobTracker and asks new job id (step 2). It connects JobTracker using the address from mapred-site.xml
- After getting a new job id. The job client submits the job to the JobTracker queue where some verification is done such as whether the output is already present or not whether and input files exist or not. In case, the output is already there or the input file doesn’t exist it will show the respective message.
- Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the job tracker’ filesystem in a directory named after the job ID. (step 3)
- Tells the JobTracker that the job is ready for execution (by calling submitJob() on JobTracker. (step 4)
- It copies job JAR with a high replication factor, which is controlled by mapreduce.client.submit.file.replication property. As there are a number of copies across the cluster for the node managers to access.
- By calling submitApplication(), submits the job to the resource manager.
2. JobTracker
Job Initialization:
- When the JobTracker receives a call to its submitJob() method, it puts it into an internal queue from where the job scheduler will pick it up and initialize it.
- Initialization involves creating an object to represent the job being run (step 5).
- To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the client from the shared filesystem (step 6). It then creates one map task for each split.
3. TaskTracker
Task Assignment:
- Tasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker. Heartbeats tell the jobtracker that a tasktracker is alive As a part of the heartbeat, a tasktracker will indicate whether it is ready to run a new task, and if it is, the jobtracker will allocate it a task, which it communicates to the tasktracker using the heartbeat return value (step 7).
Task Execution:
- Now that the tasktracker has been assigned a task, the next step is for it to run the task. First, it localizes the job JAR by copying it from the shared filesystem to the tasktracker’s filesystem. It also copies any files needed from the distributed cache by the application to the local disk.(step 8).
- TaskRunner launches a new Java Virtual Machine (step 9) to run each task in (step 10).
Progress and Status Updates:
- MapReduce jobs are long-running batch jobs, taking anything from minutes to hours to run.
- Because this is a significant length of time, it’s important for the user to get feedback on how the job is progressing. A job and each of its tasks have a status.
- When a task is running, it keeps track of its progress, that is, the proportion of the task completed.
Job Completion:
- When the job tracker receives a notification that the last task for a job is complete (this will be the special job cleanup task), it changes the status for the job to “successful.