Classic MapReduce Job run

3 min readNov 18, 2022

Classic MapReduce Job run

Job client is a java program which carries out the whole process of interaction with hadoop.

1. Client

Job Submission:

The submit() method on Job creates an internal JobSummitter instance and calls submitJobInternal() on it. Having submitted the job, waitForCompletion() polls the job’s progress after submitting the job once per second. If the reports have changed since the last report, it further reports the progress to the console. The job counters are displayed when the job completes successfully. Else the error (that caused the job to fail) is logged to the console (step 1) The job submission process implemented by JobSummitter does the following:
As soon as job.waitForCompletion(true) executes it triggers the job client to start the job submission process and connect JobTracker and asks new job id (step 2). It connects JobTracker using the address from mapred-site.xml
After getting a new job id. The job client submits the job to the JobTracker queue where some verification is done such as whether the output is already present or not whether and input files exist or not. In case, the output is already there or the input file doesn’t exist it will show the respective message.
Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the job tracker’ filesystem in a directory named after the job ID. (step 3)
Tells the JobTracker that the job is ready for execution (by calling submitJob() on JobTracker. (step 4)
It copies job JAR with a high replication factor, which is controlled by mapreduce.client.submit.file.replication property. As there are a number of copies across the cluster for the node managers to access.
By calling submitApplication(), submits the job to the resource manager.

2. JobTracker

Job Initialization:

When the JobTracker receives a call to its submitJob() method, it puts it into an internal queue from where the job scheduler will pick it up and initialize it.
Initialization involves creating an object to represent the job being run (step 5).
To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the client from the shared filesystem (step 6). It then creates one map task for each split.

3. TaskTracker

Task Assignment:

Tasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker. Heartbeats tell the jobtracker that a tasktracker is alive As a part of the heartbeat, a tasktracker will indicate whether it is ready to run a new task, and if it is, the jobtracker will allocate it a task, which it communicates to the tasktracker using the heartbeat return value (step 7).

Task Execution:

Now that the tasktracker has been assigned a task, the next step is for it to run the task. First, it localizes the job JAR by copying it from the shared filesystem to the tasktracker’s filesystem. It also copies any files needed from the distributed cache by the application to the local disk.(step 8).
TaskRunner launches a new Java Virtual Machine (step 9) to run each task in (step 10).

Progress and Status Updates:

MapReduce jobs are long-running batch jobs, taking anything from minutes to hours to run.
Because this is a significant length of time, it’s important for the user to get feedback on how the job is progressing. A job and each of its tasks have a status.
When a task is running, it keeps track of its progress, that is, the proportion of the task completed.

Job Completion:

When the job tracker receives a notification that the last task for a job is complete (this will be the special job cleanup task), it changes the status for the job to “successful.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Written by incoggeek

Just a Tech Enthusiast👨‍💻

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Recommended from Medium

Why You Should Read The Damn Book: Domain-Driven Design by Eric Evans

Kiril Vodenicharov

Why You Should Read The Damn Book: Domain-Driven Design by Eric Evans

On the quest for good technical literature, you’d rarely stumble upon a book as profound as Eric Evans’ Domain-Driven Design.

Feb 19

Getting started with Web Sockets in C++ with Nodepp

Becerracenmanueld

Getting started with Web Sockets in C++ with Nodepp

In today’s world of live streaming, video conferencing, and remote work, latency has emerged as a critical concern for users seeking a…

Sep 20, 2024

Lists

Staff picks

827 stories1648 saves

Stories to Help You Level-Up at Work

19 stories948 saves

Self-Improvement 101

20 stories3355 saves

Productivity 101

20 stories2819 saves

Solving Max-Cut Problems with D-Wave Quantum Annealing

Naoki

Solving Max-Cut Problems with D-Wave Quantum Annealing

Split the Network for Maximum Gain!

Nov 24, 2024

Quick Byte: Jumbo Frames

Yogesh

Quick Byte: Jumbo Frames

Introduction

Dec 11, 2024

Quantum Computing Computer Vector Image

Shubhransh Rai

Quantum Diary #1 — How I’m planning to self learn Quantum Physics and Computing in 1 year

Yes, two of the most difficult subjects in the entire world and my overconfident ass decides “Hm, wouldn’t it be cool if i learnt it, all…

Jan 7

DNS — Port 53 — Pentesting

Very Lazy Tech 👾

DNS — Port 53 — Pentesting

Basic info

Oct 3, 2024

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams