Tuesday, February 1, 2011

MURPA-Lin Wei-Week 4



Progress of the project

During the Skype chat with Colin, I have been advised that Nimrod/K has both parameter sweep and meta-scheduling capabilities. Therefore, instead of deploying a scheduler or Nimrod/G in the Opal server to make scheduling decisions, we can use the existing scheduler inside the Nimrod/K. There are two reasons for this choice. On the one hand, researchers at MeSsAGE Lab are focusing on developing Nimrod/K, using Nimrod/K will leverage the new features and functions. On the other hand, the end users will have the freedom to configure the Nimrod/K actors in the Kepler to fulfil their specific resource requirements. For instance, the users can specify the number of CPUs and the size of the memory to be assigned to execute a job.

The initial design of how to use the Nimrod/K scheduler with Opal web services are detailed as follow:

Overview of the OpalResource Class

In order to utilize the Nimrod/K’s meta-scheduling capability, a new class called OpalResource needs to be introduced to the Opal actor. There are three major tasks that the OpalResource class handles.

1.It synchronizes the metadata (Number of totoal CPUs, number of free CPUs, Free Memory, etc) about the Opal resource with the Opal server. The Opal resource could be cluster, grid or cloud resources that runs scientific application. At the initial stage of the project, the scheduler makes allocation decision based on the number of CPUs required to execute a job.
2.It provides the Opal resource metadata to the scheduler.
3.It monitors and maintains the job submission and resource allocation request to the Opal server.

Both the scheduler and the Opal actor has a reference to the OpalResource class so that they can interacts with the Opal resource.

Creating OpalResource Instances

This diagram illustrates how OpalResource class creates instances of Opal resources. When end users click on the run button in Kepler to execute the Autodock workflow, it triggers the OpalReousr class. The OpalResource class requests for the metadata of the resources that can run the Autodock application. Since the Opal server has records of all the Autodock resources, it can simply fullfill the request by transferring the Autodock resource metadata to the OpalResource class. Then, for each Autodock resource, the OpalResource class creates an OpalResource instance to keep track of the resource metadata. It is important to notice that the entire process is syncronized to ensure that the resources metadata in the Opal server is identical to the ones in the OpalResource instances. For instance, if 10 CPUs becomes available in an Autodock resource, it is also reflected in its corresponding OpalResource instances.


Creating Copies of Opal Actor

Nimrod/K performs parameter sweep by generating multiple copies of the Opal actor. Each copy of the Opal actor is supplied with a set of parameters. It is up to the end user to define the number of parameter set and the value of parameters in each parameter set. The parameter sets are prepared in GirdTokenFiles. Each GridTokenFileis is tagged with a different color to distinguish themselves. Although the copies of the Opal actor are not shown in Kepler, these copies are created and running on the background.


Resource Allocation

Each copy of the Opal actor for Autodock experiment may have different resource requirements. The diagram takes one copy of the Autodock Opal actor to demonstrate how resources are allocated for job execution. First, the copy of the Opal actor provides the resource requirement to the OpalResource class. Then, the OpalResource class pass the resource requirement along with the references of the Autodock OpalResource instances to the scheduler. The scheduler calls the getFreeCPUs to access to Autodock OpalResource metadata and invokes getSlots(int) method to make the allocation decision. After the allocation decision has been made, it is passed back to the OpalResource class. At the end, the OpalResource class submits the Autodock job request along with the resource allocation decision to the Opal server. The Opal server will execute the job on the Autodock resources according to the allocation decision and notify the OpalResource class if the allocation request is fulfilled. Again, The communication in the entire resource allocation process is synchronized. But for the convenience of describing the process, it is broken down into steps.



Some Thoughts About the Opal.xml Workflow

What is Nimrod/K director’s dynamic parallelism?

Why the users have to type the input in an array in the Constant actor? Is there a more convenient way for users to provide input?

How sequences are converted tags? Why manually pre-stage files? Can we design a mechanism to automatically handle the files?

What does Expression actor do?

Can we get rid of the Array to Sequence actor and Expression actor so that the Nimrod/K and the Opal actor can interacts directly?



No comments: