iSGTW - International Science Grid This Week
iSGTW - International Science Grid This Week
Null

Home > iSGTW - 4 March 2009 > iSGTW Feature - Turbocharge your job submissions

Feature - Turbocharge your job submissions!


December 2007 - October 2008 plot of Falkon across various systems (ANL/UC TG 316 processor cluster, SiCortex 5832 processor machine, IBM Blue Gene/P 4K and 160K processor machines). Over the past year, Falkon has seen wide deployment and usage across a variety of systems, from the TeraGrid, the SiCortex at Argonne National Laboratory, the IBM Blue Gene/P supercomputer at ALCF ANL, and the Sun Constellation supercomputer from the TeraGrid.

Each blue dot in the figure represents a 60 second average of allocated processors, and the black line denotes the number of completed tasks.

In summary, there were 163K peak number of processors, with 1.4 million CPU hours consumed and 164 million tasks for an average task execution time of 31 seconds.

Image courtesy of Ioan Raicu.

(Editor's note: Ioan Raicu and Ian Foster, both of the University of Chicago and Argonne National Laboratory, contributed this article.)

Applications that run thousands of jobs can cause headaches. Huge numbers of job submissions to a site often cause bottlenecks, make system administrators grumpy, and worse, bring down remote gateway nodes, rendering the resources useless and losing jobs in the process. Traditional techniques commonly used in the scientific community do not scale to today’s — let alone tomorrow’s — largest grids and supercomputers. But the new class of applications called Many Task Computing, discussed in the recent article “Many Task Computing: Bridging the performance-throughput gap” has spawned development of a new framework, called Falkon, that enables applications to scale up quite painlessly and use these large systems efficiently.

Minutes to milliseconds

Falkon (Fast And Light-weight tasK executiON) is designed to help restructure applications to reduce job wait time, network bandwidth and job submission overheads from minutes to milliseconds.  It leaves many of the higher overhead features such as accounting and persistency, for the local resource managers or the applications to handle. Falkon focuses on efficient handling of many independent tasks on large-scale distributed systems with many processors.

Falkon has demonstrated vast improvements in performance and scalability for a wide variety of tasks — tasks with execution times ranging from milliseconds to hours, compute- and data-intensive tasks, and tasks with varying arrival rates. The improvements extend across diverse applications from astronomy to medicine, economic modeling and beyond, and to scales of billions of tasks on hundreds of thousands of processors.

One researcher who adopted Falkon is Andrew Binkowski at the Midwest Center for Structural Genomics at Argonne National Laboratory. Binkowski and his team model three-dimensional protein structures in their basic research towards drug design. Since proteins with similar structures tend to behave in similar ways, the team compares the modeled structures to existing, known proteins in order to predict their functions -- a computationally intensive task.

 “As the Protein Data Bank (a repository of known proteins) expands almost exponentially, it becomes more difficult to coax desktop machines to do the types of analysis required,” says Binkowski. “We turned to Falkon as a way to utilize our existing software applications.”

Falkon’s distributed architecture, where the task dispatchers can be distributed over many nodes partitioning the compute resources into smaller pools to improve overall system throughput and scalability; for example, on the IBM Blue Gene/P supercomputer at full 160K processor scale, the typical configuration is to run 1 client, 640 dispatchers each managing 256 executors, for a total of 160K executors. 

Image courtesy of Ioan Raicu.  

What makes Falkon fly faster

The Falkon framework uses three novel techniques to enable rapid and efficient job execution and to improve application performance and scalability. Multi-level scheduling, in which resource allocation for a job is separated from job dispatch, enables on-the-fly resource allocation and minimizes the wait queue times. Secondly, Falkon’s distributed streamlined task dispatcher achieves from ten to a thousand times the dispatch rates that conventional centralized schedulers do. Third, Falkon’s data-aware scheduler can coordinate tasks and data so that the data transfer is minimized from shared or parallel file systems and across the network.

We can ask bigger questions

"Falkon has allowed us to ask bigger questions and perform experiments on a scale never before attempted — or even thought possible,” says Binkowski.  “This is the difference between comparing a newly determined protein structure to a family of related proteins versus comparing it to the entire protein universe.” 

The team has done all of this using existing software packages that were not designed for high-throughput computing or many-task computing, and used Falkon to coordinate and drive the execution of many loosely-coupled computations that are treated as “black boxes” without any application-specific code modifications.

“Whereas identifying similarities in protein binding pockets (for protein structure analysis) is characterized by millions of discrete jobs taking seconds to complete, docking and scoring a small-molecular compound (for drug discovery) can require several hours to converge on a solution.  In both cases, we are able to tailor our workflows to achieve the best possible scientific results and still get the throughput and efficiency we need to take advantage of the large computing resources we have available."

Ioan Raicu and Ian Foster

Tags:



Null
 iSGTW 22 December 2010

Feature – Army of Women allies with CaBIG for online longitudinal studies

Special Announcement - iSGTW on Holiday

Video of the Week - Learn about LiDAR

 Announcements

NeHC launches social media

PRACE announces third Tier-0 machine

iRODS 2011 User Group Meeting

Jobs in distributed computing

 Subscribe

Enter your email address to subscribe to iSGTW.

Unsubscribe

 iSGTW Blog Watch

Keep up with the grid’s blogosphere

 Mark your calendar

December 2010

13-18, AGU Fall Meeting

14-16, UCC 2010

17, ICETI 2011 and ICSIT 2011

24, Abstract Submission deadline, EGI User Forum

 

January 2011

11, HPCS 2011 Submission Deadline

11, SPCloud 2011

22, ALENEX11

30 Jan – 3 Feb, ESCC/Internet2

 

February 2011

1 - 4, GlobusWorld '11

2, Lift 11

15 - 16, Cloudscape III


More calendar items . . .

 

FooterINFSOMEuropean CommissionDepartment of EnergyNational¬†Science¬†Foundation RSSHeadlines | Site Map