iSGTW - International Science Grid This Week
iSGTW - International Science Grid This Week

Home > iSGTW 2 September 2009 > Feature - Jamie Shiers on the STEP'09 postmortem

Feature - Jamie Shiers on the STEP'09 postmortem

Image courtesy Jamie Shiers 

Recently, CERN conducted the STEP’09 test, a full- scale assessment of all the computing resources which will be used to process experiment data from the Large Hadron Collider. Those computing resources include sites all over the world which are organized into different levels, or “tiers.”

Jamie Shiers leads the grid support group in the IT department at CERN, and is responsible for the coordination of the worldwide LHC Computing Grid (wLCG) service. He organized the postmortem for STEP’09 at CERN. iSGTW caught up with him to find out how it went.

iSGTW: What is the STEP’09 postmortem?

Shiers: It’s the workshop to review the results from the STEP (Scale Testing for the Experiment Program) test. The big difference between STEP and other tests we’ve done is that it’s supposed to emphasize that we are in full production. It’s a test of readiness, oriented at what we need for this year at CERN.

It’s not the first postmortem workshop we’ve held, but this was specifically about STEP’09 and it was bigger in terms of the number of attendees.

iSGTW: What were the aims?

Shiers: What we were trying to do was demonstrate that, from a computing point of view, we are ready. And if we aren’t, to understand exactly where there were problems so we can come up with a plan to correct them. It was a very thorough postmortem – we looked at things from the experiment point of view, from a service point of view and from a site point of view.

iSGTW: What were the main findings?

Shiers: We’ve come up with a list of things to do, which fortunately is rather short and certainly shorter than it could have been. In some areas it looks like we are more ready than a pessimist might have feared, particularly in the Tier-1 layer where reprocessing of the data occurs. The networking for Tier-1 is well under control.

The analysis of the data is mainly done at the Tier-2s, and the results here were less positive. But this is to be expected because the analysis involves many more people and is much more chaotic. The other things you can organize and schedule much more easily, but in Tier-2s you don’t know how many users there are, which datasets they have, which order they’ll read them. This is something we need to work on.

As we are learning more about grid computing we understand what can be provided sensibly at the generic level and what can only be provided when you are deep inside the computing model.

Graphic designed by Daniele Bonacorsi

iSGTW: What was most surprising about it?

Shiers: Probably the most surprising thing is how well we did! Many things could have gone wrong and in fact many things did. Sites had both scheduled and unscheduled downtime. For example motorway construction in Switzerland took out a large fraction of the dedicated optical network – you don’t foresee motorway construction doing this. But despite these things we were able to continue and recover.

Maybe surprising isn’t the right word, I would say “pleasing,” because we’ve spent many, many years preparing for the unexpected. I think we’ve shown we can handle it.

iSGTW: What have you done since the post-mortem?

Shiers: We’ve been working through our list. There are a small number of sites with some issues which are being retested, with largely positive results. Most of the problems with the Tier-1 sites have been addressed, but we have to do more work at the level of Tier-2s.

The target is to have a follow-up by the end of September. At the EGEE’09 conference we will have a session to work through the main improvements since the STEP’09 post-mortem workshop, so hopefully by then the main issues will have been proven to be solved in production.

iSGTW: How confident are you about the LHC restart?

Shiers: In terms of computing readiness, we demonstrated much more than we’ve ever done before and in a more sustainable fashionable. Most sites said that for them this was business as usual. I would say for everything we know, we are well prepared, but we are also well prepared for problems we guess might happen. We found that, even for problems which are seemingly insurmountable at first, a work-around solution to keep us going can be developed over a couple of hours or days.

iSGTW: What excites you most about it all?

Shiers: We’re looking forward to doing something as challenging with real data.

Bonus – Jamie Shiers also talks about the importance of grid computing, people, collaboration, and barbecues for the LHC. Click here to listen to an audio clip of him on the Gridcast website.

Seth Bell, iSGTW


 iSGTW 22 December 2010

Feature – Army of Women allies with CaBIG for online longitudinal studies

Special Announcement - iSGTW on Holiday

Video of the Week - Learn about LiDAR


NeHC launches social media

PRACE announces third Tier-0 machine

iRODS 2011 User Group Meeting

Jobs in distributed computing


Enter your email address to subscribe to iSGTW.


 iSGTW Blog Watch

Keep up with the grid’s blogosphere

 Mark your calendar

December 2010

13-18, AGU Fall Meeting

14-16, UCC 2010

17, ICETI 2011 and ICSIT 2011

24, Abstract Submission deadline, EGI User Forum


January 2011

11, HPCS 2011 Submission Deadline

11, SPCloud 2011

22, ALENEX11

30 Jan – 3 Feb, ESCC/Internet2


February 2011

1 - 4, GlobusWorld '11

2, Lift 11

15 - 16, Cloudscape III

More calendar items . . .


FooterINFSOMEuropean CommissionDepartment of EnergyNational¬†Science¬†Foundation RSSHeadlines | Site Map