Project Management in High Performance Computing (HPC) projects doesn’t get enough attention. But it should. HPC is one of the most challenging software development environments. Extremely hard problems, very smart people. It’s hard to produce high quality code on time and on budget.
To be successful it comes down to project management.
We’ve tried numerous methods over the last several years to manage the custom software development process for Lustre, the open source massively parallel file system, in an effort to harness the unique environment and people that develop Lustre. Many have not worked well.
There are multiple challenges to managing software development in HPC. HPC products are essentially R&D products. That means we are constantly pushing capability boundaries for speed, performance, or functionality, always for “the first time.” Each version goes beyond the known into the unknown.
Estimates on doing anything for the first time are poor – and I mean really poor – worse than the average bad estimate you get when you multiply by two behind an engineer’s back before you tell the customer how long it will take. Without strong, clear project management, your estimates will just be guesses.
Methodology and Why It Matters
Large R&D projects afforded the time to be managed with a spiral model will likely have more success than those limited to a waterfall method. The spiral model is tailored to repeated cycles of defining objectives, alternatives, executing, analyzing results and resultant risks; rinse and repeat. Just the style of this, spiraling to close in on an answer is fundamentally different than being expected to step down, as in a waterfall, in a known way from A to B to C to an answer in a determinate amount of time.
Even though Lustre often needs to be developed like an R&D product, it is sold more like a commercial product with Non-Recurring Engineering (NRE) development projects committed to with fixed timelines. Customers, reasonably, won’t accept an estimate that goes something like, “Since this has never been done before, we aren’t sure exactly how long it will take and so can’t tell you how much it will cost you in the end.” As a result the team is required to make risky fixed-fee estimates and deliver on large R&D projects with incomplete information.
Contributing to making this more difficult is that Whamcloud has globally distributed engineering teams, and I mean globally. US, Canada, England, Scotland, France, Russia, China… I hope I didn’t forget anyone. This makes for an unbelievably challenging project management environment. There is no daily ‘walk-around’ to check on how tasks are going or an easy way to drop by and ask a quick question. I have to find a better way that is not disruptive and doesn’t require me to set an alarm for 2AM to Skype around the world with project team members to get project status.
I won’t go into how our team has developed a successful process for doing up-front estimations. That is a heady mix of alchemy, talented engineers with a long history of Lustre, talented project managers, and a bit of special sauce. What I do want to share is how we stay on track once we start a project.
Agile
First off is getting the right tool for the job. We use the Greenhopper plugin for Jira. This is perfect for Whamcloud for several reasons:
- Browser-based so it works for a geographically distributed team
- Browser-based so I don’t care if you like a OS X, Linux, or Windows
- Engages engineers in the process of updating so I am not chasing remote engineers for status, which isn’t easy in any environment
- Great administrative controls that allow us to manage customer access as well
The beauty of using Agile for large development projects is that we are splitting our major work into small definable chunks that allow us to see if we veer off the path before it is too late. It’s so important to me I’ll say it again. We can see if we veer off before it is too late. The visibility into the project that GreenHopper and Jira give us allows us to make course corrections after each sprint.
Most of our teams use three or four week sprints and have to show something demonstrable at the end – be it a design doc, some testable code, test cases, etc. It has to be something that moves the bar forward. During the sprint we use the functionality of marking work in progress and closing cards, as work is complete. This tackles the previous daunting task of tracking down engineers to check on status.
Since each work unit has been clearly defined – a key part of the process – into a unit of work that lasts from a day to no longer than 5-7 days there is not an opportunity for long durations of ‘in progress’ work where the project manager feels at a loss for the true status.
At the end of the sprint, it’s very easy to measure the how and why. If we missed the mark, we investigate, we re-visit, we analyze, we-replan. We don’t just plow forward. In the past, when we used a waterfall method there was a tendency for just plowing forward without much additional planning and “trying to make up the time.” It never worked.
The planning required with Agile to properly plan and clearly define cards prior to a sprint instills a structure that helps to enforce a proper thinking through of what is being accomplished and a forestalling of work until it is understood. This allows us to ponder as yet unanswered questions in offline conversations and, once answered, include in a future sprint without causing project delays.
Conclusion
I haven’t seen much literature for Agile in other environment than traditional GUI environments. I’m not sure why. Splitting work into demonstrable units seems applicable to so many environments. Whamcloud’s experience provides that it can work very effectively in HPC. All customers that I have ever worked with appreciate seeing regular results that show their project moving forward and their money being well spent. Agile for HPC for Whamcloud has meant stopping and understanding as we go along – making corrective actions in the moment – and reaping a much greater rate of on-time delivery.
Jessica is passionate about project management, process efficiencies, and working with amazing teams. She lives with her husband in Boulder, CO, and is most likely to be spotted around town trail running on the weekends.

Full Story posted in insideHPC under HPC, HPC Software by Rich Brueckner
Hi Jessica,
I did publish your project management articles a long time ago (about 3 years ago) on PM Hut, the latest one is this one.
I was wondering when you are going to write again on your blog and whether you have another one right now.
Thanks as always for your interest and support around project management issues. I don’t have a ton of extra time for writing but intend to contribute to this Whamcloud blog periodically. Please keep your eye on Whamcloud for more articles from me in the future.
Thanks,
Jessica
Excellent article. I keep visiting the website for updates on Lustre and this time, I ran into your article. Thank you for sharing your experience. I think the same challenges you mentioned apply equally to any Enperprise systems project(s) that have complex designs and HW/SW interactions. Thanks again!.
Mallik,
Good to hear that you continue to follow Lustre. I agree with you that these principles could apply to any complex design interactions that greatly increase the overall unknowns in the project.
-Jessica