There comes a point when the dataset you’re trying to work with is simply too big. For me, this was heralded by my laptop making an ungodly amount of noise and its temperature increasing to a concerning level. Oh, and I would have to leave it running all night. And then there would be an error. Rinse and repeat.
Eventually I tired of this and went searching for a solution. What I discovered was that my university, like many others, operated an HPC (high-performance computing) cluster for tasks like mine. This meant that I could submit my analysis job to the cluster and simply turn my laptop off and walk away – revolutionary! (Yes, I was naïve at the time.)
What does the cluster have that my laptop doesn’t?
There are other benefits to the cluster than just outsourcing your computing needs. Firstly, it seems likely that the cluster can offer you more storage than your laptop (or even desktop PC) ever could. It’s also likely to offer quite a bit more RAM, meaning that it can handle more complicated operations. And the GPUs are probably better, too. All of this means that you can analyse bigger data by more complex methods using an HPC cluster than you can on your local machine.
What’s the catch?
There are also reasons not to use a cluster, though. Most (if not all) HPC clusters run on Linux and require some level of shell scripting ability to use them, which presents a bit of a learning curve. If there isn’t a huge advantage to using the cluster, it may not be worth going through the effort of learning how to use it and preparing your code for it.
There are other drawbacks too. Running code faster by using more powerful hardware results in increased energy use, so it might be best not to run straightforward jobs in a cluster. There are also data protection considerations – for example, it is often not appropriate to upload sensitive personal data to an HPC cluster. These data protection concerns grow even more important when using cloud-based HPC solutions where a third party hosts the servers, like Amazon Web Services, Microsoft Azure or Google Cloud Platform.
Parallelisation
Something that I feel is very underused by researchers (at least in social science), and that HPC clusters can facilitate, is parallelisation. Parallelisation is the art of running parts of your data analysis at the same time so it finishes sooner. Pretty much all computers can do this to some extent, as modern processors have multiple “cores” which can work on different tasks simultaneously – again, HPC cluster nodes’ processors usually have more. This is very useful when you need to perform a lot of tasks (or the same task a lot of times) on the same data, or different parts of the data, and less useful when you have a procedure with many steps where each must wait for the last to complete. Parallelisation also comes with some performance penalties, though, so it’s not as simple as “running my analysis in 20 parallel processes will make it 20 times faster”. The 20 parallel processes will probably also use quite a lot of memory, which makes it easier to run out entirely.
A natural candidate for parallelisation is multiple imputation, since each imputed dataset is derived from the same original data and completely independently of the other imputations. If you need 50 imputations, you could write R code to compute all 50 simultaneously in separate processes, rather than waiting for whatever multiple imputation package you are using to compute them sequentially.
To sum up
HPC clusters are great, but not to be used for everything!
Of course, researchers are always finding new ways to break their institutions’ HPC clusters. So I just wanted to finish with a shoutout to the people whose job it is to maintain these clusters, because without them, I would be really screwed.
CODE NOTES: high-performance computing by Tom Metherell is licensed under CC BY 4.0