An OpenBLAS-based Rblas for Windows 64

Alternating Sign Matrices of Size 3One of the more important pieces of software that powers R is its BLAS, which stands for Basic Linear Algebra Subprograms. This is the suite of programs which, as its name implies, performs basic linear algebra routines such as vector copying, scaling and dot products; linear combinations; and matrix multiplication. It is the engine on which LINPACK, EISPACK, and the now industry-standard LAPACK and its variants are built. As a lower-level software suite, the BLAS can be fine-tuned to different computer architectures, and a tuned BLAS can create a dramatic speedup in many fundamental R routines. There are a number of projects developing tuned BLAS suites such as the open source ATLAS, GotoBLAS2, and OpenBLAS, and the closed source Intel MKL and AMD ACML libraries.

In ‘R’ itself, the BLAS is stored in the shared library file Rblas.dll, and a properly compiled tuned version can just be “dropped in” to the bin subdirectory and used immediately. For people working in Linux, there is significant support for specially tuned BLAS (and LAPACK) files, which can be found in great detail in the R Installation manual—the support for Windows is somewhat less robust, to be charitable. For many years, I was constrained to working in a 32-bit Windows environment, and took advantage of the 32-bit Windows compiled versions that existed in R’s repository of tuned BLAS files. However, at that time, the most recent version was for a Pentium 4, so over the next few years, I struggled and compiled ATLAS-based Rblas’s for Windows 32-bit. I did this for the Core 2 Duo (C2D) and the quad-core SandyBridge (Core i7 called C2i7 in the repository). The speedup was dramatic (see this blog, which uses the Core 2 BLAS I compiled, for an example) However, I found ATLAS to be difficult to compile, subject to any number of issues, and not really being a programmer, I often ran into issues I could not solve. Moreover, I was never able to successfully compile a 64-bit BLAS which passed the comprehensive make check-all test suite.

Recently, I made the complete switch to 64 bit Windows at both work and home so finding a 64-bit Rblas become much more important. There are some pre-compiled 64-bit BLAS files for R, graciously compiled by Dr. Ei-ji Nakama, which can be found here. Using these binaries, I found a dramatic increase in speed over the reference BLAS. However, the most recent processor-specific BLAS in that repository is for the Nehalem architecture, and cannot take advantage of the SSE4 and AVX operations built into the new SandyBridge and IvyBridge cores. Much to my frustration, I had many failures trying to compile a 64-bit ATLAS-based BLAS which, when compiled in R, would pass all the checks. With each try taking between six and a dozen hours, I eventually gave up trying to use ATLAS and resigned myself to living with the GotoBLAS-based files—which honestly was not much of a resignation.

A bit more recently, came across the OpenBLAS project, which claims to have reached near-MKL speeds on the Sandy/IvyBridge architecture due to some elegant hand-coding and optimization, and I was hoping to be able to take advantage of this in R. Unfortunately, there are no pre-compiled binaries to make use of, and so I had to attempt the compilation on my own. What made this a bit more difficult is that officially, R for Windows does not support using Goto or ACML based BLAS routines, and even ATLAS has limited support (see this thread, specifically Dr. Ripley’s response). This called for a lot of trial and error, originally resulting in dismal failure.

Serendipitously, around the time of the 3.0.1 release, there was an OpenBLAS update as well. Trying again, I was finally successful in compiling a single-threaded, OpenBLAS v2.8 based BLAS for the SandyBridge architecture on Windows 64 bit that, when used in the R compilation, created an Rblas that passed make-check all! For those interested in compiling their own, once the OpenBLAS is compiled, it can be treated as an ATLAS blas in R’s Makefile.local, with the only additional change being pointing to the compiled .a file in \src\extra\blas\Makefile.win.

Once I was successful with the SandyBridge-specific file, I compiled an Rblas that was not Sandy-Bridge dependent, but could be used on any i386 machine. I plan on submitting both to Dr. Uwe Ligges at R, and hope that, like the other BLAS’s I submitted, they will be posted eventually.

To demonstrate the increase in speed that using a tuned BLAS can provide in R, I ran a few tests. First, I created two 1000×1000 matrices populated with random normal variables. In specific (I don’t know why the first assignment operator has an extra space, I’ve seen that before when posting to WordPress):

I can provide the specific matrices for anyone interested. I then compiled a basically vanilla R 3.0.2 for 64 bit windows, using only -mtune=corei7-avx -O3 for optimizations, so the code should run on any i386. I followed the compilation steps for a full installation, so it included base, bitmapdll, cairodevices, recommended, vignettes, manuals, and rinstaller for completeness. Using a Dell M4700 (i7-3740 QM 2.7Ghz, Windows 7 Professional 64bit, 8GB RAM) I tested the following BLAS’s:

  • Reference
  • GotoBLAS Dynamic
  • GotoBLAS Nehalem
  • OpenBLAS Dynamic
  • OpenBLAS SandyBridge

I updated all packages and installed the microbenchmark package (all from source). To test the effects of the different BLAS’s I renamed and copied the appropriate blas to Rblas.dll each time. The test ran multiple copies of crossprod, solve, qr, svd, eigen, and lu (the last needs the Matrix package, but it is a recommended package). The actual test code is:

The results are illuminating:

All the tuned BLAS results are much better than the reference, with the exception of eigenvalue decomposition for the GotoBLAS-based Rblas. I do not know why that is the case, but the difference was so severe that I had to run it only 20 times to have results in reasonable time. For the other routines, sometimes the OpenBLAS based version is quicker, other times not. I personally will use it exclusively as the minor lag in comparison to some of the GotoBLAS timings is more than compensated for by the eigenvalue speedup, and overall, it is still anywhere between 3 and 10 times faster than the reference BLAS. Regardless, it is clear that using a tuned BLAS can speed up calculations considerably.

Using a tuned BLAS is not the only way to increase R’s speed. For anyone compiling R for themselves, throwing the proper flags for R in its compilation can squeeze a bit more speed out if it, and activating byte-compilation can do a bit more. In my next post, I hope to show similar timing numbers, but this time, using an R compiled for my specific machine (Ivy Bridge) in concert with the tuned BLAS.

Leave a Reply

  1. I’m looking forward to something like this becoming generally available. I usually use R on a very similar system (R 3.0, Dell, Intel i7 CPU, 16GB RAM, Windows 64-bit), though I’m not sure how much it would really help me because most of my wait time (which can be hours or days) is spent in the gbm, earth, and nnet packages, which I think do their calculations in their own C code.

    • I’m not that familiar with those packages, Andrew, but it stands to reason. The nnet source package has C source code which contains functions like “sigmoid” and “Build_Net” and the gbm has a slew of C++ files, so it is likely that a faster BLAS will not help too much, although it cannot hurt to try. Have you considered porting any specific routines you have built, such as using the Rcpp package?

    • According to Wikipedia, the Westmere is the architecture between the Nehalem and the SandyBridge, and it has the AES instruction set but not the AVX instruction set, so I would suggest the GotoBLAS compiled for Nehalem for now. If (hopefully when) the dynamic OpenBLAS I submitted to CRAN gets approved, that would be another option as well.

  2. Pingback: An OpenBLAS-based Rblas for Windows 64: Step-by-step | Strange Attractors

  3. Pingback: R with GotoBLAS on Windows 10 | Matt Moores