In the first post of this series, we saw what a financial option is. We also discussed the pricing problem for an option and looked at how mathematics and statistics can be used to formulate a sensible way to calculate such a price. We finally saw how to implement that calculation using F# as programming language.

In this publication we take our discussion more on the technical programming side, and put our focus more on the performance, meaning “how can we use our tools to calculate that price as fast as possible”?

We’ll explore some scenarios and compare the results, getting interesting information out of it. In each scenario, we will calculate only the price for a Call option, with a fixed set of parameters. What we are interested here is not to calculate the strict CPU time, but the average time over a finite and big enough number of executions in a standard OS environment. There is still room to stretch the results, for example

- naked hardware, without any OS overhead
- using specialized processors. It’s becoming more common, nowadays, to use the almost always available graphics processors, whose hardware has been optimized into parallel/vectorized structures and are able to perform specific type of computations much faster than regular general purpose processors
- explicitly intervene in the normal scheduling of the OS, like forcing our task to real-time or uninterruptible priority, including disabling interrupts to ensure that nothing will stop our computation

We won’t push our tests that far here. We will, though, use our test computer with only our application running, so to avoid inconsistencies that would derive by having several application competing for user level CPU resources.

# Test code

The main test code is executed in F# interactive, all code is 64 bits

let optionsPricesStress principal strike rate volatility timeToMaturity repeat mode = let n, calculator = match mode with | FSharp -> repeat, fun() -> BlackScholes.callPrice1 principal strike rate volatility timeToMaturity | MathNET -> repeat, fun() -> callPrice principal strike rate volatility timeToMaturity | CPP -> repeat, fun() -> cpp.CallPrice(principal, strike, rate, volatility, timeToMaturity) | CPPREP -> 1, fun() -> cpp.CallPriceStress(principal, strike, rate, volatility, timeToMaturity, repeat) | ASM -> repeat, fun() -> CallPrice(principal, strike, rate, volatility, timeToMaturity) | ASMREP -> 1, fun() -> CallPriceRep(principal, strike, rate, volatility, timeToMaturity, repeat) let rec loop i = match i with | 1 -> calculator() | j -> calculator() |> ignore; loop <| j - 1 loop n #time optionsPricesStress 100. 100. 0.05 0.2 1.0 (1<<<26) FSharp optionsPricesStress 100. 100. 0.05 0.2 1.0 (1<<<26) MathNET optionsPricesStress 100. 100. 0.05 0.2 1.0 (1<<<26) CPP optionsPricesStress 100. 100. 0.05 0.2 1.0 (1<<<26) CPPREP optionsPricesStress 100. 100. 0.05 0.2 1.0 (1<<<26) ASM optionsPricesStress 100. 100. 0.05 0.2 1.0 (1<<<26) ASMREP

# Math.NET

Math.Net.Numerics contains functions that calculate the normal cumulative distribution. In the original implementation, we implemented the calculation directly in F#, using a well-know approximation method (Zelen & Severo).

open MathNet.Numerics.Distributions let private normalCDF x = Normal.CDF(0., 1., x) let internal callPrice principal strike rate volatility timeToMaturity = let [nd1; nd2] = BlackScholes.d12 principal strike rate volatility timeToMaturity |> List.map normalCDF let KexpMinusRT = BlackScholes.KexpMinusRT strike rate timeToMaturity principal * nd1 - KexpMinusRT * nd2

# C++

In this scenario, we create a C++ CLI DLL and implement the same calculations as we did in F#. The loop is executed in F#, which means we will incur in the typical call overhead when calling .NET DLLs

open QuanTifCppCLI let private cpp = BlackScholesCppCLI() cpp.CallPrice(principal, strike, rate, volatility, timeToMaturity)

see github for the C++ code.

# C++ with loop

The difference in this case is that the loop is executed in C++, not F#; we remove the DLL call overhead

# ASM

In this case the calculation is made in assembly (more later). The code sits in an unmanaged DLL, and the call from F# uses PInvoke (unmanaged binary call)

[<DllImport(@"C:\Root\Project\QuanTif\x64\Debug\QuanTif.x64.exe", CallingConvention = CallingConvention.Cdecl) extern float CallPrice(float principal, float strike, float rate, float volatility, float timeToMaturity) CallPrice(principal, strike, rate, volatility, timeToMaturity)

# ASM with loop

As before, we manage the loop DLL side, avoiding possible PInvoke call overhead

[<DllImport(@"C:\Root\Project\QuanTif\x64\Debug\QuanTif.x64.exe", CallingConvention = CallingConvention.Cdecl) extern float CallPriceRep(float principal, float strike, float rate, float volatility, float timeToMaturity, int repeat) CallPriceRep(principal, strike, rate, volatility, timeToMaturity, repeat)

For the C++ and ASM code in github see here and here.

# Async.Parallel

All the previous scenarios are repeated by using the Async.Parallel pattern, where we split the batch of calculation into 4 different parallel thread. Async.Parallel will schedule these 4 calculation streams to 4 different cores that will execute them in parallel.

let f1 n r m = async { return optionsPricesStress 100. 100. 0.05 0.2 1.0 n m } |> Seq.replicate r |> Async.Parallel |> Async.RunSynchronously |> Seq.head let f2 = f1 (1<<<24) 4 f2 FSharp f2 MathNET f2 CPP f2 ASM f2 CPPREP f2 ASMREP

# Results

The results below refer to 2^26 calls (67,108,864); the tests have been executed in a Windows 10 VM with 4 cores (host has 8 cores).

Real Time [s] | Single Core | 4 Cores |

FSharp | 29.625 | 8.345 |

Math.NET | 25.943 | 7.389 |

C++ | 27.392 | 6.697 |

C++ Loop | 25.371 | 6.317 |

ASM | 6.753 | 3.179 |

ASM Loop | 3.343 | 2.753 |

The reader can infer whatever conclusions she likes, by my part I would highlight

- The .NET parallelism is very easy to use in F# through Async.Parallel, and it delivers a quasi-perfect-dividing performance for higher level language implementations
- Math.NET looks quite good in performance optimization, considering that the plain FSharp implementation is using a fast approximation
- Using C++ to implement the calculation gives some improvement, but not an astonishing one; it looks like the F# compiler and the .NET JIT compiler do a quite good job.
- To reach a remarkable improvement in performance we need to abandon the comfort of high level languages and deal with our hardware directly. The assembly implementations presented here, in fact (more later), do not simply translate the high level implementation, but explicitly try to get rid of whatever is not strictly required (calling convention) and break down the calculation to leverage as most as possible on the features of the underlying processor. Compilers do what they can to optimize code, but there is still quite a bunch of work to make them available to automatically discover parallel patterns from a non trivial sequence of high level instructions.
- The DLL-call overhead (difference between Loop and non Loop) is not big but neither negligible, especially in the ASM version where its weight is the same as the calculation itself. Also, we note that there is not much difference between a .NET and a binary P-Invoke DLL call.
- The reduction factor from the slowest to the fastest is greater than 10. This is obtained acting only on the software implementation. It’s not big thing compared to the much greater impact hardware improvements have brought, but it reminds us that a software implementation can quickly invalidate any efforts in the hardware side. Another way to renew the timeless wisdom of “With great power (powerful programming languages) comes great responsibility”

# ASM+AVX2 implementation

There are basically two areas where the assembly implementation optimizes our calculation against high level languages (F# and C++):

#### Function call conventions overhead

High level programming languages can have many specific and unique features, but one common thing they all have is a call mechanism. This means that we can identify computations that need to be repeatedly executed in our program and codify them just once. Then, we call them when needed. This is a basic abstraction feature and one of the first inventions in the history of programming.

Processors, though, offer a much lower level of abstraction, hence the compiler or interpreter of our high level script needs to translate the high level call. Functions (or subroutines as we could more generally call them), have parameters and return values. They execute the same sequence of instructions, what changes among instances of these executions is the data context. The data context a subroutine operates on is composed (roughly) by global data (allocated independently of the subroutine) and instance data (allocated only when the subroutine is executed). In turn, instance data can be divided in interface data (a.k.a. arguments or parameters) and local data (data which is used by the subroutine code and doesn’t serve anything else outside of it).

To efficiently manage these requirements, language translators produce some well-defined code segments, usually executed before (header) and after (footer) the subroutine user code. These code segments follow precise protocols, which are called calling-conventions.

Introducing protocols, though, has a cost, which is paid whenever we call a subroutine. In this implementation, therefore, I try to avoid these costs, by implementing as much as possible inline code. For example, functions like logarithm and exponential are not called from existing libraries, but implemented in code (see here and here respectively).

In the early days of the first 8088/8086 8 and 16 bits processors, these were implemented using basic arithmetic instructions on 16 bits integers. Then we began to see the FPUs (Floating point units) 8087, a co-processor physically distinct from the main CPU. This gave us an extended instruction set with transcendental function instructions, much faster both in execution and codification (only one call instead than an algorithm). Behind the scenes, though, the FPU continued to implement the same polynomial approximation, in hardware or micro-code.

Later, the FPU was integrated in the main processor, boosting performance thanks to the fact that exchange between the main unit and the FP unit did not involve the motherboard’s bus.

Nowadays, the AVX instructions make the FPU obsolete, and we are back in writing our transcendental functions in the polynomial way.

The moral is: processors do not calculate transcendental functions directly and intrinsically. The building blocks of any calculations are always the good old arithmetic operations (+-*/). From those we derive the rest as more or less complex algorithms.

Hence, the absence of transcendental instructions in the AVX set may seem a step back from the FPUs, but in reality it is not. By potentiating the basic arithmetic instructions, we get better performance also in transcendental function algorithms as a consequence.

#### Parallel (vector) execution

Modern processors offer built-in vector capable instructions. These instructions perform operations on vectors instead of scalars, which means, for example, that we can codify multiple multiplications in a single CPU instruction. The computer I used to prepare this material is equipped with an Intel(R) Core(TM) i7-7820HQ CPU. This processor offers AVX2 (Advanced Vector Extensions 2) level instruction set. These instructions allow to operate on 256 bit registers (mnemonic YMM1 to YMM15). In double precision floating point (64 bits scalar), this means that in a single instruction we can perform 4 basic arithmetic operations (+,-, *, /). If our computation workflow happens to be such that we can identify parallel independent streams, then we can leverage on this feature. At a higher level, parallel patterns can be damn complex to identify by an automatic translator (compiler or interpreter). That’s why, in this implementation, I spent some time trying to identify those patterns, and I found some; the following block shows a few.

*C *

double sigmaSqrHalved = volatility * volatility * 0.5;

double sigmaSqrT = volatility * sqrt(timeToMaturity);

*ASM*

MULPD XMM8, XMM3 ;XMM8 *= (volatility, volatility) = (sigmaSqrHalved, sigmaSqrT)

*C *

d1 = (logSdivK + (rate + sigmaSqrHalved) * timeToMaturity) / sigmaSqrT;

d2 = (logSdivK + (rate – sigmaSqrHalved) * timeToMaturity) / sigmaSqrT;

*ASM*

VADDSUBPD XMM8, XMM2, XMM9 ;XMM8 = (rate + sigmaSqrHalved, rate – sigmaSqrHalved)

MULPD XMM8, XMM4 ;XMM8 *= (timeToMaturity, timeToMaturity)

ADDPD XMM8, XMM5 ;XMM8 += (logSdivK, logSdivK)

VDIVPD XMM6, XMM8, XMM10 ;XMM6 /= (sigmaSqrT, sigmaSqrT) -> (d1, d2)

*C (3 separate exp)*

nd1 = normalCDF(get(_d12));

nd2 = normalCDF(get(_d12));

KexpMinusRT = strike * exp(-rate * timeToMaturity);

*ASM (3 parallel exp)*

**;YMM11 input = (-rate*timeToMaturity, -rate*timeToMaturity, -0.5*d1^2, -0.5*d2^2)**

VMULPD YMM11, YMM11, L2E ;YMM11 = [log2(e) * x = t]

VROUNDPD YMM4, YMM11, 1 ;YMM4 = [floor(t)]

VMOVUPD YMM12, CNVI

VADDPD YMM13, YMM12, YMM4

VPSUBQ YMM12, YMM13, YMM12 ;YMM12 = [i = (int)floor(t)]

VSUBPD YMM13, YMM11, YMM4 ;YMM13 = [f = t – floor(t)]

VMOVUPD YMM4, expC0 ;YMM4 = [c0]

VFMADD213PD YMM4, YMM13, expC1 ;YMM4 = [c0*f+c1]

VFMADD213PD YMM4, YMM13, expC2 ;YMM4 = [(c0*f+c1)*f+c2 ~ 2^f]

VPSLLQ YMM12, YMM12, 52 ;position exponent in double IEEE representation (=> * 2^i)

VPADDQ YMM11, YMM12, YMM4 ;blend in 2^f => YMM11 = [2^f * 2^i]

**;YMM11 output = (exp(-rate*timeToMaturity),exp(-rate*timeToMaturity),exp(-0.5*d1^2),exp(-0.5*d2^2))**

*C (3 arithmetic operations)*

callValue = principal * nd1 – _KexpMinusRT * nd2;

*ASM (single vector dot product instruction)*

VDPPD XMM0, XMM13, XMM14, 031h ;XMM0 = principal * nd1 – KexpMinusRT * nd2 = RESULT

#### Intrinsics

AVX instructions can be called directly by C/C++, using intrinsics. However, while using intrinsics simplifies codifying AVX code, it still implies some overhead as it uses memory to keep data, with significant more move operations in and out of registers. In my implementation, I do almost everything using only CPU registers.

Looking at the results table above, the performance improvement achieved (C++ Loop vs. ASM Loop) is almost by a factor of 8 for the single core execution, which I think is worth the effort spent.

#### Approximation

Having implemented everything in code, I chose specific approximations for the required functions (natural logarithm and exponential); my choice was dictated by the objective to reach the maximum speed in calculation. As a result, the final result is not as precise as we get from the C++ or F# implementations (the high level implementations give a call option price of 10.45058357, the ASM gives 10.39619611).

Of course the ASM result can be made more precise as we want, losing in performance. I think, however, that a greater level of precision can be reached (e.g. 10.45****) with still a significant performance advantage over the high level implementations.

And this is one of the advantages of this way of doing things. We are free to tune precision and performance. The high level implementations, using library functions, are constrained by their performance and precision.

# Calling ASM from F#

In the ASM implementation, we import (through DllImport) the function CallPrice from a DLL. That function is implemented in assembly and exported by the DLL project’s DEF file. In other words, we call assembly code directly from F#, without passing through an intermediate C function. Nothing special actually, but it may be worth noticing it.

# Conclusion

In this two posts series, we’ve seen how the Black-Scholes model for calculate option prices can be translated into F# code. Secondly, we put our focus on performance, and sought ways to make the calculation faster and faster, acting only on code.

I haven’t touched an important corollary of the Black-Scholes model, the so called greeks. The greeks are partial derivatives of the main price equations and their importance is due to the fact that they tell us how sensitive the option price is in relation of changes in the market parameters used in the price equation, and so are well suited to help in risk analytics.

Maybe I’ll write further about them in a future publication.

I wish you all my warmest Happy Coding.