Non- Linear Feedback Simulation

CyberUnits Bricks: An Implementation Study of a Class Library for Simulating Nonlinear Biological Feedback Loops

Which programming language is fit for high-performance computer #simulations in biomedical #cybernetics? We found that #Free_Pascal generates faster code than S/R and #Python. Interestingly, Object #Pascal even outperforms #Swift and C++.

https://doi.org/10.14201/adcaij.31762

1 Like

(Editors Note. FWIW, If they’ve outperformed C++, maybe they didn’t write very good C++. :wink: )

Delphi as raw code is good but slows dramatically when objects are added and as the instructions move in and out or through of Objects.
Delphi 's containment of data in Objects is so very covenant to the programmer but not always to the Micro-Processor. We have to remember the languages above were mainly written from C++ so how can they be faster than C++. Delphi is another language written from mainly C++ today with also with critical parts in assembly.
My conclusion is Delphi, C# or C++ are all much the same but its more about how you write your code more than anything else. or C++ is a good way to discipline your self not to use objects to get speed and what about your assembly code skills. Naturally 32 bit can help speed because of smaller pointers so are you thinking *.DLL as the final product? So your vocabulary with the language you use can in the end have the biggest impact on speed with these 3 languages.

If you want to push things to the limit you need to start looking at the use of L1, and to a lesser extent, L2 cache. As an exercise I worked out how hit the limit of performance when doing Monte Carlo simulations of tennis matches based on players’ service percentages. I use C#, but same applies to any language. The largest performance benefit had nothing to do with code – I managed to pin ALL the memory I needed inside the L1 cache, so it never got swapped out. Ultimate result was that I could simulate (and record) a point in around 13 clock cycles/3 nanoseconds (sustained). By pinning the lookups in L1 cache I could access is as read-only memory across multiple threads simultaneously and did this running 40 concurrent threads on my server. Granted this was a limited example, but never knew about how much performance you lose when swapping memory out of L1 cache

Misha,

Out of curiosity, were there any specific API’s to do this, i.e.: explicitly? or you just needed to limit the sizes of your data and/or code, i.e.: implicitly done by the OS/CPU?

Alex

I’m not entirely sure how we got here, but this is a graph of the cpu cycle cost of a range of operations.

Just used read-only arrays and limited data used to these arrays and referenced all data through array offsets. Kept the size under the L1 cache. Anything that could be, was pre-calculated. For random numbers used a pre-initialised array and started with a variable offset that wrapped around the array. Code super tight and small

Here is the code for the tight loop:

for (nint i = 0; i != MAX_SIMULATIONS; i++) {
  nint stateIndex = initStateIndex;
  nint randomIndex = i;
  nint isMatchFinished = 0;

  while (isMatchFinished == 0) {
    pointCount++;
    if (randomIndex == MAX_SIMULATIONS)
      randomIndex = 0;

    if (randomValues[randomIndex++] < (scoreStates[stateIndex].IsServing == 0 ? pointWinCutoff2 : pointWinCutoff1)) {
      if (scoreStates[stateIndex].IsServing == 1)
        serviceWonCount++;
      if (scoreStates[stateIndex].NextWonScoreState == 0) {
        wonCount++;
        isMatchFinished = 1;
      } else
        stateIndex = scoreStates[stateIndex].NextWonScoreState;
    } else {
      if (scoreStates[stateIndex].IsServing == 0)
        serviceWonCount++;
      if (scoreStates[stateIndex].NextLostScoreState == 0)
        isMatchFinished = 1;
      else
        stateIndex = scoreStates[stateIndex].NextLostScoreState;
    }
  }
}

Branching can also affect performance … messing up speculative execution.

Maybe it’s not a big deal if one path is much more likely than the other?