## NSCA Resources

### Designing, Manufacturing and Putting into operation Avitohol – Bulgaria’s new supercomputer

Since the end of 2015 the new supercomputer Avitohol has been put into operation and available to users.

The system is a prototype of the heterogeneous Peta FLOPS supercomputers that will be produced in the European Union. Its architecture and computational organization were proposed and justified by the National Center for Supercomputing Applications and were approved by the Jülich Supercomputing Centre.

The architecture suggested is a Loosely Coupled Systems with Strongly Integrated Nodes. The systems were designed with Intel Multi Core processors and Intel Many Integrated Core coprocessors.

The other prototype of the European heterogeneous supercomputers is DEEP-EP

Avitohol’s computing nodes consist of two Intel Xeon E5 - 2600 v2 / v3 processor pairs and two coprocessors Intel Xeon Phi 7120P. Avitohol was designed with modules consisting of two processor – coprocessor pairs. The minimum memory for each pair is 32 megabytes.

The Computing nodes are interconnected by a communication network which must be able to simultaneously connect each pair of computing nodes (non- blocking networks) with an information exchange speed of at least 56 gigabits per second.

Without changing the architecture and design of the system, the number of nodes can be increased only by increasing the number of input and output channels of the communication network and respectively its communication capacity. This is most easily achieved by an InfiniBand non-blocking switch system, built on several levels.

Weakly connected systems are the only architecture that allows the system to organize parallel computations on three levels and achieve the maximal performance of algorithms for solving very large problems or modeling of complex objects.

The information transfer between each Intel Xeon E5 and the corresponding Intel Xeon Phi takes place at a rate of 6.4 gigabytes per second. This allows the node to use two types of parallel multi-thread calculations, organized by MPI / OpenMP protocols - with distributed memory and shared memory (global array). The combination of these allows the organization of the computing process to implement the Partitioned Global Address Space (PGAS) programming models, allowing the hardware node to be reconfigured by the software under the structure of the computational algorithm.

Consider the following examples:

**Simulating aircraft aerodynamics, including the calculation of turbulence and large vortices in the layer of air flowing around the body.**

One needs to solve numerically the partial differential equations system:

where: **Q** – vector of full or linearized conservative variables; **F, G, H** – the vectors of the poly or conservative linearized flowsи; **F**_{v}**, G**_{v}**, H*** _{v}* – dissipative forces. Those are forces, whose energy decreases when interacting with a mechanical body and transform into another kind of energy such as heat. One example of such a force is the power of drag that air excerpts on a flying airplane;

**– Reinhold’s number.**

*Re*To solve the given system of equations a group of Euler equations must be added:

- LEE – linearized Euler equations;
- NLDE – non-linear against external impacts Euler equations;
- EE – full Euler equations;
- LNSE – linearized equation of Navier - Stokes;
- NSE – full equation of Navier – Stokes.

Partial differential equations are transformed into a system of algebraic equations by using the Finite Element Method.

To compute the aerodynamic, the fuselage is covered with a dense network of finite elements (small squares).

(а) | (б) |

**Figure 1: Simulating aircraft aerodynamics.**

The generated network is so fine that it does not distinguish between the separate squares.In general, it can be composed of several millions and even tens of millions of squares. Such a network generates a system of tens of millions of algebraic equations. Obviously, the network must be divided into several hundred pieces, thus braking down the system of equations into several hundred subsystems. This process is called **decomposition**.We consider a part of an airplane fuselage, the two engines and horizontal and vertical tail stabilizer. However, this is a large unit which may give rise to several million algebraic equations. In order to organize several parallel computation flows, the module is divided into elements:

- left engine
- right engine
- horizontal stabilizer
- vertical stabilizer
- rear part of the fuselage.

One has to respect the rule that the boundaries between elements should be as narrow as possible.

Analogically, the wings, fuselage, front and cockpit, external tanks (if any) are also considered separately.

Similarly separate wings, the fuselage nose and cockpit external tanks, if any.

The system of algebraic equations generated by each element is solved by a group of computing nodes, called a *supercomputing module*.

Thus, by applying the famous rule of “divide and conquer”, we get two levels of parallel computing:

- Dividing the object into modules and sending each resulting system of equations to one computational module;
- Decomposing the system of equations using the DDM method.

The algorithms used for solving the system by each node reduce the problem to operations with vectors and matrices, which are divided between processor cores and lead to a multitude of parallel threads. This is the third parallelization level.

This approach is successfully applied to solving complex objects described by other systems of partial differential equations, such as three-dimensional Laplace, Poisson, Poisson - Boltzmann and Helmholtz equations, wave equations, the equation of air distribution in any given environment, the equations of heat and mass transfer, etc.

Obviously, the proper separation of modules, elements and domains dramatically reduces the number of communications between the processor pairs in the system as compared to classical architectures and computational organization in clusters. Naturally, this significantly reduces the number of processor interruptions and total time for information exchange between them.

The same methodology can successfully be applied to solve a completely different task - modeling the dynamics of very large molecules and chains of molecules that may contain a few million to several hundred million atoms.

**Modeling the dynamics of very large molecules**

The molecule is divided into blocks and each block is sent to one computing module. The block is divided into elements, the number of which is equal to the number of computational nodes of the module. In each compute node, the pieces of the molecule are divided into groups of atoms and distributed among the processor cores using a hybrid parallel MPI / OpenMP model. The computational algorithm generates parallel running threads in each core.

**Figure 2: Modeling the dynamics of very large molecules - dividing the molecule into blocks.**

**The actual performance of the weakly connected systems is one order of magnitude higher than the actual performance of clusters with the same number of processors.**

So the architecture of the weakly connected Avitohol system that ensures efficient use of its real computational power with three hierarchical levels of parallel calculations is:

**Figure 3: An architecture of the weakly connected Avitohol system.**

The Ensemble is divided into modules, each module is divided into domains, each domain is sent to a computing module consisting of several nodes, each with several central processors (CPUs) and coprocessors (GPUS). The last column represents the threads for vector and matrix calculations (SIMD- simple instruction - multiple data).

Avitohol is designed and manufactured by:

- Prof. Stoyan Markov - Chief Designer
- Dr. Scott Misage - Senior Consultant
- Vice President and General Manager of HPC
- Hewlett Packard Enterprise, Austin, Texas

**Hewlett Packard Enterprise High Performance Computing Team, Hardware design groupe**

- Jean Marie Huguenin
- Drasko Tonic
- Volodimir Saviac
- Nathalie Violette

**Intel Europe High Performance Computing Laboratory, System Software Design group**

- Andrey Semin
- Dmitry Nemirov
- Victor Gamayonov

**The supercomputer is manufactured by Hewlett - Packard Bulgaria by the management team:**

- Iravan Hira - Head
- Ivaylo Stoyanov - Production Leader
- Krasimir Gelev - Chief of Production and Assembling Team

### Handbooks of supercomputer Avitohol: