authors: Libor Bukata, Jan Kůrka, and Přemysl Šůcha

from: Industrial Informatics Research Center

Image:  Intel Xeon Phi Cards (Source: Intel Newsroom)

 

Introduction

Intel Xeon Phi is a coprocessor for high-performance computing that uses Intel Many Integrated Core Architecture (MIC) which is x86 compatible multiprocessor architecture (Source: Developer Zone). Intel Xeon Phi has over 50 cores with multiple hardware threads per core and 512-bit SIMD (IMCI-512) instructions. Official support of Linux distributions is limited to two distributions: SUSE Linux Enterprise Server (SLES) and Red Hat Enterprise Linux (RHEL).  The installation procedure on other distributions is not always straightforward, therefore we are going to show you how to install Xeon Phi on Gentoo Linux. We used Gentoo Linux with kernel version 3.12.49, Intel Manycore Platform Software Stack (MPSS) 3.5.2 and two Xeon Phi 31S1P coprocessors.

We have build upon the work of Anselm Busse who managed to run the Xeon Phi on Gentoo Linux. He has presented overlay on his GitHub as well as a brief tutorial. We've extended his overlay by adding new ebuilds (now under the review, until they are accepted download them from here), updating it to newer MPSS 3.5.2, fixing some dependencies and embedding Python 2.7 dependency into the ebuilds so that they would behave correctly even when multiple Python versions are installed. New ebuilds mainly consist of packages supporting offloading. Since Intel supports only SLES and RHEL with a very limited number of kernel versions 2.6, 3.0, 3.10, 3.12; a few patches had to be written in order to make the ebuilds compilable. The changes usually deal with OS distribution check or a slightly different kernel API.

This how-to has the following structure. First we show you how to setup the host system and coprocessor, then how to install Intel Parallel Studio XE, after that you should be able to compile code using Intel compiler (ICC) and use offloading mentioned in third chapter. Finally we present you compiler comparison of GNU compiler, Intel compiler CPU version and Intel compiler offloaded version on a synthetic micro-benchmark.

Packages to Install

The following MPSS packages need to be installed in order to setup the host system to run and administrate the coprocessor.

  • sys-apps/mpss-daemon - Daemon for starting/stopping Xeon Phi coprocessors + micctrl control utility
  • sys-apps/mpss-micmgmt - Various tools to manage Intel Xeon Phi coprocessors (e.g. miccheck, micinfo, micflash)
  • sys-firmware/mpss-flash - Bootloader and firmware images for flashing Intel Xeon Phi coprocessors.
  • sys-kernel/mic-image - Boot image for Xeon Phi card
  • sys-kernel/mic-rasmm-kernel - RASMM kernel for Intel Xeon Phi card
  • sys-kernel/mpss-modules -  Kernel modules for a host
  • sys-libs/libmicmgmt -  C-library to access and update Intel Xeon Phi coprocessor parameters
  • sys-libs/libscif - SCIF library for Intel MIC co-processors
  • sys-libs/mpss-headers -  Header files for Many Integrated Core Architecture
  • sys-devel/mpss-sdk-k1om - SDK for Intel Xeon Phi
  • dev-util/gen-symver-map - Utility for generating maps of symbols (System.map)

The following MPSS packages provide offloading support.

  • sys-libs/mpss-coi - Library for offloading support for Intel Xeon Phi coprocessor
  • sys-libs/mpss-myo - Shared memory library for MPSS stack

The ebuilds for the packages mentioned above will be published in cooperation with Anselm Busse in his Git repository (now download them here).

After Installation

When all the necessary packages are installed, we can load the kernel module 'mic'. There can also be 'mic_host' module present in the kernel, however we can't use that. Make sure that 'mic_host' is not loaded to avoid possible collisions. You can load the kernel module by the following command:
modprobe mic
It is useful to add mic module to /etc/conf.d/modules in order to load the module automatically during the booting process. The snippet of this file can look like this:
modules="nvidia-uvm msr nf_conntrack_ftp mic"

In the next sections we will use the micctrl control utility that allows us to control and administrate the coprocessors. Note that the last argument of this tool is a list of Xeon Phi cards which allows us to select only some of the cards installed in the system. This argument is optional and when not specified, it applies the command to all available Xeon Phi cards. For example
micctrl -s mic0 mic1
will check coprocessor 0 and 1 status. The coprocessors are denoted "micX" where X stands for coprocessor number (e.g. mic0) and they are numbered from zero.

We can generate default configuration files in /var/mpss
micctrl --initdefaults
and start the daemon.
/etc/init.d/mpss start
The micctrl utility let us control and configure the coprocessor. First we use
micctrl -s
to check the coprocessor status. The card should be online. If it indicates ready state, try to boot it.
micctrl -b
Note that booting takes a moment because Xeon Phi runs on Linux microkernel that has to be loaded to a card from the host system. 

After that we need to add users, setup network and possibly update the coprocessor flash as shown in next sections.

Adding Users

Users can be added to /etc/passwd and /etc/shadow files on the coprocessor file system by micctrl --useradd command. The syntax is following
micctrl --useradd=<username> --uid=<uid> --gid=<gid> [--home=<dir>] [--comment=<string>] [--app=<exec>] [--sshkeys=<keyloc>] [MIC list]

It is necessary to specify a correct user and group IDs. These can be obtained by id command. Users should have valid RSA keys in their .ssh directory on the host in order to be able to establish SSH connection with the coprocessor. To generate SSH key use ssh-keygen.

If the SSH keys are not added automatically or they are stored at a different location, you can use --sshkeys=<keyloc> switch to specify the path.

Network Configuration

The mic kernel module has to be loaded before the network setup, see After Installation chapter. The communication between the host and coprocessor is done via a virtual TCP/IP network over PCIe bus. The Static Pair topology is the simplest configuration usually used in single host installations:

  • Top two quads have default value "172.31".
  • Third quad indicates coprocessor number (0, 1,...). 
  • Last quad: coprocessor gets "1", host "254".

More complex configurations are also possible. For example it is possible to connect the card to the Internet. For more information see System Administration for the Intel Xeon Phi Coprocessor and Configuring Intel Xeon Phi coprocessor inside a cluster guides.

Each coprocessor is assigned to a separate subnet. The first three quads must match, they define the subnet of the particular coprocessor. Example of IP addresses for a host number X:

  • Assigned address to the host for communication with coprocessor number X: 172.31.X.254
  • Assigned address to the coprocessor number X for the communication with the host: 172.31.X.1

Running micctrl --initdefaults does not correctly initialize micX network interfaces on Gentoo, it has to be done manually. The host side of the coprocessors is defined in /etc/conf.d/net (note that we don't use RHEL configuration file /etc/sysconfig/network-scripts/ifcfg-micX). In /etc/conf.d/net you should define network interface micX for each installed coprocessor. A snippet of the file which defines the network interface for coprocessor number X is as follows.
config_micX=null #this line ensure that a network manager is not used
config_micX="172.31.X.254 netmask 255.255.255.0"
mtu_micX="64512"


After that, create a symbolic link from net.lo to net.micX.
cd /etc/init.d
ln -s net.lo net.micX


Start the network services. For OpenRC and default runlevel that means to execute:
rc-update add net.micX default

The coprocessor configuration is located in file /etc/mpss/micX.conf. There you can assign a hostname to the coprocessor and set the network configuration. The part of the file that we are interested in could look like this:

# Hostname to assign to MIC card
HostName your.domain.com

Network class=StaticPair micip=172.31.X.1 hostip=172.31.X.254 mtu=64512 netbits=24 modhost=yes modcard=yes


After editing this file run micctrl --resetconfig to instantiate the changes in the configuration files.

Check with ifconfig that all host network interfaces are correctly configured. There is snippet of what you might get:

micX: flags=67<UP,BROADCAST,RUNNING>  mtu 64512
        inet 172.31.X.254  netmask 255.255.255.0  broadcast 172.31.X.255
        inet6 fe80::4e79:baff:fe1c:1b4f  prefixlen 64  scopeid 0x20<link>
        ether 4c:79:ba:1c:1b:4f  txqueuelen 1000  (Ethernet)
        RX packets 6  bytes 468 (468.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8  bytes 648 (648.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

 

Update Flash 

MPSS has specific firmware requirements which are stated in MPSS readme. The micinfo utility provides detailed information about the coprocessor hardware and software and allows us to check whether the flash, SMC firmware and bootloader versions correspond to the required ones.
For MPSS 3.5.2 these versions are required:
Flash Version: 2.1.02.0391
SMC Firmware Version: 1.17.6900
SMC Boot Loader Version: 1.8.4326

Before flashing it is very important to see Flash Issues & Remedies and readme for MPSS. There are described critical combinations that do not allow standard way of flashing. In rare cases these combinations can even lead to a card stuck in a non-operational state.

Image files (with the suffix .rom.smc) are stored in default location /usr/share/mpss/flash/. If they are elsewhere, you need to pass the path as an argument to the flash tool. There are a few preconditions that should be satisfied before flashing.

  • The current running version of Flash must be >=375, otherwise see Flash Issues & Remedies.
  • Coprocessors must be in the ready state. You can achieve it by running micctrl -rw.

The micflash utility can also check the compatibility of the image file or save the current flash image into a file.

Use this command to update device X:
micflash -update -device X 
When SMC boot loader update is necessary use this command to update flash and SMC of device:
micflash -update -device X -smcbootloader

After the flashing is complete perform a cold boot.

Installation of Intel Parallel Studio

Intel Parallel Studio XE is available for students, educators, academic researchers, and open source contributors free of charge. After download, the install script install.sh will guide you through the process. For more details, see Intel Parallel Studio XE Installation guide. It depends on Python 2.7, you can select the correct Python version and run the installer by using the following command.
EPYTHON=python2.7 ./install.sh
The installer will warn you that the operating system is unsupported, however the installation should proceed successfully and Intel compiler should work. So far we have not noticed any problems with Intel Compiler.

To initialize the environment variables for Intel Parallel Studio XE tools in a current shell run
source <install-dir>/bin/compilervars.sh intel64
You can add this command to the ~/.bashrc file to make it permanent for a current user.

For more information see Intel Parallel Studio XE Installation Guide

Offloading

Offload programming model refers to running main program on a host and offloading heavy work to the coprocessor. This model is suitable for programs with many parts that are both parallelizable and vectorizable. The sequential part is carried out by a complex CPU core (e.g. out-of-order execution, bigger cache). During the design of a parallel algorithm the communication overhead of the data transfers between host and coprocessor should be taken into account to develop efficient algorithms.

On each coprocessor there is an automatically created special user called 'micuser' which executes work offloaded to the coprocessor. 

For more information see Offload Compiler Runtime for the Intel Xeon Phi Coprocessor and Native and Offload Programming Models.

Benchmark

We have created a synthetic micro-benchmark to test our system and to make some comparisons. GNU C library (glibc) includes libmvec library since version 2.22, i.e. the vector math library which contains vector variants of scalar math functions implemented by using SIMD instructions (e.g. SSE or AVX on x86 64 bit platform). Therefore, we were able to compare the performance with and without vectorization using GNU Compiler (GCC). The second objective was to compare the offloaded and CPU versions using Intel Compiler (ICPC). Third objective was to compare ICPC and GCC in terms of vectorization and code optimization only for the CPU.

The benchmark randomly initializes array a of the length N a and does specified number of attempts A to compute sin2ai +  cos2ai ; i = 0...N-1. The input is N - the size of the array and A - the number of trials (number of repetitions of computation). The snippet of the benchmark source code follows, the whole source code is available for download from this link.

#pragma omp parallel
...
#pragma omp for
for (uint32_t j = 0; j < A; ++j){
    #pragma omp simd aligned(a,b:AVX_ALIGNMENT)
    for (uint32_t i = 0; i < N; ++i) {
        // note: suffix 'f' to function names for floats to enable vectorization
        b[i] = powf(sinf(a[i]), 2.0f);
        b[i] += powf(cosf(a[i]), 2.0f);
    }
}


The following optimization flag were used:
g++ -march=native -ffast-math -fopenmp -O3 -std=c++11
icpc -march=native -openmp -O3 -std=c++11

The number of attempts distinguished small and heavy workloads. Hardware configuration was following: two Intel Xeon E5-2620 CPUs and two Xeon Phi 31S1P coprocessors with 64 GB of RAM.
 

  GCC 4.9.3 without vectorization GCC 4.9.3 with vectorization ICPC 16.0.0 CPU ICPC 16.0.0 offloaded ICPC 16.0.0 CPU ICPC 16.0.0 offloaded
Number of attempts A 220 220 220 220 230 230
Array size N 210 210 210 210 210 210
Average time over 10 runs 2.310 s 0.602 s 0.381 s 0.686 s 355 s 39.6 s


From the table, it is obvious that for a small workload, the offloading overhead significantly reduces the speedup resulting in a worse time than the pure CPU version compiled by ICPC. However for the heavy workload the offloaded version is almost 9 times faster than ICPC CPU version. Comparing ICPC with GCC using vectorization, the ICPC is around 1.6 times faster on this benchmark. When focusing on GCC, the vectorization improved performance 3.8 times.

Conclusion

We have successfully installed and used Intel Xeon Phi coprocessors on a host system running Gentoo Linux. In this how-to, we described the process of installation and configuration of both the host system and coprocessor since Intel officially supports only two enterprise Linux distributions (SLES and RHEL) and installing Intel Xeon Phi on other Linux distributions is usually not that straightforward. The main obstacles to overcome were incompatibility of packages, their dependencies and network setup between the host and coprocessor. To install the packages we had to patch some of them and create or modify ebuilds which will be uploaded to GitHub in cooperation with Anselm Busse (in review, now download here). We have also installed the Intel Parallel Studio XE which includes Intel Compiler (ICPC) needed for compiling the code for Intel Xeon Phi. That allowed us to run our synthetic micro-benchmark on which we demonstrated performance of coprocessors in comparison to the CPU. When offloading was used for heavy workload, it was almost 9 times faster than the CPU version, however, a smaller amount of work resulted in a huge performance regression. We also tested GNU Compiler (GCC) which have recently got the support of the vectorization of the math functions. Enabling the vectorization resulted in 3.8 times faster performance using GCC compared to the version without vectorization, but still was around 1.6 times slower compared to ICPC on CPU on our benchmark.
 

References:

Intel Newsroom: Intel Delivers New Architecture for Discovery with Intel® Xeon Phi™ Coprocessors.Intel Newsroom [online]. [cit. 2016-02-16]. Available from: https://newsroom.intel.com/press-kits/intel-xeon-phi-coprocessor-5110p30...

Developer Zone: Intel Xeon Phi X100 Faimily Coprocessor - the Architecture. Developer Zone [online]. [cit. 2016-02-16]. Available from: https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-cod...