# CHAPTER 1: Multicore Systems and Multi-Threading¶

Now that our Raspberry Pi is set up, let’s learn a bit more about the type of processor that is on it, and what it means for parallelism.

Type the following command to learn more about the Raspberry Pi CPU (it doesn’t matter what directory you are in):

lscpu


Similar to the ls command, the lscpu command lists information; in this case, the information is specifically about the system’s CPU architecture.

Below is some sample output from a Raspberry Pi 4:

Architecture:        armv7l
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           ARM
Model:               3
Model name:          Cortex-A72
Stepping:            r0p3
CPU max MHz:         1500.0000
CPU min MHz:         600.0000


The version of Raspbian that the CSinParallel image is built on is a 32-bit OS. Therefore, even though the physical chip supports 64-bit instructions, the instruction set architecture associated with the image is ARMv7 (32-bit ARM). A unique feature about ARM is its support for both Big Endian and Little Endian byte ordering; however, the default byte ordering on ARM devices is Little Endian (as you can see above).

Note

The CSinParallel image is built using Raspbian, which historically was the default operating system for the Raspberry Pi. However, there are 64-bit operating systems that are available for the Pi. Most notably, Ubuntu Mate offers a full 64-bit environment for the Raspberry Pi 2, 3, and 3B+. A 64-bit version of Ubuntu Server (which does not include a desktop interface) is available for the Raspberry Pi 4. In June 2020, the Raspberry Pi Foundation announced the release of Raspberry Pi OS, which is the new default operating system for the Raspberry Pi. As part of the announcement, the Raspberry Pi Foundation released a beta version 64-bit Raspberry Pi OS. All of this is cutting edge stuff; we anticipate full support for 64-bit operations soon!

## 1.1 Cores, Processes and Threads¶

The lscpu command tells us a LOT of useful information, including the number of available cores. In this case we know that there is one socket (or chip) with 4 physical cores, where each core can support 1 thread. On larger systems, it is common to see multiple threads supported per core. This is an example of simultaneous multi-threading (SMT, or Hyperthreading on Intel systems). In the case of the Raspberry Pi, each core only supports one thread. Running the top command and pressing the 1 key will confirm that the Pi supports 4 total cores. A core can be thought of as the compute unit of the CPU. It includes registers, an ALU, and control units.

Before we can discuss what a thread is, we must first discuss what a process is. A process can be thought of as an abstraction of a running program. When you type a command into the command line and press Enter, the Bash shell launches a process associated with that program executable. Each process contains a copy of the code and data of the program executable, and its own allocation of the stack and heap.

A thread is a light-weight process. While each thread gets its own stack allocation, it shares the heap, code and data of the parent process. As a result, all the threads in a multi-threaded process can access a common pool of memory. This is why multi-threading is commonly referred to as shared memory programming. A single-threaded process is also referred to as a serial process or program.

## 1.2 How the CPU Executes Threads and Processes¶

A single CPU core is capable of executing exactly one thread or process (collectively called tasks) at any given time. Before multicore CPUs became commericialized in the mid-2000s, all CPUs contained one compute core. However, anyone who used computers prior to that time can remember running multiple applications on their computers seemingly simultaneously. How did that work?

To create the illusion of multiple programs running simultaneously, the operating system employs a clever strategy known as concurrency, where it interleves the execution of multiple processes. Consider the scenario where there are four processes (P1, P2, P3, P4). An operating system may choose to run P1 for a period of time, before switching over to executing P2, then P3, and so on. The CPU’s act of switching between tasks is known as a context switch. It is also important to note that the operating system chooses the order in which to execute tasks – this is not something the user can typically control.

### Process Execution¶

A multicore CPU allows multiple processes to execute simultaneously, or in parallel. While the terms concurrency and parallel are related, it is useful to think of concurrency as a software/OS-level concept, while parallel as a hardware/execution concept. A multi-threaded program, while capable of parallel execution, runs concurrently on a system with only a single CPU core.

The primary goal of creating multi-threaded programs is to decrease the speed of a program’s execution. In a program that is perfectly parallelizable (that is, all components are paralleizable), it is usually possible to distribute the work associated with a program equally among all the threads. For a program $$p$$ whose work is equally distributed among $$t$$ threads, it will take roughly $$p/t$$ time, if executed on $$t$$ cores.

## 1.3 Leveraging Multiple Cores¶

While multicore processors are ubiquitous in today’s world, most of the popular programming languages were designed to support single-thread execution. However, several libraries are available for supporting multi-threading in popular languages like C/C++ and FORTRAN.

The rest of this tutorial will discuss the Open MultiProcessing (OpenMP), a popular API for shared memory programming, and a standard since 1997. A key benefit of OpenMP over explicit threading libraries like POSIX threads is the ability to incrementally add parallelism to a program. For standard threaded programs, it is usually necessary to write a lot of extra code to add multi-threading to a program. Instead, OpenMP employs a series of pragmas, or special compiler directives, that tell the compiler how to parallelize the code.