Multithreading Architectures

Superscalar architectures

A scalar architecture is a single issue architecture while superscalar architectures allow for multiple instructions to be executed per clock cycle. Remember that ILP $\neq =$ parallel processing:

ILP improves throughput by pipelining operations
Parallel processing is non-user-transparent way of executing programs

{width=50%}

Superscalar architectures have limitations. ILP has an upper bound:

Dynamic scheduling is expensive and difficult to design and verify
Register renaming has its limits
Jump and branch predictions are not always accurate
Memory latency is a huge issue.
Hazards prevent too many instructions from being issued simultaneously.

Temporal Multithreading

Using Temporal Multithreading is possible to further increase parallelism by alternately switching between tasks. Temporal Multi-threading has a high cost due to each thread having its own context and there are different techniques to achieve it:

Fine-grained multi-threading involves switching between different threads at a very fine level of granularity, such as after every instruction. This allows for maximum utilization of resources but can also lead to increased overhead due to frequent context switching.
Coarse-grained multi-threading involves switching between different threads at a coarser level of granularity, such as after completing a group of instructions from one thread before moving on to another thread. This reduces the overhead associated with context switching but may not fully utilize available resources.
Simultaneous multi-threading (SMT) is similar to fine-grained multi-threading but goes further by allowing multiple instructions from different threads to be executed simultaneously within each clock cycle.

{width=50%}

There are drawbacks in each approach:

Fine-grain multithreading has empty issue windows, causing idle time.
Coarse-grained multithreading architecture is not commonly used, since it requires flushing the pipeline and can result in starvation.
SMT requires more complex hardware than either fine- or coarse-grained multithreading but can provide higher levels of parallelism and better resource utilization.

{width=50%}

ARM Cortex-a53 pipeline example

What actually happens in real world? Basically different systems are combined to achieve the most performance: for example cpu alone is not enough for tasks such as playing games in 4K or streaming on Twitch: GPUs are a common addition to systems. This is called heterogeneous architecture since it combines different types of processors. An example of heterogeneous architecture is the big.LITTLE architecture. Cortex-a53 (also the Raspberry Pi 3 CPU) can be used alone as an energy-efficient alternative to the Cortex-A57, or in a big.LITTLE configuration alongside a more powerful microarchitecture. The big.LITTLE configuration architecture is multicore and heterogeneous with a shared ISA, combining battery-efficient (LITTLE) and power-hungry (big) cores. Typically, only one side will be active, but all cores can access the same memory.

blog.martino.im

Explorer

11.Multithreading Architectures

Multithreading Architectures

Superscalar architectures

Temporal Multithreading

ARM Cortex-a53 pipeline example

Graph View

Table of Contents

Backlinks