# Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

**Changwoo Min**, Woonhak Kang, Mohan Kumar, Sanidhya Kashyap, Steffen Maass, Heeseung Jo, Taesoo Kim

Virginia Tech, eBay, Georgia Tech, Chonbuk National University

April 26, 2018

### Cambrian Explosion of Processor Architecture

# Intel Unveils Plans for Knights Mill, a Xeon Phi for Deep Learning

Michael Feldman | August 18, 2016 01:33 CEST

At the Intel Developer Forum (IDF) this week in San Francisco, Intel revealed it is working on a new Xeon Phi processor aimed at deep learning applications. Diane Bryant, executive VP and GM of Intel's Data Center Group, unveiled the new chip, known as Knights Mill, during her IDF keynote address on Wednesday.



Specialization of general-purpose processors

## Cambrian Explosion of Processor Architecture

#### Intel Unveils Plans for Knights Mill, a Xeon Phi for Deep Learning

Michael Feldman | August 18, 2016

MUST READ IT JOBS IN 2020: PREPARING FOR THE NEXT INDUSTRIAL REVOLUTION

At the Intel Developer Forum (I at deep learning applications. D as Knights Mill, during her IDF



#### Nvidia expands new GPU cloud to HPC applications

With more than 500 high-performance computing applications that incorporate GPU acceleration, Nvidia is aiming to make them easier to access

Stephanie Condon for Between the Lines | November 13, 2017 -- 23:00 GMT (15:00 PST) | Topic: Processors





Intel eyes in-cabin experiences for autonomous vehicles, partners with Intel: We've found severe bugs in secretive Management Engine. affecting millions

**RELATED STORIES** Qualcomm files new patent infringement complaints against Apple Artificial Intelligence

Warner Bros



WINDOWS 10 CLOUD INNOVATION SECURITY TECH PRO MORE - NEWSLETTERS ALL V

Intel: We're ending all legacy BIOS support by 2020



Specializatio

Touting the large number of high-performance computing (HPC) applications that incorporate GPU acceleration. Nvidia on Monday announced new software and tools on the Nvidia GPU Cloud (NGC) container registry that allow scientists to guickly deploy scientific computing applications and HPC visualization tools

#### Generalization of co-processors

Changwoo Min

## Cambrian Explosion of Processor Architecture

#### Intel Unveils Plans for Knights Mill, a Xeon Phi for Deep Learning

Michael Feldman | August 18, 2016

MUST READ IT JOBS IN 2020: PREPARING FOR THE NEXT INDUSTRIAL REVOLUTION

At the Intel Developer Forum (I at deep learning applications. D as Knights Mill, during her IDF



#### Nvidia expan Intel Announces 'Nervana' Neural applications **Network Processor** By Ryan Whitwam on October 18, 2017 at 7:30 am 10 Comments With more than 500 high-perform to make them easier to access

Stephanie Condon for Between the Line DEED I FARMIN

### Specializatio

Touting the large number of high-perform incorporate GPU acceleration. Nvidia on N Nvidia GPU Cloud (NGC) container registry computing applications and HPC visualiza

#### Generalization c



INNOVATION SECURITY TECH PRO MORE - NEWSLETTERS

Science-fiction authors and modern technology mega-corporations agree on one thingartificial intelligence is the future. Everyone from Google to Facebook is designing artificial neural networks to tackle big problems like computer vision and speech synthesis. Most of these projects are using existing computer hardware, but Intel has something big on the way. The chip maker has announced the first dedicated neural network processor, the Intel Nervana Neural Network Processor (NNP).

### Specialization of co-processors

Changwoo Min

Solros: Data-Centric OS

April 26, 2018 2 / 21

## Blazingly fast IO Devices

#### Intel's new Optane memory drives will turn your PC into a beast

Jamie McKane 10 April 2017 33 Comments





Intel's cutting-edge Optane memory boasts massive performance increases over solid state drives.

#### Blazingly fast storage/memory

# Blazingly fast IO Devices

#### Intel's new Optane memory drives will turn your PC into a beast

Jamie McKane 10 April 2017 33 Comments



67 f 🕑 G 🔤 in

Intel's cutting-edge Optane memory boasts ma increases over solid state drives.

Em

### Blazingly fast storage/ı



#### November 9, 2017

In the run-up to the annual supercomputing conference SC17 next week in Denver, Mellanox made a series of announcements today, including a scalable switch platform based on its HDR 200G InfiniBand technology and the first deployment of a 100Gb/s Linux kernel-based Ethernet switch.

The company touts its HDR (High Data Rate) 200G InfiniBand Quantum, which offers up to 800 ports of 200Gb/s or 1,600 ports 100Gb/s in one chassis, as the most scalable switch platform available.

The platform family includes:

- · Quantum QM8700: 40-port 200Gb/s or 80-port 100Gb/s
- · Quantum CS8510: modular 200-port 200Gb/s or 400-port 100Gb/s
- · Quantum CS8500: modular 800-port 200Gb/s or 1,600-port 100Gb/s

Mellanox said the Quantum product line's switch density will enable space and power consumption optimization, reducing network equipment cost by 4X, electricity costs by 2X and improving data transfer time by 2X.

#### Blazingly fast network

Changwoo Min

#### Solros: Data-Centric OS

FULL D COMPLETE COVERAG

O SC17

#### April 26, 2018 3 / 21

# Blazingly fast IO Devices

#### Intel's new Optane memory drives will turn your PC into a beast

Jamie McKane 10 April 2017 33 Comments





How to exploit the full potential of such hardware devices without pain?

- System-wide performance
- Ease of programming

67 f 🎽 G• 🖂 in

Intel's cutting-edge Optane memory boasts ma increases over solid state drives.

Blazingly fast storage/ı

JLL D COMPLETE COVERAGE pynote Reviews, Analysts Write Ups, Booth Vider, Student Competition, Awards and so much more

The platform family includes:

- · Quantum QM8700: 40-port 200Gb/s or 80-port 100Gb/s
- · Quantum CS8510: modular 200-port 200Gb/s or 400-port 100Gb/s
- · Quantum CS8500: modular 800-port 200Gb/s or 1,600-port 100Gb/s

Mellanox said the Quantum product line's switch density will enable space and power consumption optimization, reducing network equipment cost by 4X, electricity costs by 2X and improving data transfer time by 2X.

#### Blazingly fast network

Changwoo Min

Solros: Data-Centric OS

### 1 Heterogeneous Computing Architectures

### 2 Solros: Split-Kernel Approach

- Solros Architecture
- Operating System Services



- Host OS controls co-processors and IO devices
- Examples: OpenCL, CUDA



- Host OS controls co-processors and IO devices
- Examples: OpenCL, CUDA



- Host OS controls co-processors and IO devices
- Examples: OpenCL, CUDA



- Host OS controls co-processors and IO devices
- Examples: OpenCL, CUDA



### Host-Centric Approach

- Host OS controls co-processors and IO devices
- Examples: OpenCL, CUDA



#### Problem

Redundant data communication Complex to program and hard to optimize

Changwoo Min

Solros: Data-Centric OS

April 26, 2018 5 / 21

- Co-processors control IO devices
- Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14]



- Co-processors control IO devices
- Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14]



- Co-processors control IO devices
- Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14]



- Co-processors control IO devices
- Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14]



#### Problem

Significant effort required for porting IO stack to co-processor Not completely exploiting powerful host processors

Changwoo Min

Solros: Data-Centric OS

April 26, 2018 6 / 21

### 1 Heterogeneous Computing Architectures

### 2 Solros: Split-Kernel Approach

- Solros Architecture
- Operating System Services



- Ease of programming
- Best use of processor architecture
- System-wide optimization

- Ease of programming
- Best use of processor architecture
- System-wide optimization

### Challenge

- Co-processor needs IO abstraction
- IO stacks is branch-divergent and difficult to parallelize
- It needs system-wide information

#### Split-Kernel Architecture

#### • Data-plane OS

- Runs on a co-processor
- Provides IO abstraction
- Delegates actual IO operations to a control-plane OS

#### Control-plane OS

- Runs on a host processor
- Runs actual IO stack
- Performs system-wide coordination

- Control-plane OS: actual OS service + system-wide coordination
- Data-plane OS: thin communication layer to host processor



- Control-plane OS: actual OS service + system-wide coordination
- Data-plane OS: thin communication layer to host processor



- Control-plane OS: actual OS service + system-wide coordination
- Data-plane OS: thin communication layer to host processor



- Control-plane OS: actual OS service + system-wide coordination
- Data-plane OS: thin communication layer to host processor



- Control-plane OS: actual OS service + system-wide coordination
- Data-plane OS: thin communication layer to host processor



- Co-processor has OS abstraction with minimal effort
- Best use of each of the fat and lean processors
- Efficient global coordination among devices (policy)

Solros: Data-Centric OS

- **1** Transport service
- ② Filesystem service
- O Network service



Ø Filesystem service

Intervice Network service

High performance data transfer among devices are challenging:

- Uniform data transfer among devices
- High contention in massively-parallel co-processor
- Asymmetric performance between host processor and co-processor

High performance data transfer among devices are challenging:

- Uniform data transfer among devices
- High contention in massively-parallel co-processor
- Asymmetric performance between host processor and co-processor

### Our approach

- Uniform data transfer  $\Rightarrow$  system-mapped PCIe window
- High contention  $\Rightarrow$  combining, replication, interleaving, etc.
- Asymmetric performance ⇒ flexibly configurable (host DMA engine vs. co-processor DMA engine)

High performance data transfer among devices are challenging:

- Uniform data transfer among devices
- High contention in massively-parallel co-processor
- Asymmetric performance between host processor and co-processor

### Our approach

- Uniform data transfer  $\Rightarrow$  system-mapped PCIe window
- High contention  $\Rightarrow$  combining, replication, interleaving, etc.
- Asymmetric performance ⇒ flexibly configurable (host DMA engine vs. co-processor DMA engine)

#### See details in the paper

#### • Peer-to-peer operation

• Buffered operation



#### • Peer-to-peer operation

#### • Buffered operation



#### • Peer-to-peer operation

#### • Buffered operation



#### • Peer-to-peer operation

#### • Buffered operation



#### • Peer-to-peer operation

#### • Buffered operation



- Peer-to-peer operation
- Buffered operation



- Peer-to-peer operation
- Buffered operation



- Peer-to-peer operation
- Buffered operation



- Peer-to-peer operation
- Buffered operation



- Peer-to-peer operation
- Buffered operation



### Implementation

- Host: 2-socket Xeon processor (12 cores each)
- Co-processor: 4 Xeon Phi (KNC, 61 cores, Linux, PCIe Gen 3x16)
- Storage device: 4 NVMe SSD
- NIC: 100 Gbps Ethernet

| Module                                   |               | Lines of code  |               |
|------------------------------------------|---------------|----------------|---------------|
|                                          |               | Added lines    | Deleted lines |
| Transport service                        |               | 1,035          | 365           |
| File system Service                      | Stub<br>Proxy | 5,957<br>2,338 | 2,073<br>124  |
| Network Service                          | Stub<br>Proxy | 2,921<br>5,609 | 79<br>34      |
| NVMe device driver<br>SCIF kernel module |               | 924<br>60      | 25<br>14      |
| Total                                    |               | 18,844         | 2,714         |

Questions:

- Performance of Solros services
- Impact on real-world applications

### Performance of Solros Services



File IO performance: 19x faster than the stock Linux on Xeon Phi TCP latency (99 percentile): 7x shorter than the stock Linux on Xeon Phi

Changwoo Min

Solros: Data-Centric OS

April 26, 2018 18 / 21

### Performance of Solros Services



Significant performance gain in data transport Running IO stack on co-processor is slower

Changwoo Min

Solros: Data-Centric OS

April 26, 2018 19 / 21

### Real-world Application - Image Search

- Image search engine is running on Xeon Phi
- Image database is on NVMe SSD (shared read-only)
- Image search queries are from network



Solros performs 2x faster than stock Linux on Xeon Phi

- Solros, a new operating system architecture for co-processors and fast IO devices
- Control-plane, data-plane architecture allow:
  - Supporting high-level OS abstraction on co-processor
  - Efficient global coordination among devices
  - Near ideal IO performance from co-processor
- We will release source code soon

• Increase the number of Xeon Phi to 4



Solros load balancing mechanism achieves near linear scaling

Changwoo Min

Solros: Data-Centric OS

- Control-plane/data-plane OS: Arrakis [OSDI'14], IX [OSDI'14]
- OS for heterogeneous systems: Helios [SOSP]09], M3 [ASPLOS'16], Hydra [ASPLOS'08]
- IO support for GPU: PTask [SOSP'11], GPUfs [ASPLOS'13], GPUnet [OSDI'14]



# Network Service (TCP)

- Outbound operation
- Inbound operation



A load balancer on a host distributes incoming TCP connections to one of least-loaded co-processors.

```
See details in the paper.
```

- Hardware support other than Xeon Phi
  - Two atomic instructions: transport service
  - MMU: isolation among co-processor applications
- Scalability of control-plane OS
  - Limited by scalability of OS service, PCIe interconnect, and performance of IO devices

### Real-world Application - Text Search

- CLucene text indexing engine running on Xeon Phi
- Text data is on NVMe SSD



Solros performs 19x faster than stock Linux (ext4/virtio) on Xeon Phi