Heterogeneous GPU reallocation

Wednesday July 5th, 1-2PM @ BA5205

Speaker: James Gleeson

Title:
Heterogeneous GPU reallocation

Abstract:
Emerging cloud markets like spot markets and batch computing services scale up services at the granularity of whole VMs. In this paper, we observe that GPU workloads underutilize GPU device memory, leading us to explore the benefits of reallocating heterogeneous GPUs within existing VMs. We outline approaches for upgrading and downgrading GPUs for OpenCL GPGPU workloads, and show how to minimize the chance of cloud operator VM termination by maximizing the heterogeneous environments in which applications can run.

Bio:
James is a PhD student under Eyal de Lara.  He has done research in mobile security for both physical and software attacks on Android phones.  His current research interests are in heterogeneous computing in data centers.

Crane: Fast and Migratable GPU Passthrough for OpenCL applications

Wednesday May 17th, 12-1PM @ BA5205

Speaker: James Gleeson

Title:
Crane: Fast and Migratable GPU Passthrough for OpenCL applications

Abstract:
General purpose GPU (GPGPU) computing in virtualized environments leverages PCI passthrough to achieve GPU performance comparable to bare-metal execution. However, GPU passthrough prevents service administrators from performing virtual machine migration between physical hosts.
Crane is a new technique for virtualizing OpenCL-based GPGPU computing that achieves within 5.25% of passthrough GPU performance while supporting VM migration. Crane interposes a virtualization-aware OpenCL library that makes it possible to reclaim and subsequently reassign physical GPUs to a VM without terminating the guest or its applications. Crane also enables continued GPU operation while the VM is undergoing live migration by transparently switching between GPU passthrough operation and API remoting.
 

Bio:
James is a PhD student under Eyal de Lara.  He has done research in mobile security for both physical and software attacks on Android phones.  His current research interests are in heterogeneous computing in data centers.

The Game of Twenty Questions: Do You Know Where to Log?

Thursday May 4th, 12-1PM @ BA5205

Speaker: Xu Zhao

Title:
The Game of Twenty Questions: Do You Know Where to Log?

Abstract:
A production system’s printed logs are often the only source of runtime information available for postmortem debugging, performance profiling, security auditing, and user behavior analytics.  Therefore, the quality of this data is critically important. Recent work has attempted to enhance log quality by recording additional variable values, but logging statement placement, i.e., where to place a logging statement, which is the most challenging and fundamental problem for improving log quality, has not been adequately addressed so far. This position paper proposes an automated placement of logging statements by measuring the uncertainty of software that can be eliminated. Guided by ideas from information theory, authors describe a simple approach that automates logging statement placement. Preliminary results suggest that the algorithm can effectively cover, and further improve, existing logging statements placed by developers. It can compute an optimal log-placement that disambiguates the entire function call path with only 0.218% of
slowdown.

Bio:
Xu Zhao is a 2nd year PhD student at the University of Toronto, under the supervision of Prof. Ding Yuan. His research interests lie in the area of performance of distributed systems and failure diagnosis. His current work focuses on automated placement of logging statements and non-intrusive performance profiling for distributed systems.

Challenges and Solutions to Secure Internet Geolocation

Wednesday May 3rd , 12-1PM @ BA5205

Speaker: AbdlRahman Abdou

Title:
Challenges and Solutions to Secure Internet Geolocation

Abstract:
The number of security-sensitive location-aware services over the Internet continues to grow, such as location-aware authentication, location-aware access policies, fraud prevention, complying with media licensing, and regulating online gambling/voting. 
An adversary can evade existing geolocation techniques, e.g., by faking GPS coordinates or employing a non-local IP address through proxy and virtual private networks. In this talk, I will present parts of my PhD work, including Client Presence Verification (CPV), which is a measurement-based technique designed to verify an assertion about a device’s presence inside a prescribed geographic region. CPV does not identify devices by their IP addresses. Rather, the device’s location is corroborated in a novel way by leveraging geometric properties of triangles, which prevents an adversary from manipulating network delays to its favor. To achieve high accuracy, CPV mitigates Internet path asymmetry using a novel method to deduce one-way application-layer delays to/from the client’s participating device, and mines these delays for evidence supporting/refuting the asserted location. I will present CPV’s evaluation results, including the granularity of the verified location and the verification time, and summarize some lessons we learned throughout the process.

Bio:
AbdelRahman Abdou is a Post-Doctoral Fellow in the School of Computer Science at Carleton University. He received his PhD (2015) in Systems and Computer Engineering from Carleton University. His research interests include location-aware security, SDN security, authentication, SSL/TLS and using Internet measurements to solve problems related to Internet security.

Consistency Oracle

Friday April 28th, 1-2PM @ BA5205

Speaker: Beom Heyn Kim

Title:
Consistency Oracle

Abstract:
Many modern distributed storage systems emphasize availability and partition tolerance over consistency, leading to many systems that provide weak data consistency. However, weak data consistency is difficult for both system designers and users to reason about formal specifications that may offer precise descriptions of consistency behavior, but they are difficult to use and usually require expertise beyond that of the average software developer. In this paper, we propose and describe consistency oracle, a novel instantiation of formal specification. A consistency oracle takes the same interface call as a distributed storage system, but returns all possible values that may be returned under a given consistency model. Consistency oracles are easy to use and can be applied to test and verify both distributed storage systems and client software that uses those systems.

Bio:
Ben is a PhD student under Prof. David Lie. His research primarily focuses on Consistency Verification for Distributed Systems.

Semantic Aware Online Detection of Resource Anomalies on the Cloud

Wednesday Nov 23rd, 12-1PM @ BA5205

Speaker: Stelios Sotiriadis

Title:
Semantic Aware Online Detection of Resource Anomalies on the Cloud

Abstract:
As cloud based platforms become more popular, it becomes an essential task for the cloud administrator to efficiently manage the costly hardware resources in the cloud environment.
Prompt action should be taken whenever hardware resources are faulty, or configured and utilized in a way that causes application performance degradation, hence poor quality of service. In this paper, we propose a semantic aware technique based on neural network learning and pattern recognition in order to provide automated, real-time support for resource anomaly detection.
We incorporate application semantics to narrow down the scope of the learning and detection phase, thus enabling our machine learning technique to work at a very low overhead when executed online. As our method runs “life-long” on monitored resource usage on the cloud, in case of wrong prediction, we can leverage administrator feedback to improve prediction on future runs.
This feedback directed scheme with the attached context helps us to achieve an anomaly detection accuracy of as high as 98.3% in our experimental evaluation, and can be easily used in conjunction with other anomaly detection techniques for the cloud.

Bio:
Stelios Sotiriadis is a research fellow under Prof. Cristiana Amza. His research focuses Inter-Cloud Meta-Scheduling (ICMS) framework.

Non-intrusive Performance Profiling of Entire Software Stacks based on the Flow Reconstruction Principle

Thursday October 27th, 12-1PM @ BA5205

Speaker: Xu Zhao

Title:
Non-intrusive Performance Profiling of Entire Software Stacks based on the Flow Reconstruction Principle

Abstract:
Understanding the performance behavior of distributed server stacks at scale is non-trivial. The servicing of just a single request can trigger numerous sub-requests across heterogeneous software components; and many similar requests are serviced concurrently and in parallel. When a user experiences poor performance, it is extremely difficult to identify the root cause, as well as the software components and machines that are the culprits. This work describes Stitch, a non-intrusive tool capable of profiling the performance of an entire distributed software stack solely using the unstructured logs output by heterogeneous software components. Stitch is substantially different from all prior related tools in that it is capable of constructing a system model of an entire software stack without building any domain knowledge into Stitch. Instead, it automatically reconstructs the extensive domain knowledge of the programmers who wrote the code; it does this by relying on the Flow Reconstruction Principle which states that programmers log events such that one can reliably reconstruct the execution flow a posteriori.

Bio:
Xu is a second year PhD student under Prof. Ding Yuan. His research focuses on performance failure debugging and log analysis in distributed systems.

Don’t Get Caught In the Cold, Warm-up Your JVM: Understand and Eliminate JVM Warm-up Overhead in Data-parallel Systems

Wednesday October 26th, 12-1PM @ BA5205

Speaker: David Lion

Title:
Don’t Get Caught In the Cold, Warm-up Your JVM:
Understand and Eliminate JVM Warm-up Overhead in Data-parallel Systems

Abstract:

Many widely used, latency sensitive, data-parallel distributed systems, such as HDFS, Hive, and Spark choose to use the Java Virtual Machine (JVM), despite debate on the overhead of doing so. This paper analyzes the extent and causes of the JVM performance overhead in the above mentioned systems. Surprisingly, we find that the warm-up overhead, i.e., class loading and interpretation of bytecode, is frequently the bottleneck. For example, even an I/O intensive, 1GB read on HDFS spends 33% of its execution time in JVM warm-up, and Spark queries spend an average of 21 seconds in warm-up.

The findings on JVM warm-up overhead reveal a contradiction between the principle of parallelization, i.e., speeding up long running jobs by parallelizing them into short tasks, and amortizing JVM warm-up overhead through long tasks. We solve this problem by designing HotTub, a new JVM that amortizes the warm-up overhead over the lifetime of a cluster node instead of over a single job by reusing a pool of already warm JVMs across multiple applications. The speed-up is significant. For example, using HotTub results in up to 1.8X speed-ups for Spark queries, despite not adhering to the JVM specification in edge cases.

Bio:
David is a first year PhD student under Prof. Ding Yuan. His research primarily focuses on Java Virtual Machine performance in data-parallel applications.

Accelerating Complex Data Transfer for Cluster Computing

Friday June 10th, 12-1PM @ BA5205

Speaker: Alexey

Title:
Accelerating Complex Data Transfer for Cluster Computing

Abstract:
The ability to move data quickly between the nodes of a distributed system is important for the performance of cluster computing frameworks, such as Hadoop and Spark. We show that in a cluster with modern networking technology data serialization is the main bottleneck and source of overhead in the transfer of rich data in systems based on high-level programming languages such as Java. We propose a new data transfer mechanism that avoids serialization altogether by using a shared cluster-wide address space to store data. The design and a prototype implementation of this approach are described. We show that our mechanism is significantly faster than serialized data transfer, and propose a number of possible applications for it.

Bio:
Alexey Khrabrov is a 1st year PhD student at University of Toronto, under the supervision of prof. Eyal de Lara. His research interests lie in the area of performance of distributed systems. His current work focuses on leveraging modern network technologies and designing new programming models to improve data transfer performance in cluster computing systems.