A Systematic View of Model Leakage Risks in Deep Neural Network Systems

As deep neural networks (DNNs) continue to find applications in ever more domains, the exact nature of the neural network architecture becomes an increasingly sensitive subject, due to either intellectual property protection or risks of adversarial attacks. While prior work has explored aspects of the risk associated with model leakage, exactly which parts of the model are most sensitive and how one infers the full architecture of the DNN when nothing is known about the structure a priori are problems that have been left unexplored. In this paper we address this gap, first by presenting a schema for reasoning about model leakage holistically, and then by proposing and quantitatively evaluating DeepSniffer, a novel learning-based model extraction framework that uses no prior knowledge of the victim model. DeepSniffer is robust to architectural and system noises introduced by the complex memory hierarchy and diverse run-time system optimizations. Taking GPU platforms as a showcase, DeepSniffer performs model extraction by learning both the architecture-level execution features of kernels and the inter-layer temporal association information introduced by the common practice of DNN design. We demonstrate that DeepSniffer works experimentally in the context of an off-the-shelf Nvidia GPU platform running a variety of DNN models and that the extracted models significantly improve attempts at crafting adversarial inputs. The DeepSniffer project has been released in https://github.com/xinghu7788/DeepSniffer.

End-to-End Synthesis of Dynamically Controlled Machine Learning Accelerators

Edge systems are required to autonomously make real-time decisions based on large quantities of input data under strict power, performance, area, and other constraints. Meeting these constraints is only possible by specializing systems through hardware accelerators purposefully built for machine learning and data analysis algorithms. However, data science evolves at a quick pace, and manual design of custom accelerators has high non-recurrent engineering costs: general solutions are needed to automatically and rapidly transition from the formulation of a new algorithm to the deployment of a dedicated hardware implementation. Our solution is the SOftware Defined Architectures (SODA) Synthesizer, an end-to-end, multi-level, modular, extensible compiler toolchain providing a direct path from machine learning tools to hardware. The SODA Synthesizer frontend is based on the multilevel intermediate representation (MLIR) framework; it ingests pre-trained machine learning models, identifies kernels suited for acceleration, performs high-level optimizations, and prepares them for hardware synthesis. In the backend, SODA leverages state-of-the-art high-level synthesis techniques to generate highly efficient accelerators, targeting both field programmable devices (FPGAs) and application-specific circuits (ASICs). In this paper, we describe how the SODA Synthesizer can also assemble the generated accelerators (based on the finite state machine with datapath model) in a custom system driven by a distributed controller, building a coarse-grained dataflow architecture that does not require a host processor to orchestrate parallel execution of multiple accelerators. We show the effectiveness of our approach by automatically generating ASIC accelerators for layers of popular deep neural networks (DNNs). Our high-level optimizations result in up to 74x speedup on isolated accelerators for individual DNN layers, and our dynamically scheduled architecture yields an additio- al 3x performance improvement when combining accelerators to handle streaming inputs.

vTrust: Remotely Executing Mobile Apps Transparently With Local Untrusted OS

Increasingly, many security and privacy-sensitive applications are running on mobile platforms. However, as mobile operating systems are becoming increasingly sophisticated, they are vulnerable to various attacks. In addressing the need of running high assurance mobile apps in a secure environment even though the operating systems are untrusted, this paper presents vTrust, a new mobile app trusted execution environment, which offloads the general execution and storage of a mobile app to a trusted remote server (e.g., a VM running in a cloud) and secures the I/O between the server and the mobile device with the aid of a trusted hypervisor on the mobile device. Specifically, vTrust establishes an encrypted I/O channel between the local hypervisor and the remote server. In this way, any sensitive data flowing through the mobile OS, which the hypervisor hosts, is encrypted from the perspective of the local mobile OS. To enhance the performance of vTrust, we have also designed multiple optimizations, such as output data compression and selective sensor data transmission. We have implemented vTrust, and our evaluation shows that it has limited impact on both user experience and the application performance.

Architecting a Flash-Based Storage System for Low-Cost Inference of Extreme-Scale DNNs

The size of deep neural network (DNN) models has been exploding rapidly, demanding a colossal amount of memory capacity. For example, Google has recently scaled its Switch Transformer to have a parameter size of up to 6.4 TB. However, today's HBM DRAM-based memory system for GPUs and DNN accelerators is suboptimal for these extreme-scale DNNs as it fails to provide enough capacity while its massive bandwidth is poorly utilized. Thus, we propose Leviathan, a DNN inference accelerator, which integrates a cost-effective flash-based storage system, instead. We carefully architect the storage system to provide enough memory bandwidth while preventing performance drop caused by read disturbance errors. Our evaluation of Leviathan demonstrates an 8.3× throughput gain compared to the iso-FLOPS DNN accelerator with conventional SSDs and up to 19.5× higher memory cost-efficiency than the HBM-based DNN accelerator.

Dynamic Sparse Attention for Scalable Transformer Acceleration

Transformers are the mainstream of NLP applications and are becoming increasingly popular in other domains such as Computer Vision. Despite the improvements in model quality, the enormous computation costs make Transformers difficult at deployment, especially when the sequence length is large in emerging applications. Processing attention mechanism as the essential component of Transformer is the bottleneck of execution due to the quadratic complexity. Prior art explores sparse patterns in attention to support long sequence modeling, but those pieces of work are on static or fixed patterns. We demonstrate that the sparse patterns are dynamic, depending on input sequences. Thus, we propose the Dynamic Sparse Attention (DSA) that can efficiently exploit dynamic sparse patterns in attention. Compared with other methods, our approach can achieve better trade-offs between accuracy and model complexity. Moving forward, we identify challenges and provide solutions to implement DSA on existing hardware (GPUs) and specialized hardware in order to achieve practical speedup and efficiency improvements for Transformer execution.

EGCN: An Efficient GCN Accelerator for Minimizing Off-Chip Memory Access

As Graph Convolutional Networks (GCNs) have emerged as a promising solution for graph representation learning, designing specialized GCN accelerators has become an important challenge. An analysis of GCN workloads shows that the main bottleneck of GCN processing is not computation but the memory latency of intensive off-chip data transfer. Therefore, minimizing off-chip data transfer is the primary challenge for designing an efficient GCN accelerator. To address this challenge, optimization is initialized by considering GCNs as tiled matrix multiplication. In this paper, we optimize off-chip memory access from both the in- and out-of-tile perspectives. From the out-of-tile perspective, we find optimal tile configurations of given datasets and on-chip buffer capacity, then observe the dataflow across phases and layers. Inter-layer phase fusion dataflow with optimal tile configuration reduces data transfer of intermediate outputs. From the in-tile perspective, due to the sparsity of tiles, tiles have redundant data which does not participate in computation. Redundant data load is eliminated with hardware support. Finally, we introduce an efficient GCN inference accelerator, EGCN, specialized for minimizing off-chip memory access. EGCN achieves 41.9% off-chip DRAM access reduction, 1.49× speedup, and 1.95× energy efficiency improvement on average over the state-of-the-art accelerators.

CloudChain: A Cloud Blockchain Using Shared Memory Consensus and RDMA

Blockchain technologies can enable secure computing environments among mistrusting parties. Permissioned blockchains are particularly enlightened by companies, enterprises, and government agencies due to their efficiency, customizability, and governance-friendly features. Obviously, seamlessly fusing blockchain and cloud computing can significantly benefit permissioned blockchains; nevertheless, most blockchains implemented on clouds are originally designed for loosely-coupled networks where nodes communicate asynchronously, failing to take advantages of the closely-coupled nature of cloud servers. In this paper, we propose an innovative cloud-oriented blockchain – CloudChain, which is a modularized three-layer system composed of the network layer, consensus layer, and blockchain layer. CloudChain is based on a shared-memory model where nodes communicate synchronously by direct memory accesses. We realize the shared-memory model with the Remote Direct Memory Access technology, based on which we propose a shared-memory consensus algorithm to ensure presistence and liveness, the two crucial blockchain security properties countering Byzantine nodes. We also implement a CloudChain prototype based on a RoCEv2-based testbed to experimentally validate our design, and the results verify the feasibility and efficiency of CloudChain.

WBMatrix: An Optimized Matrix Library for White-Box Block Cipher Implementations

White-box block cipher (WBC) has been proposed by Chow et al. to prevent the secret key to be extracted from its implementation in an untrusted context. A pivotal technique behind WBC is to convert the iterated round functions into a series of look-up tables (LUTs) with encodings. The construction of encoded LUTs consists of matrix operations, such as multiplication and inversion. The widely-used matrix libraries in applications, such as open-source NTL and M4RI, are primarily designed for large dimensional matrix operations. Therefore, they might not be suitable for WBC implementations which are mainly based on small-scale matrices and vectors. In this paper, we propose a new matrix library named WBMatrix for the optimization of WBC implementations. WBMatrix reduces the operating steps of multiplication and simultaneously generates pairwise invertible matrices as encodings. The performance comparison supports that WBMatrix improves the table construction and encryption phases on Intel x86 and ARMv8 platforms. Moreover, WBMatrix also boosts the initialization and encryption phases of LowMC/LowMC-M block ciphers and enhances the performance for the generation of key-dependent Sbox.

Efficient and Scalable FPGA Design of GF(<inline-formula><tex-math notation=”LaTeX”>$2^m$</tex-math><alternatives><mml:math><mml:msup><mml:mn>2</mml:mn><mml:mi>m</mml:mi></mml:msup></mml:math><inline-graphic xlink:href=”galimberti-ieq1-3149422.gif”/></alternatives></inline-formula>) Inversion for Post-Quantum Cryptosystems

Post-quantum cryptosystems based on QC-MDPC codes are designed to mitigate the security threat posed by quantum computers to traditional public-key cryptography. The polynomial inversion is the core operation of key generation in such cryptosystems and the adoption of ephemeral keys imposes the execution of key generation for each session. To this end, there is a need for efficient and scalable hardware implementations of the binary polynomial inversion operation to support the key generation primitive across a wide range of computational platforms. This manuscript proposes an efficient and scalable architecture implementing the binary polynomial inversion at the hardware level. Our solution can deliver a performance-optimized implementation for the large polynomials used in post-quantum code-based cryptosystems and for each FPGA of the mid-range Xilinx Artix-7 family. The effectiveness of the proposed solution was validated by means of the BIKE and LEDAcrypt post-quantum QC-MDPC cryptosystems as representative use cases. Compared to the C11- and the optimized AVX2-based software implementations of LEDAcrypt, instances of the proposed architecture targeting the Artix-7 200 FPGA show an average performance improvement of 31.7 and 2.2 times, respectively. Moreover, the proposed architecture delivers a performance improvement up to 18.1 and 21.5 times for AES-128 and AES-192 security levels, respectively, compared to the BIKE hardware implementation.

Blockchain-Cloud Transparent Data Marketing: Consortium Management and Fairness

Data are generated by Internet of Things (IoT) devices and centralized at a cloud server, that can later be traded with third parties, i.e., data marketing, to enable various data-intensive applications. However, the centralized approach is recently under debate due to the lack of (1) transparent and distributed marketplace management, and (2) marketing fairness for both IoT users (data sellers) and third parties (data buyers). In this paper, we propose a Blockchain-Cloud Transparent Data Marketing (Block-DM) with consortium management and executable fairness. First, we introduce a hybrid data-marketing architecture, where the cloud acts as an efficient data management unit and a consortium blockchain serves as a transparent marketing controller. Under the architecture, consent-based secure data trading and identity privacy for data owners are achieved with the distributed credential issuance and threshold credential openings. Second, with a consortium committee, we design a fair on/off-chain data marketing protocol. By financial incentives and succinct ‘commitments’ of marketing operations, the protocol can achieve the marketing fairness and effective detection of unfair marketing operations. We demonstrate the security of Block-DM with thorough analysis. We conduct extensive experiments with a consortium blockchain network on Hyperledger Fabric to show the feasibility and practicality of Block-DM.