Search Program
Organizations
Contributors
Presentations
IP
Engineering Tracks
IP
DescriptionCurrent vehicle systems need to process the amount of data from a wide variety of sensors, such as radars and cameras, at high speed and in real time, therefore Ethernet Switch embedded in vehicle system should communicate at a high throughput of 50Gbps. In addition, the next generation vehicle system which autonomous driving technology becomes increasingly sophisticated, high speed communication of 100 Gbps will be required. Furthermore, in-vehicle ECUs are required to consume even less power in order to prevent heat generation in the in-vehicle environment and optimize battery efficiency.
Conventionally, Ethernet Switch have used HASH method for search processing in the switch processing block, which has low power consumption but limited throughput. TCAM method is essential in order to achieve high throughput of 100 Gbps, but it has a problem of high power consumption. Additionally, architectural optimization in the searching processing block is also required to achieve high throughput with low power consumption.
We have realized an Ethernet Switch that can achieve high throughput with low power consumption by adopting a pipeline search method and a phase shift search method on TCAM base. This Ethernet Switch fulfills the requirements of next generation autonomous driving car.
Conventionally, Ethernet Switch have used HASH method for search processing in the switch processing block, which has low power consumption but limited throughput. TCAM method is essential in order to achieve high throughput of 100 Gbps, but it has a problem of high power consumption. Additionally, architectural optimization in the searching processing block is also required to achieve high throughput with low power consumption.
We have realized an Ethernet Switch that can achieve high throughput with low power consumption by adopting a pipeline search method and a phase shift search method on TCAM base. This Ethernet Switch fulfills the requirements of next generation autonomous driving car.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionWith the growing demand for the heterogenous chip interconnect there is a dire need of a unified EDA design environment that effectively handles the complex logical interconnects, physical layout design, EE, mechanical & thermal simulations.
Intel's embedded multi-die interconnect bridge (EMIB) is an approach to in-package high-density interconnect of heterogeneous chips. With the increasing demand from Intel internal and IFS customer base there is a bigger challenges for tools to handle highly complex design with 10s of complex chiplets and their connectivity management, low latency high bump count layout and reliable interconnects that can be seamlessly simulate with EDA tools.
Intel's collaboration with Cadence on automating the 2.5D design is a significant step towards making EMib technology a more widely adopted and efficient solution for high-performance chip design. This will significantly benefit other companies and researchers working in this field.
Intel's embedded multi-die interconnect bridge (EMIB) is an approach to in-package high-density interconnect of heterogeneous chips. With the increasing demand from Intel internal and IFS customer base there is a bigger challenges for tools to handle highly complex design with 10s of complex chiplets and their connectivity management, low latency high bump count layout and reliable interconnects that can be seamlessly simulate with EDA tools.
Intel's collaboration with Cadence on automating the 2.5D design is a significant step towards making EMib technology a more widely adopted and efficient solution for high-performance chip design. This will significantly benefit other companies and researchers working in this field.
Research Manuscript
EDA
Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionEnvironmental sustainability is a critical concern for Integrated Circuits (ICs) throughout their entire life cycle, particularly in manufacturing and use. Meanwhile, ICs using 3D/2.5D integration technologies have emerged as promising solutions to meet the growing demands for computational power. However, there is a distinct lack of carbon modeling tools for 3D/2.5D ICs. Addressing this, we propose 3D-Carbon, an analytical carbon modeling tool designed to quantify the carbon emissions of 3D/2.5D ICs throughout their life cycle. 3D-Carbon factors in both potential savings and overheads from advanced integration technologies, considering practical deployment constraints like bandwidth. We validate 3D-Carbon's accuracy against established baselines and illustrate its utility through case studies in autonomous vehicles. We believe that 3D-Carbon lays the initial foundation for future innovations in developing environmentally sustainable 3D/2.5D ICs.
Research Panel
Design
DescriptionAt the end of 2D scaling of Moore's law, 3D integrated circuits that take advantages of advanced packaging and heterogeneous integration offers many prospects of extending the chip density scaling and the system performance improvements for the next decade. Much of 3DIC design activity in the industry today is done via different teams within the same chipmaker company. 3DICs hold the potential to not only make the chip architecture heterogeneous, and chiplet sourcing to be highly diversified. Moreover, 3DICs themselves have a few avenues to be realized towards commercial success, ranging from true disaggregated chiplets to sequential stacked processing. This presses us to answer a few key questions:
1. Technology:
a. How will heat dissipation be managed, are new cooling techniques are being pursued to mitigate the thermal challenge?
b. How to design the power delivery network from the board to the substrate to the multi-tier of 3D stack with minimal voltage drop and high-power conversion efficiency? How to design the backside power delivery in leading edge node CMOS with 3D stacking?
c. How to ensure signal integrity, yield and reliability between multiple tiers of 3D stacking, and what testing and standardization efforts are needed to embrace the heterogeneous dies from different designers and foundries?
2. EDA flows and interoperability
a. Will the ecosystem extend the same standards-based interoperability of design tools, flows and methodologies to 3DIC, as enjoyed by monolithic system designers today?
b. How can EDA industry help system designers in planning, managing and tracking their complex 3DIC projects in implementation, analysis, and signoffs?
3. Roadmap:
a. Is the roadmap to sequential monolithic stacked 3DIC an inevitability? What factors lead the industry to it?
b. What are the boundaries between monolithic 3D integration (with sequential processing at BEOL) and heterogenous 3D integration (with die stacking or bonding)?
Are we as an industry able to apply lessons from the past struggles with monolithic chip design and interoperability to this emerging challenge? This panel will discuss the need, scope of solution and potential candidate efforts already in motion.
1. Technology:
a. How will heat dissipation be managed, are new cooling techniques are being pursued to mitigate the thermal challenge?
b. How to design the power delivery network from the board to the substrate to the multi-tier of 3D stack with minimal voltage drop and high-power conversion efficiency? How to design the backside power delivery in leading edge node CMOS with 3D stacking?
c. How to ensure signal integrity, yield and reliability between multiple tiers of 3D stacking, and what testing and standardization efforts are needed to embrace the heterogeneous dies from different designers and foundries?
2. EDA flows and interoperability
a. Will the ecosystem extend the same standards-based interoperability of design tools, flows and methodologies to 3DIC, as enjoyed by monolithic system designers today?
b. How can EDA industry help system designers in planning, managing and tracking their complex 3DIC projects in implementation, analysis, and signoffs?
3. Roadmap:
a. Is the roadmap to sequential monolithic stacked 3DIC an inevitability? What factors lead the industry to it?
b. What are the boundaries between monolithic 3D integration (with sequential processing at BEOL) and heterogenous 3D integration (with die stacking or bonding)?
Are we as an industry able to apply lessons from the past struggles with monolithic chip design and interoperability to this emerging challenge? This panel will discuss the need, scope of solution and potential candidate efforts already in motion.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
Description3DIC design can reduce the length of interconnections and secure gains in power and performance by using multiple dies stacked vertically.
However, the design complexity increases, and more resources are required to modify the design compared to a single die design.
In the early stages of design, we need to be able to quickly and easily prototype design.
Early thermal analysis is an important key to determining the design floorplan, and a high correlation is required after the design is complete.
When we performed thermal analysis on the prototype design and the two designs after the actual P&R was completed, we confirmed that the thermal map showed similar heat maps and hot spots.
When we performed thermal analysis according to the three power scenario steps, the largest error rate between the prototype and the real was 8.34%, which was found near the chip boundary at 5.8s.
We confirmed that the temperature difference was less than 10% and the hot spot trend was very similar.
However, the design complexity increases, and more resources are required to modify the design compared to a single die design.
In the early stages of design, we need to be able to quickly and easily prototype design.
Early thermal analysis is an important key to determining the design floorplan, and a high correlation is required after the design is complete.
When we performed thermal analysis on the prototype design and the two designs after the actual P&R was completed, we confirmed that the thermal map showed similar heat maps and hot spots.
When we performed thermal analysis according to the three power scenario steps, the largest error rate between the prototype and the real was 8.34%, which was found near the chip boundary at 5.8s.
We confirmed that the temperature difference was less than 10% and the hot spot trend was very similar.
Research Manuscript
4-Transistor Ternary Content Addressable Memory Cell Design using Stacked Hybrid IGZO/Si Transistors
Design
Emerging Models of Computation
DescriptionIn this paper, we propose a 4T-based paired orthogonally stacked transistors for random access memory (POST-RAM) cell structure and also suggest ternary content addressable memory (TCAM) applications. POST-RAM cells feature vertically stacked read and write transistors, maximizing area efficiency by utilizing only two transistors' space. %POST-RAM cells have read and write transistors stacked vertically, maximizing area efficiency by using the area of only two transistors.
POST-RAM employs InGaZnO (IGZO) channels for write transistors and single crystal silicon channels for read transistors, which results in both extremely long memory retention and fast reading performance. A comprehensive 3D-TCAD simulation is conducted to validate the procedural design of the proposed device structure. Furthermore, we introduced a self-clamped searching scheme (SC2S) designed to enhance the efficiency of TCAM operations. The results conclusively demonstrate that operating a TCAM based on the proposed POST-RAM architecture can lead to a 20$\%$ improvement in energy-delay product (EDP). Notably, the delay performance can be enhanced by up to 40$\%$ when compared to a 16T SRAM-based TCAM. Additionally, the proposed scheme enables a more than sixfold reduction in cell area, demonstrating an efficient use of space.
POST-RAM employs InGaZnO (IGZO) channels for write transistors and single crystal silicon channels for read transistors, which results in both extremely long memory retention and fast reading performance. A comprehensive 3D-TCAD simulation is conducted to validate the procedural design of the proposed device structure. Furthermore, we introduced a self-clamped searching scheme (SC2S) designed to enhance the efficiency of TCAM operations. The results conclusively demonstrate that operating a TCAM based on the proposed POST-RAM architecture can lead to a 20$\%$ improvement in energy-delay product (EDP). Notably, the delay performance can be enhanced by up to 40$\%$ when compared to a 16T SRAM-based TCAM. Additionally, the proposed scheme enables a more than sixfold reduction in cell area, demonstrating an efficient use of space.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionFor today's high speed AMS design, as the processes shrinking and design complexity increasing, the layout parasitics have become more and more important and even more dominant than devices, which impact a lot on design's performance.On the other hand, as the parasitics magnitude increase, it's more and more hard to debug complex parasitics issue through traditional method like post-sim, with which designer need to spend more post-sim and sign-off runtime, more experience-based manually debug and iteration to identify the real bottleneck can usually make the design schedule out of control.
To improve the design efficiency, a "shift-left" parasitic analysis flow for AMS layout parasitics become necessary and important, to help design identify the parasitics caused design problem more early, quickly, and easily.
Before go to sign-off stage, we first use ParagonX perform quickly parasitics analysis of R, C, RC delay, net matching, etc in early design stage, and debug result by element, by layer, by layout locations, to identify and optimize the real layout bottleneck, reducing the layout iterations ranging from weeks to hours. Through the flow improvement, we makes parasitics debugging and layout optimization easy and efficient, significantly improve design efficiency.
To improve the design efficiency, a "shift-left" parasitic analysis flow for AMS layout parasitics become necessary and important, to help design identify the parasitics caused design problem more early, quickly, and easily.
Before go to sign-off stage, we first use ParagonX perform quickly parasitics analysis of R, C, RC delay, net matching, etc in early design stage, and debug result by element, by layer, by layout locations, to identify and optimize the real layout bottleneck, reducing the layout iterations ranging from weeks to hours. Through the flow improvement, we makes parasitics debugging and layout optimization easy and efficient, significantly improve design efficiency.
Research Manuscript
Embedded Systems
Time-Critical and Fault-Tolerant System Design
DescriptionParallel real-time systems often rely on the shared cache for dependent data transmissions across cores. Conventional shared cache and their management techniques suffer from intensive contention and are markedly inflexible, leading to significant transmission latency of shared data. In this paper, we provide a Virtual Indexed Physical Tagged, Selectively-Inclusive Non-Exclusive L1.5 Cache, offering way-level control and versatile sharing capabilities. Focusing on a common-seen parallel task model, the Directed Acyclic Graph (DAG), we construct a novel scheduling method that exploits the L1.5 Cache to reduce data transmission latency, achieving improved timing performance. As a systematical solution, we build a real system, from the SoC and ISA to the drivers and the programming model. Experiments show that the proposed solution significantly improves the real-time performance of DAG tasks with negligible hardware overhead.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionTraditionally, the budgeting of STA and IR drop limits was done separately, with each converging to their respective limits without much interaction. Recently, there have been attempts to incorporate IR drop into STA analysis for a more informed timing signoff. However, the reverse - incorporating timing critical path into IR signoff - has not been as thoroughly investigated.
This work proposed a methodology for IR drop signoff with awareness of timing critical paths. It utilizes the latest features from the Redhawk-SC EDA tool to incorporate timing analysis results into IR voltage drop signoff. This IR voltage drop data can subsequently be incorporated into an incremental timing analysis to pinpoint potential waivers for IR violations. Evaluation data from real design blocks in advanced nodes demonstrate that it can improve design coverage and enhance silicon robustness and system performance.
This work proposed a methodology for IR drop signoff with awareness of timing critical paths. It utilizes the latest features from the Redhawk-SC EDA tool to incorporate timing analysis results into IR voltage drop signoff. This IR voltage drop data can subsequently be incorporated into an incremental timing analysis to pinpoint potential waivers for IR violations. Evaluation data from real design blocks in advanced nodes demonstrate that it can improve design coverage and enhance silicon robustness and system performance.
Research Manuscript
Embedded Systems
Embedded Memory and Storage Systems
Descriptionk-Clique counting problem plays an important role in graph mining which has seen a growing number of applications. However, current k-Clique counting accelerators cannot meet the performance requirement mainly because they struggle with high data transfer issue incurred by the intensive set intersection operations and the inability of load balancing. In this paper, we propose to solve this problem with a hybrid framework of content addressable memory (CAM) and in-memory processing (PIM). Specifically, we first utilize CAM for binary induced subgraph generation in order to reduce the search space, then we use PIM to implement in-place parallel k-Clique counting through iterative Boolean logic "AND"- like operation. To take full advantage of this combined CAM and PIM framework, we develop dynamic task scheduling strategies that can achieve near optimal load balancing among the PIM arrays. Experimental results demonstrate that, compared with state-of-the-art CPU and GPU platforms, our approach achieves speedups of 167.5× and 28.8×, respectively. Meanwhile, the energy efficiency is improved by 788.3× over the GPU baseline.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWith interconnect spacing shrinking in advanced technology nodes, the precision of existing timing predictions worsens as crosstalk-induced delay is hard to quantify. During the routing process, the crosstalk effect is usually modeled by predicting coupling capacitance with congestion information. However, the timing estimation is overly pessimistic since the crosstalk-induced delay depends not only on the coupling capacitance but also on the signal arrival time. In this work, a crosstalk-aware timing estimation method is presented using a two-step machine learning approach. Interconnects that are physically adjacent and overlap in signal timing windows are filtered first. Secondly, crosstalk delay is predicted by integrating physical topology features and timing features without the post-routing result and the parasitic extraction flow. Experimental results demonstrate that the match rate of identified crosstalk-critical nets is over 99\% compared to a commercial tool. The delay prediction results are more accurate than other state-of-the-art methods.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionVarious custom cells are used in DRAM and NAND Flash memories to optimize power, performance, and area. Liberty model characterization of the custom cells becomes a time-consuming manual task when an automation tool is unable to extract the timing arc and Spice input decks, called configuration for characterization in this paper, from them. The conventional approach is to enhance the tool's capabilities so that it can accommodate custom cells which were not previously taken into consideration. However, as the majority of cell types are remained unchanged across various projects, the configurations can be reused once manually crafted and verified. This study presented a data-driven approach that automates the Liberty model characterization process by mapping a cell to its corresponding configuration with a neural network. We employed graph neural networks (GNNs) to establish relationships between cell topologies and the configurations. We implemented supervised classifiers based on widely used GNNs such as GCN, GraphSAGE, GAT, and GIN, and compare the classification accuracies and the numbers of parameters. With GNNs, our method reached over 94% accuracy, while the traditional rule-based methods using naming convention or ad-hoc connectivity rule scored below 75%.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionNearly a decade ago, in July 2015, we released the 1st edition of our book "Formal Verification: An Essential Toolkit for Modern VLSI Design". This book was well-received in the industry, being essentially the first practical modern guidebook on the topic of Formal Verification (FV) aimed at active engineers designing and validating RTL models, rather than theoretical researchers. However, we are part of a rapidly evolving field, and our notion of best practices for FV has undergone many changes in the years since the initial release. We have also gained a variety of different experiences— while all three authors had worked together at Intel when beginning the first edition, since then one author moved to academia, and another moved from Intel to EDA vendor Cadence. It is the gradual accumulation of these changes and varied new learnings that eventually motivated us to put out a heavily revised 2nd edition, released in June 2023. Since not every FV practitioner has purchased our 2nd edition, or has kept completely up-to-date with FV methodology at other companies in the industry, we think it will be useful to summarize some of the major areas in which FV practice has changed and improved in the years leading up to our 2nd edition. This information will help current designers, validators, and FV specialists to improve their practices and enable them to better incorporate the industry's latest learnings.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionDeep Learning, particularly Deep Neural Networks (DNNs), has emerged as a powerful tool for addressing intricate real-world challenges. Nonetheless, the deployment of DNNs presents its own set of obstacles, chiefly stemming from substantial hardware demands. In response to this challenge, Domain-Specific Accelerators (DSAs) have gained prominence as a means of executing DNNs, especially within cloud service providers offering DNN execution as a service. For service providers, managing multi-tenancy and ensuring high quality service delivery, particularly in meeting stringent execution time constraints, assumes paramount importance, all while endeavoring to maintain cost-effectiveness. In this context, the utilization of heterogeneous multi-accelerator systems becomes increasingly relevant. This paper presents RELMAS, a low-overhead deep reinforcement learning algorithm designed for the real-time scheduling of DNNs in multi-tenant environments, taking into account the dataflow heterogeneity of accelerators and memory bandwidths contentions. By doing so, service providers can employ the most efficient scheduling policy for user requests, optimizing Service-Level-Agreement (SLA) satisfaction rates and enhancing hardware utilization. The application of RELMAS to a heterogeneous multi-accelerator system composed of various instances of Simba and Eyeriss sub-accelerators resulted in up to a 173% improvement in SLA satisfaction rate compared to state-of-the-art scheduling techniques across different workload scenarios, with less than a 1.5% energy overhead.
Front-End Design
Design
Engineering Tracks
Front-End Design
DescriptionIn this talk, we present Cross Testbench (XTB), a distributed co-simulation environment that enables co-simulation across two simulation approaches, event-driven and cycle-based. Event-driven and cycle-based simulation are two commonly utilized verification approaches in the industry. The former takes into account delays and timings, is versatile, and works well with asynchronous systems, which makes it ideal for achieving highly accurate simulations. However, simulation speed depends on the model size and activity, making it slower for large designs. Whereas, cycle simulation is faster, scales better, and supports hardware acceleration, but does not include timing information, which makes it more suitable for large designs such as server microprocessors. Each approach has distinct benefits, and leveraging both ensures reliable and precise verification while maintaining rapid execution and extensive test coverage. We leveraged XTB to achieve chip-level verification allowing for interplay between parts of the design which were required to be simulated with event simulation (such as vendor delivered Verification IPs for physical parts), and the rest of it which utilized cycle simulation to achieve high throughput. We highlight the successful usage of XTB to verify IBM's memory buffer chip which integrates external IPs such as DDR5 and PCIe. In addition, we outline XTBs capability to save and restart in a distributed co-simulation to significantly improve performance in a production environment.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe state-of-the-art method for oracle synthesis in quantum computing is based on logic networks, where each node corresponds to an output or an intermediate state requiring uncomputation cleanup. The order in which we compute and uncompute these nodes, sometimes referred to as the reversible pebble game, is a key factor influencing the number of qubits and the circuit length in the final result. In this paper, we introduce a novel pebbling strategy based on divide-and-conquer that aims at reducing the number of qubits while maintaining a reasonable circuit length. Our results show that our algorithm beats previous heuristic method in both number of qubits and circuit length, having the potential in tackling large-scale oracle synthesis problems.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionTransformers-based language models have demonstrated tremendous accuracy in multiple natural language processing (NLP) tasks. Transformers use self-attention, in which matrix multiplication is the dominant computation. Moreover, their large size, makes the data movement a latency and energy efficiency bottleneck in conventional Von-Neumann systems. The processing-in-memory architectures with compute elements in the memory have been proposed to address the bottleneck. This paper presents PACT-3D, a PIM architecture with novel computing units interfaced with DRAM banks performing the required computations and achieving a 1.7X reduction in latency and 18.7X improvement in energy efficiency against the state-of-the-art PIM architecture.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIn recent years, In-RRAM Computing (IRC) is a promising technique for deep neural network (DNN) applications. Combined with proper pruning technique, the cost and energy of DNN computation can be further reduced. However, IRC often suffers from various non-ideal effects in RRAM arrays such as the sneak path and IR-drop, which greatly affect the computation accuracy. Therefore, accurate error injection is required in the verification at early design stage. Conventional random disturbance and equation-based approaches did not consider the data allocation issue, which may incur larger errors for the sparse matrix generated by data pruning techniques. In this paper, a fast and accurate IR-drop model is proposed to reflect the data-dependent effects, which is able to offer accurate error injection in the DNN training phase with sparse matrix. As shown in the experimental results, the proposed model shows a good match to the HSPICE results even if the data allocation becomes non-uniform. With the proposed simple model, the accuracy degradation of real NN applications can be well observed, even for large RRAM arrays.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIn analog or mixed-signal in-memory computing (IMC) applications, the focus is typically on the bit cell, particularly during the inference period. However, for transmitting multiplication-and-accumulation (MAC) results to subsequent layers, IMC macros must convert analog signals into digital domain using analog-to-digital converters (ADCs), often the most power and area-intensive components in IMC systems. Addressing this, we present an efficient training/inferencing algorithm tailored for specific IMC applications, introducing an ADC-less IMC macro design suitable for practical memory systems. This novel architecture eliminates the need for power-intensive ADCs, opting for reconfigurable conventional memory structures with sense amplifiers, like DRAM or SRAM arrays. This study introduces an algorithm that integrates sense amplifiers into both the training and inference processes without extra hardware additives.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionCircuit knitting emerges as a promising technique to overcome the limitation of the few physical qubits in near-term quantum hardware by cutting large quantum circuits into smaller subcircuits. Recent research in this area has been primarily oriented towards reducing subcircuit sampling overhead. Unfortunately, these works neglect hardware information during circuit cutting, thus posing significant challenges to the follow on stages. In fact, direct compilation and execution of these partitioned subcircuits yields low-fidelity results, highlighting the need for a more holistic optimization strategy.
In this work, we propose a hardware-aware framework aiming to advance the practicability of circuit knitting. Drawing a contrast with prior methodologies, the presented framework innovatively designs a cutting scheme that concurrently optimizes the number of gate cuttings and SWAP insertions during circuit cutting. In particular, we leverage the graph similarity between qubits interaction and chip layout as a heuristic guide to reduces potential SWAPs in the subsequent step of qubit routing. Building upon this, the circuit knitting framework we developed can reduce total subcircuits depth by up to 64% (48% on average) compared to the state-of-the-art approach, and enhance the relative fidelity up to 2.7x.
In this work, we propose a hardware-aware framework aiming to advance the practicability of circuit knitting. Drawing a contrast with prior methodologies, the presented framework innovatively designs a cutting scheme that concurrently optimizes the number of gate cuttings and SWAP insertions during circuit cutting. In particular, we leverage the graph similarity between qubits interaction and chip layout as a heuristic guide to reduces potential SWAPs in the subsequent step of qubit routing. Building upon this, the circuit knitting framework we developed can reduce total subcircuits depth by up to 64% (48% on average) compared to the state-of-the-art approach, and enhance the relative fidelity up to 2.7x.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAs the technology node shrinks, routing in memory devices is becoming a challenging problem. Advanced commercial routing solutions have been introduced for dealing with more complex design rules and less routing resources, however, routing results are still far from satisfactory. Complex routing patterns from those routing solutions are not meeting customer's specific expectations, rather making it more difficult for engineers to manually modify it. In this paper we explore the possibility whether a simpler approach, a heuristic-based routing methodology can be a better option for improving routability. Our routing methodology simplifies entire routing process into two stages: global routing and local routing, and heuristic-based algorithm is applied in each stage. With our routing methodology, we could achieve higher routing success rate by on average 43%, with less routing resource usage by on average 13% and less drc errors by on average 68%.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWireless baseband processing (WBP) is a key element of wireless communications, with a series of signal processing modules to improve data throughput and counter channel fading. Conventional hardware solutions, such as digital signal processors (DSPs) and more recently, graphic processing units (GPUs), provide various degrees of parallelism, yet they both fail to take into account the cyclical and consecutive character of WBP. Furthermore, the large amount of data in WBPs cannot be processed quickly in symmetric multiprocessors (SMPs) due to the unpredictability of memory latency. To address this issue, we propose a hierarchical dataflow-driven architecture to accelerate WBP. A \textit{pack-and-ship} approach is presented under a non-uniform memory access (NUMA) architecture to allow the subordinate tiles to operate in a bundled access and execute manner. We also propose a multi-level dataflow model and the related scheduling scheme to manage and allocate the heterogeneous hardware resources. Experiment results demonstrate that our prototype achieves $2\times$ and $2.3\times$ speedup in terms of normalized throughput and single-tile clock cycles compared with GPU and DSP counterparts in several critical WBP benchmarks. Additionally, a link-level throughput of $288$ Mbps can be achieved with a $45$-core configuration.
Research Manuscript
EDA
Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
Description3D ICs promise increased logic density and reduced routing congestion over conventional monolithic 2D ICs.
High level synthesis (HLS) tools promise reduced design complexity by approaching the design from a higher abstraction level and allow for more optimization flexibility.
We propose improving timing closure of 3D ICs by co-designing the architecture and physical design by integrating HLS and 3D IC macro placement into the same holistic loop.
On average our method is able to reduce estimated total negative slack (TNS) by 62% and 92% when compared to a traditional binding and placement technique for 2D and 3D ICs respectively.
High level synthesis (HLS) tools promise reduced design complexity by approaching the design from a higher abstraction level and allow for more optimization flexibility.
We propose improving timing closure of 3D ICs by co-designing the architecture and physical design by integrating HLS and 3D IC macro placement into the same holistic loop.
On average our method is able to reduce estimated total negative slack (TNS) by 62% and 92% when compared to a traditional binding and placement technique for 2D and 3D ICs respectively.
Research Manuscript
Design
Quantum Computing
DescriptionIsing model-based computers have recently emerged as high-performance solvers for combinatorial optimization problems (COPs). For Ising model, a simulated bifurcation (SB) algorithm searches for the solution by solving pairs of differential equations. The SB machine benefits from massive parallelism but suffers from high energy. Dynamic stochastic computing implements accumulation-based operations efficiently. This article proposes a high-performance stochastic SB machine (SSBM) for solving COPs with efficient hardware. To this end, we develop a stochastic SB (sSB) algorithm such that the multiply-and-accumulate (MAC) operation is converted to multiplexing and addition while the numerical integration is implemented by using signed stochastic integrators (SSIs). Specifically, the sSB stochastically ternarizes position values used for the MAC operation. A stochastic computing SB cell (SC-SBC) is constructed by using two SSIs for area efficiency. Additionally, a binary-stochastic computing SB cell (BSC-SBC) uses one binary integrator and one SSI to achieve a reduced delay. Based on sSB, an SSBM is then built by using the SC-SBC or BSC-SBC as the basic building block. The designs and syntheses of two SSBMs with 2000 fully connected spins require at least 1.13 times smaller area than the state-of-the-art designs.
Research Manuscript
Security
Hardware Security: Primitives, Architecture, Design & Test
DescriptionFully Homomorphic Encryption (FHE) enables unlimited computation depth, allowing for privacy-enhanced neural network inference tasks directly on the ciphertext. However, existing FHE architectures suffer from the memory access bottleneck due to the significant data consumption. This work proposes a High-throughput FHE engine for private inference (PI) based on 3D stacked memory (H3). H3 adopts software-hardware co-design that dynamically adjusts the polynomial decomposition during the PI process to minimize the computation and storage overhead at a fine granularity. With 3D hybrid bonding, H3 integrates a logic die with a multi-layer embedded DRAM, routing data efficiently to the processing unit array through an efficient broadcast mechanism. H3 consumes 192mm$^2$ of the area when implemented using a 28nm logic process. H3 achieves a throughput of 1.36 million LeNet-5 or 920 ResNet-20 PI per minute, surpassing existing 7nm accelerators by 52%. This demonstrates that 3D memory is a promising technology to promote the performance of FHE.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThis paper presents a high-throughput, energy-efficient, and constant-time in-SRAM Advanced Encryption Standard (AES) engine. The proposed in-memory AES ensures high-throughput operation exploiting the column-wise single instruction multiple data (SIMD) processing of compact round functions for both electronic-codebook (ECB) and counter (CTR) modes of operation. Moreover, we proposed a processor-assisted key loading strategy and a prudent memory management scheme to minimize the memory footprint needed for AES to improve the peak operating frequency and energy efficiency of the underlying compute SRAM hardware. The bit-serial processing further guarantees the constant-time execution of AES, providing strong resistance to side-channel timing attacks. Experimental results show that our proposed AES ECB design achieves 2.4×(149×) throughput, 2.4×(270×) throughput per area, 2.3×(7.7×) per block energy improvement as compared to the state-of-the-art non-constant-time (constant-time) designs, respectively. The resulted AES Counter (CTR) mode design achieves 1.9× per block energy improvement as compared to the state-of-the-art reconfigurable IMC AES CTR designs.
Research Manuscript
AI
AI/ML Application and Infrastructure
DescriptionAs deep learning empowers various fields, many new operators have been proposed to improve the accuracy of deep learning models. Researchers often use imperative programming diagrams (PyTorch) to express these new operators, leaving the fusion optimization of these operators to deep learning compilers. Unfortunately, the inherent side effects introduced by imperative tensor programs, especially tensor-level mutations, often make optimization extremely difficult. We present a holistic functionalization approach (TensorSSA) to optimizing imperative tensor programs beyond control flow boundaries. We achieve a 1.79X (1.34X on average) speedup in representative deep learning tasks than state-of-the-art works.
Research Manuscript
Autonomous Systems
Autonomous Systems (Automotive, Robotics, Drones)
DescriptionIn this paper, we introduce DLAPID, a novel decoupled parallel hardware-software co-design architecture for real-time video dehazing. From a software point of view, DLAPID isolates the atmospheric light operation from the initial transmission estimation to take full advantage of the hardware accelerators' parallelization features. For the hardware implementation, we deploy DLAPID both on FPGA and GPU platforms and validate its effectiveness. Using both real-world driving scenario testing sets and ground-truth datasets, we quantitatively and qualitatively assess the proposed method against several SOTA (state-of-the-art) video dehazing models. The outcomes of our experiments demonstrate that our approach achieves better dehazing performance with lower power consumption and has real-time processing capabilities, thereby preventing potential accidents in autonomous vehicles.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionBackside Power Delivery Network (BSPDN) is a Design Technology Co-Optimization (DTCO) method aimed at sustaining Moore's Law. It achieves this by relocating the Power Delivery Network (PDN) inside silicon, transitioning from the front-side to the back-side, thereby freeing up routing resources for improved signal routing. The improvement in IR drop compared to the traditional Frontside Power Delivery Network (FSPDN) is also noteworthy. Traditional IR drop analysis takes months from PDK released, P&R, IR analysis. In this paper, we propose a methodology to estimate the IR drop enhancement of BSPDN at early stage. We initially model the Power Delivery Network (PDN) using a simplified resistance and current model. Based on this simplified model, we derive a formula to calculate the IR drop. This formula is applicable to model IR drop for both BSPDN and FSPDN. Utilizing this method allows us to estimate IR drop before the actual Place and Route (P&R) tasks are completed, thereby speeding up the Design Technology Co-Optimization (DTCO) iteration. To demonstrate the correlation, we implement a real design and analyze IR drop using Electronic Design Automation (EDA) tools. The results indicate that this methodology is effective in estimating IR drop before the design is implemented, thereby benefiting DTCO.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn this paper we would like to propose an easy module bind based automation for the AXI protocol violation check and extraction of the performance from any AXI-3 based bus. The automation infrastructure proposed, reduces the manual effort, time and human error in extracting the performance indices. It also flags any AXI protocol violations in the design. The major capabilities of the infrastructure include reporting of any AXI protocol violations, per transaction latency, byte transferred, average latency, peak latency, total accumulated latency, average outstanding transactions, number of address requests, number of data requests and net bandwidth. The infrastructure also generates independent RTL hierarchical performance summary log with the previously mentioned parameters which enables user to get the performance info without any wave. The infrastructure was tested on various AXI-3 masters with different address, data and ID width which resulted in reduction in the design verification time, and a higher confidence on the quality of the design. Producing a performance and protocol check report is effortless using this infrastructure with very minimal input. The infrastructure, being parameterized and bind-based, exhibit significant reusability, whether at the SOC or IP level.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionLiquid State Machine (LSM), a spiking neural network model, has shown superiority in various applications due to its inherent spatiotemporal information processing property and low training complexity. Traditional hyperparameters optimization methodologies for LSM usually focus on the mono-criteria of accuracy while ignoring the trade-off among accuracy, parameter size, and hardware overhead (e.g., power consumption) when deployed on neuromorphic processors, which hinders LSM's better applications in resource-restricted scenarios (e.g., embedded systems). Thus, co-considering the performance of LSM algorithms and hardware constraints is critical for real-world applications, which still requires further exploration. This work treats the optimization of LSM as a Multi-objective Optimization Problem (MOP) and proposes a general hardware-aware multi-objective optimization framework. In light of the vast design space and time-consuming function evaluations of the spiking neural network, a decomposition-based Multi-objective Optimization Algorithm (MOA) aiming at computationally expensive problems, MOTPE/D, is proposed in this framework. Experiments are conducted on two typical case studies, i.e., N-MNIST classification and DVS-128 classification. It has been experimentally supported that the proposed framework outperforms peer solutions in terms of different performance indicators. This work is open sourced for reproducibility and further study, which can be accessed through: https://anonymous.4open.science/r/MOTPE-D.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionGraph Neural Networks (GNNs) demand extensive fine-grained memory access, which leads to the inefficient use of bandwidth resources. This issue is more serious when dealing with large-scale graph training tasks. Near-data processing emerges as a promising solution for data-intensive computation tasks; however, existing GNN acceleration architectures do not integrate the near-data processing approach. To address this gap, we conduct a comprehensive analysis of GNN operation characteristics, taking into consideration the requirements for accelerating aggregation and combination processes. In this paper, we introduce a near-data processing architecture tailored for GNN acceleration, named NDPGNN. NDPGNN offers different operational modes, catering to the acceleration needs of various GNN frameworks, while ensuring system configurability and scalability. In comparison to previous approaches, NDPGNN brings 5.68x improvement in system performance while reducing 8.49× energy consumption overhead.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn the fast-paced semiconductor world, rapid time-to-market is crucial. Traditional SoC development, waiting for fully developed IPs, hinders speed and competitiveness. This presentation introduces the concept of preliminary IP CAD views, generated as soon as IP specifications are defined. This allows SoC developers to start design (flow setup and cleanup) and provide feedback earlier, significantly reducing overall cycle time. We propose an optimized approach for generating these preliminary views, achieving up to 40% faster runtime and minimizing delays caused by human intervention. This streamlined technique allows for faster iterations and feedback, increasing development speed and competitiveness in the competitive industry.
DAC Pavilion Panel
Design
DescriptionRISC-V and a growing open-source ecosystem have moved from hype to reality. Consequently, the semiconductor industry is at an inflection point as architectural paradigms require early power and performance metrics, creating demand for new design, verification and validation technologies and methodologies.
Engineers now have the ability to design a specific rather than generic open-source instruction set easily customizable to an application in a vertical market. It's an era where RISC-V starts are not just starts but used in volume production.
The status quo has been upended and with it a challenge with a new open-source ecosystem versus the trust of a traditional, well-established and rich ecosystem. The new open-source instruction set and software don't have legacy, experience and domain knowledge sharing, particularly software validation's usage and experience.
It could also become an exciting era for design verification as it becomes the chief enabler for the new ecosystem and architecture, especially hardware-assisted verification that can serve as a risk mitigation tool.
A panel of design and verification users and experts, all of whom have studied the open-source ecosystem and its requirements and deficiencies, will be part of the DAC Pavilion Panel. DAC attendees are invited to listen in as they discuss where emphasis should be placed for the next-generation design verification flow. Audience participation will be encouraged.
Engineers now have the ability to design a specific rather than generic open-source instruction set easily customizable to an application in a vertical market. It's an era where RISC-V starts are not just starts but used in volume production.
The status quo has been upended and with it a challenge with a new open-source ecosystem versus the trust of a traditional, well-established and rich ecosystem. The new open-source instruction set and software don't have legacy, experience and domain knowledge sharing, particularly software validation's usage and experience.
It could also become an exciting era for design verification as it becomes the chief enabler for the new ecosystem and architecture, especially hardware-assisted verification that can serve as a risk mitigation tool.
A panel of design and verification users and experts, all of whom have studied the open-source ecosystem and its requirements and deficiencies, will be part of the DAC Pavilion Panel. DAC attendees are invited to listen in as they discuss where emphasis should be placed for the next-generation design verification flow. Audience participation will be encouraged.
Back-End Design
Back-End Design
Design
Engineering Tracks
Description5G Downlink Datapath designs contain repeated structures, the same design instantiated multiple times, which make it very difficult to identify and place the macros in a way that is optimal for routability and performance. Traditional macro placement however, has been a very manual and iterative endeavor for these and all types of complex designs where the number of macros has grown dramatically, the sizes vary widely, and the interconnectivity between them is more intricate.
In this paper, we set out to test if a P&R AI-driven macro placement capability could mimic the QoR (floorplan quality and design metrics), achieved by the expert engineers on this design but in a fraction of the time, and lessening the burden of manually placing the macros and running full-flow iterations required on our traditional flow.
In addition, we further investigated the benefits of the feature's Bayesian optimization flow on design exploration for the same block. Analyzing if the generation of various floorplans that each alone could meet the required metrics, could yield and optimal solution, unique to the needs of the design. By providing comparison results at post placement, the designers could then choose which option to push through the full P&R flow, reducing the total number of iterations and the overall turnaround time.
In this paper, we set out to test if a P&R AI-driven macro placement capability could mimic the QoR (floorplan quality and design metrics), achieved by the expert engineers on this design but in a fraction of the time, and lessening the burden of manually placing the macros and running full-flow iterations required on our traditional flow.
In addition, we further investigated the benefits of the feature's Bayesian optimization flow on design exploration for the same block. Analyzing if the generation of various floorplans that each alone could meet the required metrics, could yield and optimal solution, unique to the needs of the design. By providing comparison results at post placement, the designers could then choose which option to push through the full P&R flow, reducing the total number of iterations and the overall turnaround time.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionRF circuit analysis such as periodic AC (PAC) and noise (PNoise) simulation has been very computationally demanding, especially when the number of frequency points is large. In this paper, we propose a new iterative method with Krylov subspace recycling technology for large-scale PAC and PNoise analysis, which can re-use the Krylov subspace generated during the solutions of previous frequencies to accelerate the convergence of iterative solution at subsequent frequencies. In particular, we derive the recycling method based on the GMRES formulation which is more efficient and robust than the previous recycling method based on the GCR formulation. In addition, we also study the effects of frequency sweeping order to reduce the total number of iterations in the subspace recycling process. Numerical results show that the proposed method can achieve a speed-up of 4.7X-21.2X compared to non-recycling GMRES and up to 24.5% improvement compared to the traditional recycling GCR.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
Descriptionarm has always been exploring the cloud advantages and is quite motivated to be fully Cloud enabled. On cloud, spot instances have always been the cost effective solution but not many EDA tools can leverage this advantage. Spot instances offer a cost-effective solution by taking advantage of unused cloud resources. arm has already adopted spot instances for small/short workloads like APL characterizations.The runtime of Redhawk-SC EMIR runs is high, leading to a higher susceptibility for failure due to extended durations and more resource requirement. Goal is for RHSC to harnesses this capability for large work loads, to provide a Viable Option to Optimize the user's Cloud Expenses.
The new DataLake feature offers a more cost-effective solution by dividing workers into two categories. Execution workers are launched on spot instances and are responsible solely for the execution of jobs and with micro-resiliency, the eviction of spot instances is handled gracefully. On the other hand, DataLake workers are launched on reserved instances to ensure reliability since they are file servers. By dividing workers into these two categories and leveraging the capabilities of reserved and spot instances, this approach enables a highly scalable and cost-efficient system with robust micro-resiliency.
DataLake runs on aarch64 machines were completed successfully inspite of spot instances eviction. There is reduction in cost seen on DataLake runs as compared to reserved with minimal impact on run time and no change in QoR.
The new DataLake feature offers a more cost-effective solution by dividing workers into two categories. Execution workers are launched on spot instances and are responsible solely for the execution of jobs and with micro-resiliency, the eviction of spot instances is handled gracefully. On the other hand, DataLake workers are launched on reserved instances to ensure reliability since they are file servers. By dividing workers into these two categories and leveraging the capabilities of reserved and spot instances, this approach enables a highly scalable and cost-efficient system with robust micro-resiliency.
DataLake runs on aarch64 machines were completed successfully inspite of spot instances eviction. There is reduction in cost seen on DataLake runs as compared to reserved with minimal impact on run time and no change in QoR.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionOne of the critical requirements for any embedded application is FuSA (Functional Safety) because it is essential that all the embedded devices function correctly and safely under any faulty or failure scenarios. When it comes to automotives, as per ISO 26262 standard, any failure, be it systematic or random, needs to be addressed during the development itself.
This paper focuses on two safety strategies being widely used in automotives (TMR : Triple Module Redundancy & DCLS : Dual Core Lock Step) and how by using the new USF (Unified Safety Format), these safety mechanisms can be implemented in automotive designs with minimal user effort & reduced run time.
Earlier both the strategies were achieved using custom coded scripts and user was required to manually create bounds for the Safety Main and Shadow modules. Also, the TMR solution with a single voter cell was not supported.
With USF format support, the TMR conversion can be achieved using a single voter cell and an effective physical separation can be achieved for the Main and Shadow modules.
This paper will also highlight the run-time gain with the new USF based approach
This paper focuses on two safety strategies being widely used in automotives (TMR : Triple Module Redundancy & DCLS : Dual Core Lock Step) and how by using the new USF (Unified Safety Format), these safety mechanisms can be implemented in automotive designs with minimal user effort & reduced run time.
Earlier both the strategies were achieved using custom coded scripts and user was required to manually create bounds for the Safety Main and Shadow modules. Also, the TMR solution with a single voter cell was not supported.
With USF format support, the TMR conversion can be achieved using a single voter cell and an effective physical separation can be achieved for the Main and Shadow modules.
This paper will also highlight the run-time gain with the new USF based approach
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionPrior to product market launch, it is critical to have a cost-effective post-silicon validation program. Currently, post-silicon validation requires tremendous resources to constantly stress test post silicon by running a list of internal and external tools across a cluster of systems. This effort involves a high number of stress-test cases and consumes thousands of stress hours. However, the question remains, are the parts really being stressed by running those stress tests? How thorough is stress coverage across the silicon? Moreover, does the probability of identifying a bug increase with higher stress? What about the case for lower stress? The answers to these questions can teach us how to create and improve an effective validation stress test plan. This paper describes a novel approach to extracting the stress map from a stress tool, applying a stress map to correlate with a stress-induced failure (bug), and assessing stress coverage across the entire validation test plan. It also discusses how the current validation stress test plan can be improved using lessons from previous stress-induced failure studies.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionFormal Verification is widely applied at IP level. FPV and its Apps are largely used (Linting, Register check, Coverage). IPs are often signed-off only with Formal. Our aim is to use FPV at SoC level.
Our top-level verification tasks:
1. IP integration:
Check that all the IPs are correctly connected on the bus and accessible by the masters.
2. IP operation:
Check that all the IPs are functionally working in SoC.
3. System behavior:
Check that the main application is working.
The tests are usually developed in C code and executed by a CPU in a UVM test bench.
The paper is focused on step 1; the idea is to use the Formal Property Verification to prove the IP integration. Internally developed Python utility generates specific SVA assertions by a simple SoC description excel file. It produces read-write properties that check the accessibility of the peripheral registers and memory spaces from the CPU bus master.
This approach verifies the SoC integration early in the flow, with no UVM; the bugs commonly discovered are:
- Wrong memory map
- Wrong data bus connection
- IP clock and/or reset stuck-at
- Wrong peripheral's reset value
Our top-level verification tasks:
1. IP integration:
Check that all the IPs are correctly connected on the bus and accessible by the masters.
2. IP operation:
Check that all the IPs are functionally working in SoC.
3. System behavior:
Check that the main application is working.
The tests are usually developed in C code and executed by a CPU in a UVM test bench.
The paper is focused on step 1; the idea is to use the Formal Property Verification to prove the IP integration. Internally developed Python utility generates specific SVA assertions by a simple SoC description excel file. It produces read-write properties that check the accessibility of the peripheral registers and memory spaces from the CPU bus master.
This approach verifies the SoC integration early in the flow, with no UVM; the bugs commonly discovered are:
- Wrong memory map
- Wrong data bus connection
- IP clock and/or reset stuck-at
- Wrong peripheral's reset value
Front-End Design
Design
Engineering Tracks
Front-End Design
DescriptionCurrently, formal verification techniques succumb when it comes to the verification of system-level behavior. Only a handful of properties are converged by state-of-the-art SMT solvers. Moreover, the current state-of-the-art frameworks do not address the formal verification aspects as the scale of the design increases beyond the component level, they are: Consistency (if the design is over-constrained), completeness( The set of properties considered are exhaustive) and correctness(if the properties describe the correct behavior). In our solution to system-level verification, we address all of these concerns in our proposed approach. We show in our experiments all the properties either converge with a better bound or show higher bounds as compared to the legacy techniques. This leads us to say with confidence that our solution works well for subsystem-level design verification. While we submit this work, experiments are still ongoing to check the viability of this solution if the designs are further scaled up.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionAs semiconductor manufacturing technology has been rapidly advanced, conventional approaches cannot classify new wafer defect patterns without training. To overcome this, our study proposes an image matching-based search algorithm to analyse wafer defect patterns. The proposed algorithm finds the correlation of wafer defect patterns that determines the feature-based similarity between Wafer Bin Maps (WBMs). Besides, we propose a new metric called Match of Defects (MoD) score to perform robust searching by considering the size and location of defect patterns. Experimental results show that our method is effective on industrial WBM datasets called WM811K and MixedWM38.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionLLE (Local Layout Effect) refers to the mutual influence of adjacent layout elements in semiconductor design. In the process of measuring the characteristics of standard cells, LLE context assumptions are stored in design kit together to be utilized for the block level analysis.To minimize LLE impact on design, conventional library characterization relies on assumption of fixed overlay patterns that takes into account worst or best context based on multiple experiments. But actual context and characterized context can be different, and those situations make uncertainty skew on clock path causing pessimism and optimism on design. This proposed characterization and modeling method resolves the gap between actual context and design kit due to assumption of fixed overlay patterns during characterization. It removes redundant pessimism and optimism in cell delay modeling, then achieved PPA improvement and higher sign-off frequency in design.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionTo achieve the highest power savings, it is desired to make modification as early as possible in the design cycle requiring RTL Power Optimization flows. One of the major challenges with RTL Power Optimization is lack of eco-system to validate the impact on Power for the changes. To capture the power saving for modifications, new waveform must be generated, requiring a re-simulation. In most cases the simulation setup is available for SoC, thus doing the re-simulation for modified RTL becomes resource and time-consuming process.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionAnalog/Mixed Signal IP/Product have large number of custom bus routings which have been routed manually to meet many requirements (various width/space/layer for matching, IR drop, EM, Noise, etc.). This causes TAT increase continuously by the complexity of advanced node DRC, product size increase, design change, lack of automated solutions. In this paper, we analyzed challenges for custom bus automation and proposed a new custom bus routing solution which enables the fast generation of large number of various bus routings with quality by copying the user defined reference wire information and applying segmented combination of pre-defined bus options. The proposed solution is developed under collaboration of SLSI and Cadence and achieved 63% of TAT reduction at pilot test.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionAnnealing processors have attracted attention as domain-specific computers to solve combinatorial optimization problems (COPs) efficiently. Furthermore, their performance can be enhanced by the merge method that enables updating multi-variables simultaneously. However, directly implementing the merge method on an annealing processor requires large-scale computational and memory resources.
In this paper, we propose a parallel-trial double-update annealing (PDA) algorithm that integrates the merge method into the annealing computation flow. Also, we can realize its processor by a minor extension to the existing near-memory architecture. Simulation results for several COPs demonstrate that PDA can find higher quality solutions than the conventional annealing algorithm.
In this paper, we propose a parallel-trial double-update annealing (PDA) algorithm that integrates the merge method into the annealing computation flow. Also, we can realize its processor by a minor extension to the existing near-memory architecture. Simulation results for several COPs demonstrate that PDA can find higher quality solutions than the conventional annealing algorithm.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWe propose PRADA, a practical DRAM-based analog PIM architecture. Unlike existing proposals, PRADA does not introduce any change to the cell area to implement NOT operation. PRADA proposes two states in the bitline sense amplifier to implement NOT operation without additional circuitry. We also introduce sequential row activation to enhance the throughput performance and not to modify the row decoder. Compared to state-of-the-art analog PIM architectures, PRADA demonstrates 2.67-4.79x higher throughput for 8-bit integer multiply. For vector-ADD, PRADA achieves 3.09-3.13x speedups over the baseline, which compares favorably to the other architectures with 1.04-2.07x speedups, while maintaining superior compatibility and reliability.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionHewlett Packard Labs has been researching high-speed low-power dense wavelength division multiplexing (DWDM) Silicon Photonics (SiPh) system for post-exascale high-performance computing system. We propose a process/temperature/voltage (PVT) variation analysis for SiPh designs leveraging electronic-photonic co-design engine. Especially, as electronics' corner extremes distort signal integrity of the SiPh link in the voltage and time domain. Thus, we exploit the novel adjustable tuning techniques on electronic transceiver to improve system performance.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionSolving the Boolean Matching (BMP) is one of the fundamental tasks in EDA: indeed it allows to match components from a technology library for functional equivalence against portions of a digital design. The equivalence under negation-permutation-negation of two Boolean functions requires the exploration of a super-exponential number of possible negations and permutations of input and output bits. Current solutions address the BMP via approximate methods, which still have a more than exponential worst-case time complexity.
In this work, we propose a quantum solver for the BMP achieving an exponential speedup in the exploration of the input negations, and devise a quantum sorting network to perform custom input permutations at runtime. We provide a fully detailed quantum circuit implementing our proposal, showing its costs in terms of the number of qubits and quantum gates.
We experimentally validated our solution both with a quantum circuit simulator and a physical quantum computer, a Rigetti ASPEN-M-2, employing the ISCAS benchmark suite, a de-facto standard for classical EDA.
In this work, we propose a quantum solver for the BMP achieving an exponential speedup in the exploration of the input negations, and devise a quantum sorting network to perform custom input permutations at runtime. We provide a fully detailed quantum circuit implementing our proposal, showing its costs in terms of the number of qubits and quantum gates.
We experimentally validated our solution both with a quantum circuit simulator and a physical quantum computer, a Rigetti ASPEN-M-2, employing the ISCAS benchmark suite, a de-facto standard for classical EDA.
Research Manuscript
Embedded Systems
Time-Critical and Fault-Tolerant System Design
DescriptionMultimodal transformer excels in various applications, but faces great challenges such as high memory consumption and limited data reuse that hinder real-time performance. To address these issues, we propose a processing-in-memory (PIM)-GPU collaboration oriented compiler that optimizes the acceleration of multimodal transformers. The PIM-GPU synergy adapts well to multimodal transformers and improves execution time through dynamic programming algorithms. In addition, we introduce a tailored PIM allocation algorithm for variable-length inputs to further increase efficiency. Experimental results show an average end-to-end speedup of 15x.
Research Manuscript
Embedded Systems
Embedded Memory and Storage Systems
DescriptionVirtual reality (VR) wearable devices can achieve immersive entertainment by fusing multi-modal tasks from various senses. However, constrained by the short battery life and limited hardware resources of VR devices, it is difficult to run multiple tasks simultaneously with different modals. Based on the above issues, we propose an energy-efficient accelerator that supports Multi-modal Tasks for VR devices, namely MTVR. We present a multi-task computing solution based on the flexible multi-task computing core design and efficient computing unit allocation strategy, which simultaneously achieves efficient work of multi-modal tasks. We have designed an early exit detector to skip invalid calculations, which saves energy greatly. In addition, a fine-grained tiny value skip method at multiplier and adder levels is proposed to save energy
further. We provide a hybrid RRAM and SRAM memory access scheme, reducing the external memory access (EMA). Through experimental evaluation, the multi-task computing core achieves an average computational utilization of 95%. When the invalid input ratio is 90%, energy saving brought by the early exit detector can reach 88%. The tiny value skip method further achieved 13% energy saving. A hybrid memory access scheme obtains a 98.9% EMA reduction. We deployed the MTVR accelerator in FPGA and self-designed RRAM, achieving energy efficiency of 3.6 TOPS/W, higher than other single-task accelerators.
further. We provide a hybrid RRAM and SRAM memory access scheme, reducing the external memory access (EMA). Through experimental evaluation, the multi-task computing core achieves an average computational utilization of 95%. When the invalid input ratio is 90%, energy saving brought by the early exit detector can reach 88%. The tiny value skip method further achieved 13% energy saving. A hybrid memory access scheme obtains a 98.9% EMA reduction. We deployed the MTVR accelerator in FPGA and self-designed RRAM, achieving energy efficiency of 3.6 TOPS/W, higher than other single-task accelerators.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThe semiconductor industry faces a significantly higher portion of third-party IP, and the number of Status and Control Registers (CSRs) can now grow to 5M+. Hardware/software interfaces (HSIs) are critical, and users write and maintain homegrown scripts and solutions and spend significant manual efforts to manually generate accurate designs using many different forms of definitions like IP-XACT, SystemRDL, and spreadsheets.
We will introduce a unified single-source approach to CSR development that automates the generation of all outputs for hardware and software interface implementation, eliminates time-consuming and error-prone manual scripting and editing of design data, provides a scalable infrastructure that promotes a rapid, highly iterative design environment and scales to the most complex designs.
The CSRSpec domain-specific language specifies all aspects of the HSI and generates RTL, firmware headers, verification class instances, documentation outputs, register behavior, and address map hierarchy description. It provides a broad set of configurations and behaviors with over 200 unique properties and 6,000 register behavior combinations. The resulting methodology is repeatable, scalable, and supports legacy data reuse while supporting industry standards. Our examples show a significant reduction of manually maintained CSR specifications, reduced source code copy-paste errors and coherency problems, and eliminated file coherency issues.
We will introduce a unified single-source approach to CSR development that automates the generation of all outputs for hardware and software interface implementation, eliminates time-consuming and error-prone manual scripting and editing of design data, provides a scalable infrastructure that promotes a rapid, highly iterative design environment and scales to the most complex designs.
The CSRSpec domain-specific language specifies all aspects of the HSI and generates RTL, firmware headers, verification class instances, documentation outputs, register behavior, and address map hierarchy description. It provides a broad set of configurations and behaviors with over 200 unique properties and 6,000 register behavior combinations. The resulting methodology is repeatable, scalable, and supports legacy data reuse while supporting industry standards. Our examples show a significant reduction of manually maintained CSR specifications, reduced source code copy-paste errors and coherency problems, and eliminated file coherency issues.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionVolume imaging (3D model with inner structure) is widely applied to various areas, such as medical diagnosis and archaeology. Especially during the COVID-19 pandemic, there is a great demand for lung CT. However, it is quite time-consuming to generate a 3D model by reconstructing the internal structure of an object. To make things worse, due to the poor data locality of the reconstruction algorithm, researchers are pessimistic about accelerating it with ASIC. Besides the locality issue, we find that the complex synchronization is also a major obstacle for 3D reconstruction. To overcome the problems, we propose a holistic solution using software-hardware co-design. We first provide a unified programming model to cover various 3D reconstruction tasks. Then, we redesign the dataflow of the reconstruction algorithm to improve data locality. In addition, we remove unnecessary synchronizations by carefully analyzing the data dependency. After that, we propose a novel near-memory acceleration architecture, called Waffle, for further improvement. Experiment results show that Waffle in a package can achieve 3.51× ∼ 3.96× speedup over a cluster of 10 GPUs with 9.35× ∼ 10.97× energy efficiency.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAs the proportion of memories increasing in design, MMB (Multi-Memory Bus) interface is widely used in HPC core for memory test, which is a predefined bus in Function RTL, providing an access to multiple memory arrays and no need for memory wrappers. As a result of MMB interface application, the test area, timing impact and routing congestion can be reduced. However, there are some challenges when using MMB interface. The memories inside MMB interface only support serially test which means test time, test cost and the chip time-to-market will increase.
In this paper, we propose some solution for above challenges.
The memory subgroups of one MMB interface will be tested parallelly, and the outputs of every two adjacent subgroups will make a comparison in-situ. In order to ensure the assuracy of the compare results, the output data of one subgroup will also feed into processor for comparation.
The repair logic is also shared between the parallel test subgroups. A common repair solution will be applied for the test groups.
In this paper, we propose some solution for above challenges.
The memory subgroups of one MMB interface will be tested parallelly, and the outputs of every two adjacent subgroups will make a comparison in-situ. In order to ensure the assuracy of the compare results, the output data of one subgroup will also feed into processor for comparation.
The repair logic is also shared between the parallel test subgroups. A common repair solution will be applied for the test groups.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionDomain-specific systems, consisting of custom hardware accelerators, improve the performance of a specific set of applications compared to general-purpose processing systems. These hardware accelerators are generated using high-level synthesis (HLS) tools. The HLS tools often ignore the challenges of implementing a complex system of parallel accelerators, particularly regarding the way accelerators access memory. Our work proposes a buffering system design that improves accelerators' memory accesses by intelligently employing burst transactions to prefetch useful data from external memory to on-chip local buffers. Our design is dynamic, parametric, and transparent to the accelerators generated by HLS tools. We derive the parameters using appropriate compiler-based analysis passes and memory channel latency constraints. The proposed buffering system design results in, on average, 8.8x performance improvements while lowering memory channel utilization on average by 53.2% for a set of PolyBench kernels.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionAs wafer cost continues to increase at a rapid pace, there is a growing demand to convert more of our 2D SoCs into 3D System-In-Package designs. Furthermore, as individual IPs get larger and more complex, we see a need to disaggregate these designs along arbitrary boundaries, or "cutlines", rather than along standard fabric interfaces as has been done in the past. This results in large numbers of high-speed ad hoc interfaces on the die boundaries and creates a need for cross-die optimization techniques. Silicon architects and floorplanners need robust and intuitive methods to rapidly create and assess different configurations in the early planning phase of the design, so that they can deliver the best mix of Performance, Power, Area and Cost for the product. This paper presents these construction and analysis techniques on 2 different designs – a low-power Crypto core that explores several cutlines, and a high-speed compute module that explores different bump pitch and floorplan options. We present exhaustive studies and KPIs that can support cutline decisions, including 2D/3D PPA comparison, 3D IR/Thermal plots, 2D vs 3D QoR (ex. buffer/inverter count & routing length), D2D Bump-to-Flop distance monitoring, D2D timing paths analysis, and 2D vs 3D metal layer usage.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThis paper presents the results of tests evaluating the quality of 4G and 5G signals in Brazil, aimed mainly at cities that build wind farms in mountainous regions – a typical scenario along the Brazilian. The results obtained between November 2021 and October 2023 show an increase in signal coverage of 835%. However, this does not represent the quality or end of the oscillation; on the contrary, we found and listed four serious problems: first, failure in closed environments such as hospitals, loss of signal on roads and highways, slowness in settings with a large circulation of people and vehicles, which consequently affects applications that use two-factor authentication and banking and credit card applications.
Analyst Presentation
DescriptionWe will examine the financial performance and key business metrics of the EDA industry through 2023, as well as the material technical and market trends and requirements that have influenced EDA business performance and strategies. Among the trends, we will again examine the progression of semiconductor R&D spending and how the market value of the publicly held EDA companies has evolved. Lastly, we will provide our updated financial projections for the EDA industry for 2024 through 2026.
Research Manuscript
AI
AI/ML Application and Infrastructure
DescriptionOne of the primary challenges impeding the progress of Neural Architecture Search (NAS) is its extensive reliance on exorbitant computational resources. NAS benchmarks aim to simulate runs of NAS experiments at zero cost, remediating the need for extensive compute. However, existing NAS benchmarks use synthetic datasets and model proxies that make simplified assumptions about the characteristics of these datasets and models, leading to unrealistic evaluations. We present a technique that allows searching for training proxies that reduce the cost of benchmark construction by significant margins, making it possible to construct realistic NAS benchmarks for large-scale datasets. Using this technique, we construct an open-source bi-objective NAS benchmark for the ImageNet2012 dataset combined with the on-device performance of accelerators, including GPUs, TPUs, and FPGAs. Through extensive experimentation with various NAS optimizers and hardware platforms, we show that the benchmark is accurate and allows searching for state-of-the-art hardware-aware models at zero cost.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThe Keysight ADS RF board design automation tool has significantly improved the efficiency of Bill of Material simulation and brings down the work needed for an engineer to validate a typical RF board down from 14 days to 1.5 day. The approach and tooling are used by a major smart phone developer. Furthermore, the tooling has been built so it can be leveraged to benefit more RF Module & RFIC customers.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThe amount of data generated in 2025 is estimated to be 181 zettabytes (181,000,000,000,000,000,000,000 bytes). To accommodate this, the size of data centers keeps expanding, putting different servers of the same data center several miles away from each other. Optical fibers are a necessity between servers and leveraging Silicon Photonics comes into play. With only about 15 years of learning ("All-silicon active and passive guided-wave components for λ = 1.3 and 1.6 µm": https://ieeexplore.ieee.org/document/1073057), Silicon Photonics doesn't have as much legacy information as CMOS2 (~ 75 years: https://en.wikipedia.org/wiki/History_of_the_transistor). We can't afford to wait another 50 years, so how do we accelerate this learning pace?
To face this challenge, we will discuss strategies such as: anticipating design constraints based on FMEA analysis in order to accelerate design timeline, design compaction to support higher packaging density, minimizing wafer scraps and improvement of wafer yield.
This presentation will discuss our research approach, the hurdles we encountered and how we handled them as well as the current limits and our future steps.
FMEA: Failure Mode and Effect Analysis
To face this challenge, we will discuss strategies such as: anticipating design constraints based on FMEA analysis in order to accelerate design timeline, design compaction to support higher packaging density, minimizing wafer scraps and improvement of wafer yield.
This presentation will discuss our research approach, the hurdles we encountered and how we handled them as well as the current limits and our future steps.
FMEA: Failure Mode and Effect Analysis
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionWith the semiconductor industry's push to newer process nodes and shorter time to market, analog and custom IC layout creation is turning out to be the bottleneck as it has historically been a highly manual process. Since Analog IPs often stay the same across nodes, the ability to automatically recreate the designs can reduce costly iterations and help designs converge faster.
When the design methodology requirements vary across process nodes, layout porting based on mapping of objects and scaling of sizes and coordinates fails miserably in producing high-quality layout that is design rule correct. Our innovative approach of auto-inferring design intents from source layout and driving automated layout creation in target node solves the layout migration challenge with upwards of 2X boost in productivity.
The schematics on the target node are generated by mapping devices and parameters from the source schematic and optimizing them for the target node using customizable machine learning (ML)-based engines. Schematic-driven layout generates node and design-specific grids to ensure DRC-correct placement and routing, while the migration functionality seeds the target layout with relative placement information from the source layout including device groups, captured as scalable templates, that take updated parameters and instance counts into account. Incremental placer legalizes the placement followed by guard ring and fill cell generation that are specific to target process node. In the last step, routing topology information from the source layout is used to generate routing in the target layout to help meet electrical and parasitic requirements through a combination of automation and migration. The final LVS and DRC-clean layout on the target node is generated in a significantly shorter time compared to manual creation, boosted by the use of existing layout footprint and patterns.
When the design methodology requirements vary across process nodes, layout porting based on mapping of objects and scaling of sizes and coordinates fails miserably in producing high-quality layout that is design rule correct. Our innovative approach of auto-inferring design intents from source layout and driving automated layout creation in target node solves the layout migration challenge with upwards of 2X boost in productivity.
The schematics on the target node are generated by mapping devices and parameters from the source schematic and optimizing them for the target node using customizable machine learning (ML)-based engines. Schematic-driven layout generates node and design-specific grids to ensure DRC-correct placement and routing, while the migration functionality seeds the target layout with relative placement information from the source layout including device groups, captured as scalable templates, that take updated parameters and instance counts into account. Incremental placer legalizes the placement followed by guard ring and fill cell generation that are specific to target process node. In the last step, routing topology information from the source layout is used to generate routing in the target layout to help meet electrical and parasitic requirements through a combination of automation and migration. The final LVS and DRC-clean layout on the target node is generated in a significantly shorter time compared to manual creation, boosted by the use of existing layout footprint and patterns.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionBalancing accuracy and hardware efficiency remains a challenge with traditional pruning methods. N:M sparsity is a recent approach offering a compromise, allowing up to N non-zero weights in a group of M consecutive weights.
However, N:M pruning enforces a uniform sparsity level of $\frac{N}{M}$ across all layers, which does not align well sparse nature of deep neural networks (DNNs). To achieve a more flexible sparsity pattern and a higher overall sparsity level, we present~\textit{JointNF}, a novel joint N:M and structured pruning algorithm to enable fine-grained structured pruning with adaptive sparsity levels across the DNN layers. Moreover, we show for the first time that N:M pruning can also be applied over the input activation for further performance enhancement.
However, N:M pruning enforces a uniform sparsity level of $\frac{N}{M}$ across all layers, which does not align well sparse nature of deep neural networks (DNNs). To achieve a more flexible sparsity pattern and a higher overall sparsity level, we present~\textit{JointNF}, a novel joint N:M and structured pruning algorithm to enable fine-grained structured pruning with adaptive sparsity levels across the DNN layers. Moreover, we show for the first time that N:M pruning can also be applied over the input activation for further performance enhancement.
Research Manuscript
AI
AI/ML Application and Infrastructure
DescriptionDesign-Technology Co-Optimization (DTCO) can be significantly accelerated by employing Neural Compact Models (NCMs). However, the effective deployment of NCMs requires a substantial amount of training data for accurate device modeling. This paper introduces an Active Learning (AL) framework designed to enhance the efficiency of both device modeling and process optimization, particularly addressing the challenges of time-intensive Technology Computer-Aided Design (TCAD) simulations. The framework employs a ranking algorithm that assesses metrics such as the expected variance from the neural tangent kernel (NTK), TCAD simulation time, and the complexity of I-V curves. This strategy considerably reduces the number of required simulations while maintaining high accuracy. Demonstrating the effectiveness of our AL framework, we achieved a 28.5\% improvement in MSE within a 30-minute time budget for device modeling, and an 86.7\% reduction in the data points required for process optimization of a 51-stage ring oscillator (RO). These results offer a streamlined, adaptable solution for rapid device modeling and process optimization in various DTCO applications.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionAs a small effort towards general purpose CIM paradigm, in this paper, we propose a heterogeneous workloads centric compute-in-memory (HWCCIM) architecture. Particularly, we present a design to compile essential algorithmic operations into an address table for in-memory computing circuits. Leveraging a reconfigurable address generation unit to guide data movement within different in-memory computing-based operator arrays, it is able to complete calculations and producing corresponding results. We further illustrate the construction of HWCCIM architecture in a behavioral-level circuit model. We also evaluate the proposed architecture using two classical algorithms, the Fast Fourier Transform (FFT) and the Multilayer Perceptron (MLP) algorithms. Compared to conventional approaches, HWCCIM achieves a maximum latency acceleration of 1.5x and an average latency acceleration of 1.3x.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionTime-to-market is a crucial factor in today's competitive chip design landscape. Accurate timing and power analysis are essential for successful tapeout, demanding fast and precise Liberty characterization data (.libs). Traditional methods, heavily reliant on SPICE simulations, are often time-consuming and resource intensive. This presentation investigates the application of AI to revolutionize library characterization in two different chip design scenarios.
Scenario 1 leverages ML to analyze existing PVT data and build accurate models for timing, power, and noise across various Liberty formats (NLDM, CCS, CCSN and LVF). This dramatically reduces characterization time for new PVT additions, offering up to a 100x runtime savings. Importantly, the generated .libs maintain high accuracy, with deviations from Spice simulations within 5% for timing & 10% for leakage power and internal power energy.
Scenario 2 optimizes the characterization flow by identifying a critical subset of .libs from existing libraries and generating the remaining .libs within a target accuracy range. This significantly reduces the need for recharacterization, saving over 50% of time and resources during Spice model updates or minor design changes.
Scenario 1 leverages ML to analyze existing PVT data and build accurate models for timing, power, and noise across various Liberty formats (NLDM, CCS, CCSN and LVF). This dramatically reduces characterization time for new PVT additions, offering up to a 100x runtime savings. Importantly, the generated .libs maintain high accuracy, with deviations from Spice simulations within 5% for timing & 10% for leakage power and internal power energy.
Scenario 2 optimizes the characterization flow by identifying a critical subset of .libs from existing libraries and generating the remaining .libs within a target accuracy range. This significantly reduces the need for recharacterization, saving over 50% of time and resources during Spice model updates or minor design changes.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionSparse LU factorization is the indispensable building block of the circuit simulation, and dominates the simulation time, especially when dealing with large-scale circuits. RF circuits has been increasingly emphasized with the evolution of ubiquitous wireless communication (i.e., 5G and WiFi). The RF simulation matrices show a distinctive pattern of structured dense blocks, and this pattern has been inadvertently overlooked by prior works, leading to underutilization of computational resources. In this paper, by exploiting the block structure, we propose a novel blocked format for L and U factors and re-design the large-scale sparse LU factorization accordingly, which leverages the data locality inherent in RF matrices. The data format transformation is streamlined, strategically eliminating the redundant data movement and costly indirect memory access. Moreover, the vector operations is converted into matrix operations, enabling efficient data reuse and enhancing data-level parallelism. The experiment results show that our method achieves superior performance to state-of-the-art implementation.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionRange join-based variant annotation is an essential stage in genomic big data analysis, often requiring complex conditional joins with databases spanning Terabytes in size. However, its performance on multi-threaded CPUs/GPUs have been bottlenecked by both the memory-access bandwidth and instruction/data dependencies. Furthermore, massive data-accesses involved in range joins for variant annotations drastically affect energy efficiency, and pose serious challenges to commercial adoption of fast-evolving genomic big data analysis. In this work, we present an efficient hardware-software co-design for range join-based variant annotations on clusters of HBM-enabled FPGAs. Our highly-scalable in-memory processing system achieves up-to 1.98x/6.51x/38.1x speedup/energy improvements/memory access reductions compared to state-of-the-art CPU solution, while being highly extensible to other big data applications of range join.
Research Manuscript
Design
In-memory and Near-memory Computing Circuits
DescriptionRegular path queries (RPQs) in graph databases are bottlenecked by the memory wall. Emerging processing-in-memory (PIM) technologies offer a promising solution to dispatch and execute path matching tasks in parallel within PIM modules. We present Moctopus, a PIM-based data management system for graph databases that supports efficient batch RPQs and graph updates. Moctopus employs a PIM-friendly dynamic graph partitioning algorithm, which tackles graph skewness and preserves graph locality with low overhead for RPQ processing. Moctopus enables efficient graph updates by amortizing the host CPU's update overhead to PIM modules. Evaluation of Moctopus demonstrates superiority over the state-of-the-art traditional graph database.
IP
Engineering Tracks
IP
DescriptionWith the rapidly rising number of computing and peripheral building blocks in modern System-on-chip (SoC) development quickly being in the 100s, the interconnect between these blocks can become the long pole for timing analysis and significantly contribute to power consumption. Networks-on-Chips (NoCs) have emerged as the critical solution for on-chip communication and have seen a rapid rise in protocol complexity for coherent and non-coherent designs, and flows for automated RTL generation of configurable NoC IP from high-level topology descriptions have emerged.
With the transport delay increasingly dominated by RC wiring delay, changes in the NoC topology caused by difficulties in timing closure during the Place and Route (P&R) phase can add significant project delays.
This presentation will outline a flow and methodology that uses earlier, abstracted technology information to efficiently guide the development of NoCs using .lef/.def based import of floorplan information to inform NoC-topology development and export constraint and placement information as guidance to standard digital implementation flows to avoid late surprises in timing closure.
With the transport delay increasingly dominated by RC wiring delay, changes in the NoC topology caused by difficulties in timing closure during the Place and Route (P&R) phase can add significant project delays.
This presentation will outline a flow and methodology that uses earlier, abstracted technology information to efficiently guide the development of NoCs using .lef/.def based import of floorplan information to inform NoC-topology development and export constraint and placement information as guidance to standard digital implementation flows to avoid late surprises in timing closure.
Research Manuscript
Embedded Systems
Embedded System Design Tools and Methodologies
DescriptionSimulink has been widely used in embedded software development, which supports simulation to validate the correctness of the constructed models. However, as the scale and complexity of models in industrial applications grows, it is time-consuming for the simulation engine of Simulink to achieve high coverage and detect potential errors, especially accumulative errors.
In this paper, we propose AccMoS, an accelerating model simulation method for Simulink models via code generation. AccMoS generates simulation functionality code for Simulink models through simulation oriented instrumentation, including runtime actor information collection, coverage collection, and calculation diagnosis. The final simulation code is constructed by composing all the instrumentation code with actor code generated from a predefined template library and integrating test data import. After compiling and executing the code, AccMoS generates simulation results that include coverage and diagnostic information. We implemented AccMoS and evaluated it on several benchmark Simulink models. Compared to Simulink's simulation engine, AccMoS shows a 215.3× improvement in simulation efficiency, significantly reduces the time required for detecting errors. AccMoS also achieved greater coverage within equivalent time.
In this paper, we propose AccMoS, an accelerating model simulation method for Simulink models via code generation. AccMoS generates simulation functionality code for Simulink models through simulation oriented instrumentation, including runtime actor information collection, coverage collection, and calculation diagnosis. The final simulation code is constructed by composing all the instrumentation code with actor code generated from a predefined template library and integrating test data import. After compiling and executing the code, AccMoS generates simulation results that include coverage and diagnostic information. We implemented AccMoS and evaluated it on several benchmark Simulink models. Compared to Simulink's simulation engine, AccMoS shows a 215.3× improvement in simulation efficiency, significantly reduces the time required for detecting errors. AccMoS also achieved greater coverage within equivalent time.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAutomotive-grade multiprocessor System-on-Chips (SoCs) operating in advanced FinFET nodes demand unparalleled reliability and quality. Ensuring power integrity signoff for these SoCs is crucial, necessitating extensive coverage of local switching noise for EMIR analyses. Conventional vectorless EMIR and Gate-VCD based methods are increasingly inadequate in identifying critical noise conditions affecting timing. This study introduces a novel aggressor-based EMIR analysis using SigmaDVD, delivering exceptional local noise coverage for robust power integrity sign-off. Comparative analyses of conventional vectorless EMIR and Gate-VCD EMIR against SigmaDVD on two automotive SoCs reveal significantly heightened local noise coverage with SigmaDVD. This innovative approach provides a foundation for confident power integrity signoff on automotive SoCs, addressing the stringent requirements of extreme reliability in advanced FinFET nodes.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionLarge neural networks, especially transformer-based models, present two critical challenges that exacerbate the memory wall issue in AI accelerator designs. First, the increased dynamic range of the weights requires higher precision quantization formats, leading to higher memory capacity requirements. Second, the exponential growth in model parameters incurs more data movement, leading to increased latency and power consumption. In this study, we propose two novel approaches to address these problems. First, based on Posit, we introduce a new format called adaptive Posit (AdaP), which dynamically extends the dynamic range of its representation at run time with minimal hardware overhead. AdaP, utilizing two exponent encoding schemes, accommodates the data distribution with lower quantization error compared to regular Posit. Second, we propose to use compute-in-memory (CIM) architecture to implement AdaP multiply-and-accumulate (MAC) computation to reduce weight data movement. Traditional CIM proposed for floating-point-alike MAC computation uses a comparator tree (CT) to compute the maximum exponent, enabling the CIM to focus on integer MAC. However, the CT-based design has poor scalability as the number of inputs increases. To address this, we propose a speculative input alignment design that significantly reduces the delay, area, and power consumption for the max exponent computation. Software evaluations show that 8-bit AdaP incurs a negligible 0.25% F1 score reduction on the XLM language identification benchmark compared to the full-precision baseline. Hardware synthesis and simulation results further illustrate that our approach achieves 55% energy efficiency and 2.4x area efficiency improvement compared to the state-of-the-art posit processing element.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIn thermal analysis of a chiplet system, conventional numerical methods or machine learning-based surrogate models face tremendous challenges in computation cost and accuracy, especially in the presence of process and material variations. We propose Graph Neural Networks (GNNs) as a mathematical framework for efficient and robust thermal analysis with composite materials. By modeling each region and their thermal interactions as a graph, we continually adapt the GNN model under thermal interface variations. We validate our approach with numerical solutions and real thermal images from a crossbar unit, and demonstrate its speedup and accuracy in a 2.5D chiplet system.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe pervasive integration of deep neural networks (DNNs) within smart devices has significantly increased computational workloads, consequently intensifying pressure on real-time performance and device power consumption. Offloading segments of DNNs to the edge has emerged as an effective strategy for reducing latency and device power usage. Nonetheless, determining the workload to offload presents a complex challenge, particularly in the face of fluctuating device workloads and varying wireless signal strengths. This paper introduces a streamlined approach aimed at swiftly and accurately forecasting the computing latency of a DNN. Building upon this, an adaptive neurosurgeon framework is proposed to dynamically select the optimal partition point of a DNN during runtime, effectively minimizing computing latency. Through experimental validation, our proposed adaptive neurosurgeon demonstrates superior performance in reducing computing latency amidst changing DNN workloads across devices and varying wireless communication capabilities, outperforming existing state-of-the-art approaches, such as the autodidactic neurosurgeon.
Research Manuscript
AI
Design
AI/ML System and Platform Design
DescriptionAlthough Federated Learning (FL) is promising to enable collaborative learning among Artificial Intelligence of Things (AIoT) devices, it suffers from the problem of low classification performance due
to various heterogeneity factors (e.g., computing capacity, memory size) of devices and uncertain operating environments. To address these issues, this paper introduces an effective FL approach named AdaptiveFL based on a novel fine-grained width-wise model pruning strategy, which can generate various heterogeneous local models for heterogeneous AIoT devices. By using our proposed reinforcement learning-based device selection mechanism, AdaptiveFL can adaptively dispatch suitable heterogeneous models to corresponding AIoT devices on the fly based on their available resources for
local training. Experimental results show that, compared to state-of-the-art methods, AdaptiveFL can achieve up to 16.83% inference improvements for both IID and non-IID scenarios.
to various heterogeneity factors (e.g., computing capacity, memory size) of devices and uncertain operating environments. To address these issues, this paper introduces an effective FL approach named AdaptiveFL based on a novel fine-grained width-wise model pruning strategy, which can generate various heterogeneous local models for heterogeneous AIoT devices. By using our proposed reinforcement learning-based device selection mechanism, AdaptiveFL can adaptively dispatch suitable heterogeneous models to corresponding AIoT devices on the fly based on their available resources for
local training. Experimental results show that, compared to state-of-the-art methods, AdaptiveFL can achieve up to 16.83% inference improvements for both IID and non-IID scenarios.
Research Manuscript
AI
Design
AI/ML, Digital, and Analog Circuits
DescriptionEmerging proposals, such as AdderNet, exploit efficient arithmetic alternatives to the Multiply-ACcumulate (MAC) operations in convolutional neural networks (CNNs). AdderNet adopts an ℓ1-norm based feature extraction kernel, which shows nearly identical model accuracy as compared to the CNN counterparts and can achieve considerable hardware savings due to simpler Sum-of-Absolute-Difference (SAD) operations. Nevertheless, existing AdderNet-based accelerator designs still face critical implementation challenges, such as inefficient model quantization, excessive feature memory overheads, and sub-optimal resource utilization. This paper presents AdderNet 2.0, an optimal AdderNet based accelerator architecture with a novel Activation-Oriented Quantization (AOQ) strategy, a Fused Bias Removal (FBR) scheme for on-chip feature memory bitwidth reduction, and an improved PE design to improve resource utilization. The proposed AdderNet 2.0 accelerator designs were implemented on Xilinx Kria KV-260 FPGA. Experimental results show that INT6 accelerator design achieves up to 3.78× DSP density improvement, and 24% LUT, 40% FF, and 2.1× BRAM savings compared to the baseline CNN design.
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionThe compute-in-memory (CIM) paradigm holds great promise to efficiently accelerate machine learning workloads. Among memory devices, static random-access memory (SRAM) stands out as a practical choice due to its exceptional reliability in the digital domain and balanced performance. Recently, there has been a growing interest in accelerating floating-point (FP) deep neural networks (DNNs) with SRAM CIM due to their critical importance in DNN training and high-accurate inference. This paper proposes an efficient SRAM CIM macro for FP DNNs. To achieve the design, we identify a lightweight approach that decomposes conventional FP mantissa multiplication into two parts: mantissa sub-addition (sub-ADD) and mantissa sub-multiplication (sub-MUL). Our study shows that while mantissa sub-MUL is compute-intensive, it only contributes to the minority of FP products, whereas mantissa sub-ADD, although compute-light, accounts for the majority of FP products. Recognizing "Addition is Most You Need", we develop a hybrid-domain SRAM CIM macro to accurately handle mantissa sub-ADD in the digital domain while improving the energy efficiency of mantissa sub-MUL using analog computing. Experiments with the MLPerf benchmark demonstrate its remarkable improvement in energy efficiency by 8.7×∼ 9.3× (7.3×∼8.2×) in inference (training) compared to a fully digital FP baseline without any accuracy loss, showcasing its great potential for FP DNN acceleration.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionDNN accelerators, significantly advanced by model compression and specialized dataflow techniques, have marked considerable progress. However, the frequent access of high-precision partial sums (PSUMs) leads to excessive memory demands in architectures utilizing weight/input stationary dataflows. Traditional compression strategies have typically overlooked PSUM quantization, a gap recently explored in compute-in-memory research. Moreover, these approaches are mainly toward reducing the Analog-to-Digital Converter (ADC) overhead, neglecting the critical issue of intensive memory access. This study introduces a novel Additive Partial Sum Quantization (APSQ) method, seamlessly integrating PSUM accumulation into the quantization framework. We further propose a grouping strategy that combines APSQ with PSQ enhanced by a floating-point regularization technique to boost accuracy. The experiments indicate that APSQ can efficiently compress PSUMs to INT-8, incurring a negligible degradation of accuracy for the Segformer-B0 and EfficientViT-B0 on the challenging Cityscapes dataset. This leads to a notable reduction in energy costs by 30~45%.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionAI algorithms are increasingly diverse, from dense to sparse, and from regular to irregular. To efficiently manage such diversity in hardware, we propose a programmable heterogeneous accelerator that dynamically balances the computation requirements across different design levels. It comprises two types of processing elements (PEs) customized for dense (e.g., DNNs) and sparse (e.g., graphs) workloads, respectively. These PEs are integrated into a programmable architecture, enabling support for various memory access and computation patterns. Based on 16nm design data, the new accelerator achieves a 11x improvement in latency compared to state-of-the-art homogeneous accelerators.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionLast decades have seen a lot of research on Analog Design Automation. The most recent approaches are based on Reinforcement Learning (RL), instead of heuristic optimizers, such as ant colony, particle swarm or differential evolution algorithm. This paper describes a new learning strategy enhancing the most recent Proximal Policy Optimization (PPO) RL approach, applied to analog design. This solution is compared to more classical heuristic methods mentioned above. This study is done using an electrical-simulator-based environment under equivalent calculation conditions. The paper highlights convergence properties and demonstrates the RL ability to avoid local minimum traps.
Research Manuscript
EDA
Timing and Power Analysis and Optimization
DescriptionMultiple Input Switching (MIS) effects commonly induce undesired glitch pulses at the output of CMOS gates, potentially leading to circuit malfunction and significant power consumption. Thus, accurate and efficient glitch modeling is crucial for the design of high-performance, low-power, and reliable ICs. In this work, we present a new gate-level approach for modeling glitch effects under MIS. Unlike previous studies, we leverage efficient Machine Learning (ML) techniques to accurately estimate the glitch shape characteristics, propagation delay, and power consumption. To this end, we evaluate various ML engines and explore different Artificial Neural Network (ANN) architectures. Moreover, we introduce a seamless workflow to integrate our ANNs into existing standard cell libraries, striking an optimal balance between model size and accuracy in gate-level glitch modeling. Experimental evaluation on gates implemented in 7 nm FinFET technology demonstrates that the proposed models achieve an average error of 2.19% against SPICE simulation while maintaining a minimal memory footprint.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThis presentation is about how to consider LLE impact in timing signoff flow. In advanced node, LLE impact is increased than before, so it has become an essential item to be considered.
Because this LLE impact could not be considered in timing signoff flow in the existing method, we introduce the advanced timing signoff methodology that fully considers LLE impact.
LLE impact can be calculated with vth and u0 parameters. So with sensitivity of these parameters characterized library, depending on which cell is placed next to it, the amount of change in these parameters are measured and reflected in the delay.
Additionally, the verification method and design gain that can be obtained from this methodology are also described.
Because this LLE impact could not be considered in timing signoff flow in the existing method, we introduce the advanced timing signoff methodology that fully considers LLE impact.
LLE impact can be calculated with vth and u0 parameters. So with sensitivity of these parameters characterized library, depending on which cell is placed next to it, the amount of change in these parameters are measured and reflected in the delay.
Additionally, the verification method and design gain that can be obtained from this methodology are also described.
Research Manuscript
EDA
Design Verification and Validation
DescriptionGiven the increasing complexity of integrated circuits, the utilization of machine learning in simulation-based hardware design verification (DV) has become crucial to ensure comprehensive coverage of hard-to-hit states. Our paper proposes a deep deterministic policy gradient (DDPG) algorithm combined with prioritized experience replay (PER) to determine the stimulus settings that result in the highest average FIFO depth in a modified exclusive shared invalid (MESI) cache controller architecture. This architecture includes four FIFOs, each corresponding to a distinct CPU.
Through extensive experimentation, DDPG coupled with PER (DDPG-PER) proves to be more effective than DDPG with uniform experience replay in enhancing average FIFO depth and coverage within the DV process. Furthermore, our proposed DDPG-PER framework significantly increases the occurrence of higher FIFO depths, thereby addressing the challenges associated with reaching hard-to-hit states in DV. The proposed DDPG-PER and DDPG algorithms also demonstrate a larger average FIFO depth over four CPUs, requiring considerably less execution time than Bayesian Optimization (BO).
Through extensive experimentation, DDPG coupled with PER (DDPG-PER) proves to be more effective than DDPG with uniform experience replay in enhancing average FIFO depth and coverage within the DV process. Furthermore, our proposed DDPG-PER framework significantly increases the occurrence of higher FIFO depths, thereby addressing the challenges associated with reaching hard-to-hit states in DV. The proposed DDPG-PER and DDPG algorithms also demonstrate a larger average FIFO depth over four CPUs, requiring considerably less execution time than Bayesian Optimization (BO).
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAs we move towards lower-technology nodes, the challenges in design implementation intensify & enhancing design methodologies and algorithms becomes crucial. By encouraging the integration of different stages, we can significantly improve the implementation process. We present an innovative methodology for implementing a source-synchronous design by integrating an extra stage into our conventional APR flow, strategically situated between the floorplan and placement stages during the design implementation. Our presented solution utilizes a source-synchronous design topology, [SSD Flow], comprising two distinct stages. Initially, we traverse the critical signal nets, followed by the execution of tailored clock routing that adheres to specified rules and constraints. This articulated approach systematically navigates timing intricacies while proactively mitigating crosstalk and noise issues, ultimately optimizing the design. The main objective is to devise a methodology to simplify the implementation process and achieve an enhanced Quality of Results (QoR). Our proposed methodology has significantly streamlined the design implementation process, yielding substantial improvements. Remarkably, our approach has showcased a substantial improvement in Turnaround Time (TAT), featuring a commendable reduction of 2 weeks. From the implementation perspective, our methodology has delivered noteworthy and promising outcomes, including a 48% decrease in latency, a 59.20% reduction in data path delay, a 39.6% enhancement in dynamic power, a 50% reduction in data path depth, and a 55.5% decrease in clock path depth.
DAC Pavilion Panel
Security
DescriptionSemiconductor security is increasingly crucial due to the increasing number of chip vulnerabilities and initiatives regulating cybersecurity assurance for electronic products and systems. Various industry and regulatory bodies have implemented standards and regulations to address cybersecurity concerns across both software and hardware, such as the ISO/SAE 21434 cybersecurity standard for automotive, and the recently released European Union (EU) Cyber Resilience Act. This panel of industry experts will delve into the current state of cybersecurity assurance for semiconductor chips, and how the emerging security standards and growing threat landscape will continue to accelerate the need for more rigorous cybersecurity measures across all sectors.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn the current dynamically changing landscape of computing, growth of artificial intelligence (AI) applications have caused an exponential increase in energy consumption, re-emphasizing the need for managing power footprint in chip design. To manage this escalating energy footprint and enabling true system level low power design, modeling standards play a key role to facilitate inter-operability and re-use. The IEEE 2416 system level power modeling standard, introduced in 2019, offers a unified framework spanning system-level to detailed design, facilitating comprehensive low power design for entire systems. This standard also enables efficiency through contributor-based Process, Voltage, and Temperature (PVT) independent power modeling.
The IEEE 2416 standard is currently undergoing several extensions slated for release in 2024. Noteworthy among these extensions is the comprehensive modeling of multiple voltage blocks and precise representations of analog and mixed-signal blocks. We present these upcoming extensions for the first time, highlighting their potential value through a complete system example with processor cores, accelerators, analog and mixed-signal IP.
This presentation offers insights into the practical implementation of forthcoming extensions with examples. We believe that sharing these advancements, coupled with real-world examples, will help the audience gain valuable early details in using the standard for designing low power systems.
The IEEE 2416 standard is currently undergoing several extensions slated for release in 2024. Noteworthy among these extensions is the comprehensive modeling of multiple voltage blocks and precise representations of analog and mixed-signal blocks. We present these upcoming extensions for the first time, highlighting their potential value through a complete system example with processor cores, accelerators, analog and mixed-signal IP.
This presentation offers insights into the practical implementation of forthcoming extensions with examples. We believe that sharing these advancements, coupled with real-world examples, will help the audience gain valuable early details in using the standard for designing low power systems.
IP
Engineering Tracks
IP
DescriptionContinuous Time Delta Sigma Modulators (CTDSMs) are critical part of various RF receiver chains. These ADCs should be able to accommodate wider signal bandwidths with high dynamic range. This requires higher sampling rate leading to increased power consumption. Thereby, making successful power and signal integrity sign-off a challenging task.
In EMIR analysis, a circuit is simulated together with the parasitic resistor and capacitor network which models the IR drop and Electromigration (EM) effects for both power and signal nets. Advanced node designs have more complex EM rules and with exponential increase in parasitics (RCs) for such kind of designs, the EM simulation becomes more costly.
To address these challenges, we have used Virtuoso-ADE and SpectreX-EMIR solution which handles high-capacity designs and provides exceptional performance. With this flow, a new two-stage iterated method of Spectre-X is used for EMIR analysis to achieve golden accuracy with high performance gain.
In this paper, by using this new two stage iterated method of Spectre-X EMIR, we have achieved close to golden accuracy of direct method (single stage), accelerating EMIR signoff analysis closure by 2.5X performance gain. Seamless integration of Voltus-Fi solution with easy visualization and postprocessing features of ADE, provides productivity gain of 30%.
In EMIR analysis, a circuit is simulated together with the parasitic resistor and capacitor network which models the IR drop and Electromigration (EM) effects for both power and signal nets. Advanced node designs have more complex EM rules and with exponential increase in parasitics (RCs) for such kind of designs, the EM simulation becomes more costly.
To address these challenges, we have used Virtuoso-ADE and SpectreX-EMIR solution which handles high-capacity designs and provides exceptional performance. With this flow, a new two-stage iterated method of Spectre-X is used for EMIR analysis to achieve golden accuracy with high performance gain.
In this paper, by using this new two stage iterated method of Spectre-X EMIR, we have achieved close to golden accuracy of direct method (single stage), accelerating EMIR signoff analysis closure by 2.5X performance gain. Seamless integration of Voltus-Fi solution with easy visualization and postprocessing features of ADE, provides productivity gain of 30%.
Research Manuscript
AI
Security
AI/ML Security/Privacy
DescriptionThe paper introduces AdvHunter, a novel strategy to detect adversarial examples (AEs) in Deep Neural Networks (DNNs). AdvHunter operates effectively in practical black-box scenarios, where only hard-label query access is available, a situation often encountered with proprietary DNNs. This differentiates it from existing defenses, which usually rely on white-box access or need to be integrated during the training phase - requirements often not feasible with proprietary DNNs. AdvHunter functions by monitoring data flow dynamics within the computational environment during the inference phase of DNNs. It utilizes Hardware Performance Counters to monitor microarchitectural activities and employs principles of Gaussian Mixture Models to detect AEs. Extensive evaluation across various datasets, DNN architectures, and adversarial perturbations demonstrate the effectiveness of AdvHunter.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionProcessing-in-Memory (PIM) enables efficient computation of heavy workloads. Motivated by its capabilities, we investigate its potential ability in accelerating Fully Homomorphic Encryption (FHE), a domain known for its colossal computational demands. We present affinity-based optimizations confronting challenges in optimizing the extensive data processing of FHE within PIM's unique architectural constraints, focusing on the balance between parallelism and data affinity. Our novel scheduling methodology minimizes remote data access while reducing penalties by loss in parallelism. We evaluate our solution on an existing PIM-HBM system, achieving 4.55x-216.56x speedup when computing real-world workloads over the TFHE, compared to previous works.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionHeterogeneous systems-on-chips (SoCs) for real-time applications integrate CPUs and/or GPUs with accelerators to meet application deadlines under strict power/area constraints. The large design space of these systems necessitates efficient SoC-level design space exploration (DSE). Existing static approaches struggle to find SoCs that satisfy all constraints, rendering them unsuitable for real-time applications. We propose the use of dynamic scheduling techniques to significantly reduce the design space and navigate it efficiently. Our proposal outperforms existing methodologies with 5.3-12.8x faster DSE times for autonomous vehicle and augmented/virtual reality domains, yielding designs with 1.2-3x better throughput (iso-area) and up to 2.4x lower area (iso-throughput).
Keynote
Special Event
AI
Design
DescriptionArtificial intelligence is changing the world around us, but most of the focus has been on large models running on immense compute servers. There is a critical need for AI in edge applications to decrease latency and power consumption. Fulfilling this need requires new approaches to meet the constraints of future industrial, automotive, and consumer platforms at the intelligent edge.
Front-End Design
AI
Design
Engineering Tracks
Front-End Design
DescriptionGenerative AI is everywhere, but it's still making its first steps in Chip Design.
In this session, we'll invite representative from the design community to review the challenges and present working solutions on using AI for front-end chip design, with an emphasis in sharing "how-to" ideas.
In this session, we'll invite representative from the design community to review the challenges and present working solutions on using AI for front-end chip design, with an emphasis in sharing "how-to" ideas.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThis presentation discusses how an AI-assisted design optimization methodology provides a verified optimal solution for two circuits: metal-option switches and charge pumps. By exploring the entire design space, up to 260,000 design combinations in this case, it results in a faster design cycle, improved capacity, and reduced CPU time.
Micron uses metal-layer switches in its circuits for adjustment to changes. These switches need tuning multiple times during a product design cycle; they might also require adjustment post-tapeout if the process varies, causing poor circuit performance. On the other hand, charge pumps are widely used in memory design for converting a supply voltage to a higher or lower value.
The traditional tuning methods of the above-mentioned circuits involve an iterative manual process to explore as many of the design combinations as possible. This process is time-consuming and may lead to sub-optimal solutions.
This presentation covers the motivation behind the work, the methodology used, and the results obtained by the design team. We also discuss the algorithm behind the AI-powered solution that helped achieve these results.
Micron uses metal-layer switches in its circuits for adjustment to changes. These switches need tuning multiple times during a product design cycle; they might also require adjustment post-tapeout if the process varies, causing poor circuit performance. On the other hand, charge pumps are widely used in memory design for converting a supply voltage to a higher or lower value.
The traditional tuning methods of the above-mentioned circuits involve an iterative manual process to explore as many of the design combinations as possible. This process is time-consuming and may lead to sub-optimal solutions.
This presentation covers the motivation behind the work, the methodology used, and the results obtained by the design team. We also discuss the algorithm behind the AI-powered solution that helped achieve these results.
IP
Engineering Tracks
IP
DescriptionThe paper addresses the challenge of validating Process, Voltage, and Temperature (PVT) corners in semiconductor design, highlighting the increasing complexity of design technology and the impact of process variables and device interference. Recognizing the limitations of traditional Brute-Force methods and the impracticality of validating all PVT corners due to runtime constraints, the paper proposes an AI-based approach. The authors introduce a statistical verification method that combines a scaling method with an Artificial Intelligence (AI)-based Brute-Force accurate method.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionWe examine an advanced IMC NPU design through a detailed case study, illustrating the application of a novel design space exploration methodology. This method integrates adaptive body bias for PVT pruning with a Cerberus-guided routing and floor planning design space exploration. The synergy of these techniques culminates in a substantial enhancement of compute density and energy efficiency in the IMC NPU. Our findings reveal a tenfold improvement in vital performance metrics when compared to conventional digital NPUs. This work not only underscores the viability of IMC NPU designs in high-efficiency applications but also exemplifies the use of AI/ML in refining hardware design processes to achieve unprecedented performance gains.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionWith the rise of generative AI applications, there is a growing demand for high-bandwidth memory in AI/GPU chips, and interposer designs like UCIE for D2D and SOC to HBM interconnects are increasingly popular for chiplets interconnection. Interposer designs face unique challenges like small trace width, high interconnect density, and the absence of a solid plane. These challenges make traditional SI flow time consuming and lack of silicon-based material consideration. An efficient and accuracy pre-layout analysis flow is very urgently needed.
This paper proposes an efficient interposer high-speed design simulation and optimization flow. This flow is driven by optiSLang, allowing for the configuration of design parameters and objectives. By leveraging various AI/ML algorithms, the solution space is explored to identify the optimal design. This flow operates as a closed-loop automatic iterative optimization process.
In summary, this paper presents an automated interposer pre-layout design simulation and optimization flow. The proposed flow enhances accuracy, speed, and realism compared to traditional manual approaches, and the validation results demonstrate its effectiveness and applicability.
This paper proposes an efficient interposer high-speed design simulation and optimization flow. This flow is driven by optiSLang, allowing for the configuration of design parameters and objectives. By leveraging various AI/ML algorithms, the solution space is explored to identify the optimal design. This flow operates as a closed-loop automatic iterative optimization process.
In summary, this paper presents an automated interposer pre-layout design simulation and optimization flow. The proposed flow enhances accuracy, speed, and realism compared to traditional manual approaches, and the validation results demonstrate its effectiveness and applicability.
Front-End Design
AI
Design
Engineering Tracks
Front-End Design
DescriptionThis paper addresses the critical challenge in chip design scalability, where standard cells are replicated in the millions, resulting in designs with tens of billions of transistors. Traditional methods of constraining Process, Voltage, and Temperature (PVT) corners based on past experiences and conducting Monte Carlo simulations on worst-case scenarios prove unreliable. Incorrectly predicting worst-case PVT can lead to schedule delays and design robustness issues. The brute-force Monte Carlo methods for high sigma verification are both costly and impractical.
To overcome these challenges, we present an AI-powered automated methodology for detecting and verifying worst-case yield. Our single-pass PVT + variation high-sigma solution, exemplified by the Solido PVTMC Verifier, achieves the fastest runtime, while the brute-force accurate high-sigma solution, demonstrated by Solido High-Sigma Verifier, ensures the highest accuracy.
The results on latch-based D flip-flop circuits showcase the effectiveness of our approach. Solido High-Sigma Verifier verified bimodality failure occurrences with 4,000 simulations, delivering a staggering 2,500,000X faster runtime than brute-force methods. Furthermore, the yield for this cell at the target PVT was verified to 6.322 sigma, accompanied by a remarkable 30X runtime speedup compared to the previous methodology. This signifies not only improved performance but also better accuracy and coverage rates.
To overcome these challenges, we present an AI-powered automated methodology for detecting and verifying worst-case yield. Our single-pass PVT + variation high-sigma solution, exemplified by the Solido PVTMC Verifier, achieves the fastest runtime, while the brute-force accurate high-sigma solution, demonstrated by Solido High-Sigma Verifier, ensures the highest accuracy.
The results on latch-based D flip-flop circuits showcase the effectiveness of our approach. Solido High-Sigma Verifier verified bimodality failure occurrences with 4,000 simulations, delivering a staggering 2,500,000X faster runtime than brute-force methods. Furthermore, the yield for this cell at the target PVT was verified to 6.322 sigma, accompanied by a remarkable 30X runtime speedup compared to the previous methodology. This signifies not only improved performance but also better accuracy and coverage rates.
Work-in-Progress Poster
AiDAC: A Low-Cost In-Memory Computing Architecture with All-Analog Multibit Compute and Interconnect
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionAnalog in-memory computing (AiMC) is an emerging technology that shows fantastic performance superiority for neural network acceleration. However, as the computational bit-width and scale increase, high-precision data conversion and long-distance data routing will result in unacceptable energy and latency overheads in the AiMC system. In this work, we focus on the potential of in-charge computing and in-time interconnection and show an innovative AiMC architecture, named AiDAC, with three key contributions: (1) AiDAC enhances multibit computing efficiency and reduces data conversion times by grouping capacitors technology; (2) AiDAC first adopts row drivers and column time accumulators to achieve large-scale AiMC arrays integration while minimizing the energy cost of data movements. (3) AiDAC is the first work to support large-scale all-analog multibit vector-matrix multiplication (VMM) operations. The evaluation shows that AiDAC maintains high-precision calculation (less than 0.79% total computing error) while also possessing excellent performance features, such as high parallelism (up to 26.2TOPS), low latency (<20ns/VMM), and high energy efficiency (123.8TOPS/W), for 8bits VMM with 1024 input channels.
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionThe emergence of Diffusion models has gained significant attention in the field of Artificial Intelligence Generated Content. While Diffusion demonstrates impressive image generation capability, it faces hardware deployment challenges due to its unique model architecture and computation requirement. In this paper, we present a hardware accelerator design, i.e. AIG-CIM, which incorporates tri-gear heterogeneous digital compute-in-memory to address the flexible data reuse demands in Diffusion models. Our framework offers a collaborative design methodology for large generative models from the computational circuit-level to the multi-chip-module system-level. We implemented and evaluated the AIG-CIM accelerator using TSMC 22nm technology. For several Diffusion inferences, scalable AIG-CIM chiplets achieve 21.3× latency reduction, up to 231.2× throughput improvement and three orders of magnitude energy efficiency improvement compared to the NVIDIA RTX 3090 GPU.
Research Manuscript
Security
Hardware Security: Primitives, Architecture, Design & Test
DescriptionThe use of cross-scheme fully homomorphic encryption (FHE) in privacy-preserving applications challenges hardware accelerator design. Existing accelerator architectures fail to efficiently handle hybrid FHE schemes due to the mismatch between computational demands and hardware resources. We propose a novel architecture using a hardware-friendly, versatile low-level operator, i.e., Meta-OP. Our slot-based data management efficiently handles memory access patterns of the meta-op for diverse operations. Alchemist accelerates both arithmetic and logic FHE with high hardware utilization rates. Compared to existing ASIC accelerators, Alchemist outperforms with a 29.4× performance per area improvement for arithmetic FHE and a 7.0× overall speedup for logic FHE.
Research Manuscript
AI
AI/ML Algorithms
DescriptionTraditional Deep Neural Network (DNN) quantization methods using integer, fixed-point, or floating-point data types struggle to capture diverse DNN parameter distributions at low precision, and often require large silicon overhead and intensive quantization-aware training. In this study, we introduce Logarithmic Posits (LP), an adaptive, hardware-friendly data type inspired by posits that dynamically adapts to DNN weight/activation distributions by parameterizing LP bit fields. We also develop a novel genetic-algorithm based framework, LP Quantization (LPQ), to find optimal layer-wise LP parameters while reducing representational divergence between quantized and full-precision models through a novel global-local contrastive objective. Additionally, we design a unified mixed-precision LP accelerator (LPA) architecture comprising of processing elements (PEs) incorporating LP in the computational datapath. Our algorithm-hardware co-design demonstrates on average <1% drop in top-1 accuracy across various CNN and ViT models. It also achieves ~2x improvements in performance per unit area and 2.2x gains in energy efficiency compared to state-of-the-art quantization accelerators using different data types.
Research Manuscript
Embedded Systems
Embedded Software
DescriptionRegular Expression (RE) matching enables the identification of patterns in datastream of heterogeneous fields ranging from proteomics to computer security. These scenarios require massive data analysis that, combined with the high data dependency of the REs, leads to long computational times and high energy consumption. Currently, RE engines rely on either (1) flexibility in run-time RE changes and broad operators support impairing performance or (2) fixed high-performing accelerators implementing few simple RE operators. To overcome these limitations, we propose ALVEARE: a hardware-software approach combining a Domain-Specific Language (DSL) with an embedded Domain-Specific Architecture. We exploit REs as a DSL by translating them into flexible executables through our RISC-based Instruction Set Architecture that expresses from simple to advanced primitives. Then, we design a speculation-based microarchitecture to execute real benchmarks efficiently.
ALVEARE provides RE-domain flexibility and broad operators' support and achieves up to 34x speedup and 57x energy efficiency improvements against the state-of-the-art RE2 and Bluefield DPU 2 with its RE accelerator.
ALVEARE provides RE-domain flexibility and broad operators' support and achieves up to 34x speedup and 57x energy efficiency improvements against the state-of-the-art RE2 and Bluefield DPU 2 with its RE accelerator.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionResearchers and industries are increasingly drawn to quantum computing solutions, attracted by their potential computational advantages over classical systems. However, validating new quantum algorithms faces challenges due to limited qubit availability and noise in current quantum devices. Software simulators offer a solution but are time-consuming. Hardware emulators are emerging as an attractive alternative.
This article introduces AMARETTO (quAntuM ARchitecture EmulaTion TechnOlogy), an architecture designed for quantum computing emulation on low-tier Field Programmable Gate Arrays (FPGAs) supporting Clifford+T and rotational gate sets. AMARETTO accelerates and simplifies the functional verification of quantum algorithms using a Reduced-Instruction-Set-Computer (RISC)-like structure and efficient handling of sparse quantum gates. A dedicated compiler translates OpenQASM 2.0 into RISC-like instructions. Our results, validated against the Qiskit state vector simulator, demonstrate successful emulation of 16 qubits on a Xilinx Kria KV260 System on Module (SoM). This approach rivals other works in the literature, offering similar emulated qubit capacity on a smaller, more accessible FPGA.
This article introduces AMARETTO (quAntuM ARchitecture EmulaTion TechnOlogy), an architecture designed for quantum computing emulation on low-tier Field Programmable Gate Arrays (FPGAs) supporting Clifford+T and rotational gate sets. AMARETTO accelerates and simplifies the functional verification of quantum algorithms using a Reduced-Instruction-Set-Computer (RISC)-like structure and efficient handling of sparse quantum gates. A dedicated compiler translates OpenQASM 2.0 into RISC-like instructions. Our results, validated against the Qiskit state vector simulator, demonstrate successful emulation of 16 qubits on a Xilinx Kria KV260 System on Module (SoM). This approach rivals other works in the literature, offering similar emulated qubit capacity on a smaller, more accessible FPGA.
IP
Engineering Tracks
IP
DescriptionA new time-skew mismatch correction IP with lowest known convergence time has been developed for a TI-ADC (Time-Interleaved Analog to Digital Converter) for a communications receiver system. The proposed design greatly relieves the communication link budget by significantly reducing time-skew estimation and correction by at least two orders of magnitude. The proposed non-iterative calibration technique is purely deterministic, uses contemporary signal processing blocks and is not based on any correlational or statistical approaches. Numerical simulation results demonstrate a significant improvement in the TI-ADC performance with the proposed calibration method. In next generation 5G/6G, Radar & Space communication domains, the low latency of the proposed TI-ADC will enable applications where response time needed is fast. As the correction converges at a very fast rate, time-skew changes due to rapid temperature changes will be tracked and compensated.
IP
Engineering Tracks
IP
DescriptionAn on-chip all-digital transient filter IP is proposed as a replacement of an off-chip, external to chip RC circuit for glitch filtering. This is mandatory for EMC compliance and to filter-out transient artifacts due to impedance mismatches. The proposed filter area is very low and can be accommodated in existing design of serial link PHY receivers. It has low insertion latency, does not vary the signal transition width and preserves the signal width/duty cycle of the received signal. The high figure of merit all-digital filter completely replaces the conventional RC low pass filter. The prior analog RC filter not only adds inertia to the system, but also occupies physical board space. Muti-channel systems will benefit a lot and become less cumbersome. The proposed design is all-digital and thus highly technology independent, enabling very short development times. It has been deployed successfully in the MIPI I3C controller and tested with glitchy transitions in both SCL & SDA signals.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionQuantum computers based on superconducting qubits require classical radio-frequency (RF) electronic circuits to control and read out the quantum states. As the complexity of quantum computers scales up, the classical circuitry part becomes increasingly important, calling for high-quality models for its design and optimization. In this paper, we derive an analytical model to quantify the impact of circuit non-idealities on the readout fidelity for superconducting quantum computing hardware. Such a model considers a comprehensive set of non-idealities commonly found in the readout chain, such as frequency, amplitude and phase inaccuracies, impedance mismatch, quantum noise, and amplifier noise, and predicts the joint effects of these non-idealities on the final fidelity. The model's accuracy and effectiveness are verified by numerical quantum-classical co-simulation. The availability of such a model can facilitate the design and optimization of practical quantum computers.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWhile outsourcing hardware designs using FPGA (Field Programmable Gate Array) enables cost optimization in manufacturing, hardware-Trojan insertion becomes a potential threat to industrial fields. In this paper, we propose a system that applies IFT (Information Flow Tracking) to detect hardware Trojans inserted into a DUT (Design Under Test) written in HDL (Hardware Description Language). Unlike existing IFT techniques for DUTs, our implementation tracks the information flow of multiple variables in simulation. This allows flexible assertion policies used for testing. By checking if a DUT violates any given policies, our system detects a Trojan with extracting a statement in HDL and its condition for execution related to the Trojan. These are useful to understand the location of the Trojan and its trigger condition.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThe integration of multiple dies and substrates into a unified 3D-IC package presents a compelling solution to the limitations posed by scaling and challenges in SOC migration, making it a focal point in semiconductor advancement. Despite its prominence, diverse fabrication methodologies, teams, and formats introduce complexities to seamless integration. This approach underscores the critical need for innovative approaches to ensure cohesive connectivity. Additionally, it emphasizes the imperative role of automation in generating 3D-IC rule decks for swift and precise qualification. Efficient qualification solutions demand automated systems capable of synthesizing rule decks while adhering to design specifications and manufacturing methods. This approach accelerates system netlist generation, layout assembly, and LVS (Layout vs. Schematic) rule deck creation, expediting physical verification to mitigate challenges, and promote seamless integration across diverse substrates in semiconductor design and manufacturing.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionFull-Chip STA is a mandatory step for Design closure cycle. With extensive market requirements for high computational workloads, design sizes are growing along with the ask for performance and area. With Chips getting fabricated on shrinking technology nodes, this further tightens the impact window on design accuracy and pessimism. To cater for the mentioned needs, designs are highly modularized at architecture level, but when it comes to STA, performing a Full-Chip Flat STA is the only option for computing exact design performance.
Flat STA on big designs comes with a high cost of runtime and memory requirements, which makes flat STA performed at final design closure stage as the optimal situation. Hence, there are methodologies to perform faster STA, for e.g., distributed chip timing analysis and Hierarchical Timing analysis, which can save on runtime & memory requirements, but could show an impact on accuracy.
For flexibility in STA methodologies multiple Hierarchical STA Flows, including ETMs, Boundary-Models (for Bottom-Up analysis) and Timing Contexts (for Top-Down analysis) are supported by EDA vendors.
SmartScope Flow discussed in the paper, provides a method to bridge the gap between the flow with timing-based models and Flat STA, with a vision to provide accuracy as that of Flat STA and runtime/memory requirements as that of timing-model based flows.
This paper will showcase:
i) Quantitative analysis of Full Hierarchical Flows, and
ii) A detailed correlation in terms of runtime, memory, and accuracy comparison among different
Hierarchical STA Flows, with Flat STA as the anchor point for comparison.
1. Hierarchical STA with ETMs: Use of extracted timing models for blocks and netlist/spef for
Toplevel for STA. Best in runtime/memory but could show hit on top-block interface.
2. Bottom-Up analysis with Boundary-Models: A hybrid of etm and full verilog, this flow uses a
trimmed down netlist model for sub-blocks which offers faster TAT along with analyzing the
interface timing inaccuracies, if any.
a. Comprehensive QOR comparison (Memory consumption, Runtime, Performance (Accuracy))
across Hierarchical and Flat methodologies
b. Extended Interface model netlist reduction techniques with similar accuracy
c. Debug techniques to handle Clock Mapping issues
3. Top-Down analysis with Timing context Flat FC-STA: Creating context timing in a Full-Chip Flat
STA for blocks and performing Block STA with actual toplevel latencies as constraints. Support
for both SIM and MIM blocks is provided.
4. SmartScope Flows: Closing the loop b/w Bottom-Up and Top-Down approaches by creating a
hand-shake b/w the two flows.
This paper provides data points on a ~150M instance design, in terms of timing correlation and runtime/memory benefits in comparison to flat STA.
Flat STA on big designs comes with a high cost of runtime and memory requirements, which makes flat STA performed at final design closure stage as the optimal situation. Hence, there are methodologies to perform faster STA, for e.g., distributed chip timing analysis and Hierarchical Timing analysis, which can save on runtime & memory requirements, but could show an impact on accuracy.
For flexibility in STA methodologies multiple Hierarchical STA Flows, including ETMs, Boundary-Models (for Bottom-Up analysis) and Timing Contexts (for Top-Down analysis) are supported by EDA vendors.
SmartScope Flow discussed in the paper, provides a method to bridge the gap between the flow with timing-based models and Flat STA, with a vision to provide accuracy as that of Flat STA and runtime/memory requirements as that of timing-model based flows.
This paper will showcase:
i) Quantitative analysis of Full Hierarchical Flows, and
ii) A detailed correlation in terms of runtime, memory, and accuracy comparison among different
Hierarchical STA Flows, with Flat STA as the anchor point for comparison.
1. Hierarchical STA with ETMs: Use of extracted timing models for blocks and netlist/spef for
Toplevel for STA. Best in runtime/memory but could show hit on top-block interface.
2. Bottom-Up analysis with Boundary-Models: A hybrid of etm and full verilog, this flow uses a
trimmed down netlist model for sub-blocks which offers faster TAT along with analyzing the
interface timing inaccuracies, if any.
a. Comprehensive QOR comparison (Memory consumption, Runtime, Performance (Accuracy))
across Hierarchical and Flat methodologies
b. Extended Interface model netlist reduction techniques with similar accuracy
c. Debug techniques to handle Clock Mapping issues
3. Top-Down analysis with Timing context Flat FC-STA: Creating context timing in a Full-Chip Flat
STA for blocks and performing Block STA with actual toplevel latencies as constraints. Support
for both SIM and MIM blocks is provided.
4. SmartScope Flows: Closing the loop b/w Bottom-Up and Top-Down approaches by creating a
hand-shake b/w the two flows.
This paper provides data points on a ~150M instance design, in terms of timing correlation and runtime/memory benefits in comparison to flat STA.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThis study introduces an innovative approach for closing large SOC designs efficiently through an effective hierarchical EM flow. The methodology leverages hierarchical analysis framework, integrating both top-level and block-level EM considerations to address the complexities of large-scale SoC designs. This approach uniquely combines the granularity of block-level analysis with the holistic perspective of top-level integration, enabling precise identification and mitigation of EM issues without compromising the accuracy.
Key elements of this methodology include advanced EM modeling at various hierarchical levels, strategic partitioning of the SoC into manageable blocks, and the use of boundary models to accurately assess EM effects at interconnects.
The results demonstrate a significant reduction in the time required to close large SoC designs and memory footprint. This methodology not only enhances the reliability and performance of the SoC but also offers a scalable solution applicable to a wide range of complex integrated circuit designs. The hierarchical top scope signal EM flow represents a substantial advancement in SoC design methodologies, setting a new benchmark for efficiently addressing electromigration challenges in large-complex SoC designs.
Key elements of this methodology include advanced EM modeling at various hierarchical levels, strategic partitioning of the SoC into manageable blocks, and the use of boundary models to accurately assess EM effects at interconnects.
The results demonstrate a significant reduction in the time required to close large SoC designs and memory footprint. This methodology not only enhances the reliability and performance of the SoC but also offers a scalable solution applicable to a wide range of complex integrated circuit designs. The hierarchical top scope signal EM flow represents a substantial advancement in SoC design methodologies, setting a new benchmark for efficiently addressing electromigration challenges in large-complex SoC designs.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionWith the continuous advancement of advanced chip packaging technology, the excellent performance of package core power plays a pivotal role in the operation of the entire chip. Especially for high-performance 2.5D and 3D large-scale ICs, the efficient simulation of core power poses significant challenges.
Many indicators of IP power noise are targeted at M0 within the die. Backend engineers can utilize a package subckt model to simulate dynamic and static IR drops to verify if the power noise at M0 meets the requirements of the indicators. But package engineers typically only simulate power noise at the bumps limited by tools and methods.
This paper introduces a fast method for evaluating chip power noise using iCPM. The iCPM is generated by RedHawk-SC with several probe points on M0. Subsequently, package engineers can construct a circuit using iCPM + package model + PCB model. Simulating power noise at M0 via spice simulation only takes a few minutes. This method significantly improves simulation efficiency.
Many indicators of IP power noise are targeted at M0 within the die. Backend engineers can utilize a package subckt model to simulate dynamic and static IR drops to verify if the power noise at M0 meets the requirements of the indicators. But package engineers typically only simulate power noise at the bumps limited by tools and methods.
This paper introduces a fast method for evaluating chip power noise using iCPM. The iCPM is generated by RedHawk-SC with several probe points on M0. Subsequently, package engineers can construct a circuit using iCPM + package model + PCB model. Simulating power noise at M0 via spice simulation only takes a few minutes. This method significantly improves simulation efficiency.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionTiming closure is a critical but effort-taking task in VLSI designs. In this paper, we focus on the timing-driven placement by considering two important factors, namely accurate sign-off timing predictor and accordingly placement optimization method. For accurate timing analysis, an innovative timing prediction model that can be transformed into a differentiable function is proposed, serving as a replacement for the conventional Elmore delay. While maintaining model accuracy, the overall model complexity is thereby reduced. To evaluate the effectiveness of our timing model, we seamlessly integrate it into the open-source placer DreamPlace. In addition, a pin-to-pin weighting approach based on differentiable timing model is given for timing optimization. Experimental results show that our differentiable timing prediction model can significantly reduce the max and mean timing errors compared to the Elmore delay, and exhibits equivalent accuracy to the non-differentiable timing prediction model. The timing performance after placement optimization is better than the result using Elmore delay, i.,e., smaller TNS and WNS with wirelength decreases by 15% on average.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionA key distinguishing feature of single flux quantum (SFQ) circuits is that each logic gate is clocked. This feature forces the introduction of path-balancing flip-flops to ensure proper synchronization of inputs at each gate. This paper proposes a polynomial time complexity approximation algorithm for clocking assignments that minimizes the insertion of path balancing
buffers for multi-threaded multi-phase clocking of SFQ circuits. Existing SFQ multi-phase clocking solutions have been shown to effectively reduce the number of required buffers inserted while maintaining high throughput, however, the associated clock assignment algorithms
have exponential complexity and can have prohibitively long runtimes for large circuits, limiting the scalability of this approach. Our proposed algorithm is based on a linear program (LP) that
leads to solutions that are experimentally on average within 5% of the optimum and helps accelerate convergence towards the optimal integer linear program (ILP) based solution. The improved LP and ILP runtimes permit multi-phase clocking schemes to scale to larger SFQ circuits than previous state of the art clocking assignment methods. We further extend the existing algorithm to support fanout sharing of the added buffers, saving, on average, an additional 10% of the inserted DFFs. Compared to traditional full path balancing (FPB) methods across 10 benchmarks, our enhanced LP saves 79.9%, 87.8%, and 91.2% of the inserted buffers for 2, 3, and 4 clock phases respectively. Finally, we extend this approach to the generation of circuits that completely mitigate potential hold-time violations at the cost of either adding on average less than 10% more buffers (for designs with 3 or more clock phases) or, more generally, adding a clock phase and thereby reducing throughput.
buffers for multi-threaded multi-phase clocking of SFQ circuits. Existing SFQ multi-phase clocking solutions have been shown to effectively reduce the number of required buffers inserted while maintaining high throughput, however, the associated clock assignment algorithms
have exponential complexity and can have prohibitively long runtimes for large circuits, limiting the scalability of this approach. Our proposed algorithm is based on a linear program (LP) that
leads to solutions that are experimentally on average within 5% of the optimum and helps accelerate convergence towards the optimal integer linear program (ILP) based solution. The improved LP and ILP runtimes permit multi-phase clocking schemes to scale to larger SFQ circuits than previous state of the art clocking assignment methods. We further extend the existing algorithm to support fanout sharing of the added buffers, saving, on average, an additional 10% of the inserted DFFs. Compared to traditional full path balancing (FPB) methods across 10 benchmarks, our enhanced LP saves 79.9%, 87.8%, and 91.2% of the inserted buffers for 2, 3, and 4 clock phases respectively. Finally, we extend this approach to the generation of circuits that completely mitigate potential hold-time violations at the cost of either adding on average less than 10% more buffers (for designs with 3 or more clock phases) or, more generally, adding a clock phase and thereby reducing throughput.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionMulti-die designs, 2.5DIC and 3DIC, have been rising in popularity in last decade as they offer tremendously increased levels of integration, smaller footprint, performance gains, and more. While they are attractive for many applications, it also creates more stringent design bottlenecks in the areas of thermal management and power delivery. For 3DICs, in addition to the complex SoC/PCB interactions seen in their 2D counterparts, we must account for electrical and thermal coupling between dies as well.
For these advanced package design, such as 2.5D/3DIC, chiplets, power, thermal, electromagnetics and mechanical – and their highly coupled interactions – are the primary limiters of entitled performance, yield and cost. As we know, when temperature increases, it increases the device leakage power consumption, and requires more cooling costs. Also, temperature increase can have tremendous negative impact on the overall design performance, such as interconnect resistance hike, device performance degradation, and the thermal induced noise can change the light wave phased in optical designs.
Higher thermal effects also cause reliability issues, like electromigration failure, aging issue, and stress related failures. So thermal management becomes very important to avoid thermal runaway and reliability issues. However, full 3DIC system thermal analysis with detail CTM takes too much time at sign-off stage, and once thermal issues arise, there is no space left to adjust on the SoC die. Therefore, in most cases, upgrading cooling equipment is almost the only option, and the cost is too high! We seek a shifting left method to manage chip thermal in the early stages. Early thermal management can more efficiently avoid thermal run away, reduce thermal management costs, and give designers more confidence during design sign-off analysis.
Thermal aware floorplan & power plan with preliminary collateral in RedHawk-SC-Electrothermal at early stage can analyze and predict power-thermal reliability issues, identify thermal issues early enables fixes/changes that can have a profound effect on reducing failures with a minimum of design effort. Through early-stage thermal-stress analysis, we can avoid the warpage and solder joint reliability issues caused by thermal expansion.
keywords : 3DIC, thermal-aware floorplan, power-plan, early-stage thermal management
For these advanced package design, such as 2.5D/3DIC, chiplets, power, thermal, electromagnetics and mechanical – and their highly coupled interactions – are the primary limiters of entitled performance, yield and cost. As we know, when temperature increases, it increases the device leakage power consumption, and requires more cooling costs. Also, temperature increase can have tremendous negative impact on the overall design performance, such as interconnect resistance hike, device performance degradation, and the thermal induced noise can change the light wave phased in optical designs.
Higher thermal effects also cause reliability issues, like electromigration failure, aging issue, and stress related failures. So thermal management becomes very important to avoid thermal runaway and reliability issues. However, full 3DIC system thermal analysis with detail CTM takes too much time at sign-off stage, and once thermal issues arise, there is no space left to adjust on the SoC die. Therefore, in most cases, upgrading cooling equipment is almost the only option, and the cost is too high! We seek a shifting left method to manage chip thermal in the early stages. Early thermal management can more efficiently avoid thermal run away, reduce thermal management costs, and give designers more confidence during design sign-off analysis.
Thermal aware floorplan & power plan with preliminary collateral in RedHawk-SC-Electrothermal at early stage can analyze and predict power-thermal reliability issues, identify thermal issues early enables fixes/changes that can have a profound effect on reducing failures with a minimum of design effort. Through early-stage thermal-stress analysis, we can avoid the warpage and solder joint reliability issues caused by thermal expansion.
keywords : 3DIC, thermal-aware floorplan, power-plan, early-stage thermal management
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionExterior design plays an important role in automotive design industries and it usually takes laborious work by designers. Image editing, as a fundamental image manipulation task, has been revolutionized by denoising diffusion models thanks to its great productivity and creativity. However, the application of denoising diffusion models for image editing on automotive design is still limited due to the ambiguous editing instructions and uncontrollable output, leading to undesirable results with bad quality. Moreover, the training and inference require a lot of resources. In this work, we propose a novel image editing framework for automotive design to precisely comprehend human instructions and produce high-fidelity exterior renderings. Meanwhile, it needs only 6.5 GPU hours and 16GB VRAM to train and 8GB VRAM to inference, making it more accessible.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAt Renesas, we develop compact and low-power SRAMs for our products. For our SRAM library development, we produce and verify all 10,000+ memory instances generated by our Memory Compiler.
All SRAM IPs must be validated across a wide range of process, voltage, and temperature (PVT) conditions, as well as multiple views and formats for consistency and correctness, including logical, physical, timing, SPICE, and other views. This requires significant time and effort.
To enhance IP QA process in terms of efficiency and coverage, Renesas has built an SRAM IP QA methodology in collaboration with Siemens' Solido Crosscheck. This methodology includes several custom checks from Renesas, in addition to standard SRAM and IP checks. It covers all relevant front-end and back-end design views for IP production and integration workflows, and enables Renesas to fully validate IPs in significantly less time than before.
In this paper, we will discuss Renesas' efficient SRAM IP QA methodology. Within this methodology, we will also highlight key QA checks for SRAM validation, the importance of such rules, and provide insight into QA efficiency and coverage of the flow.
All SRAM IPs must be validated across a wide range of process, voltage, and temperature (PVT) conditions, as well as multiple views and formats for consistency and correctness, including logical, physical, timing, SPICE, and other views. This requires significant time and effort.
To enhance IP QA process in terms of efficiency and coverage, Renesas has built an SRAM IP QA methodology in collaboration with Siemens' Solido Crosscheck. This methodology includes several custom checks from Renesas, in addition to standard SRAM and IP checks. It covers all relevant front-end and back-end design views for IP production and integration workflows, and enables Renesas to fully validate IPs in significantly less time than before.
In this paper, we will discuss Renesas' efficient SRAM IP QA methodology. Within this methodology, we will also highlight key QA checks for SRAM validation, the importance of such rules, and provide insight into QA efficiency and coverage of the flow.
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionTransformer models equipped with multi-head attention (MHA) mechanism have demonstrated promise in computer vision tasks, i.e., vision transformers (ViTs). Nevertheless, the lack of inductive bias in ViTs leads to substantial computational and storage requirements, hindering their deployment on resource-constrained edge devices. To this end, multi-scale hybrid models are proposed to take the advantages of both transformers and CNNs. However, existing domain-specific architectures usually focus on the optimization of either convolution or MHA at the expense of flexibility. In this work, an in-memory computing (IMC) accelerator is proposed to efficiently accelerate ViTs with hybrid MHA and convolution topology by introducing pipeline reordering. SRAM-based digital IMC macro is utilized to mitigate memory access bottleneck, while avoiding analog non-ideality. The reconfigurable processing engines and interconnections are investigated to enable the adaptable mapping of both convolution and MHA. Under typical workloads, experimental results exhibit that our proposed IMC architecture delivers 2.20× to 2.52× speedup and 40.6% to 74.8% energy reduction compared with the baseline design.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThis paper proposes a layout automation method for area compact memory design with considering the modification strategies of X-peripheral and Y-peripheral circuits in memory. Traditional template-based methods are hindered by manual effort required in template creation. To eliminate the need for manual template creation of each circuit, we propose novel method of reforming layout based on target locations. In the TSMC 28nm process, the layout automation reduces the peripheral circuit area by 1.79% to 4.08% and decreases dynamic power by 0.76% to 12.86%, and reduce access time by 0.75% to 7.23%.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThe shrinking technologies have paved the path for complex devices having various functionalities integrating various IPs in a single SoC and hence, complex clocking structure and efficient power management in AMS IP are gaining popularity. The same design complexity is reflected in HDL behavior model like timing from internal clock, real modeling, power aware modeling etc. There is need of robust behavior modeling of these complex IPs, to enable accurate and efficient functional check along with timing.
In this paper, the challenges and shortcomings associated with modeling of complex AMS IPs for timing simulations are discussed, along with the proposed methodology. It has also been demonstrated how this methodology handles the correct data latching issue in case of negative timing checks present in the design, without compromising on any advanced feature supported in the model.
In this paper, the challenges and shortcomings associated with modeling of complex AMS IPs for timing simulations are discussed, along with the proposed methodology. It has also been demonstrated how this methodology handles the correct data latching issue in case of negative timing checks present in the design, without compromising on any advanced feature supported in the model.
Research Manuscript
Design
SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionGrowing IC manufacturing complexity and reliance on third-party fabrication create supply chain fragility, contributing to chip shortages and IP security risks. General-purpose ICs can mitigate manufacturing security risks but rely on rely on software-based configurations, which is not optimal for high-consequence applications.
Our work proposes a novel IP-agnostic Foundational Cell Array (FC-Array) platform to overcome these challenges. Built on only verified standard cells and industry-standard EDA tools, this platform preserves many advantages of an ASIC. By incorporating 3D split manufacturing, we provide semantically secure IP protection and a base wafer that can be stockpiled. Our tests demonstrate both power-efficient (100 MHz) and high-performance (1 GHz) options. In a post-place-and-route simulated 28nm design, our FC-Array shows a worst-case 1.85x increase in power consumption and a 2.61x increase in area compared to standard cell ASICs for equivalent timing performance.
Our work proposes a novel IP-agnostic Foundational Cell Array (FC-Array) platform to overcome these challenges. Built on only verified standard cells and industry-standard EDA tools, this platform preserves many advantages of an ASIC. By incorporating 3D split manufacturing, we provide semantically secure IP protection and a base wafer that can be stockpiled. Our tests demonstrate both power-efficient (100 MHz) and high-performance (1 GHz) options. In a post-place-and-route simulated 28nm design, our FC-Array shows a worst-case 1.85x increase in power consumption and a 2.61x increase in area compared to standard cell ASICs for equivalent timing performance.
Research Manuscript
Security
Hardware Security: Primitives, Architecture, Design & Test
DescriptionAs a core arithmetic operation and security guarantee of Fully Homomorphic Encryption (FHE), Number Theoretic Transform (NTT) of a large degree is the primary source of computational and time overhead. In this paper, we propose a scalable and conflict-free memory mapping algorithm that breaks the memory bound and releases a large amount of on-chip resources. A flexible and no-stall hardware/software pipeline architecture is designed to boost the throughput of NTT/INTT of $N=2^{16}$ to over 48,543 operations per second with area efficiency, which 4× and 10× speed up the FPGA-based (HPCA'23) and GPU-based (HPCA'23) schemes.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionResearchers have previously developed advanced analysis tools that identify fault-causing inputs in complex digital designs. One contributing factor to the success of these tools is the availability of publicly available digital designs and open-source execution flows. We observe the field of AMS circuit verification currently lacks an open-source, mixed-signal (AMS) execution flow that targets AMS system designs. We present VerA, an analysis-friendly open-source AMS modeling and simulation framework works with open-source digital simulators. VerA's compiler employs optimizations to reduce the state space of the digitized analog model and seamlessly integrates digital and analog blocks, enabling easier analysis of the AMS system.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAn explosion in automotive applications in the form of driverless cars, complex space explorations, aviation advancements has made it mandatory for the Design and EDA community to come up with solutions focused on the safety of such devices. There have been earlier attempts to explore safety-based features as a separate or an add-on requirement once the design is fabricated for integrating purposes. It required long turnaround times and endless back and forth iterations of design modifications to cater to system level requirements. Moreover, these safety requirements always came with a cost to PPA and this was considered as a non-negotiable aspect of safety implementation. This was due to the lack of an industry standard approach to pass on the information of proper specification, implementation, and modelling of safety critical systems through the Implementation flow from Synthesis till Routing.
This paper discusses the industry standard solution from Cadence Design Systems using Unified Safety Format (USF) by Midas Safety Platform that can be seamlessly passed on to Implementation tools like Genus, Innovus and Conformal to provide best in class PPA aware Safety Intent Driven Implementation and Verification of chips targeted for automotive devices.
This paper discusses the industry standard solution from Cadence Design Systems using Unified Safety Format (USF) by Midas Safety Platform that can be seamlessly passed on to Implementation tools like Genus, Innovus and Conformal to provide best in class PPA aware Safety Intent Driven Implementation and Verification of chips targeted for automotive devices.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIn this paper, we provide the first thorough analysis of 64-bit parallel prefix adders (PPAs) and 32-bit matrix multiply units (MMUs) implemented using 7-nm carbon nanotube field effect transistors (CNFETs). Unlike many previous studies in which researchers performed the analysis of CNFET circuits at the SPICE level, we focus on netlists placed and routed using the state-of-the-art CNFET cell library. This approach enables us to analyze a more complex and wider range of CNFET circuits (i.e., various architectures of parallel prefix adders and matrix multiply units) than researchers in previous studies, while considering various effects of the physical layout of the circuits. Our experimental results show that 7-nm CNFET improves energy-delay products (EDPs) by 90× and 44× on average for PPAs and MMUs, respectively, compared to 7-nm FinFET. In addition, our analysis shows that the impact of wires, particularly on power consumption, is more substantial in CNFET circuits than FinFET circuits, and wire savings are therefore crucial for the optimization of the EDP of CNFET circuits. This study opens up a new opportunity to develop a wire-aware design for CNFET circuits.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionHigh sigma analysis is an important topic in circuit design and analysis area, which predicts the probability of rare circuit/device failure events in VLSI circuits, such as in SRAM arrays. There are EDA start-ups specifically dedicated to address rare failure event problems, such as Solido, MunEDA, etc. The importance sampling, the tail sampling methods, etc. have been used in this area for many years. More recently, the Scaled Sigma Sampling (SSS) method by Prof. X. Li, et al. at Carnegie Mellon advanced the analysis of rare failure events greatly. The SSS method is an extrapolation method. The EDA industry has welcomed the SSS method. However, we have not seen a comparison of the SSS method against a set of known and exact failure probabilities. Without such a benchmark comparison, the validity range and the expected accuracy of the SSS method are not very clear. In this work, we wish to fill this gap. In this work, we also present an improved SSS method.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionMonolithic designs face significant fabrication costs and data movement challenges, especially when dealing with complex and diverse AI models. Advanced 2.5D/3D packaging promises high bandwidth and density to overcome these challenges but also introduces new electro-thermal constraints. This paper presents a suite of analytical performance models to enable efficient benchmarking of a 2.5D/3D AI system. These models cover various metrics of computing units, network-on-chip, and network-on-package. The results are summarized into a new tool, HISIM. Benefiting from the accuracy and efficiency of HISIM, we evaluate the potential of 2.5D/3D heterogeneous integration on representative AI algorithms under thermal constraints.
Research Manuscript
Annotating Slack Directly on Your Verilog: Fine-Grained RTL Timing Evaluation for Early Optimization
EDA
Timing and Power Analysis and Optimization
DescriptionIn digital IC design, the early register-transfer level (RTL) stage offers greater optimization flexibility than post-synthesis netlists or layouts. Some recent machine learning (ML) solutions propose to predict the overall timing of a design at the RTL stage, but the fine-grained timing information of individual registers remains unavailable. In this work, we introduce RTL-Timer, the first fine-grained general timing estimator applicable to any given design. RTL-Timer explores multiple promising RTL representations and customizes loss functions to capture the maximum arrival time at register endpoints. RTL-Timer's fine-grained predictions are further applied to guide optimization in a standard logic synthesis flow.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionApproximate computing is emerging as a promising approach to devise energy-efficient IoT systems by exploiting the inherent error-tolerant nature of various applications. In this work, we present Approx-T, a novel design methodology that conducts an in-depth study on Approximate Multiplication Units (AMUs) via Taylor-expansion. This paper comprises three key contributions: (1) Pioneering the incorporation of Taylor's theorem into the design concept of approximate units. (2) Leverage the inherent symmetrical error distribution of Taylor series to conduct unbiased AMUs. (3) Present a runtime configurable error compensation architecture with low-complexity arithmetic operations. We implemented both approximate integer and floating multiplication arithmetic units and compared with the state-of-the-art approximations, experimental results show that Approx-T outperforms in all aspects including precision, area and power consumption. We also deployed AMUs on embedded FPGA for various edge computing tasks, Approx-T can achieve up to 5.7x energy efficiency in CNN application with negligible impact on accuracy.
Research Manuscript
AI
AI/ML Algorithms
DescriptionLarge Language Models have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer's weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model. We leverage the Hessian trace as a sensitivity metric for mixed-precision quantization. Experiments show APTQ attains state-of-the-art zero-shot accuracy of 68.24% and 70.48% at 3.8 bitwidth in LLaMa-7B and LLaMa-13B, respectively.
Research Manuscript
EDA
Physical Design and Verification
DescriptionThis paper presents a novel reinforcement-learning-trained router for building a multi-layer obstacle-avoiding rectilinear Steiner minimum tree (OARSMT). The router is trained by our proposed combinatorial Monte-Carlo tree search to select a proper set of Steiner points for OARSMT with only one inference. By using a Hanna-grid graph as the input and a 3D UNet as the network architecture, the router can handle layouts with any dimensions and any routing costs between grids. The experiments on both random cases and public benchmarks demonstrate that our router can significantly outperform previous algorithmic routers and other RL routers using Alpha-Go-like or PPO-based training.
Exhibitor Forum
DescriptionBursting EDA workloads from on-prem to cloud is a challenge for most on-prem environments that are increasingly running out of capacity due to the growing complexity of advanced-node designs. For massively parallelized workloads, such as library characterization, implementation and physical verification, engineers currently need to split their designs between on-prem and cloud execution if they want to leverage the scalable compute capacity on cloud. Depending on the design, this is a tedious activity that eats away at precious engineering productivity. And once job execution is complete, the process to transfer output data back from cloud to on-prem and aggregate it with output generated on-prem adds to this overhead. In this session, we will discuss a unique approach to enabling a true hybrid cloud environment architected specifically for EDA workloads which enables engineers to submit a large job exclusively on-prem automatically splitting the job, routing selective worker traffic through a secure network for cloud execution, and syncing data generated on cloud back to on-prem storage for further processing in the flow. Along with license management automation, hybrid cloud optimization can radically improve engineering productivity and enhance the coverall cloud experience for SoC design.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionVon Neumann's architecture has played a fundamental role in advancing state-of-the-art computing platforms. Despite its contributions, the architecture's heavy reliance on data movement between memory and processor elements poses a significant challenge. The evolving compute-in-memory (CiM) paradigm offers a promising solution to address the memory wall bottleneck by facilitating simultaneous processing and storage within static random-access memory (SRAM) elements. Numerous design decisions taken at different levels of hierarchy affect the figure of merits (FoMs) of SRAM, such as power, performance, area, and yield. The absence of a rapid assessment mechanism for the impact of changes at different hierarchy levels on global FoMs poses a challenge to accurately evaluating innovative SRAM designs. This paper presents an automation tool designed to optimize the energy and latency of SRAM designs incorporating diverse implementation strategies for executing logic operations within the SRAM. The tool structure allows easy comparison across different array topologies and various design strategies to result in energy-efficient implementations. Our study involves a comprehensive comparison of over 6900+ distinct design implementation strategies for EPFL combinational benchmark circuits on the energy-recycling resonant compute-in-memory (rCiM) architecture designed using TSMC 28 nm technology. When provided with a combinational circuit, the tool aims to generate an energy-efficient implementation strategy tailored to the specified input memory and latency constraints. The tool reduces 80.9% of energy consumption on average across all benchmarks compared to baseline implementation of single-macro topology by considering the parallel processing capability of rCiM cache size ranging from 4KB to 192KB
Research Manuscript
Security
Hardware Security: Attack and Defense
DescriptionSecurity practices in the field of Machine learning (ML)
encompass a range of measures, with one notable strategy that involves
concealing the architecture of ML models from users, thereby adding
an extra layer of protection. This proactive strategy serves multiple
key purposes, including safeguarding intellectual property, mitigating
model vulnerabilities, and preventing adversarial attacks. In this work, we
propose a novel fingerprinting attack that identifies a given ML model's
architecture family, from among the latest categories. To this aim, we
are the first to leverage a Frequency Throttling Side-Channel Attack, a
method that enables us to convert power side-channel information into
timing variations at the user-space level. We utilize the timing information
of crafted adversary kernels combined with a supervised machine learning
classifier to identify the ML model architecture. In particular, our
proposed method involves capturing timing information by monitoring
an adversary kernel's execution time while a specific ML model runs,
unveiling distinctive timing patterns. This process involves initiating the
frequency throttling side-channel effect and transforming it into timing
information. Subsequently, we employ a specialized machine learning
classifier trained on this timing data to precisely identify the victim's
ML model architecture. With this approach, we achieve 98% accuracy
in correctly classifying a known ML model into its corresponding
architecture family. Furthermore, our attack demonstrates transferability
by accurately assigning the correct family to unseen models with 90.6%
accuracy on average. Additionally, for the purpose of thorough analysis, we
have reproduced this attack across 3 different platforms, with comparable
results underscoring the attack's platform portability. Finally, it is notable
that we intend to publicly release our work, making it accessible to the
research community for the purpose of reproducibility.
encompass a range of measures, with one notable strategy that involves
concealing the architecture of ML models from users, thereby adding
an extra layer of protection. This proactive strategy serves multiple
key purposes, including safeguarding intellectual property, mitigating
model vulnerabilities, and preventing adversarial attacks. In this work, we
propose a novel fingerprinting attack that identifies a given ML model's
architecture family, from among the latest categories. To this aim, we
are the first to leverage a Frequency Throttling Side-Channel Attack, a
method that enables us to convert power side-channel information into
timing variations at the user-space level. We utilize the timing information
of crafted adversary kernels combined with a supervised machine learning
classifier to identify the ML model architecture. In particular, our
proposed method involves capturing timing information by monitoring
an adversary kernel's execution time while a specific ML model runs,
unveiling distinctive timing patterns. This process involves initiating the
frequency throttling side-channel effect and transforming it into timing
information. Subsequently, we employ a specialized machine learning
classifier trained on this timing data to precisely identify the victim's
ML model architecture. With this approach, we achieve 98% accuracy
in correctly classifying a known ML model into its corresponding
architecture family. Furthermore, our attack demonstrates transferability
by accurately assigning the correct family to unseen models with 90.6%
accuracy on average. Additionally, for the purpose of thorough analysis, we
have reproduced this attack across 3 different platforms, with comparable
results underscoring the attack's platform portability. Finally, it is notable
that we intend to publicly release our work, making it accessible to the
research community for the purpose of reproducibility.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAAET (Architecture Area Evaluation Tool) is designed to address the pressing need for accurate and unified area estimation for future devices. A precise estimation plays an important role in determining the approximate cost of the devices in terms of area so as to meet the market requirements.
AAET aims in enabling seamless data consolidation by efficiently processing information from various legacy devices, which can be configured based on frequency, memory, processor features etc. in order to adapt to the changing requirements of future devices.
The Dashboard provides a user-friendly and efficient tool for estimating area (synthesis and PD) using interactive GUI features, serving as a data analysis tool, with the goal of reducing workload, preventing manual errors, and facilitating data-driven decision-making for competitive advantage.
AAET aims in enabling seamless data consolidation by efficiently processing information from various legacy devices, which can be configured based on frequency, memory, processor features etc. in order to adapt to the changing requirements of future devices.
The Dashboard provides a user-friendly and efficient tool for estimating area (synthesis and PD) using interactive GUI features, serving as a data analysis tool, with the goal of reducing workload, preventing manual errors, and facilitating data-driven decision-making for competitive advantage.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionMost existing works reveal that the deep learning system is extremely susceptible to adversarial examples (AEs), which continually reverberate around the community of DL testing. Consequently, adversarial attacks are exploited to test the robustness of DL model, especially optimized gradient-based technologies in white-box testing. Although AEs have achieved competitive performance in fault revealing-ability and coverage improvement-ability in DL testing, there is little research analyzing the phenomenon theoretically. In this work, we give a formal analysis between gradient-based attack and loss minima of the loss function to prove that powerful adversaries will share similar feature representations with a high probability. Our extensive evaluation and theoretical analysis revealed that (1) the optimized gradient-based technologies can only cover several limited decision logic which is apparently contradictory to the diversity of test suites, (2) the reasons why adversarial examples can increase test coverage, and (3) the weaknesses of AEs by comparing with search-based and fuzz-based test suites generation technologies. Finally, our results prove that AEs can efficiently discover the vulnerability of DL model but are not suitable to explore more inner logic as test suites.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAs the semiconductor technology has been increased, a lot of challenges related to IR-Drop have been increased considerably in recent years. Especially Dynamic IR-Drop issue becomes a bigger factor resulting in function failure and this will be true for advanced process node below 5nm and smaller. We need post-VCD files having various actual scenarios to find out if there are IR-Drop issues or not. But this post-VCD files can't be available until the end of design cycle and this is too late to fix out IR-Drop issues. It's very time consuming, painful and sometimes almost impossible to fix out at the final stage of design cycle when the post-VCD files can be obtainable.
The only to resolve this situation is to find out where is weak to dynamic IR-Drop as earlier as possible and that's why we have proposed the Areal and Time decomposed Phalanx based DNN(Deep Neural Network) methodology. Using this methodology, we have chosen Phalanx which is most similar to DNN modeling and predicted IR-Drop at the new design. We have found out where is weak at PDN(Power Distribution Network) even without layout routing information which is essential in the traditional flow and can fix out issues and strengthen PDN at the very earlier stage of design cycle with this methodology.
This method shows a IR-Drop accuracy over 95% and reduced a lot of iteration time to fix IR-Drop violation by 40% or so.
This Areal and Time decomposed Phalanx based DNN methodology has been verified using commercial tool, Cadence Voltus.
The only to resolve this situation is to find out where is weak to dynamic IR-Drop as earlier as possible and that's why we have proposed the Areal and Time decomposed Phalanx based DNN(Deep Neural Network) methodology. Using this methodology, we have chosen Phalanx which is most similar to DNN modeling and predicted IR-Drop at the new design. We have found out where is weak at PDN(Power Distribution Network) even without layout routing information which is essential in the traditional flow and can fix out issues and strengthen PDN at the very earlier stage of design cycle with this methodology.
This method shows a IR-Drop accuracy over 95% and reduced a lot of iteration time to fix IR-Drop violation by 40% or so.
This Areal and Time decomposed Phalanx based DNN methodology has been verified using commercial tool, Cadence Voltus.
Research Manuscript
AI
AI/ML Application and Infrastructure
DescriptionThis paper presents Artisan, an automated operational amplifier design framework using large language models. We develop a bidirectional representation to align abstract circuit topologies with their structural and functional semantics. We further employ Tree-of-Thoughts and Chain-of-Thoughts approaches to model the design process as a hierarchical question-answer sequence, implemented by a mechanism of multi-agent interaction. A high-quality opamp dataset is developed to enhance the design proficiency of Artisan. Experimental results demonstrate that Artisan outperforms state-of-the-art optimization-based methods and benchmark LLMs, in success rate, circuit performance metrics, and interpretability, while accelerating the design process by up to 50.1x. Artisan will be released for public access.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionLarge language models (LLMs), like ChatGPT, has been shown to be quite effective at providing information retrieval. By leveraging conversational AI, we have extended the functionality of the our in-house stack overflow-like system. We provide a virtual assistant capable of answering questions about design, technology, and tools that our design team needs. We ingest design manuals, methodology and tool documentation, education materials and use retrieval augmentation generation with a LLM to respond to queries. We have built a private, on-premises system that keeps our confidential data in-house. We'll show early progress for this project.
Research Panel
AI
EDA
DescriptionAccording to data provided by the World Health Organization, it is a grim reality that
more than 1.3 million people lose their lives annually due to the tragic outcomes of
road traffic accidents, further exacerbating the situation with a staggering 20 to 50
million individuals being left with non-fatal injuries. These disheartening statistics serve
as a stark reminder of the urgent need for improved safety measures in the automotive
industry.
Historically driven by the pursuit of creating vehicles that captivate and exhilarate
consumers, the automotive sector has increasingly shifted its focus toward fostering a
robust safety culture. This transformation has only sometimes been an organic
process, as governments worldwide have often found themselves leading the charge
in pushing for more excellent vehicular safety through stringent regulations. These
regulatory frameworks, which initially took root in Europe and China, have now been
rapidly disseminated globally. Consequently, automakers have found themselves
compelled to make safety an integral and non-negotiable facet of their automotive
solutions.
The impending European Safety Regulations, set to become a standard in the
industry, have been significantly motivated by the rapid evolution of automotive
technology and an unwavering commitment to ensuring the safety of both drivers and
passengers. A pivotal component of this technological revolution in the automotive
realm is interior sensing. It plays a critical role in monitoring drivers for distractions and
fatigue, as well as tracking the movements of vehicle occupants.
This distinguished panel of experts brings together some of the foremost sensor and
System-on-chip (SoC) suppliers and in-cabin monitoring specialists who are pivotal in
driving the burgeoning interior sensing market. Their collective aim is to deliberate on
various topics, ranging from emerging technology trends to innovative packaging
options, seamless connectivity, and integration points for Advanced Driver Assistance
Systems (ADAS), including the transformative Driver and occupant Monitoring System
technology.
Recognizing that human drivers are inherently prone to errors, safety technology
providers adopt a holistic systems approach to assist, enhance, and even assume
control of the driving task when necessary. In-cabin monitoring emerges as a crucial
element within this overarching strategy. Overcoming challenges related to cost,
packaging constraints, and system complexity, hardware and application vendors
continually push the boundaries of innovation, seeking novel ways to optimize their
designs to support efficient and cost-effective in-cabin monitoring solutions.
The panel discussion, featuring prominent figures from industry and University leaders
such as Seeing Machines, Qualcomm, Texas Instruments, Ambarella, OmniVision,
and TU Braunschweig, will delve deep into the dynamic Sensor and SoC market for
in-cabin monitoring. They will explore critical issues, including how in-cabin
monitoring technology underpins the global safety agenda, the preferences of
suppliers regarding packaging locations, the pros and cons of variousintegration approaches, and the implications for Original Equipment Manufacturers (OEMs) who must ensure that safety and convenience remain
paramount in their offerings. There are a variety of differing opinions, and it is
these differing opinions that will be brought forth in this panel.
The panel is aimed at students, researchers, and practitioners. Students will
understand the state of the art and the challenges. Researchers will be able to
examine open industrial problems which are still open, and industry practitioners will
be able to understand the available solutions and the industry trends.
The panel aims to engage in a comprehensive discussion surrounding critical
questions, including but not limited to:
How can we best support a low-cost and low-power consumption market?
Which aspect or component of Sensor and SoC design should we prioritize for future
advancements?
What are the foremost challenges associated with Artificial Intelligence (AI) in
designing sensors and SoCs?
Where should we channel our Research and Development (R&D) efforts?
Which packaging configurations are poised to dominate the automotive market?
How vital is cybersecurity in this context?
What obstacles do we face in implementing AI techniques for in-cabin monitoring?
How are these cutting-edge designs rigorously tested to ensure their efficacy and
safety?
more than 1.3 million people lose their lives annually due to the tragic outcomes of
road traffic accidents, further exacerbating the situation with a staggering 20 to 50
million individuals being left with non-fatal injuries. These disheartening statistics serve
as a stark reminder of the urgent need for improved safety measures in the automotive
industry.
Historically driven by the pursuit of creating vehicles that captivate and exhilarate
consumers, the automotive sector has increasingly shifted its focus toward fostering a
robust safety culture. This transformation has only sometimes been an organic
process, as governments worldwide have often found themselves leading the charge
in pushing for more excellent vehicular safety through stringent regulations. These
regulatory frameworks, which initially took root in Europe and China, have now been
rapidly disseminated globally. Consequently, automakers have found themselves
compelled to make safety an integral and non-negotiable facet of their automotive
solutions.
The impending European Safety Regulations, set to become a standard in the
industry, have been significantly motivated by the rapid evolution of automotive
technology and an unwavering commitment to ensuring the safety of both drivers and
passengers. A pivotal component of this technological revolution in the automotive
realm is interior sensing. It plays a critical role in monitoring drivers for distractions and
fatigue, as well as tracking the movements of vehicle occupants.
This distinguished panel of experts brings together some of the foremost sensor and
System-on-chip (SoC) suppliers and in-cabin monitoring specialists who are pivotal in
driving the burgeoning interior sensing market. Their collective aim is to deliberate on
various topics, ranging from emerging technology trends to innovative packaging
options, seamless connectivity, and integration points for Advanced Driver Assistance
Systems (ADAS), including the transformative Driver and occupant Monitoring System
technology.
Recognizing that human drivers are inherently prone to errors, safety technology
providers adopt a holistic systems approach to assist, enhance, and even assume
control of the driving task when necessary. In-cabin monitoring emerges as a crucial
element within this overarching strategy. Overcoming challenges related to cost,
packaging constraints, and system complexity, hardware and application vendors
continually push the boundaries of innovation, seeking novel ways to optimize their
designs to support efficient and cost-effective in-cabin monitoring solutions.
The panel discussion, featuring prominent figures from industry and University leaders
such as Seeing Machines, Qualcomm, Texas Instruments, Ambarella, OmniVision,
and TU Braunschweig, will delve deep into the dynamic Sensor and SoC market for
in-cabin monitoring. They will explore critical issues, including how in-cabin
monitoring technology underpins the global safety agenda, the preferences of
suppliers regarding packaging locations, the pros and cons of variousintegration approaches, and the implications for Original Equipment Manufacturers (OEMs) who must ensure that safety and convenience remain
paramount in their offerings. There are a variety of differing opinions, and it is
these differing opinions that will be brought forth in this panel.
The panel is aimed at students, researchers, and practitioners. Students will
understand the state of the art and the challenges. Researchers will be able to
examine open industrial problems which are still open, and industry practitioners will
be able to understand the available solutions and the industry trends.
The panel aims to engage in a comprehensive discussion surrounding critical
questions, including but not limited to:
How can we best support a low-cost and low-power consumption market?
Which aspect or component of Sensor and SoC design should we prioritize for future
advancements?
What are the foremost challenges associated with Artificial Intelligence (AI) in
designing sensors and SoCs?
Where should we channel our Research and Development (R&D) efforts?
Which packaging configurations are poised to dominate the automotive market?
How vital is cybersecurity in this context?
What obstacles do we face in implementing AI techniques for in-cabin monitoring?
How are these cutting-edge designs rigorously tested to ensure their efficacy and
safety?
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe increasing prevalence of device aging significantly complicates the timing analysis of digital circuits, especially due to the time-consuming nature of current methodologies, which struggle with the variety of standard cells and diverse input conditions. Addressing this challenge, this work proposes a novel, design-friendly framework for efficient and rapid aging-aware timing analysis. This framework harnesses the capabilities of Hybrid Graph Neural Networks to effectively capture cell structural details and extract delay-related information, enabling a straightforward mapping from operational conditions to specific cell aging delays. Additionally, it incorporates a Relational Graph Convolution Network (R-GCN) for modelling the complex relationships between nodes and a Graph Attention Network (GAN) for assessing the relative importance of each node, based on their types. This integrated approach significantly streamlines the process of aging-aware timing analysis, offering a substantial improvement in both speed and accuracy for digital circuit design. Our framework has 5% to 28% higher average prediction accuracy and better generalization ability on new cell than other benchmark networks; Compared with the conventional method, our framework greatly reduces time consumption and achieve average acceleration ratio of 600 on prediction tasks of a large number of cell structures and input conditions.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionPerforming per-packet Neural Network (NN) inference on the network data plane is required for high-quality and fast decision-making in computer networking. However, data plane architecture like the Reconfigurable Match Tables (RMT) pipeline has limited support for NN. Previous efforts have utilized Binary Neuron Networks (BNNs) as a compromise, but the accuracy loss of BNN is high. Inspired by the accuracy gain of the two-bit model. this paper proposes Athena. Athena can deploy the sparse low-bit quantization (two-bit and four-bit) model on RMT. Compared with the BNN-based state-of-the-art, Athena is cost-effective regarding accuracy loss reduction, inference latency, and chip area overhead.
Front-End Design
AI
Design
Engineering Tracks
Front-End Design
DescriptionIn the realm of ever evolving Semiconductor technology landscape with complex SoC's and Systems , integration of Chat GPT like AI Transformers in IP/SoC Design Verification could potentially revolutionize a transformative wave of automating verification there by contributing to increased robustness of designs.
IP and SOCs underpin many modern electronic systems like HPC/AI and Automotive SoC's. While functional correctness is crucial, it no longer suffices for real-world applications and usage. In this paper we have explored to utilize the power of light-weight generative AI BER Transformer Model in verification as it redefines the possibilities of how we interact with textual data, including hardware design specifications and taking verification to completeness by suggesting extra scenarios for Performance and Security aspects . It bridges the gap between 'what' a system does, 'how well' it performs, and 'how securely' it operates and addresses the grey areas in system level verification which cannot be captured at IP or sub-system level.
We can scale this model to SOC level and try to address verification challenges for miscellaneous SOC IP's like GPIO,DFT mux, Lower Power Elements and Safety Elements.
This paper highlights the power of using Generative AI in verification, Augmenting AI with verification can help us catch bugs/issues early in the verification life cycle.
IP and SOCs underpin many modern electronic systems like HPC/AI and Automotive SoC's. While functional correctness is crucial, it no longer suffices for real-world applications and usage. In this paper we have explored to utilize the power of light-weight generative AI BER Transformer Model in verification as it redefines the possibilities of how we interact with textual data, including hardware design specifications and taking verification to completeness by suggesting extra scenarios for Performance and Security aspects . It bridges the gap between 'what' a system does, 'how well' it performs, and 'how securely' it operates and addresses the grey areas in system level verification which cannot be captured at IP or sub-system level.
We can scale this model to SOC level and try to address verification challenges for miscellaneous SOC IP's like GPIO,DFT mux, Lower Power Elements and Safety Elements.
This paper highlights the power of using Generative AI in verification, Augmenting AI with verification can help us catch bugs/issues early in the verification life cycle.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn our 2.5D/3D System on Chip (SoC) designs that are being developed at lower (< 10nm) technology nodes, it is crucial to ensure that the IR drop is within the signoff threshold limits in order to achieve the targeted PPA goals.
Traditionally this includes multiple iterations of IR simulations after which the engineer identifies the IR, timing critical areas in the design that need to be improved. Manual identification of even a handful of regions pose a significant bandwidth impact.
Utilizing k-means clustering algorithm, we have developed an end to end pipeline where the engineer can:
• Provide the IR threshold limit and the algorithm will provide the list of regions where instances having drop higher than the threshold are clustered.
• Provide the type of cell which is resistance critical and the algorithm will provide the list of regions where the instances of the specified cell type are clustered. (example: Level Shifters)
• Provide the Instance toggle rate data and algorithm will cluster the regions based on use given high toggle rate threshold.
The regions are provided in the form of bounding boxes which can then be incorporated into the PnR flows like PG grid reinforcement, VT swap to downsize cells, etc.
Traditionally this includes multiple iterations of IR simulations after which the engineer identifies the IR, timing critical areas in the design that need to be improved. Manual identification of even a handful of regions pose a significant bandwidth impact.
Utilizing k-means clustering algorithm, we have developed an end to end pipeline where the engineer can:
• Provide the IR threshold limit and the algorithm will provide the list of regions where instances having drop higher than the threshold are clustered.
• Provide the type of cell which is resistance critical and the algorithm will provide the list of regions where the instances of the specified cell type are clustered. (example: Level Shifters)
• Provide the Instance toggle rate data and algorithm will cluster the regions based on use given high toggle rate threshold.
The regions are provided in the form of bounding boxes which can then be incorporated into the PnR flows like PG grid reinforcement, VT swap to downsize cells, etc.
Research Manuscript
AI
Design
AI/ML System and Platform Design
DescriptionImage Signal Processor (ISP) is widely used in intelligent edge devices across various scenarios. The intricate and time-consuming tuning process demands substantial expertise. Current AI-based auto-tuning operates discretely offline, relying on predefined scenes with human intervention, leading to inconvenient manipulation, with potentially fatal impacts on downstream tasks in unforeseen scenes. We propose a real-time automatic hyperparameter optimization ISP hardware system to address real-world scenarios. Our design features a tri-step framework and a hardware accelerator, demonstrating superior performance in human and computer vision tasks, even in real-time unforeseen scenes. Experiments showcase its practicality, achieving 1080P@75FPS/240FPS in FPGA/ASIC, respectively.
Exhibitor Forum
AI
DescriptionArtificial Intelligence (AI), particularly Large Language Models (LLMs), has revolutionized the landscape of Hardware Description Language (HDL) generation in digital design. This breakthrough technology holds immense promise for streamlining design processes and accelerating innovation. However, the probabilistic nature of LLMs poses unique challenges in HDL generation, frequently leading to inaccurate code predictions. This is a crucial concern in hardware design, where precision is paramount.
To address this critical challenge, we introduce AutoDV, an innovative LLM-based architecture designed to enhance the precision and reliability of AI-generated HDL code. At its core lies a system of interconnected, specialized, and compact LLMs, each meticulously crafted to handle specific aspects of the HDL generation process. This approach not only enables AutoDV to leverage the collective strengths of individual LLMs, but also fosters synergistic interactions among them.
AutoDV's groundbreaking capabilities stem from its two key components: the capability of automatically interfacing with external verification tools and a comprehensive library of pre-defined IPs. By seamlessly interfacing with established verification tools, AutoDV ensures rigorous Design Verification (DV), minimizing the risk of propagating errors to subsequent design stages. Additionally, AutoDV's IP library empowers LLMs to directly access and utilize these well-established and rigorously verified design components, significantly elevating the accuracy of the generated HDL code.
In this presentation, we will explore the technical underpinnings of AutoDV, beginning with an overview of its architecture and then examining the synergism between its components. The presentation will conclude with a practical demonstration.
To address this critical challenge, we introduce AutoDV, an innovative LLM-based architecture designed to enhance the precision and reliability of AI-generated HDL code. At its core lies a system of interconnected, specialized, and compact LLMs, each meticulously crafted to handle specific aspects of the HDL generation process. This approach not only enables AutoDV to leverage the collective strengths of individual LLMs, but also fosters synergistic interactions among them.
AutoDV's groundbreaking capabilities stem from its two key components: the capability of automatically interfacing with external verification tools and a comprehensive library of pre-defined IPs. By seamlessly interfacing with established verification tools, AutoDV ensures rigorous Design Verification (DV), minimizing the risk of propagating errors to subsequent design stages. Additionally, AutoDV's IP library empowers LLMs to directly access and utilize these well-established and rigorously verified design components, significantly elevating the accuracy of the generated HDL code.
In this presentation, we will explore the technical underpinnings of AutoDV, beginning with an overview of its architecture and then examining the synergism between its components. The presentation will conclude with a practical demonstration.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThis paper proposes a novel method for automatically inferring message flow specifications from the communication traces of a system-on-chip (SoC) design that captures messages exchanged among the components during a system execution.
The inferred message flows characterize the communication and coordination of components in a system design for realizing various system functions, and they are essential for SoC validation and debugging.
The proposed method relieves the burden of manual development and maintenance of such specifications on human designers.
Our method also develops a new accuracy metric, acceptance ratio, to evaluate the quality of the mined specifications instead of the specification size often used in the previous work, enabling more accurate specifications to be mined.
The effectiveness of the proposed method is evaluated on both synthetic traces and traces generated from executing several system models in GEM5.
In both cases, the proposed method achieves superior accuracies compared to a previous approach.
The inferred message flows characterize the communication and coordination of components in a system design for realizing various system functions, and they are essential for SoC validation and debugging.
The proposed method relieves the burden of manual development and maintenance of such specifications on human designers.
Our method also develops a new accuracy metric, acceptance ratio, to evaluate the quality of the mined specifications instead of the specification size often used in the previous work, enabling more accurate specifications to be mined.
The effectiveness of the proposed method is evaluated on both synthetic traces and traces generated from executing several system models in GEM5.
In both cases, the proposed method achieves superior accuracies compared to a previous approach.
Tutorial
Autonomous Systems
DescriptionThe contemporary era struggles with the intricate challenge of designing ``complex systems''.
These systems are characterized by intricate webs of interactions that interlace their components, giving rise to multifaceted complexities, springing from at least two sources.
First, the co-design of complex systems (e.g., a large network of cyber-physical systems) demands the simultaneous selection of components arising from heterogeneous natures (e.g., hardware vs. software parts), while satisfying system constraints and accounting for multiple objectives.
Second, different components are interconnected through interactions, and their design cannot be decoupled (e.g., within a mobility system).
Navigating this complexity necessitates innovative approaches, and this tutorial responds to this imperative by focusing on a monotone theory of co-design.
Our exploration extends from the design of individual platforms, such as autonomous vehicles, to the orchestration of entire mobility systems built upon such platforms.
In particular, we will delve into the theoretical foundations of a monotone theory of co-design, establishing a robust mathematical framework and its application to a diverse array of real-world problems, revolving around the domain of embodied intelligence.
The presented toolbox empowers efficient computation of optimal design solutions
tailored to specific tasks and, in its novelty, paves the way for several possibilities for future research.
This tutorial will focus on the particular application of computational design of autonomous systems, featuring both a technical and a practical session.
Participants will have the opportunity to explore dedicated demos and ``learn by doing'' through guided exercises.
The tutorial provides participants with an introduction to robot co-design and aims to connect multiple communities to enable the development of composable models, algorithms, fabrication processes, and hardware for embodied intelligence.
It is intended to be accessible from any background and seniority level and will present applications to a wide array of topics of interest to the design automation and robotics communities.
These systems are characterized by intricate webs of interactions that interlace their components, giving rise to multifaceted complexities, springing from at least two sources.
First, the co-design of complex systems (e.g., a large network of cyber-physical systems) demands the simultaneous selection of components arising from heterogeneous natures (e.g., hardware vs. software parts), while satisfying system constraints and accounting for multiple objectives.
Second, different components are interconnected through interactions, and their design cannot be decoupled (e.g., within a mobility system).
Navigating this complexity necessitates innovative approaches, and this tutorial responds to this imperative by focusing on a monotone theory of co-design.
Our exploration extends from the design of individual platforms, such as autonomous vehicles, to the orchestration of entire mobility systems built upon such platforms.
In particular, we will delve into the theoretical foundations of a monotone theory of co-design, establishing a robust mathematical framework and its application to a diverse array of real-world problems, revolving around the domain of embodied intelligence.
The presented toolbox empowers efficient computation of optimal design solutions
tailored to specific tasks and, in its novelty, paves the way for several possibilities for future research.
This tutorial will focus on the particular application of computational design of autonomous systems, featuring both a technical and a practical session.
Participants will have the opportunity to explore dedicated demos and ``learn by doing'' through guided exercises.
The tutorial provides participants with an introduction to robot co-design and aims to connect multiple communities to enable the development of composable models, algorithms, fabrication processes, and hardware for embodied intelligence.
It is intended to be accessible from any background and seniority level and will present applications to a wide array of topics of interest to the design automation and robotics communities.
IP
Engineering Tracks
IP
DescriptionIn recent times, Increased size of SOC has made static verification time and memory consuming. In a SOC which contains billions of design elements, few cases of missing/false violations or large run time issues get reported by customer on any static tool. When such issues get reported at the time of final sign off stage of the chip, they become a gating issue for any static tool. In such case static tool vendors are expected to provide the fix in the tool on urgent priority.
To fix any issue in the tool R&D engineer need to first identify root cause of the issue. Below methods are the traditional ways of root cause identification in a big design:
1. Using debug prints
2. Apply debugger on code execution
3. Code profiling tools
4. Reducing the size of design by making unrelated portion of design as Blackbox model
Finding the root cause of the issue using above mentioned ways and provide quick fix in tool takes time as:-
R&D, AE may not have direct access to design.
Shipping design to a secure network is difficult or takes time
High number of debug prints make it difficult to find root cause
Attaching debuggers on large design is cumbersome and slow
From the debug fields in violations and other reports, R&D or field only has limited knowledge about the design scenarios. It is difficult to create a unit reproducer
It has often been observed that having a small reproducer in hand reduces the turnaround time significantly for delivery of the fix. To overcome this challenge we have developed an utility in our tool which provides us a capability to create a small reproducer out of the big design.
To fix any issue in the tool R&D engineer need to first identify root cause of the issue. Below methods are the traditional ways of root cause identification in a big design:
1. Using debug prints
2. Apply debugger on code execution
3. Code profiling tools
4. Reducing the size of design by making unrelated portion of design as Blackbox model
Finding the root cause of the issue using above mentioned ways and provide quick fix in tool takes time as:-
R&D, AE may not have direct access to design.
Shipping design to a secure network is difficult or takes time
High number of debug prints make it difficult to find root cause
Attaching debuggers on large design is cumbersome and slow
From the debug fields in violations and other reports, R&D or field only has limited knowledge about the design scenarios. It is difficult to create a unit reproducer
It has often been observed that having a small reproducer in hand reduces the turnaround time significantly for delivery of the fix. To overcome this challenge we have developed an utility in our tool which provides us a capability to create a small reproducer out of the big design.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionConventional hierarchical design planning flows are neither runtime efficient nor resource efficient for a) quick floorplan porting during process node evaluation and library bring up with minimal dependency or b) what-if exploration to hasten block convergence with improved local FP optimization and identify critical limiters for different partition layout topologies. The scaling framework is a one-stop solution capable of operating on bare minimum baseline floorplan information to port floorplans even without any netlist or memory collaterals. The Framework can generate basic floorplanning compatible netlist and scaled library memory collateral from baseline floorplans on a different node/library. The framework can also enable evaluation of block convergence recipes and floorplan utilization or frequency sweeps through macro placement techniques including ML macro placement suitably augmented with additional algorithmic pin placement intelligence to retain global context. The framework has evolved to be the de facto early floorplan execution flow, scaling and porting floorplans between libraries, nodes and even foundries, and improving the work model execution efficiency by 16X and resource efficiency by 3X for each partition. The framework has also been a key pillar in block optimization exploration, during later execution milestones, saving 2-4 weeks of convergence efforts on 80% of blocks with pre-configured techniques and strategies.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionASIPs are attractive for high energy efficiency. However, design effort for ASIP is time-consuming and error-prone. We present an automatic design framework that generates out-of-order ASIPs from ISA documents via nano-operator (nOP) abstraction. The key insight is the proposed nOPs are semantic-aligned and functional-complete. Therefore, we first leverage LLMs to generate nOP graphs from ISA documents, then propose an nOP fusion algorithm to optimize, and generate corresponding OoO ASIPs. Experiments show that compared with SOTA LLM-assisted methods, our approach generate a processor with 5818x larger area without HDL modification. Furthermore, our processor achieves 3.33x speedup compared with a general-purpose CPU.
Embedded Systems and Software
AI
Embedded Systems
Engineering Tracks
DescriptionReinforcement learning has demonstrated optimization performance in various simulation environments, yet there has been limited evidence of its effectiveness in real-world scenarios.
In this study, we applied offline reinforcement learning in an SSD simulator with real product-level complexity. Attempting to design test cases that impose high loads on the SSD, we confirmed a reduction of over 50% in test input quantity compared to random testing.
To overcome the high complexity, we transformed the extensive input range supported by the product into an optimal range, reflecting product characteristics. We effectively represented internal information using a Graph Neural Network.
We propose an automated test generation framework that applies the reuse of trajactiories generated during the agent training process for training.
In this study, we applied offline reinforcement learning in an SSD simulator with real product-level complexity. Attempting to design test cases that impose high loads on the SSD, we confirmed a reduction of over 50% in test input quantity compared to random testing.
To overcome the high complexity, we transformed the extensive input range supported by the product into an optimal range, reflecting product characteristics. We effectively represented internal information using a Graph Neural Network.
We propose an automated test generation framework that applies the reuse of trajactiories generated during the agent training process for training.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionMost SoCs today have analog or mixed signal blocks, such as SerDes cores, DACS, ADCs, PLLs and other transceivers. Many analog blocks have digital control logic. As such, an increasing amount of analog IP is mixed signal, and with rapidly increasing SoC capacity, a single IP block might represent an extremely complex mixed signal function. Currently, a sizable part of mixed signal design implementation is done manually, which is a slow and laborious process that can lead to design errors and numerous iterations. The blocks are placed and routed using semi-manual process, without the aid of design rule-correct automation. In this paper, we introduce a methodology to automate the placement and routing of such digital/mixed signal blocks with LVS and DRC awareness. Within a few clicks the digital block is placed and routed with the addition of boundary cells, tap cells and fills. The solution is capable to read user constraints and enhance quality of routing.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionTI supports various Packaging Technologies which brings forth the challenge of Thermal Modeling and Analysis, where design teams grapple with the intricacies of mastering thermal modeling tools for diverse package families. The current process involves time-consuming manual efforts in creating intricate package geometry and PCB setups with the CAD tools, often resulting in errors.
The collaborative dance between design teams and centralized units prolongs the thermal modeling iteration cycle to 2+ weeks. In response, an automated solution is proposed to streamline this process, reducing the timeline to around 2 days. This automation liberates design teams from the need for extensive CAD/modeling tool familiarity, empowering them to conduct thermal modeling independently without overreliance on centralized teams.
This shift toward automation not only addresses efficiency but also marks a practical evolution in product development. It promises a smoother journey through the complexities of thermal modeling and analysis, reflecting a commitment to innovation while maintaining a grounded approach to practical implementation.
The collaborative dance between design teams and centralized units prolongs the thermal modeling iteration cycle to 2+ weeks. In response, an automated solution is proposed to streamline this process, reducing the timeline to around 2 days. This automation liberates design teams from the need for extensive CAD/modeling tool familiarity, empowering them to conduct thermal modeling independently without overreliance on centralized teams.
This shift toward automation not only addresses efficiency but also marks a practical evolution in product development. It promises a smoother journey through the complexities of thermal modeling and analysis, reflecting a commitment to innovation while maintaining a grounded approach to practical implementation.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionThe identification of layout constraints in analog circuits such as symmetry, matching, etc. has become a crucial task to meet more and more aggressive design specifications, especially in new process nodes where parasitic effects can have a severe impact on circuit performance and lifetime. However, the manual annotation of such constraints requires design expertise and is a challenging and error prone task. In this paper, we propose an unsupervised node embedding method on circuit netlist graph to capture topological similarities between nodes. We evaluate our method on open-source and in-house analog circuit designs to validate the ability of this new approach to identify symmetry constraints. Compared to other solutions based on machine learning (ML) techniques recently proposed in the literature that rely on annotated netlists datasets, this unsupervised solution does not need any prior knowledge usually extracted during computationally expensive machine learning phase.
Research Manuscript
AI
AI/ML Application and Infrastructure
DescriptionThis paper presents RTLFixer, a novel framework enabling automatic syntax errors fixing for Verilog code with Large Language Models (LLMs). Despite LLM's promising capabilities, our analysis indicates that approximately 55\% of errors in LLM-generated Verilog are syntax-related, leading to compilation failures. To tackle this issue, we introduce a novel debugging framework that employs Retrieval-Augmented Generation (RAG) and ReAct prompting, enabling LLMs to act as autonomous agents in interactively debugging the code with feedback. This framework demonstrates exceptional proficiency in resolving syntax errors, successfully correcting about 98.5\% of compilation errors in our debugging dataset, comprising 212 erroneous implementations derived from the VerilogEval benchmark. Our method leads to 32.3\% and 8.6\% increase in pass@1 success rates in the VerilogEval-Machine and VerilogEval-Human benchmarks, respectively.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThe characterization of input/output (IO) devices is complex and time-consuming process due to the multiple supplies involved, such as VDD and VDDE, which ramp up at different rates and in different orders. This is particularly important in the context of modern complex IO design, which often require rigorous validation to ensure reliable and robust operation.
This complexity can be addressed with automation scripts that enable the efficient generation of various validation scenarios in characterization process. In this way, designers can save significant time and effort, while also improving the accuracy and completeness of the validation process
To achieve this, the automation scripts is designed to automatically generate series of tests that cover a range of supply ramp rates and orders. The scripts can be customized to the specific requirements of the IO device being characterized, and by addition to Solido Design Environment can incorporate a variety of simulation and analysis techniques available, such as Monte Carlo analysis and sensitivity analysis.
The addition of an automation script for IO device characterization to the Solido Design Environment represents a significant technical advance in the design and verification of analog and mixed-signal ICs, with important implications for efficiency, accuracy, and reliability.
This complexity can be addressed with automation scripts that enable the efficient generation of various validation scenarios in characterization process. In this way, designers can save significant time and effort, while also improving the accuracy and completeness of the validation process
To achieve this, the automation scripts is designed to automatically generate series of tests that cover a range of supply ramp rates and orders. The scripts can be customized to the specific requirements of the IO device being characterized, and by addition to Solido Design Environment can incorporate a variety of simulation and analysis techniques available, such as Monte Carlo analysis and sensitivity analysis.
The addition of an automation script for IO device characterization to the Solido Design Environment represents a significant technical advance in the design and verification of analog and mixed-signal ICs, with important implications for efficiency, accuracy, and reliability.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionDesign synthesis flows are not aware of Clock Domain Crossing (CDC). Thus, synthesis optimizations that are built to enhance power, performance, and area (PPA), may cause corruption in CDC paths and therefore, the netlist generated by the synthesis tools can introduce new CDC errors even after CDC signoff at the RTL.
The synthesis optimizations may also cause functional glitch issues due to retiming, self-gating, and mux-decompositions which can result in silicon escapes.
Currently, designers use ad hoc methods such as manual synthesis constraints, full CDC re-verification at gate-level, or relying on Gate-level Simulation (GLS) to overcome these challenges. However, it is error prone due to over-constraining, high noise-level during re-verification, or low GLS coverage.
Using VC SpyGlass CDC-aware Fusion Compiler flow, correct-by-construction synthesis is performed with regard to avoiding CDC bugs during netlist transformation.
Running this automated flow using the following steps:
• After RTL CDC signoff using VC SpyGlass CDC, a Static database is generated to guide the synthesis
• Fusion Compiler generates synthesis constraints using the Static database to ensure no corruption happens to CDC paths and no functional glitches are introduced
Integrating this technology in the flow mitigates the risk of introducing any new CDC violations in Netlist that were previously qualified at RTL.
The synthesis optimizations may also cause functional glitch issues due to retiming, self-gating, and mux-decompositions which can result in silicon escapes.
Currently, designers use ad hoc methods such as manual synthesis constraints, full CDC re-verification at gate-level, or relying on Gate-level Simulation (GLS) to overcome these challenges. However, it is error prone due to over-constraining, high noise-level during re-verification, or low GLS coverage.
Using VC SpyGlass CDC-aware Fusion Compiler flow, correct-by-construction synthesis is performed with regard to avoiding CDC bugs during netlist transformation.
Running this automated flow using the following steps:
• After RTL CDC signoff using VC SpyGlass CDC, a Static database is generated to guide the synthesis
• Fusion Compiler generates synthesis constraints using the Static database to ensure no corruption happens to CDC paths and no functional glitches are introduced
Integrating this technology in the flow mitigates the risk of introducing any new CDC violations in Netlist that were previously qualified at RTL.
Work-in-Progress Poster
B-Ring:An Efficient Interleaved Bidirectional Ring All-reduce Algorithm for Gradient Synchronization
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe prevailing Ring all-reduce technique in distributed computing comprises communication establishment, data transmission, and data processing phases in each step. However, as nodes increase, it suffers from excessive communication overhead due to underutilized bandwidth during communication establishment and data processing. To address this, we introduce a bidirectional ring all-reduce (B-Ring) approach, employing asynchronous communication to alleviate communication establishment and data processing impact. Extensive experiments demonstrate B-Ring's effectiveness, reducing average communication overhead by 8.4% and up to 23.6%.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn the rapidly evolving landscape of technology, the pursuit of high-performance systems has become increasingly essential. With the growing complexities in chip design, achieving a harmonious balance between Power, Performance, and Area (PPA) – the foundational pillars of contemporary chip architecture – presents formidable challenges. Traditional clock methodologies such as clock tree synthesis, clock mesh, and multi-source clock tree synthesis have proven inadequate in addressing the intricacies of modern chip design. Recognizing these limitations, we introduce the innovative Hybrid Clock Network technique, a customized approach designed to construct robust clock networks within Network On Chips (NoC).
Our technique has yielded remarkable improvements in clock quality when compared to conventional clock tree methodologies. Notably, our results showcase a 41.66% reduction in latency, a 43.75% enhancement in skew, a 14.22% decrease in clock power consumption, and an overall 12.46% reduction in total power consumption. Additionally, our approach has conserved 11.55% of routing resources, reduced the clock buffer count by 16.2%, and streamlined the clock depth from 23 to 19 levels. These compelling findings underscore the efficacy of our proposed technique in significantly enhancing critical PPA metrics. The Hybrid Clock Network technique represents a breakthrough in addressing the challenges of contemporary chip design, offering a promising path forward in the pursuit of high-performance systems.
Our technique has yielded remarkable improvements in clock quality when compared to conventional clock tree methodologies. Notably, our results showcase a 41.66% reduction in latency, a 43.75% enhancement in skew, a 14.22% decrease in clock power consumption, and an overall 12.46% reduction in total power consumption. Additionally, our approach has conserved 11.55% of routing resources, reduced the clock buffer count by 16.2%, and streamlined the clock depth from 23 to 19 levels. These compelling findings underscore the efficacy of our proposed technique in significantly enhancing critical PPA metrics. The Hybrid Clock Network technique represents a breakthrough in addressing the challenges of contemporary chip design, offering a promising path forward in the pursuit of high-performance systems.
Research Manuscript
Embedded Systems
Embedded Memory and Storage Systems
DescriptionThis paper proposes Balloon-ZNS that enables transparent compression in emerging storage devices ZNS SSDs to enhance cost efficiency. ZNS SSDs require data pages to be stored and aligned in logical zones and flash blocks, conflicting with the management of variable-length compressed pages. Motivated by an observation that compressibility locality widely exists in data streams, Balloon-ZNS performs compressibility-adaptive, slot-aligned storage management to address the conflict. Evaluation with RocksDB shows Balloon-ZNS can reap more than 80% of the compression gain while achieving -7% to 14% higher throughput than a vanilla ZNS SSD, on average, when data compressibility is not poor.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThis paper proposes Bayesian learning driven automated embedded memory design methodology that aims to minimize power consumption and/or maximize performance while meeting predefined constraints. To achieve this objective effectively, we present an automatic tool that leverages a reference initial circuit design to generate a diverse set of schematic and layout options for logic-equivalent circuit variants. Subsequently, leveraging the range of circuit options generated, Bayesian optimization is employed not only to identify optimal circuit parameters but also to select the most appropriate circuit topology to attain the desired design objectives. TSMC 28nm process simulation results demonstrate the proposed methodology reducing dynamic power by 21.59%-39.02% and access time by 29.45%-38.21% compared to the compiler-generated design, with a runtime of 10-40 hours.
DAC Pavilion Panel
DescriptionThis panel will explore, with leading software companies, a phenomenon that has long been anticipated: the business, market and technical convergences of the two halves of Engineering Software (EDA and "industrial" software). These convergences are increasingly evident in the companies' product and acquisition strategies.
Research Manuscript
Security
Hardware Security: Attack and Defense
DescriptionThis research investigates the vulnerability of ML-enabled Hardware Malware Detection(HMD) methods to adversarial attacks. We introduce proactive and robust adversarial learning and defense based on Deep Reinforcement Learning(DRL). First, highly effective adversarial attacks are employed to circumvent detection mechanisms. Subsequently, an efficient DRL technique based on Advantage Actor-Critic(A2C) is presented to predict adversarial attack patterns in real-time. Next, ML models are fortified through adversarial training to enhance their defense capabilities against both malware and adversarial attacks. To achieve greater efficiency, a constraint controller using Upper Confidence Bounds(UCB) algorithm is proposed that dynamically assigns defense responsibilities to specialized RL agents.
IP
Engineering Tracks
IP
DescriptionInterface IPs are an important part of any integrated circuit design that needs to communicate with the outside world or other integrated circuits. Out of many design views of IO libraries (e.g., GPIO, I2C, I3C, etc.) the logical views have special importance as it defines the basic function of the design. The functionality in these views should be verified to the best possible extent as broken functionality leads to one of the heaviest costs a design house may pay in terms of silicon failures. Symbolic simulation provides unique and powerful solutions to the plethora of technical challenges faced by logic verification engineers of interface IPs. The Synopsys ESP uses symbolic simulation technology to offers high-quality equivalence checking for full-custom designs.
In this paper, Synopsys ESP has been explored to validate complex interface IP's. ESP is quite known for equivalence checking of Standard cells & Memories, which is mostly having digital blocks. On another side - Interface IPs consists of bunch of analog blocks along with digital which makes it more complex for equivalence checking. Resolving analog-blocks is complex for ESP and sometimes resolved to incorrect logic, so we are showcasing the challenges faced with analog-blocks of Interface IP's along with their proven solutions and showcasing the advantages it brought within ESP broadening its Analog Design validation coverage.
In this paper, Synopsys ESP has been explored to validate complex interface IP's. ESP is quite known for equivalence checking of Standard cells & Memories, which is mostly having digital blocks. On another side - Interface IPs consists of bunch of analog blocks along with digital which makes it more complex for equivalence checking. Resolving analog-blocks is complex for ESP and sometimes resolved to incorrect logic, so we are showcasing the challenges faced with analog-blocks of Interface IP's along with their proven solutions and showcasing the advantages it brought within ESP broadening its Analog Design validation coverage.
Research Manuscript
AI
AI/ML Application and Infrastructure
DescriptionDeep neural network (DNN) inference has become an important
part of many data-center workloads. This has prompted focused ef-
forts to design ever-faster deep learning accelerators such as GPUs
and TPUs. However, an end-to-end vision application contains
more than just DNN inference, including input decompression, re-
sizing, sampling, normalization, and data transfer. In this paper,
we perform a thorough evaluation of computer vision inference
requests performed on a throughput-optimized serving system. We
quantify the performance impact of server overheads such as data
movement, preprocessing, and message brokers between two DNNs
producing outputs at different rates. Our empirical analysis encom-
passes many computer vision tasks including image classification,
segmentation, detection, depth-estimation, and more complex pro-
cessing pipelines with multiple DNNs. Our results consistently
demonstrate that end-to-end application performance can easily
be dominated by data processing and data movement functions (up
to 56% of end-to-end latency in a medium-sized image, and ∼ 80%
impact on system throughput in a large image), even though these
functions have been conventionally overlooked in deep learning
system design. Our work identifies important performance bottle-
necks in different application scenarios, achieves
2.25× better throughput compared to prior work, and paves the
way for more holistic deep learning system design.
part of many data-center workloads. This has prompted focused ef-
forts to design ever-faster deep learning accelerators such as GPUs
and TPUs. However, an end-to-end vision application contains
more than just DNN inference, including input decompression, re-
sizing, sampling, normalization, and data transfer. In this paper,
we perform a thorough evaluation of computer vision inference
requests performed on a throughput-optimized serving system. We
quantify the performance impact of server overheads such as data
movement, preprocessing, and message brokers between two DNNs
producing outputs at different rates. Our empirical analysis encom-
passes many computer vision tasks including image classification,
segmentation, detection, depth-estimation, and more complex pro-
cessing pipelines with multiple DNNs. Our results consistently
demonstrate that end-to-end application performance can easily
be dominated by data processing and data movement functions (up
to 56% of end-to-end latency in a medium-sized image, and ∼ 80%
impact on system throughput in a large image), even though these
functions have been conventionally overlooked in deep learning
system design. Our work identifies important performance bottle-
necks in different application scenarios, achieves
2.25× better throughput compared to prior work, and paves the
way for more holistic deep learning system design.
Research Manuscript
EDA
Timing and Power Analysis and Optimization
DescriptionThough using multi-bit flip-flop (MBFF) cells provide the benefit of saving dynamic power, its big cell size with many D/Q-pins inherently entails two critical limitations, which are (1) the loss of full flexibility in optimizing the wires connecting to the D/Q-pins in MBFFs and (2) the loss of selectively resizing i.e., controlling output driving strength of internal flip
Research Manuscript
Design
Emerging Models of Computation
DescriptionHyperdimensional computing (HDC), a powerful paradigm for cognitive tasks, often demands hypervectors of high dimensions (e.g., 10,000) to achieve competitive accuracy. However, processing such large-dimensional data poses challenges for performance and energy efficiency, particularly on resource-constrained devices. In this paper, We present a framework to terminate bit-serial HDC inference early when sufficient confidence is attained in the prediction. This approach integrates a Naive Bayes model to replace the conventional associative memory in HDC. This transformation allows for a probabilistic interpretation of the model outputs, steering away from mere similarity measures. We reduce more than 70% of bits that need to be processed while maintaining comparable accuracy across diverse benchmarks. In addition, We show the adaptability of our early termination algorithm during on-the-fly learning scenarios.
DAC Pavilion Panel
Design
DescriptionSoCs designed for compute-intensive workloads, such as AI training and inferencing, continue to grow and power budgets are increasing geometrically. Handling these power budgets from an SoC and system perspective requires rigorous tools, flows, and methodologies. The question that remains is how these burgeoning power budgets impact broader systems and system-of-system effects, and what role does silicon IP play in shaping these outcomes.
2.5D and 3D solutions are emerging as potential mitigators for the expanding power budgets, but the extent of their effect is yet to be fully understood. Additionally, with the constant evolution and growth in technology, there is a looming question: will power budgets level off or continue on a path of exponential growth? The influence of silicon IP in directing this trajectory is a topic of keen interest.
A significant player in this dynamic is the role of next-generation VRMs. With their potential to regulate voltage and hence influence power, they might hold the answer to managing the surge in power budgets. This conference seeks to explore their impact, dissect the role of silicon IP, and generate insightful discussions on the future of power consumption within technology. Together, we will answer some of the following questions from an EDA, system, IP, and SoC design perspective:
o What are the primary factors driving the immense leaps in on-die power?
o What tools, flows, and methodologies are required to manage SoC and system power budgets? o What are the system and system-of-system effects of ballooning power budgets?
o What effect will 2.5D and 3D solutions have on growing power budgets?
o Will we see a leveling off in power budgets or will they keep growing exponentially? And why? o What is the role of next-generation VRMs
2.5D and 3D solutions are emerging as potential mitigators for the expanding power budgets, but the extent of their effect is yet to be fully understood. Additionally, with the constant evolution and growth in technology, there is a looming question: will power budgets level off or continue on a path of exponential growth? The influence of silicon IP in directing this trajectory is a topic of keen interest.
A significant player in this dynamic is the role of next-generation VRMs. With their potential to regulate voltage and hence influence power, they might hold the answer to managing the surge in power budgets. This conference seeks to explore their impact, dissect the role of silicon IP, and generate insightful discussions on the future of power consumption within technology. Together, we will answer some of the following questions from an EDA, system, IP, and SoC design perspective:
o What are the primary factors driving the immense leaps in on-die power?
o What tools, flows, and methodologies are required to manage SoC and system power budgets? o What are the system and system-of-system effects of ballooning power budgets?
o What effect will 2.5D and 3D solutions have on growing power budgets?
o Will we see a leveling off in power budgets or will they keep growing exponentially? And why? o What is the role of next-generation VRMs
Research Manuscript
Design
Design for Manufacturability and Reliability
DescriptionYield estimation and optimization is ubiquitous in modern circuit design but remains elusive for large-scale chips. This is largely due to the mounting cost of transistor-level simulation and one's often limited resources. In this study, we propose a novel framework to estimate and optimize yield using Bayesian Neural Network (BNN-YEO). By coupling machine learning method with Bayesian network, our approach can effectively integrate prior knowledge and is unaffected by the overfitting problem prevalent in most surrogate models. With the introduction of a smooth approximation of the indicator function, it incorporates gradient information to facilitate global yield optimization. We examine its effectiveness via numerical experiments on 6T SRAM and found that BNN-YEO provides 100x speedup (in terms of SPICE simulations) over standard Monte Carlo in yield estimation, and 20x faster than the state-of-the-art method for total yield estimation and optimization with improved accuracy.
Research Manuscript
Design
Quantum Computing
DescriptionBoolean matching is an important problem in logic synthesis and verification. Despite being well-studied for conventional Boolean circuits, its treatment for reversible logic circuits remains largely, if not completely, missing. This work provides the first such study. Given two (black-box) reversible logic circuits that are promised to be matchable, we check their equivalences under various input/output negation and permutation conditions subject to the availability/unavailability of their inverse circuits. Notably, among other results, we show that the equivalence up to input negation and permutation is solvable in quantum polynomial time, while the classical complexity is exponential. This result is arguably the first demonstration of quantum exponential speedup in solving design automation problems. Also, as a negative result, we show that the equivalence up to both input and output negations is not solvable in quantum polynomial time unless UNIQUE-SAT is, which is unlikely. This work paves the theoretical foundation of Boolean matching reversible circuits for potential applications, e.g., in quantum circuit synthesis.
Front-End Design
Design
Engineering Tracks
Front-End Design
DescriptionDot-product compute engines are pivotal to any of the AI/ML hardware accelerators. Multi-term and floating-point dot-product engines increase the datapath complexity due to added logic for rounding, normalization, and alignment of significands per maximum exponent. To formally verify such dot-product compute engines, a C/C++ vs RTL formal check tool (e.g. Synopsys's VC Formal DPV) is used. The datapath complexity of a multi-term and floating-point dot-product engine for a complex AI/ML chip along with different dataflow graph (DFG) structures of the corresponding C/C++ and RTL models often make it difficult for the formal tool to converge. This research depicts various techniques (assume-guarantee, lemma partitioning, DFG optimization, maximizing equivalence points, case splitting, and using optimized solvers) that are adopted to obtain formal convergence across several floating-point types. Moreover, we enable helper lemmas after coming up with adder tree expressions to match the RTL and C-model adder tree structures. The results demonstrate that a formal run for a multi-term FP32-based dot-product operation can converge within 30mins. We recommend a new feature for the VC Formal DPV tool to streamline detection of the adder trees and automatically resolving them in the flow, which Synopsys is currently working on.
IP
Engineering Tracks
IP
DescriptionThe methods and tools we use for digital hardware design today are deeply antiquated and little changed from the 1990s when IP Reuse was in its infancy. Software design, on the other hand, has undergone explosive changes since that time. We have now reached the inflection point where a combination of new open-source software EDA tools and modern software development environments can change the way we design hardware. In this paper, we present our work showing a complete digital design flow that can produce high-quality, professional-grade IP built entirely with open-source software and EDA tools. We also share early results of how generative AI may become a powerful tool in the designer's toolbox for creating ever more complex IP.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionHigh bandwidth memory (HBM) consists of several memory chips and a dedicated buffer die that serializes and de-serializes data for processing and transferring. One major parameter deciding the performance of a buffer die is the number of parallel signal buslines spanning half the die between signal IO circuitry (e.g., PHY) and input/output ports (i.e., through-silicon via (TSV)) of the buffer die. The speed of signal buses is also important to make smoother signal transitions during the clock cycle time. This transition time ensuring full signal swing, determines the maximum clock frequency of the HBM. The faster the device and the larger the number of buslines, the higher the performance an HBM can deliver. The busline bit count is expected to exceed several tens of thousands in the next HBM generation. The busline delay difference must be minimized for correct signal transfer of all bits within a very narrow available time slot for signal transition. Until now, the bus design has been done by iterative manual layout and simulation, since no good automated solutions exist.This work seeks an automated layout and optimization methodology for the many signal buslines for a next generation HBM. We formulate the design constraints from custom layouts, and develop a novel bus delay optimization algorithm based on a commercial P&R tool. This automated solution demonstrates a bus layout for an HBM buffer die within seconds, while satisfying all metric requirements.
Research Manuscript
Design
Emerging Models of Computation
DescriptionThe concept of Nash equilibrium (NE), pivotal within game theory, has garnered widespread attention across numerous industries.
However, verifying the existence of NE poses a significant computational challenge, classified as an NP-complete problem.
Recent advancements introduced several quantum Nash solvers aimed at identifying pure strategy NE solutions (i.e., binary solutions) by integrating slack terms into the objective function, commonly referred to as slack-quadratic unconstrained binary optimization (S-QUBO).
However, incorporation of slack terms into the quadratic optimization results in changes of the objective function, which may cause incorrect solutions.
Furthermore, these quantum solvers only identify a limited subset of pure strategy NE solutions, and fail to address mixed strategy NE (i.e., decimal solutions), leaving many solutions undiscovered.
In this work, we propose C-Nash, a novel ferroelectric computing-in-memory (CiM) architecture that can efficiently handle both pure and mixed strategy NE solutions.
The proposed framework consists of
(i) a transformation method that converts quadratic optimization into a MAX-QUBO form without introducing additional slack variables, thereby avoiding objective function changes;
(ii) a ferroelectric FET (FeFET) based bi-crossbar structure for storing payoff matrices and accelerating the core vector-matrix-vector (VMV) multiplications of QUBO form;
(iii) A winner-takes-all (WTA) tree implementing the MAX form and a two-phase based simulated annealing (SA) logic for searching NE solutions.
Evaluations demonstrate that C-Nash has up to 68.6% increase in the success rate for identifying NE solutions, finding all pure and mixed NE solutions rather than only a portion of pure NE solutions, compared to D-Wave based quantum approaches.
Moreover, C-Nash boasts a reduction up to 157.9X/79.0X in time-to-solutions in comparison to D-Wave 2000 Q6 and D-Wave Advantage 4.1, respectively.
However, verifying the existence of NE poses a significant computational challenge, classified as an NP-complete problem.
Recent advancements introduced several quantum Nash solvers aimed at identifying pure strategy NE solutions (i.e., binary solutions) by integrating slack terms into the objective function, commonly referred to as slack-quadratic unconstrained binary optimization (S-QUBO).
However, incorporation of slack terms into the quadratic optimization results in changes of the objective function, which may cause incorrect solutions.
Furthermore, these quantum solvers only identify a limited subset of pure strategy NE solutions, and fail to address mixed strategy NE (i.e., decimal solutions), leaving many solutions undiscovered.
In this work, we propose C-Nash, a novel ferroelectric computing-in-memory (CiM) architecture that can efficiently handle both pure and mixed strategy NE solutions.
The proposed framework consists of
(i) a transformation method that converts quadratic optimization into a MAX-QUBO form without introducing additional slack variables, thereby avoiding objective function changes;
(ii) a ferroelectric FET (FeFET) based bi-crossbar structure for storing payoff matrices and accelerating the core vector-matrix-vector (VMV) multiplications of QUBO form;
(iii) A winner-takes-all (WTA) tree implementing the MAX form and a two-phase based simulated annealing (SA) logic for searching NE solutions.
Evaluations demonstrate that C-Nash has up to 68.6% increase in the success rate for identifying NE solutions, finding all pure and mixed NE solutions rather than only a portion of pure NE solutions, compared to D-Wave based quantum approaches.
Moreover, C-Nash boasts a reduction up to 157.9X/79.0X in time-to-solutions in comparison to D-Wave 2000 Q6 and D-Wave Advantage 4.1, respectively.
Research Manuscript
Embedded Systems
Embedded Software
DescriptionEnergy harvesting offers a scalable and cost-effective power solution for IoT devices, but it introduces the challenge of frequent and unpredictable power failures due to the unstable environment.
To address this, intermittent computing has been proposed, which periodically backs up the system state to non-volatile memory (NVM), enabling robust and sustainable computing even in the face of unreliable power supplies.
In modern processors, write back cache is extensively utilized to enhance system performance.
However, it poses a challenge during backup operations as it buffers updates to memory, potentially leading to inconsistent system states.
One solution is to adopt a write-through cache, which avoids the inconsistency issue but incurs increased memory access latency for each write reference.
Some existing work enforces a cache flushing before backups to maintain a consistent system state, resulting in significant backup overhead.
In this paper, we point out that although cache delays updates to the main memory, it may preserve a recoverable system state in the main memory.
Leveraging this characteristic, we propose a cache-aware task decomposition method that divides an application into multiple tasks, ensuring that no dirty cache lines are evicted during their execution.
Furthermore, the cache-aware task decomposition maintains a unchanged memory state during the execution of each task, enabling us to parallelize the backup process with task execution and effectively hide the backup latency.
Experimental results with different power traces demonstrate the effectiveness of the proposed system.
To address this, intermittent computing has been proposed, which periodically backs up the system state to non-volatile memory (NVM), enabling robust and sustainable computing even in the face of unreliable power supplies.
In modern processors, write back cache is extensively utilized to enhance system performance.
However, it poses a challenge during backup operations as it buffers updates to memory, potentially leading to inconsistent system states.
One solution is to adopt a write-through cache, which avoids the inconsistency issue but incurs increased memory access latency for each write reference.
Some existing work enforces a cache flushing before backups to maintain a consistent system state, resulting in significant backup overhead.
In this paper, we point out that although cache delays updates to the main memory, it may preserve a recoverable system state in the main memory.
Leveraging this characteristic, we propose a cache-aware task decomposition method that divides an application into multiple tasks, ensuring that no dirty cache lines are evicted during their execution.
Furthermore, the cache-aware task decomposition maintains a unchanged memory state during the execution of each task, enabling us to parallelize the backup process with task execution and effectively hide the backup latency.
Experimental results with different power traces demonstrate the effectiveness of the proposed system.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionWith the exponential growth in design complexity, stringent timelines in Chip design cycle closure, the process advancements and increased runtimes in both physical design sign-off verification and Quality Analysis are constantly driving the need for faster and more efficient physical verification (PV) strategies.
Early PV analysis ensures designers to be able to quickly and easily analyze the critical issues. They can find and fix the root cause of errors in an efficient, accurate and fast manner. Fixing critical DRC and DFM issues later in the project cycle becomes more challenging. Our paper describes some of the efficient techniques which enable the faster Chip design sign-off convergence.
Early PV analysis ensures designers to be able to quickly and easily analyze the critical issues. They can find and fix the root cause of errors in an efficient, accurate and fast manner. Fixing critical DRC and DFM issues later in the project cycle becomes more challenging. Our paper describes some of the efficient techniques which enable the faster Chip design sign-off convergence.
Research Manuscript
Design
Design for Manufacturability and Reliability
DescriptionOptical proximity correction (OPC) is a vital step to ensure printability in modern VLSI manufacturing. Various OPC approaches have been proposed, which are typically data-driven and hardly involve particular considerations of the OPC problem, leading to potential performance bottlenecks. In this paper, we propose CAMO, a reinforcement learning-based OPC system that integrates important principles of the OPC problem. CAMO explicitly involves the spatial correlation among the neighboring segments and an OPC-inspired modulation for movement action selection. Experiments are conducted on via patterns and metal patterns. The results demonstrate that CAMO outperforms state-of-the-art OPC engines from both academia and industry.
Research Manuscript
Design
In-memory and Near-memory Computing Circuits
DescriptionRange search is the key part of the point cloud processing pipeline. CAM has proven its efficiency for search tasks on switches. In this work, we propose CAMPER, aiming to explore the potential of CAM for point cloud range search. We developed a ripple comparison 13T CAM cell for distance comparison, designed a spatial approximation search algorithm based on Chebyshev distance, and discussed the flexibility and scalability of the architecture. The results show that in the 64k@64k task, CAMPER achieves a latency of 0.83ms and a power consumption of 114.6mW, increased by 10.4x and 228x, respectively.
Research Manuscript
Design
In-memory and Near-memory Computing Circuits
DescriptionDemands for efficient computing under memory wall have led to computation-in-memory (CIM) accelerators that leverages memory structure to perform in-situ computing. The content addressable memory (CAM) processing is a CIM paradigm that accomplishes general purpose functions, via sequences of search and update op- eration. However, the conventional CAM-based CIM is customized for inter-vector operation and require long search-update iterations for computing.
To mitigate the drawback of prior works, this work proposes a content addressable processor (CAP), improving both of the func- tionality and performance. CAP supports general purpose inter- vector and intra-vector operation. Respectively, CAP shortens the search and update step latency. The sequence order of search-update pair is released by CAP to achieve parallel search-update. CAP is implemented in 22nm CMOS technology with 0.6 mm2 area. By integrating all the techniques, CAP achieves 2.68 x performance improvement over the baseline, also realizes 11.37 TOPs/W energy efficiency and 1.376 TOP/mm2 area efficiency.
To mitigate the drawback of prior works, this work proposes a content addressable processor (CAP), improving both of the func- tionality and performance. CAP supports general purpose inter- vector and intra-vector operation. Respectively, CAP shortens the search and update step latency. The sequence order of search-update pair is released by CAP to achieve parallel search-update. CAP is implemented in 22nm CMOS technology with 0.6 mm2 area. By integrating all the techniques, CAP achieves 2.68 x performance improvement over the baseline, also realizes 11.37 TOPs/W energy efficiency and 1.376 TOP/mm2 area efficiency.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionAsynchronous Graph Neural Network (AGNN) has attracted much research attention because it enables faster convergence speed than the synchronous GNN. However, existing software/hardware solutions suffer from redundant computation overhead and excessive off-chip communications
for AGNN due to irregular state propagations along the dependency chains between vertices. This paper proposes a chain-driven asynchronous accelerator, CDA-GNN, for efficient AGNN inference. Specifically, CDA-GNN proposes a chain-driven asynchronous execution approach into novel
accelerator design to regularize the vertex state propagations for fewer redundant computations and off-chip communications, and also designs a chain-aware data caching method to improve data locality for AGNN. We have implemented and evaluated CDA-GNN on a Xilinx Alveo U280 FPGA card.
Compared with the state-of-the-art software solutions (i.e., Dorylus and AMP) and hardware solutions (i.e., BlockGNN and FlowGNN), CDA-GNN improves the performance of AGNN inference by an average of 1,173x, 182.4x, 10.2x, and 7.9x and saves energy by 2,241x, 242.2x, 12.4x, and 8.9x, respectively.
for AGNN due to irregular state propagations along the dependency chains between vertices. This paper proposes a chain-driven asynchronous accelerator, CDA-GNN, for efficient AGNN inference. Specifically, CDA-GNN proposes a chain-driven asynchronous execution approach into novel
accelerator design to regularize the vertex state propagations for fewer redundant computations and off-chip communications, and also designs a chain-aware data caching method to improve data locality for AGNN. We have implemented and evaluated CDA-GNN on a Xilinx Alveo U280 FPGA card.
Compared with the state-of-the-art software solutions (i.e., Dorylus and AMP) and hardware solutions (i.e., BlockGNN and FlowGNN), CDA-GNN improves the performance of AGNN inference by an average of 1,173x, 182.4x, 10.2x, and 7.9x and saves energy by 2,241x, 242.2x, 12.4x, and 8.9x, respectively.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionSmart sensing is currently a developing topic. In order to further explore the potential of smart sensing, we propose a centralized-distributed computing architecture. Under this architecture, the end, edge, and center deploy networks suitable for their respective scales, and improve system performance and energy efficiency through collaborative computing. The proposed architecture uses a visual model deployed at the near-sensor end to complete feature extraction, and uses a transformation model deployed at the edge to bridge the visual model and the large language models (LLMs) deployed at the center.Then, in order to deploy neural networks at the resource-constrained near-sensor end, this work focuses on the lightweight of neural networks. Our experimental results show up to 5799x parameters reduction and up to 2.8x time and energy savings with maximum 26.9%accuracy decline on the Visual Question Answering (VQA) task.
IP
Engineering Tracks
IP
DescriptionSystem on Chip (SoC) designs constitute multiple modes to deliver desired configurability
Requirement is to signoff CDC for each mode individually to ensure design is bug free
Challenges:
- Single mode CDC signoff methodology incomplete
- CDC signoff for each mode manually inefficient
- Individual mode level runs are time consuming
- Duplicate efforts for reviewing common CDC violations require additional designer bandwidth
Proposed Solution:
- Enabling CDC Multimode signoff helps us to segregate common violations among different modes and violations unique to modes
- Common violations among modes can be analyzed in one go
- Unique violations to modes can be reviewed/analyzed to complete exhaustive multi-mode analysis
Requirement is to signoff CDC for each mode individually to ensure design is bug free
Challenges:
- Single mode CDC signoff methodology incomplete
- CDC signoff for each mode manually inefficient
- Individual mode level runs are time consuming
- Duplicate efforts for reviewing common CDC violations require additional designer bandwidth
Proposed Solution:
- Enabling CDC Multimode signoff helps us to segregate common violations among different modes and violations unique to modes
- Common violations among modes can be analyzed in one go
- Unique violations to modes can be reviewed/analyzed to complete exhaustive multi-mode analysis
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn the structural sign-off of metastability issues associated with Clock Domain Crossings (CDC), several assumptions are made, one of which is the presence of static signals. Static signals are typically categorized into two types: stable and constant. During the CDC structural sign-off process, these static signals can obscure numerous asynchronous boundaries within data paths because they are presumed to be inherently safe.
However, the validity of such assumptions can be questionable, potentially leading to discrepancies when compared to functional sign-off. This underscores the necessity for a robust methodology to authenticate these presumptions.
Conventionally, the disparity between structural and functional sign-offs is mitigated by validating assertions that focus on constrained signals. Yet, this traditional approach has limited analytical reach and fails to elucidate the issues clearly when assertions do not hold true. Its reliability is compromised as it does not adequately address the metastability concerns that arise when a static signal undergoes alterations at the fanout receiving registers across different clock domains.
This paper proposes a novel method for verifying data stability in static signals, aiming to enhance the reliability of CDC sign-offs by providing a more comprehensive analysis of potential metastability issues. And also we explain that we can overcome the simulation overhead with checkers embedded by way of re-organizing checker architecture algorithmically.
However, the validity of such assumptions can be questionable, potentially leading to discrepancies when compared to functional sign-off. This underscores the necessity for a robust methodology to authenticate these presumptions.
Conventionally, the disparity between structural and functional sign-offs is mitigated by validating assertions that focus on constrained signals. Yet, this traditional approach has limited analytical reach and fails to elucidate the issues clearly when assertions do not hold true. Its reliability is compromised as it does not adequately address the metastability concerns that arise when a static signal undergoes alterations at the fanout receiving registers across different clock domains.
This paper proposes a novel method for verifying data stability in static signals, aiming to enhance the reliability of CDC sign-offs by providing a more comprehensive analysis of potential metastability issues. And also we explain that we can overcome the simulation overhead with checkers embedded by way of re-organizing checker architecture algorithmically.
Research Manuscript
AI
Design
AI/ML, Digital, and Analog Circuits
DescriptionIn advanced process technology nodes, analog circuit performance is intrinsically linked to layout parasitics, and layout dependent effects (LDE). In contrast to digital designs, layout generation for analog mixed signal circuits remains predominantly a slow manual task, impeding rapid design convergence. To address this bottleneck, we introduce CDLS - a Constraint Driven Generative AI Framework for Analog Layout Synthesis. CDLS is fundamentally a constraint driven framework that enables analog circuit designers to auto-generate simulation-ready layout. Unlike traditional algorithmic approaches, CDLS uses generative AI and machine learning techniques to generate key design constraints that drive the quality of autogenerated placement and routing. Using CDLS on average we reduce layout iteration time by 2-3X on industrial designs. By reducing the turn-around-time on layout iterations we estimate a 30% reduction to overall design convergence cycle. We also demonstrate the quality of results achieved through CDLS is on par with manual drawn layout, on state-of-the-art analog designs developed on an Intel sub-10nm process technology node.
Research Manuscript
Security
Hardware Security: Attack and Defense
DescriptionIn this paper, we present CDS, a delay chain based digital sensor that exploits timing variations of both detector and protected object for detecting multiple types of fault injection attacks. To demonstrate its capability, we use CDS to protect the hardware accelerator of PRESENT cryptographic algorithm against multiple glitching attacks. Simulation results show that (1) CDS can detect 100% of voltage and temperature coordinated glitching attacks with 4.1% early warning; (2) CDS can detect 100% of laser glitching attacks with 9.1% early warning; (3) CDS maintains outstanding aging resistance with only 1.1% false alarm rate after 7 years of use.
Research Manuscript
AI
Design
AI/ML, Digital, and Analog Circuits
DescriptionLarge-format single-photon avalanche diode (SPAD)-based direct time of flight (dToF) sensors are expected to be widely applied in future L5 full driving automation. However, the high-power in-pixel TDCs and the huge amount of data generated by multi-frame histogram sampling impose limitations on the pixel format of SPAD-based dToF sensors. To tackle this challenge, we proposed the Computing-in-pixel Edge-aware Detection and Reconstruction (CEDAR) architecture. In this architecture, edge pixels are recognized by charge-domain convolution (CDC) computing, and noise pixels are eliminated by in-memory denoising (IMD). Only few TDCs in these edge pixels are activated, resulting in significant power and data savings. Afterward, the full format image is reconstructed by a U-Net using the obtained depth information from these edge pixels. For the first time, we proposed a high-resolution 512 × 512 SPAD-based dToF sensor with a low power of 83.3 mW, a distance accuracy of 0.9 cm, and a frame rate of 60 fps. The high-resolution 3D image can be reconstructed by only 3.5% sparse edge pixels, achieving a PSNR of 35.2 dB. The CEDAR architecture can achieve 16× pixel format and image resolution improvement under the same constraint of power dissipation.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
Description3D NAND flash memory is one of the most important storage technologies in modern computer systems because of its non-volatile nature and excellent data access performance. However, it suffers from aging and reliability issues due to its inherent property. In contrast to the previous research that tried to recover the data with additional encoding techniques, we propose a novel reprogramming technique, called CellRejuvo, to improve the reliability of NAND Flash cells. To the best of our knowledge, CellRejuvo is the first data recovery technique that reprograms the most error-prone state to extend the SSD lifetime. We implement CellRejuvo on a 3D NAND flash-based SSD and evaluate its capability on various realistic workloads. The extensive experimental results show that CellRejuvo successfully reduces the error rate of SSD by an average of 38.28% under various retention times.
Research Manuscript
EDA
Design Verification and Validation
DescriptionSimulink is extensively utilized in system design for its ability to facilitate modeling and synthesis of embedded controllers. It provides automatic test case generation to assist testers in inspecting the model. However, with the continuous increase in the model's scale, the control logic and internal states of the model are becoming more and more complex. Mainstream test case generation methods based on constraint solving and model simulation face challenges in achieving high coverage metrics.
In this paper, we propose CFTCG, a fuzzing based test case generation method for Simulink models. First, CFTCG generates the fuzzing code, which includes the fuzz driver based on the model's input information and the fuzz code with model-level branch instrumentation. These codes are then compiled together to execute the model oriented fuzzing loop. During this fuzzing loop, we make use of the field information of the model inports and the coverage difference between iterative executions, allowing for more targeted input mutation. We evaluated CFTCG on several benchmark Simulink models. In comparison to the built-in Simulink Design Verifier and the state-of-the-art academic work SimCoTest, CFTCG demonstrates an average improvement of 47.2% and 100.8% on Decision Coverage, 38.3% and 44.6% on Condition Coverage, and 144.5% and 232.4% on Modified Condition Decision Coverage, respectively.
In this paper, we propose CFTCG, a fuzzing based test case generation method for Simulink models. First, CFTCG generates the fuzzing code, which includes the fuzz driver based on the model's input information and the fuzz code with model-level branch instrumentation. These codes are then compiled together to execute the model oriented fuzzing loop. During this fuzzing loop, we make use of the field information of the model inports and the coverage difference between iterative executions, allowing for more targeted input mutation. We evaluated CFTCG on several benchmark Simulink models. In comparison to the built-in Simulink Design Verifier and the state-of-the-art academic work SimCoTest, CFTCG demonstrates an average improvement of 47.2% and 100.8% on Decision Coverage, 38.3% and 44.6% on Condition Coverage, and 144.5% and 232.4% on Modified Condition Decision Coverage, respectively.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionDesigning an analog block involves more human intervention and expertise compared to digital counterpart. Analog design flow is still strongly manual, which leads to a time-consuming and error-prone process. This paper firstly reviews the difficulties faced by analog designers in handmade schematic design. This is followed by feasibility analysis and preliminary evaluation to implement various proposals keeping in mind backward compatibility. The improvement point discussed in this paper will help analog designer design and debug blocks in a more efficient manner.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionMetastability Injection (MSI) in the N-stage synchronisers for the synchronised paths, is the need essential to verify the robustness of the design against the uncertainties associated with the stability of synchroniser output. Although there are multiple ways to solve this, one of them being Jasper MSI utility which can be exported to inject metastability on the injection points identified during the CDC/RDC analysis cycle. But this solution comes with the challenges to completely close on the coverage metrics. As the number of synchronisers increases, number of such cover points also increases. Now, if we try to use the above method to prove that the convergence problems seen in analysis phase are safe, then we need to ensure 100% cross coverage of the synchronised paths which will blast exponentially as number of converging synchroniser paths increases in the design. Proposing how to combat the CDC tool problem with alternative approach of using FPV and SEC Apps for better sign off.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionClock Domain Crossing (CDC)and Reset Domain crossing(RDC) checks signoff poses several challenges in digital design, and addressing these challenges is crucial for ensuring the reliability and correctness of complex SoC's. While static analysis tools provide critical role in CDC/ RDC analysis, functional verification through simulations is equally necessary to validate the correctness of architectural assumptions to signoff the correctness of static analysis.
Current CDC/RDC constraints signoff challenges:
Accuracy of constraints : CDC/RDC constraints used for static analysis gets written based on certain design assumptions but what if these assumptions are incorrect .
Thoroughness of constraints : What if these assumptions are not complete ?
Validation of constraints : Existing flow doesn't ensure the validity of these constrains.
Fundamental goal of this presentation is to provide holistic methodology for CDC/RDC constraints signoff which has been written based on design assumption using SystemVerilog Assertions(SVA) in functional simulations.
Current CDC/RDC constraints signoff challenges:
Accuracy of constraints : CDC/RDC constraints used for static analysis gets written based on certain design assumptions but what if these assumptions are incorrect .
Thoroughness of constraints : What if these assumptions are not complete ?
Validation of constraints : Existing flow doesn't ensure the validity of these constrains.
Fundamental goal of this presentation is to provide holistic methodology for CDC/RDC constraints signoff which has been written based on design assumption using SystemVerilog Assertions(SVA) in functional simulations.
Research Manuscript
AI
Design
AI/ML, Digital, and Analog Circuits
DescriptionThe increasing complexity of semiconductor designs necessitates agile hardware development methodologies to keep pace with rapid technological advancements. Following this trend, Large Language Models (LLMs) emerge as a potential solution, providing new opportunities in hardware design automation. However, existing LLMs exhibit challenges in HDL design and verification, especially for complicated hardware systems. Addressing this need, we introduce ChatCPU, the first end-to-end agile hardware design and verification platform with LLM. ChatCPU streamlines the ASIC design and verification process, guiding it from initial specifications to the final RTL implementations with enhanced design agility. Incorporating the LLM fine-tuning and the processor description language design for CPU design automation, ChatCPU significantly enhances the hardware design capability using LLM. Utilizing ChatCPU, we developed a 6-stage in-order RISC-V CPU prototype, achieving successful tape-out using SkyWater 130nm MPW project with Efabless, which is currently the largest CPU design generated by LLM. Our results demonstrate a remarkable improvement in CPU design efficiency, accelerating the design iteration process by an average of 3.81X, and peaking at 12X and 9.33X in HDL implementations and verification stages, respectively. The ChatCPU also enhances the design capability of LLM by 2.63X as compared to base LLama2. These advancements position ChatCPU as a significant milestone in LLM-driven ASIC design and verification.
Research Manuscript
Design
Design for Manufacturability and Reliability
DescriptionExisting works focus on fixed-size layout pattern generation, while the more pratical free-size pattern generation receives limited attention. In this paper, we propose ChatPattern, a novel Large-Language-Model (LLM) powered framework for flexible pattern customization. ChatPattern utilizes a two-part system featuring an expert LLM-agent and a highly controllable layout pattern generator. The LLM-agent can interpret natural language requirements and operate design tools to meet specified needs, while the generator excels in conditional layout generation, pattern modification, and memory-friendly patterns extension. Experiments on challenging pattern generation setting shows the ability of ChatPattern to synthesize high-quality large-scale patterns.
IP
Engineering Tracks
IP
DescriptionWith all the discussion about Moore's Law, one thing is for sure: Memories aren't scaling as much as logic. On the other hand, AI applications, so popular these days, require increasing amount of memory. Add to that the need to extend the use of available fabs, and you get a great reason to explore new memory paradigms.
In this session we'll explore cutting-edge technologies transforming the landscape of memory design. The expert speakers will share real-world applications in AI, machine learning, and edge computing, exploring new technologies and optimization strategies.
In this session we'll explore cutting-edge technologies transforming the landscape of memory design. The expert speakers will share real-world applications in AI, machine learning, and edge computing, exploring new technologies and optimization strategies.
Research Manuscript
Design
SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionFully Homomorphic Encryption (FHE) is one of the most promising privacy-preserving techniques, which has drawn increasing attention from both academia and industry due to its ideal security. Chiplet-based designs integrate multiple dies (chiplet) into the package delivering high performance and thereby are embraced by the resources-hungry FHE. Despite the chiplet-based system with various specialized accelerators, it falls short in supporting FHE due to the novel FHE polynomial operations. For a chiplet-based system that is not tailored for FHE, one common approach to make it capable of FHE is designing a new dedicated accelerator, However, this full design-and-build approach overlooks the existing abundant resources of accelerators in the system and thereby incurs repeated customization and resource waste.
In this paper, we propose Chiplever, a framework enables effortless extension of Chiplet-based system for FHE. We aim to fully harness the available resources in the room for efficient FHE. To achieve this, Chiplever (1)introduces a specialized extension in I/O Chiplet guided by semantics matching (2)and proposes an efficient allocator featuring specialized dataflow scheduling. (3)Chiplever provides three-step mapping to achieve compiler-level to hardware-level support for FHE and optimizes the data communications.
In this paper, we propose Chiplever, a framework enables effortless extension of Chiplet-based system for FHE. We aim to fully harness the available resources in the room for efficient FHE. To achieve this, Chiplever (1)introduces a specialized extension in I/O Chiplet guided by semantics matching (2)and proposes an efficient allocator featuring specialized dataflow scheduling. (3)Chiplever provides three-step mapping to achieve compiler-level to hardware-level support for FHE and optimizes the data communications.
SKYTalk
DescriptionThe semiconductor industry has always moved fast, but changes in the industry over recent years have been historic. We will discuss a variety of challenges in the ecosystem, and how the NSTC can unite the community to address those challenges. Some of these are technical – logic, mixed signal, memory, photonics, design / co-design, and architecture all need new breakthroughs to continue to advance the state of technology. Others are ecosystem challenges. Access to design tools, IP, and collaboration environments, as well as increasing use of AI in the design and verification flow will all transform the way that the industry does its work. Access to advanced R&D facilities and leading-edge shuttles can accelerate the pace of research. The traditional venture model has been mismatched with hardware investments for decades, and this has been a drag on innovation, but there are new ideas for how this can work better. In closing, we will provide updates on the priorities for this year and show how the NSTC can change the long-term trajectory for innovation.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIn the field of large language model (LLM) inference, the high computational demand and extensive memory requirements for weights and key-value (KV) cache storage present significant challenges. This issue becomes especially problematic when relying exclusively on GPUs, as they often lack the capacity to accommodate the entire KV cache, particularly in larger LLMs. In the absence of direct communications like NVlink among multiple GPUs, LLMs typically require offloading the KV cache to the CPU for storage and computation, followed by transferring the multi-head attention results back to the GPU for subsequent transformer computations. Given that attention score computation is computationally demanding on the CPU and requires substantial data movement between KV caches and memory, the direct computation of attention scores and even the feed-forward layers on Compute-in-Memory (CIM) systems emerges as a viable alternative. This paper is at the forefront of integrating CIM technology in LLM inference, and proposes an innovative architecture that leverages this emerging technology to enhance inference efficiency. Specifically, we present a tailored CIM-based dataflow and hierarchy design for optimize the computation of attention scores and feed-forward layers using CIMs. The results show improvements in performance, with 0.021× inference latency and 1.23 · 10^{−4}x energy as compared to a CPU-based implementation.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe irregularly distributed zeros in pruned CNN networks will still be loaded and processed by CIM, which causes inefficient usage of memory arrays. We propose CIMAP, a CIM data mapping methodology for unstructured sparse CNN. CIMAP flexibly rearranges non-zero weights in unstructured sparse CNN models by strategically swapping rows within the weight map. CIMAP also introduces a CIM processor with switchable CIM macros to efficiently handle sparse weight maps. The experimental results have shown that CIMAP attains up to 1.97x, 1.8x better performance, and 7.45x, 5.02x energy efficiency improvement than the conventional CIM solutions for unstructured sparse VGG16 and ResNet50.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionRecent advances in large language models (LLMs) have computationally mastered human language through predictive modeling. Extending this concept to electronic design, we explore the idea of a "circuit model" trained on circuits to predict the next logic gate, addressing structural complexities and equivalence constraints. By encoding circuits as memory-less trajectories and employing equivalence-preserving decoding, our trained "Circuit Transformer" with 88M parameters demonstrates impressive performance in end-to-end logic synthesis. With the aid of Monte-Carlo tree search, it significantly outperforms resyn2 in ABC on small circuits while retaining strict equivalence, showcasing the potential of generative AI in conquering electronic design challenges.
Research Manuscript
EDA
RTL/Logic Level and High-level Synthesis
DescriptionAutomatically designing fast and space-efficient digital circuits is challenging because circuits are discrete, must exactly implement the desired logic, and are costly to simulate. We address these challenges with CircuitVAE, a search algorithm that embeds computation graphs in a continuous space and optimizes a learned surrogate of physical simulation by gradient descent. By carefully controlling overfitting of the simulation surrogate and ensuring diverse exploration, our algorithm is highly sample-efficient, yet gracefully scales to large problem instances and high sample budgets. We test CircuitVAE by designing binary adders across a large range of sizes, IO timing constraints, and sample budgets. Our method excels at designing large circuits, where other algorithms struggle: compared to reinforcement learning and genetic algorithms, CircuitVAE typically finds 64-bit adders which are smaller and faster using less than half the sample budget. We also find CircuitVAE can design state-of-the-art adders in a real-world chip, demonstrating that our method can outperform commercial tools in a realistic setting.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionCircuit stability (sensitivity) analysis aims at estimating the overall performance impact due to the perturbations of underlying design parameters (e.g. gate sizes, capacitance variations, etc), which remains a challenging task since many time-consuming circuit simulations are typically required. On the other hand, graph neural networks (GNNs) have proven to be effective in solving many chip design automation problems, including circuit timing prediction, parasitics prediction, gate sizing, and device placement. This paper presents a novel approach (CirSTAG) that exploits GNNs to analyze the stability (robustness) of modern integrated circuits (ICs). CirSTAG is based on a spectral framework for analyzing the stability of GNNs leveraging input/output graph-based manifolds: when two nearby nodes on the input manifold are mapped (through a GNN model) to two distant nodes (data samples) on the output manifold, it implies a large distance mapping distortion (DMD) and thus poor GNN stability. CirSTAG computes a stability score equivalent to the local Lipschitz constant for each node/edge considering both graph structural and node feature perturbations, which immediately allows for identifying the most critical (sensitive) circuit elements that would significantly alter the circuit performance. Our empirical evaluations on a variety of timing prediction tasks with realistic circuit designs show that CirSTAG can truthfully estimate each circuit element's stability under various parameter perturbations, offering a scalable approach for assessing the stability of large IC designs.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionClock parameter tuning and optimization of clock networks in lower technology nodes is one of the challenges in Backend design. Every design has its own unique challenges and coming out with a fixed recipe without understanding design models seems impractical. Identifying best recipe that affect timing and some additional parameters that may improve timing cannot be tested in a short amount of time and it is a time-consuming process that requires significant effort from the design engineer. Intelligent Adaptive learning tool capability gives us the best way to sweep multiple parameters in an efficient way to tune recipes to achieve better Performance and Power. This capability is used to free the design engineer from the drudgery of manual search, comparing and learning. This approach provides an explanation to the user that justifies its recommendations, decision, or action. The user decides based on the explanation and it helps designer to identify right knobs which are giving best results in terms of timing and power. Adaptive learning approach allowing tool to explore all possibilities within predefined user inputs/objectives to achieve best QoR. The clock methodology is used for evaluation is Multisource Clock Tree Synthesis (MSCTS).This presentation investigates comparative studies of Conventional CTS, MSCTS and MSCTS with Adaptive Learning tool and summarizes improvements that have been achieved in clock QOR.
Research Manuscript
Design
SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionCoarse-grained reconfigurable architectures (CGRAs) have gained popularity as accelerators for compute-intensive kernels. Complex CGRA architectures that support key features such as multi-context and predication are being developed to support a wider range of kernels. However, mapping applications on these complex architectures poses significant challenges. In this paper, we provide an architecture-agnostic clustered mapping technique and a new cost function tailored for simulated-annealing placement. The mapper simplifies placement and routing phases, demonstrating significant speedup for popular CGRA architectures: HyCUBE and ADRES. Additionally, our method demonstrates an increase in mapping success for the ADRES architecture.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionVideo Frame Interpolation(VFI) aims to generate intermediate frames between consecutive frames. Recent DNN-based VFI offers superior quality but suffers from performance issues. However, very few studies have focused on VFI hardware acceleration and existing work overlooks temporal information from compressed video bitstreams. In this paper, we propose a novel compressed VFI workflow and an accelerator, Co-Via. Co-Via exploits codec information reuse to reduce complex DNN computations and alleviate hardware pressure. FPGA-based Co-Via outperforms an RTX 4090 GPU 10.31x, offering a 43.08x energy efficiency boost. Its ASIC version achieves 2.4x higher throughput and 3.6x energy efficiency than the state-of-the-art solution.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThere are limitations to measuring code coverage during runtime in ARM Architecture-based firmware, such as (1) limitations in available memory and (2) multi-core system.
In this paper, we propose and apply a new technique using Binary Modification technology to measure code coverage for ARM Architecture-based firmware.
We confirmed that code coverage can be measured for Samsung NVMe SSD.
In addition, the process configuration that takes into account the development process has been efficiently configured for development and testing through DevOps.
As a result of applying this technique, mostly the same level of performance was measured when measuring code coverage for targets over 7000 in memory under 128K, as well as the performance of four storage units.
In this paper, we propose and apply a new technique using Binary Modification technology to measure code coverage for ARM Architecture-based firmware.
We confirmed that code coverage can be measured for Samsung NVMe SSD.
In addition, the process configuration that takes into account the development process has been efficiently configured for development and testing through DevOps.
As a result of applying this technique, mostly the same level of performance was measured when measuring code coverage for targets over 7000 in memory under 128K, as well as the performance of four storage units.
Research Manuscript
Design
Quantum Computing
DescriptionWe explore the integration of parameterized quantum pulses with the contextual subspace method. The advent of parameterized quantum pulses marks a transition from quantum gates to a more efficient approach. Working with pulses allows us to potentially access areas of the Hilbert space that are inaccessible with a CNOT-based circuit decomposition. Compared to the traditional Variational Quantum Eigensolver (VQE), the computation of the contextual correction generally requires fewer qubits and measurements, thus improving computational efficiency. Plus a Pauli grouping strategy, our framework can minimize the quantum resource cost for the VQE and enhance the potential for processing larger molecular structures.
Research Manuscript
Design
In-memory and Near-memory Computing Circuits
DescriptionContent addressable memory (CAM) has triggered a lot of attention for data-intensive applications due to highly parallel pattern searching capability. Most state-of-the-art works focus on reducing hardware cost of CAM by exploiting various emerging non-volatile memory (NVM) technologies. However, existing CAM designs still mainly follow the conventional encoding scheme which requires two complementary storage nodes and search signals for each bit of entry and query respectively, along with separate precharging and evaluation phases for bit-vector searching, limiting the further improvement of area- and energy-efficiency. In this work, a compact and efficient CAM architecture is proposed through two techniques: (1) a combinatorial encoding scheme for CAM by encoding entry/query states with permutations and combinations of multiple storage nodes as a group, which can significantly improve the encoding efficiency and thus greatly reduce the hardware implementation cost of CAM compared with conventional encoding scheme; (2) an one-step self-terminating searching scheme for CAM by detecting matching condition during precharging phase and terminating precharging once a match is detected, which can further reduce the search delay and energy. The experiments and evaluations of the proposed CAM architecture with co-optimization of combinatorial encoding and self-terminating searching are carried out based on ferroelectric FET (FeFET), which can reduce the area-energy-delay product (AEDP) by 1182× over the conventional CMOS-based CAM in data searching tasks, showing its great potential for area- and energy-efficient in-memory-searching accelerator.
Embedded Systems and Software
AI
Embedded Systems
Engineering Tracks
DescriptionToday's embedded system application architects face the challenge of mapping to increasingly diverse compute resources including CPUs, AIEs, and FPGA accelerators. The architect must manage the mapping of the application to these compute resources while also considering details like data movement, memory structures, and data types. This results in a complex trade-space analysis of how to optimally map an application to a heterogeneous target such as the Versal FPGA SoC.
This technical talk will outline the existing system architect workflows and shows a gap in today's SoC tools for supporting the system architect in evaluating their application mapping trade space analysis. The proposed "application explorer" tool supporting the system architect in the early analysis of application mapping to major compute resource types based on a system level stochastic model simulation. This model-based system engineering tool facilitating the system architect to iterate different design mappings and ultimately provide downstream detailed implementation teams with a definition of the scope of their functionality. The talk will then present a prototype of the concept implemented as an extension to the Mirabilis VisualSim Architect tool for a signal processing algorithm that targets a Versal FPGA SoC.
This technical talk will outline the existing system architect workflows and shows a gap in today's SoC tools for supporting the system architect in evaluating their application mapping trade space analysis. The proposed "application explorer" tool supporting the system architect in the early analysis of application mapping to major compute resource types based on a system level stochastic model simulation. This model-based system engineering tool facilitating the system architect to iterate different design mappings and ultimately provide downstream detailed implementation teams with a definition of the scope of their functionality. The talk will then present a prototype of the concept implemented as an extension to the Mirabilis VisualSim Architect tool for a signal processing algorithm that targets a Versal FPGA SoC.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionRehearsal-based Continual Learning (CL) has been investigated in Deep Neural Networks, while lacking in Spiking Neural Networks (SNNs). We present the first memory-efficient implementation of Latent Replay (LR)-based CL for SNNs, targeting resource-constrained devices. LRs combine new samples with latent representations of previous data, to mitigate forgetting. Experiments on Heidelberg SHD dataset with Sample and Class-Incremental tasks reach 92% Top-1 accuracy on average, without forgetting. Furthermore, we minimize LRs with a time-domain compression, reducing by 140x their memory, with 4% accuracy drop. On a Multi-Class-Incremental task, our SNN learns 10 new classes, with 78.4% accuracy on SHD test set.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWe propose a novel operation called Personal Self-Attention (PSA). It is designed specifically to learn non-linear 1-D functions faster than existing architectures. We show that by stacking and combining these non-linear functions with linear transformations, we can achieve the same accuracy as a larger model but with a hidden dimension that is 2-6x smaller. Further, by quantizing our non-linear function, the PSA can be mapped to a simple lookup table, allowing for very efficient translation to FPGA hardware attaining an accuracy of 86% on CIFAR-10 with a throughput of 29k FPS.
Research Manuscript
Autonomous Systems
Autonomous Systems (Automotive, Robotics, Drones)
DescriptionConnected Autonomous Vehicles have great potential to improve automobile safety and traffic flow, especially in cooperative applications where perception data is shared between vehicles. However, this cooperation must be secured from malicious intent and unintentional errors that could cause accidents. In this paper, we propose Conclave -- a tightly coupled authentication, consensus, and trust scoring mechanism that provides comprehensive security and reliability for cooperative perception. Overall, Conclave shows huge promise in preventing security flaws, detecting even relatively minor sensing faults, and increasing the robustness and accuracy of cooperative perception in CAVs while adding minimal overhead.
Research Manuscript
EDA
Physical Design and Verification
DescriptionPin access has become one of the most significant challenges in large-scale full-chip routing due to the continuous reduction in feature sizes and the increasing complexity of designs. The conventional standard cell layout synthesis approaches usually optimize pin accessibility by maximizing pin lengths and access points. However, these pre-determined pin patterns greatly occupy routing resources and may contrarily degrade routability. To address this problem, this paper proposes the first work of concurrent detailed routing with pin pattern re-generation to achieve ultimate pin access optimization. A pseudo-pin extraction and routing technique is proposed that can secure one access point for each input/output pin while allowing the remaining access points to be routable by other nets. The experimental results demonstrate that the proposed method can resolve 89% of local regions that are unroutable with original layout patterns without compromising power and timing performances.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionCloud-centric workloads are increasingly moving towards leveraging domain-specific accelerators (DSAs) such as GPU, NPU, FPGA, etc. to achieve massive speedup over general-purpose CPUs. These workloads compute sensitive data; furthermore, the programs themselves can be proprietary business secrets such as high-performance AI models. Therefore, several confidential cloud solutions have recently emerged to protect not only the attacker-controlled software stack (OS/VMM) but also the cloud service providers or CSPs themselves. CPU-centric trusted execution environments or TEEs have been around for some time and are deployed commercially. However, despite some recent proposals, most nodes do not have any TEE capability and, therefore, are unprotected against malicious CSP and software stack.
In this paper, we address this gap by proposing a new dedicated hardware module, which we call the security controller (SC) that acts as the TEE proxy for the legacy non-TEE DSA nodes in a data center rack. SC enforces access control and attestation mechanisms and protects the non-TEE nodes even from a physical attacker. We implement and synthesize SC hardware and evaluate it with real-world cloud-centric workloads with heterogeneous DSAs. Our evaluation shows that on average, SC introduces 1.5-4.5% overhead while running AI, Redis, and file system workloads and scales well with an increasing number of DSA nodes (up to 2236 concurrent NPUs running CNNs).
In this paper, we address this gap by proposing a new dedicated hardware module, which we call the security controller (SC) that acts as the TEE proxy for the legacy non-TEE DSA nodes in a data center rack. SC enforces access control and attestation mechanisms and protects the non-TEE nodes even from a physical attacker. We implement and synthesize SC hardware and evaluate it with real-world cloud-centric workloads with heterogeneous DSAs. Our evaluation shows that on average, SC introduces 1.5-4.5% overhead while running AI, Redis, and file system workloads and scales well with an increasing number of DSA nodes (up to 2236 concurrent NPUs running CNNs).
Research Manuscript
Security
Hardware Security: Attack and Defense
DescriptionIn this work, we propose a new attack called Conjuring that exploits one of the main features of CPUs' front-end: speculative fetch of instructions. We show that the Pattern History Table (PHT) in modern CPUs are a great channel to learn and leak the control-flow of victim applications. Unlike prior work, Conjuring does not require to prime the PHT or interfere with the victim execution enabling a realistic and unprivileged attacker to leak control flow information. By improving the branch predictors, our attack becomes even more serious and practical. We demonstrate the feasibility of our attack on different existing Intel, AMD, and Apple CPUs.
IP
Engineering Tracks
IP
DescriptionCustom memories are used in wide spectrum of applications and hence support many features that are useful at the SOC level. The number of combinations for verifying such functional behaviors, physical parameters and operating variations are huge and can impact the turn-around time of development due to high simulation run-time.
For large IPs like memories, leafcell extraction is done using Cc (only capacitance) methodology for supply lines and RCc (both resistance and capacitance) methodology for signals to reduce extraction size and gain in simulation run-time at the cost of accurate consideration of voltage drop at the supply lines. With advancing technologies, the cumulative effect of lowered supply voltages, increased voltage drop due to contact resistance leads to higher device sensitivity and lower noise margin. If ignored this can lead to a parametric yield loss.
For accurate characterization and robustness, we propose a methodology using layer information and StarReducer tool which helps consider voltage drop in supply lines due to contact resistance effectively. The timing penalty of around 4% between current and accurate methodology reduces to 1% using proposed methodology providing a fine balance between accuracy and simulation run time which helps in design and validation phase.
For large IPs like memories, leafcell extraction is done using Cc (only capacitance) methodology for supply lines and RCc (both resistance and capacitance) methodology for signals to reduce extraction size and gain in simulation run-time at the cost of accurate consideration of voltage drop at the supply lines. With advancing technologies, the cumulative effect of lowered supply voltages, increased voltage drop due to contact resistance leads to higher device sensitivity and lower noise margin. If ignored this can lead to a parametric yield loss.
For accurate characterization and robustness, we propose a methodology using layer information and StarReducer tool which helps consider voltage drop in supply lines due to contact resistance effectively. The timing penalty of around 4% between current and accurate methodology reduces to 1% using proposed methodology providing a fine balance between accuracy and simulation run time which helps in design and validation phase.
Research Manuscript
Embedded Systems
Embedded Software
DescriptionKernels are scheduled on Graphics Processing Units (GPUs) in the granularity of warp, a bunch of concurrently executing threads. When executing kernels with conditional branches, threads within a warp may execute different branches sequentially, resulting in a considerable utilization loss and unpredictable execution time, known as the control flow divergence. This paper proposes a novel method to predict threads' execution path before the kernel launch by deploying a branch prediction network on the GPU's tensor cores, capable of parallel running with CUDA cores. Combined with a well-designed thread data reorganization algorithm, this solution can mitigate GPUs' control flow divergence problem.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionCryogenic CMOS is a promising technology for high performance computing due to its improvement in subthreshold slope, carrier mobilities and reduced wire resistance. The threshold voltage (Vth) increase at 77K can be mitigated by metal gate work function (PHIG) engineering to achieve matched off current (IOFF) further enhancing the device performance allowing us to operate at very low supply voltage thereby reducing the Energy Delay Product (EDP). However, the effect of variation on noise margins of static random access memories (SRAM) deploying these matched IOFF devices is very prominent especially at low supply voltages (VDD) limiting its scaling. In this work, we propose a framework to perform Vth retargeting for cryogenic SRAM for improving noise margins in high performance cryogenic SRAM cells under variation. The proposed framework comprises of a Monte-Carlo engine which performs statistical analysis and DC characterization and a backend processing engine to analyze noise margins and tune the PHIG. To demonstrate the framework, we use calibrated 14nm FinFET models at 300K and 77K. First, we analyze the logic blocks using iso-IOFF devices, which yield up to 3x improvement in delay at iso energy and a 4.5x reduction in energy at iso delay. Next, we study the effect of Vth variation on the device currents. Finally, the framework is deployed to tune PHIG, and results show that it can enhance the noise margins by 23%, 31% and 19% for hold, read and write operations respectively at 77K compared to iso-IOFF devices. Further, a 1kb SRAM array has been simulated using iso-IOFF tuned peripherals and framework tuned SRAM cells which shows 5.4x reduction in read/write energies along with 1.2x delay reduction and better noise margins at 77K compared to 300K.
IP
Engineering Tracks
IP
DescriptionIP, which is clean in an IP environment with its waivable errors, is often reported dirty in a SOC environment. Negotiating waivers with the Foundry team and manually reviewing such waivers within the SOC environment takes weeks/ months. Such manually documented waivers are often lost or can't be re-used for the following project because of changes in the PDKs. Also, critical DRC fixes in the late PDKs can only be intercepted by some IP providers within their timeline, as these IPs often come from different sources that calls for a definite need to use waivers for DRC checks. In this paper, we have discussed a new waiver method that enables the design team to deliver all the waivers as part of design collaterals. The block-level waivers can be easily embedded into an Oasis file, which contains the required waiver for any IP design. The solution also takes geometries and cell hierarchies into consideration while applying waivers. The discussed solution kicks in all the paranoid checks as part of the internal tool to ensure the waiver database is consumed correctly only for the targeted rule. We have used the proposed solution to identify and differentiate IP-level and SOC-level DRCs while IPs are still under development. This has been proven to be a solid methodology for our SOC design team to enable parallel execution.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionOn-chip variation (OCV) is a significant factor affecting timing sign-off for digital designs at 20nm and below. At lower technology nodes, timing measurements such as propagation delay, setup time, and hold time may change by 50%-100% due to statistical variation. In order to capture these variation effects accurately, timing .libs include variation modeling information defined by the Liberty Variation Format (LVF).
LVF requires that each timing data point must also perform a statistical simulation/Monte Carlo analysis in order to capture the full distribution of behavior. Each data point is not just a single additional table per timing arc. For each timing arc and nominal measurement (e.g. delays, transitions and constraints), there are up to 5 additional measurements used for statistical analysis: early and late 3-sigma values, mean shift, standard deviation and skewness. This increases the runtime for SPICE characterization exponentially.
In this paper, we discuss a methodology for reducing SPICE characterization runtime by identifying the critical corners to characterize and generate the remaining LVF data using AI.
LVF requires that each timing data point must also perform a statistical simulation/Monte Carlo analysis in order to capture the full distribution of behavior. Each data point is not just a single additional table per timing arc. For each timing arc and nominal measurement (e.g. delays, transitions and constraints), there are up to 5 additional measurements used for statistical analysis: early and late 3-sigma values, mean shift, standard deviation and skewness. This increases the runtime for SPICE characterization exponentially.
In this paper, we discuss a methodology for reducing SPICE characterization runtime by identifying the critical corners to characterize and generate the remaining LVF data using AI.
Research Manuscript
AI
AI/ML Application and Infrastructure
DescriptionLearn-based cost models used for tensor compiler auto-tuning often suffer from poor performance when trained on one hardware platform and applied to another. This issue necessitates collecting performance data for each potential platform during model deployment, incurring significant overhead.
We propose Crop, a comprehensive and universal analytical cost model designed for cross-platform performance prediction of tensor programs. Crop decouples program features and hardware features, gathering hardware-independent program features on one platform and predicts their performance based on parametric hardware features for given platforms. Crop achieves comparable levels of prediction accuracy to that of a learn-based cost model while ensuring portability.
We propose Crop, a comprehensive and universal analytical cost model designed for cross-platform performance prediction of tensor programs. Crop decouples program features and hardware features, gathering hardware-independent program features on one platform and predicts their performance based on parametric hardware features for given platforms. Crop achieves comparable levels of prediction accuracy to that of a learn-based cost model while ensuring portability.
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionIn-sensor computing has emerged as a promising approach to mitigating huge data transmission costs between sensors and processing units. Recently, the emerging application scenarios have raised more demands of sensory technology for large-area and flexible integration. However, with thin-film technologies that are capable of providing flexible and large-area integration support, the implementation of in-sensor computing can be strongly restricted due to the low device performance, large-area integration variation, and costly interface between sensors and CMOS processors. To address this challenge, we propose an in-sensor computing architecture to facilitate high-parallelism NN pre-processing and effective data compression. The boundaries of computing parallelism are expanded by adopting compact ROM-based compute-in-memory scheme next to sensing array. Differential-frame computing provides not only excellent robustness, but also high data sparsity. A bio-inspired data compression method with residual recovery caches and zero-skip circuits further enhances output sparsity without accumulated error. Based on the proposed cross-layer design optimization, an LTPS TFT-based ROM CiM chip has been fabricated and experimentally measured. The system-level evaluation demonstrates 3.85× speedup and 5.10× energy efficiency improvement compared with traditional architecture with separated sensors and processors, outperforming existing in-sensor computing works in large-area thin-film technology scenarios.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionIn this work, we propose CSTrans-OPU, an FPGA-based overlay processor with full compilation for transformer networks via sparsity exploration. Specifically, we customize a multi-precision processing element (PE) array with DSP-packing for unified computation format with full resource utilization. Additionally, the introduced sorting and computation mode selection modules make it possible to explore the token sparsity. Moreover, equipped with a user-friendly compiler, CSTrans-OPU enables model parsing, operation fusion, model quantization, instruction generation and reordering directly from model files. To the best of our knowledge, our CSTrans-OPU is the first overlay processor for transformer networks considering sparsity.
Research Manuscript
EDA
RTL/Logic Level and High-level Synthesis
DescriptionLogic rewriting is a critical and time-consuming task in logic synthesis, which determines the area and delay of the synthesized circuit. However, existing parallel solutions for this task suffer from limitations in terms of runtime or quality in large-scale complex circuits. In this paper, we propose a divide-and-conquer parallel approach namely DACPara for high-quality logic rewriting in large-scale circuits. Specifically, after nodes in AIG are divided in each level, dynamic global information is considered to divide and conquer rewriting into three stages for parallel processing. Experiments show that DACPara using 40 CPU physical cores can be 34.36x and 1.96x faster than logic rewriting in ABC and the state-of-the-art CPU parallel method on large benchmarks, respectively, with extremely comparable quality of result. Also, for large-scale complex benchmarks, compared with state-of-the-art GPU accelerated method ours can achieve 1.1% quality improvement.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionWith the increasing complexity and size reduction of System-on-Chips (SoCs), evaluating diverse design rules becomes crucial in early-stage Design-Technology Co-Optimization (DTCO) and initial Performance, Power, and Area (PPA) assessments.
In response to these challenges, this paper presents a novel approach for detecting workflows at early nodes, specifically tailored for Samsung Foundry. This task is relatively straightforward for experienced engineers but poses challenges for beginners, leading to time-consuming and error-prone processes.
This innovative workflow leverages DesignDash, an advanced data visualization and machine intelligence-based design optimization solution by Synopsys. DesignDash facilitates efficient data collection and visualization.
The proposed early node DTCO/PPA workflow focuses on problem detection, outlining key parameters to assess when conducting an evaluation. In the initial stages of Foundry projects, various issues can arise in design kits (DK), technology files, libraries, and enablement tools, such as Fusion Compiler (FC). Addressing these challenges swiftly is imperative to shorten the schedule required for PPA forecasting.
The workflow enables engineers to assess the feasibility of library cells and implementation flows through floor-planning, placement and routing analysis. By consolidating checklist items and providing actionable insights, this approach enhances visibility and significantly reduces turnaround time.
To further streamline the process, a customized interface is integrated into the existing DesignDash framework, empowering users to swiftly identify and address issues.
This paper not only presents an optimized workflow for early node DTCO/PPA but also emphasizes the importance of knowledge sharing, encouraging the exchange of success stories.
In response to these challenges, this paper presents a novel approach for detecting workflows at early nodes, specifically tailored for Samsung Foundry. This task is relatively straightforward for experienced engineers but poses challenges for beginners, leading to time-consuming and error-prone processes.
This innovative workflow leverages DesignDash, an advanced data visualization and machine intelligence-based design optimization solution by Synopsys. DesignDash facilitates efficient data collection and visualization.
The proposed early node DTCO/PPA workflow focuses on problem detection, outlining key parameters to assess when conducting an evaluation. In the initial stages of Foundry projects, various issues can arise in design kits (DK), technology files, libraries, and enablement tools, such as Fusion Compiler (FC). Addressing these challenges swiftly is imperative to shorten the schedule required for PPA forecasting.
The workflow enables engineers to assess the feasibility of library cells and implementation flows through floor-planning, placement and routing analysis. By consolidating checklist items and providing actionable insights, this approach enhances visibility and significantly reduces turnaround time.
To further streamline the process, a customized interface is integrated into the existing DesignDash framework, empowering users to swiftly identify and address issues.
This paper not only presents an optimized workflow for early node DTCO/PPA but also emphasizes the importance of knowledge sharing, encouraging the exchange of success stories.
Research Manuscript
AI
AI/ML Application and Infrastructure
DescriptionRecent advances in large language models have demonstrated their potential for automated generation of Verilog code from high-level prompts. Researchers have utilized fine-tuning to enhance the ability of these large language models (LLMs) in the field of Chip Design. However, the lack of Verilog data hinders further improvement in the quality of Verilog generation by LLMs. Additionally, the absence of a Verilog and EDA script data augmentation framework significantly increases the time required to prepare the training dataset for LLM trainers. In this paper, we propose an automated design-data augmentation framework, which generates high quality natural language description of the Verilog/EDA script. To evaluate the effectiveness of our data augmentation method, we finetune Llama2-13B and Llama2-7B models. The results demonstrate a significant improvement in the Verilog generation task when compared to the general data augmentation method. Moreover, the accuracy of Verilog generation surpasses that of the current state-of-the-art open-source Verilog generation model, increasing from 58.8% to 70.6% with the same benchmark and outperforms GPT-3.5 in Verilog repair and EDA Script Generation with only 13B weights.
.
.
Research Manuscript
Design
SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionHigh-Level Synthesis (HLS) has played a pivotal role in making FPGAs accessible to a broader audience by facilitating high-level device programming and rapid microarchitecture customization through the use of directives. However, manually selecting the right directives can be a formidable challenge for programmers lacking a hardware background.This paper introduces an ultra-fast, knowledge-based HLS design optimization method that automatically extracts and applies the most promising directive configurations to the original source code. This optimization approach is entirely data-driven, offering a generalized HLS tuning solution without reliance on Quality of Result (QoR) models or meta-heuristics. We design, implement, and evaluate our methodology using over 100 applications sourced from well-established benchmark suites and GitHub repositories, all running on a Xilinx ZCU104 FPGA.
The results are promising, including an average geometric mean speedup of $\times$1.35 and $\times$7.2 compared to over-provisioning and designer-optimized designs, respectively. Additionally, it demonstrates a high design feasibility score and maintains an average inference latency of 38ms. Comparative analysis with traditional genetic algorithm-based Design Space Exploration (DSE) methods and State-of-the-Art (SoA) approaches reveals that it produces designs of similar quality but at speeds 2-3 orders of magnitude faster. This suggests that it is a highly promising solution for ultra-fast and automated HLS optimization.
The results are promising, including an average geometric mean speedup of $\times$1.35 and $\times$7.2 compared to over-provisioning and designer-optimized designs, respectively. Additionally, it demonstrates a high design feasibility score and maintains an average inference latency of 38ms. Comparative analysis with traditional genetic algorithm-based Design Space Exploration (DSE) methods and State-of-the-Art (SoA) approaches reveals that it produces designs of similar quality but at speeds 2-3 orders of magnitude faster. This suggests that it is a highly promising solution for ultra-fast and automated HLS optimization.
Research Manuscript
EDA
Test, Validation and Silicon Lifecycle Management
DescriptionAccurate minimum operating voltage (Vmin) prediction is a critical element in manufacturing tests. Conventional methods lack coverage guarantees in interval predictions. Conformal Prediction (CP), a distribution-free machine learning approach, excels in providing rigorous coverage guarantees for interval predictions. However, standard CP predictors may fail due to a lack of knowledge of process variations. We address this challenge by providing principled conformalized interval prediction in the presence of process variations with high data efficiency, where a few additional chips are utilized for calibration. We demonstrate the superiority of the proposed method on industrial 16nm chip data.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionRecent advances in DRAM technologies and large-dataset applications in data centers make both academia and industrial researchers eager to explore DRAM's novel usage and cross-disciplinary DTCO (design and technology co-design) spaces, as illustrated by recent studies of the PIM or RowHammer effect etc. This evolving landscape has created a pressing need for systematic testing and validation of those emerging DTCO studies. To meet this demand, we introduce DATIS (DRAM Architecture and Technology Integrated Simulator), a tool that effectively connects architectural design and the complexities of DRAM technology. DATIS addresses two critical challenges: abstracting technology intricacies and establishing connections between architectural activities and device-level process structures. This versatile tool empowers researchers to unlock the latent capabilities of DRAM and provides manufacturers with a platform to experiment with new processes and architecture co-design. We build DATIS upon Ramulator, a well-known open source DRAM simulator for architecture-level modeling, and thus can support a wide range of DRAM specifications, including DDRx, LPDDR5, GDDR6, and HBM2\&3 etc. Our experiments demonstrates DATIS's efficacy and precision through three compelling case studies, addressing pivotal facets of DRAM technology, including storage, reliability, and computation.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionCommercial FPGA simulation verification tools such as Xilinx' Vivado aids developers in swiftly identifying and rectifying bugs and issues in FPGA designs through a robust built-in debugger. The debug process ensures the correctness and development efficiency of the FPGA design. Hardening such FPGA debugger tools by testing is crucial since engineers might misinterpret code and introduce incorrect fixes, leading to security risks. To address this issue, we propose DB-Hunter to perform program and debug action transformations to generate diverse and complex Verilog design file code and debug actions, to thoroughly test the Vivado debugger using differential testing and detect bugs. In three months, DB-Hunter reported 18 issues, including 10 confirmed as bugs by Xilinx Support, 6 of which have been fixed.
Workshop
AI
DescriptionIn the ever-evolving domain of computational technologies, the profound impact of artificial intelligence (AI) is indisputable. The DCgAA 2024 Workshop stands at the forefront of this revolution, offering an essential platform for synergizing deep learning (DL) models with advanced hardware system designs. This second iteration of our workshop is dedicated to exploring and fortifying the symbiotic relationship between DL and hardware innovation, especially in the context of generative AI applications. Deep learning's integration across various computing sectors necessitates robust hardware solutions to amplify model performance and efficiency. However, current DL research often overlooks critical real-world computational constraints such as power efficiency, memory usage, and scalability of model sizes. This oversight limits the practical deployment of AI innovations, particularly in scenarios requiring high computational efficiency like mobile devices, AR/VR technologies, and other edge computing environments. Our workshop aims to bridge this gap by fostering discussions and research on optimizing hardware designs specifically tailored for generative AI applications. We will delve into the unique computational demands of these models and the necessity of hardware systems that can adapt to their complex requirements. This approach is pivotal for realizing the full potential of DL innovations and ensuring their effective application in real-world scenarios.
Tutorial
EDA
DescriptionA Process Design Kit (PDK) serves as the fundamental building block for integrated circuit (IC) design, playing a crucial role in transforming chip designs into silicon reality. In this exploration, we delve into the intricate world of PDKs, examining their development, quality assurance processes, and multifaceted applications.
PDK Overview:
A PDK encompasses a collection of files that meticulously describe the specifics of a semiconductor process. These files serve as essential inputs for Electronic Design Automation (EDA) tools during chip design.
Clients engage with a foundry's PDKs before production to ensure that their chip designs align with the foundry's capabilities and intended functionality.
PDK Components and Usage:
We dissect each part of the PDK and explore its role in IC design. From technology files defining design rules to parameterized cells (PCells) customizing transistors, PDKs provide critical guidance.
PDKs act as the vital link between design and fabrication, enabling seamless communication between designers and foundries.
Semiconductor Process Variations:
We investigate different semiconductor processes such as FinFET, SOI, GAA/Back metal, and Silicon photonics. Each process has unique requirements, and PDKs tailor their contents accordingly.
The respective PDKs support these technologies by providing essential information for successful chip fabrication.
EDA Tool Ecosystem and PDK Integration:
We briefly explore the EDA tool landscape, discussing tools used at various design stages. These tools rely on accurate PDK data to generate layouts, verify designs, and simulate performance.
Standardized interfaces across diverse technology platforms enhance PDK usability.
Effective PDK Utilization:
Tips and tricks for maximizing PDK features and utilities are shared. Designers can leverage these insights to streamline their workflows and achieve optimal results.
Case Studies and Impact:
We delve into real-world case studies, examining how new devices and metal stack enablement influence different PDK components.
By understanding these impacts, designers can make informed decisions during the design process.
PDK Overview:
A PDK encompasses a collection of files that meticulously describe the specifics of a semiconductor process. These files serve as essential inputs for Electronic Design Automation (EDA) tools during chip design.
Clients engage with a foundry's PDKs before production to ensure that their chip designs align with the foundry's capabilities and intended functionality.
PDK Components and Usage:
We dissect each part of the PDK and explore its role in IC design. From technology files defining design rules to parameterized cells (PCells) customizing transistors, PDKs provide critical guidance.
PDKs act as the vital link between design and fabrication, enabling seamless communication between designers and foundries.
Semiconductor Process Variations:
We investigate different semiconductor processes such as FinFET, SOI, GAA/Back metal, and Silicon photonics. Each process has unique requirements, and PDKs tailor their contents accordingly.
The respective PDKs support these technologies by providing essential information for successful chip fabrication.
EDA Tool Ecosystem and PDK Integration:
We briefly explore the EDA tool landscape, discussing tools used at various design stages. These tools rely on accurate PDK data to generate layouts, verify designs, and simulate performance.
Standardized interfaces across diverse technology platforms enhance PDK usability.
Effective PDK Utilization:
Tips and tricks for maximizing PDK features and utilities are shared. Designers can leverage these insights to streamline their workflows and achieve optimal results.
Case Studies and Impact:
We delve into real-world case studies, examining how new devices and metal stack enablement influence different PDK components.
By understanding these impacts, designers can make informed decisions during the design process.
Tutorial
EDA
DescriptionA Process Design Kit (PDK) serves as the fundamental building block for integrated circuit (IC) design, playing a crucial role in transforming chip designs into silicon reality. In this exploration, we delve into the intricate world of PDKs, examining their development, quality assurance processes, and multifaceted applications.
PDK Overview:
A PDK encompasses a collection of files that meticulously describe the specifics of a semiconductor process. These files serve as essential inputs for Electronic Design Automation (EDA) tools during chip design.
Clients engage with a foundry's PDKs before production to ensure that their chip designs align with the foundry's capabilities and intended functionality.
PDK Components and Usage:
We dissect each part of the PDK and explore its role in IC design. From technology files defining design rules to parameterized cells (PCells) customizing transistors, PDKs provide critical guidance.
PDKs act as the vital link between design and fabrication, enabling seamless communication between designers and foundries.
Semiconductor Process Variations:
We investigate different semiconductor processes such as FinFET, SOI, GAA/Back metal, and Silicon photonics. Each process has unique requirements, and PDKs tailor their contents accordingly.
The respective PDKs support these technologies by providing essential information for successful chip fabrication.
EDA Tool Ecosystem and PDK Integration:
We briefly explore the EDA tool landscape, discussing tools used at various design stages. These tools rely on accurate PDK data to generate layouts, verify designs, and simulate performance.
Standardized interfaces across diverse technology platforms enhance PDK usability.
Effective PDK Utilization:
Tips and tricks for maximizing PDK features and utilities are shared. Designers can leverage these insights to streamline their workflows and achieve optimal results.
Case Studies and Impact:
We delve into real-world case studies, examining how new devices and metal stack enablement influence different PDK components.
By understanding these impacts, designers can make informed decisions during the design process.
PDK Overview:
A PDK encompasses a collection of files that meticulously describe the specifics of a semiconductor process. These files serve as essential inputs for Electronic Design Automation (EDA) tools during chip design.
Clients engage with a foundry's PDKs before production to ensure that their chip designs align with the foundry's capabilities and intended functionality.
PDK Components and Usage:
We dissect each part of the PDK and explore its role in IC design. From technology files defining design rules to parameterized cells (PCells) customizing transistors, PDKs provide critical guidance.
PDKs act as the vital link between design and fabrication, enabling seamless communication between designers and foundries.
Semiconductor Process Variations:
We investigate different semiconductor processes such as FinFET, SOI, GAA/Back metal, and Silicon photonics. Each process has unique requirements, and PDKs tailor their contents accordingly.
The respective PDKs support these technologies by providing essential information for successful chip fabrication.
EDA Tool Ecosystem and PDK Integration:
We briefly explore the EDA tool landscape, discussing tools used at various design stages. These tools rely on accurate PDK data to generate layouts, verify designs, and simulate performance.
Standardized interfaces across diverse technology platforms enhance PDK usability.
Effective PDK Utilization:
Tips and tricks for maximizing PDK features and utilities are shared. Designers can leverage these insights to streamline their workflows and achieve optimal results.
Case Studies and Impact:
We delve into real-world case studies, examining how new devices and metal stack enablement influence different PDK components.
By understanding these impacts, designers can make informed decisions during the design process.
Research Manuscript
AI
AI/ML Algorithms
DescriptionWe present a method, referred to as Deep Harmonic Finesse (DHF), for separation of non-stationary quasi-periodic signals when limited data is available. The problem frequently arises in wearable systems in which, a combination of quasi-periodic physiological phenomena give rise to the sensed signal, and excessive data collection is prohibitive. Our approach utilizes prior knowledge of time-frequency patterns in the signals to mask and in-paint spectrograms. This is achieved through an application-inspired deep harmonic neural network coupled with an integrated pattern alignment component. The network's structure embeds the implicit harmonic priors within the time-frequency domain, while the pattern-alignment method transforms the sensed signal, ensuring a strong alignment with the network. The effectiveness of the algorithm is demonstrated in the context of non-invasive fetal monitoring using both synthesized and in vivo data. When applied to the synthesized data, our method exhibits significant improvements in signal-to-distortion ratio (26% on average) and mean squared error (80% on average), compared to the best competing method. When applied to in vivo data captured in pregnant animal studies, our method improves the correlation error between estimated fetal blood oxygen saturation and the ground truth by 80.5% compared to the state of the art.
Research Manuscript
AI
Design
AI/ML System and Platform Design
DescriptionDesigning intelligent, tiny devices with limited memory is immensely challenging, exacerbated by the additional memory requirement of residual connections in deep neural networks. In contrast to existing approaches that eliminate residuals to reduce peak memory usage at the cost of significant accuracy degradation, this paper presents DERO, which reorganizes residual connections by leveraging insights into the types and interdependencies of operations across residual connections. Evaluations were conducted across diverse model architectures designed for common computer vision applications. DERO consistently achieves peak memory usage comparable to plain-style models without residuals, while maintaining the accuracy of the original models with residuals.
Research Manuscript
Design
Design of Cyber-physical Systems and IoT
DescriptionWe present DeepRIoT, a continuous integration and continuous deployment (CI/CD) based architecture that accelerates the learning and deployment of a Robotic-IoT system trained from deep reinforcement learning (RL). We adopted a multi-stage approach that agilely trains a multi-objective RL controller in the simulator. We then collected traces from the real robot to optimize its plant model, and used transfer learning to adapt the controller to the updated model. We automated our framework through CI/CD pipelines, and finally, with low cost, succeeded in deploying our controller in a real F1tenth car that is able to reach the goal and avoid collision from a virtual car through mixed reality.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionMulti-scale deformable attention (MSDeformAttn) has emerged as a key mechanism in various vision tasks, demonstrating explicit superiority attributed to multi-scale grid-sampling. However, this newly introduced operator incurs irregular data access and enormous memory requirement, leading to severe PE under-utilization. Meanwhile, existing approaches for attention acceleration cannot be directly applied to MSDeformAttn due to lack of support for this distinct procedure. Therefore, we propose a dedicated algorithm-architecture co-design dubbed DEFA, the first-of-its-kind method for MSDeformAttn acceleration. At the algorithm level, DEFA adopts frequency-weighted pruning and probability-aware pruning for feature maps and sampling points respectively, alleviating the memory footprint by over 80%. At the architecture level, it explores the multi-scale parallelism to boost the throughput significantly and further reduces the memory access via fine-grained layer fusion and feature map reusing. Extensively evaluated on representative benchmarks, DEFA achieves 10.1-31.9× speedup and 20.3-37.7× energy efficiency boost compared to powerful GPU platforms. It also rivals the related accelerators by 2.2-3.7× energy efficiency improvement while providing pioneering support of MSDeformAttn.
Research Manuscript
AI
Security
AI/ML Security/Privacy
Descriptionreliable use of machine learning models. These attacks involve the strategic modification of localized patches or specific image areas to deceive trained machine learning models. In this paper, we propose DefensiveDR, a practical mechanism using a dimensionality reduction technique to thwart such patch-based attacks. Our method involves projecting the sample images onto a lower-dimensional space while retaining essential information or variability for effective machine learning tasks. We perform this using two techniques, Singular Value Decomposition and t-Distributed Stochastic Neighbour Embedding. We experimentally tune the variability to be preserved for optimal performance as a hyper-parameter. This dimension reduction substantially mitigates adversarial perturbations, thereby enhancing the robustness of the given machine learning model. Our defense is model-agnostic and operates without assumptions about access to model decisions or model architectures, making it effective in both black-box and white-box settings. Furthermore, it maintains accuracy across various models and remains robust against several unseen patch-based attacks. The proposed defensive approach improves the accuracy from 38.8% (without defense) to 66.2% (with defense) when performing LaVAN and GoogleAp attacks, which supersedes that of the prominent state-of-the-art like LGS (53.86%) and Jujutsu (60%).
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionPrivacy concerns arise from malicious attacks on Deep Neural Network (DNN) applications during sensitive data inference on edge devices. Our proposed defense method addresses limitations in existing Trusted Execution Environments (TEEs) by employing depth-wise layer partitioning for large DNNs and a model quantization strategy. This enhances protection against both white-box and black-box Membership Inference Attacks (MIAs) while accelerating computation. Experiments on Raspberry Pi 3B+ demonstrate significant reductions in white-box MIA accuracy (up to 35.3%) and black-box MIA accuracy (up to 29.6%) for popular DNN models (AlexNet, VGG-16, ResNet-20) on CIFAR-100 dataset.
IP
Engineering Tracks
IP
DescriptionIt took sixty years for the semiconductor industry to reach 500 billion, but it is widely expected that we will hit the 1 trillion mark towards the end of this decade. Thisexplosivegrowth needs to come with several shifts in the way we design chips, includingattracting, educating, and enabling a whole new
generation of designers. Learn from industry luminaries on their perspectives on how we can successfully hit the trillion dollar mark.
generation of designers. Learn from industry luminaries on their perspectives on how we can successfully hit the trillion dollar mark.
IP
Engineering Tracks
IP
DescriptionSilicon is inherently unreliable, and silicon at advanced nodes is most susceptible. DFX at advanced nodes calls for new strategies. This invited session will explore failure-mechanism driven techniques to enable reliable advanced node silicon. Key topics will include safety, security, reliability and SLM (Silicon Lifecycle Management) methodologies and frameworks that are necessary to ensure reliable silicon.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWith the rapid advance in Deep Neural Networks (DNNs), GPU's role as a hardware accelerator becomes more and more important. Due to the GPU's significant power consumption, developing high performance and power-efficient GPU systems becomes a critical challenge. DNN applications need to move a large amount of data between memory and the processing cores which consumes a great amount of power in the on-chip network. Prior data compression techniques have been proposed for network-on-chips to reduce the size of data being moved and can thus save power. However, these techniques are usually lossless because they target on general purpose applications that are not resilient to errors. DNN applications, on the contrary, are well known to be error-resistant which makes them good candidate for lossy compressions.
In this work, we propose an NoC architecture that can reduce the power consumption without compromising the performance and accuracy. Our technique takes advantage of the error resilience of DNNs as well as the data locality in the exponent field of DNN's floating-point data. Each data packet is reorganized by grouping data with similar exponents and redundant exponents are sent only once. We further compress the mantissa fields by appropriately selecting deputy values for data sharing a same exponent. Our evaluation results show that the proposed technique can effectively reduce the data transmissions and lead to better performance and power tradeoffs without losing accuracy.
In this work, we propose an NoC architecture that can reduce the power consumption without compromising the performance and accuracy. Our technique takes advantage of the error resilience of DNNs as well as the data locality in the exponent field of DNN's floating-point data. Each data packet is reorganized by grouping data with similar exponents and redundant exponents are sent only once. We further compress the mantissa fields by appropriately selecting deputy values for data sharing a same exponent. Our evaluation results show that the proposed technique can effectively reduce the data transmissions and lead to better performance and power tradeoffs without losing accuracy.
Tutorial
AI
DescriptionAdvent of Large Language Models (LLM) and generative AI has introduced uncertainty in operation of autonomous systems with significant implications on safe and secure operation. This has led to the US government directive on assurance and testing of trustworthiness of AI. This tutorial aims at introducing the audience to the arising safety issues of AI-enabled autonomous systems (AAS) and how is affects dependable and safe design for real life deployments. With the advent of LLMs and deep AI methods, AAS are becoming vulnerable to uncertainties. It will introduce a new human in the loop human in the plant design philosophy that is geared towards assured certifiability in presence of human actions and AI uncertainties while reducing data sharing between the AAS manufacturer and certifier. We will provide a landscape of informal and formal approaches in ensuring AI-based AAS safety at every phase of the design lifecycle, defining the gaps, current research to fill those gaps, and tools for detection of commonly occurring software failures such as doping. This tutorial also aims at emphasizing the need for operational safety of AI-based AAS and highlight the importance of explainability at every stage for enhancing trustworthiness. There has been significant research in the domain of model-based engineering that are attempting to solve this design problem. Observations from the deployment of a AAS are used to: a) ascertain whether the AAS used in practice match the proposed safety assured design, b) explain reasons for a mismatch in AAS operation and the safety assured design, c) generate evidence to establish the trustworthiness of a AAS, d) generate novel practical scenarios where a AAS is likely to fail.
- Relevance, target audience, and interest for the DAC community
AI has been widely adopted in different domains including autonomous vehicles and IoT medical device. In a competitive environment, engineers and researchers are focused on developing innovative applications while minimal attention is provided to safety engineering techniques that cope with the fast pace of technological advances. As a result, recent failures and operational accidents of AI-based system highlight a pressing need for the development of suitable stringent safety monitoring techniques. We advocate for a change in the linear AAS development lifecycle from design, validation, implementation, and verification by incorporating feedback from the field of operation. This will result in a circular AAS development lifecycle, where operational data can be used to identify novel states and can be used as feedback. This will enable an agile proactive redesign policy that can predict failures and propose techniques to circumvent any safety risks. The tools used in this circular lifecycle will provide interpretable reports to the appropriate stake holders such as certification agencies, developers and users at different stages. This tutorial directly relates to the Autonomous systems, ML, topic of DAC.
- Relevance, target audience, and interest for the DAC community
AI has been widely adopted in different domains including autonomous vehicles and IoT medical device. In a competitive environment, engineers and researchers are focused on developing innovative applications while minimal attention is provided to safety engineering techniques that cope with the fast pace of technological advances. As a result, recent failures and operational accidents of AI-based system highlight a pressing need for the development of suitable stringent safety monitoring techniques. We advocate for a change in the linear AAS development lifecycle from design, validation, implementation, and verification by incorporating feedback from the field of operation. This will result in a circular AAS development lifecycle, where operational data can be used to identify novel states and can be used as feedback. This will enable an agile proactive redesign policy that can predict failures and propose techniques to circumvent any safety risks. The tools used in this circular lifecycle will provide interpretable reports to the appropriate stake holders such as certification agencies, developers and users at different stages. This tutorial directly relates to the Autonomous systems, ML, topic of DAC.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionToday's high-performance and low-power, state-of-the-art designs are getting more complex with a significant presence of Analog and Mixed-Signal blocks. As a result, continuous innovation in the Analog Design Automation space is becoming crucial. This session focuses on various aspects of that domain. The first talk is about the power/signal/thermal/reliability integrity challenges for three emerging 3D Heterogeneous designs. The second talk describes the recent developments in an open-source analog layout automation flow that has been applied to various design types and technology nodes. The third presentation will attempt to predict if and when analog chip design can become wholly autonomous.. Final speaker will review Siemens EDA's latest production-proven methods for transistor-level verification that are used to create consistently accurate AI-derived answers for measuring chip variability and for generating timing models.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIntel latest microprocessors are built by chiplet assembly over Foveros passive silicon interposer. Second generation based on this technology was targeting a more aggressive cost optimization, featuring reduced interposer layer count and decoupling density. Stronger design automation flows were developed, minimizing the manual layout labor, and tuned to address the inherent challenges of reduced layer count interposer. The design automation flow consists of few key stages, some of them logistics and some carry the algorithms for the layout synthesis. Among those synthesis algorithms is voltage area automatic generation, which set the regions for power delivery grid stenciling and decoupling spread. Another algorithm is pad-to-pad robust via connectivity, enabled to withstand slight offsets between the interposer bumps, and to mediate the connectivity of the pad with the rest of the power delivery grid. Although some manual user interventions are allowed, and some manual or semi manual layout editing is recommended, all manual steps are registered and archived to allow automatic re-run iterations. Finally, the full database can be built within few hours of uninterrupted flow. The database is meeting all design criterions (manufacturing, reliability, timing), minimizing the need of manual layout design, and meeting schedule by the efficient run times.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAn efficient and effective methodology for an overall turn around time reduction during physical design implementation phase of SOC Design. This has been implemented via automated stagewise checkers and ease of review dashboard. The smooth and ease of project execution involves data gathering during initial design planning. The idea is to closely inspect the QOR of every physical design implementation stage and rework to achieve the best possible results before passing on to next stage.
This would help in easy design closure at final stage and effectively reduce the huge runtimes usually seen towards the last stages as more and more design components starts stacking up on the SOC.
The problem or pain point has been correctly identified, and the proposed solution is capable of producing the desired outcomes. The stagewise data can be viewed in a user-friendly dashboard which further simplifies the tracking and review process. Further eliminating the scope of human error. During initial design phase working with dirty data would lead to multiple check failures which can be reviewed and waived initially, but the same need to be attended during final closure. All this is pretty well captured and tracked. The dashboard is quiet helpful in reviewing final design closure. It clearly gives the reviewer what all as passed/failed/waived at different stages. All the waivers have the justifications along with approver details.
This would help in easy design closure at final stage and effectively reduce the huge runtimes usually seen towards the last stages as more and more design components starts stacking up on the SOC.
The problem or pain point has been correctly identified, and the proposed solution is capable of producing the desired outcomes. The stagewise data can be viewed in a user-friendly dashboard which further simplifies the tracking and review process. Further eliminating the scope of human error. During initial design phase working with dirty data would lead to multiple check failures which can be reviewed and waived initially, but the same need to be attended during final closure. All this is pretty well captured and tracked. The dashboard is quiet helpful in reviewing final design closure. It clearly gives the reviewer what all as passed/failed/waived at different stages. All the waivers have the justifications along with approver details.
Front-End Design
AI
Design
Engineering Tracks
Front-End Design
DescriptionThe presence of multiple and cascaded clock MUX structures can be seen in complex RTL/digital desings. The existing CDC static verification tools, Synthesis tools and Static Timing Analysis (STA) tools have a hard time analyzing the operation modes of clock multiplexers (clock MUX in short), especially if they are glitch free switching circuits that may not resemble a simple multiplexer to the tools. For uch clock MUX structures, relevant design constraints (or SDCs) need to be added to guide the static tools. In this proposal we have a series of well-defined generated clock and logically exclusive constraints that will help cover all the combinations of clock propagations through any complex clock MUX structure in the design. This will enable smooth CDC, synthesis and timing verification of the design considering all the clocks input to the clock MUX, in an accurate manner. This technique yields accurate results from an STA and a synthesis perspective. This technique helps to reduce the number of case analysis modes needed (which is essentially to propagate only one fixed clock through a clock MUX in that mode), meaning that it can do the heavy lifting of all the analysis modes at one go, thereby reducing the turnaround time considerably. The coverage of all combinations of clocks is now possible in one single CDC static verification run, which also eliminates the noise in CDC reports where there were overlapping violations across analysis modes previously. With such promising results, there is also scope for automation of this technique provided the relevant data is available as input.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn the last years, CMOS scaling becomes slower than used to be and it becomes more challenging to follow Moore's law. One of the proposed scaling boosters is system level 3D-IC where the vertical dimension is used by stacking dies on top of each other. Fine-pitch 3D interconnects such as wafer-to-wafer hybrid bonding (10-1 µm) leverage the benefits of 3D-IC by reducing the wire length connections, hence improving PPA compared to the 2D counterparts.
Thermal hotspots become more challenging with 3D-IC for two reasons. First, reducing die footprint in 3D stacks leads to increasing the power density. Second, bringing two dies in close proximity leads to heat confinement and poor heat dissipation.
In this work, a thorough 3D thermal analysis is performed on MemPool design. Face-to-face 3D stack of MemPool design leads to maximum temperature increase by 30ºC compared to the 2D counterpart configuration under static power conditions. Increasing temperature escalates the static leakage power and resistance and results in total grid resistance and max IR-drop value to rise by 2.8% and 4.7% respectively.
Cadence Celsius thermal solver and Voltus™ IC power integrity are used in the electrical-thermal co-simulation presented in this work.
Thermal hotspots become more challenging with 3D-IC for two reasons. First, reducing die footprint in 3D stacks leads to increasing the power density. Second, bringing two dies in close proximity leads to heat confinement and poor heat dissipation.
In this work, a thorough 3D thermal analysis is performed on MemPool design. Face-to-face 3D stack of MemPool design leads to maximum temperature increase by 30ºC compared to the 2D counterpart configuration under static power conditions. Increasing temperature escalates the static leakage power and resistance and results in total grid resistance and max IR-drop value to rise by 2.8% and 4.7% respectively.
Cadence Celsius thermal solver and Voltus™ IC power integrity are used in the electrical-thermal co-simulation presented in this work.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn advanced technology node, the difference in the ratio of cell height scaling and interconnect scaling has resulted in local routing congestion in the low-level metal layers. This congestion is one of the bottleneck factors in node scaling. In this paper, we address two approaches to alleviate the local routing congestion in the low-level metal layers: (1) increasing pin access points by utilizing the middle-of-line (MOL) layer as a pin of the standard cell, and (2) minimizing local interconnections by merging repetitive logic combinations. We propose an efficient method for preparing the standard cells that offer routability gains, as well as equivalent cell swapping case by case that are expected to enhance routability during the placement and routing (P&R) stage. Our experiments show a 1.82% and 0.6% block area gain for MOL pin routing and merged logic cells, respectively. We demonstrate that alleviating local routing congestion in lower-level metal layers is an important key for interconnect scaling.
Research Manuscript
Design
Quantum Computing
DescriptionSearch algorithms based on quantum walks have emerged as a promising approach to solve computational problems across various domains, including combinatorial optimization and cryptography. Stating a generic search problem in terms of a (quantum) search over a graph makes the efficiency of the algorithmic method depend on the structure of the graph itself. In this work, we propose a complete implementation of a quantum walk search on Johnson graphs, speeding up the solution of the subset-sum problem. We provide a detailed design of each sub-circuit, quantifying their cost in terms of gate number, depth, and width. We additionally compare our solution against a Grover quantum search approach, showing a reduction of the T-depth cost compared to it. The proposed design provides a building block for the construction of efficient quantum search algorithms that can be modeled on Johnson graphs, filling the gap with the existing theoretical complexity analyses.
Analyst Presentation
DescriptionThe presentation will cover what is required to design an ASIC for the Generative AI Era. It will cover the compute, networking, and memory constraints of generative AI as well as what companies are doing to push beyond it with optics, packaging, and system level design.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionCeva SensPro AI & Vision DSP are embedded within automotive and many other markets' chips that require safety. Showing that DSP matches the safety metrics set by ISO 26262 for relevant ASIL is critical across these markets as they move towards having advanced ADAS and autonomous driving solutions. STL is a state-of-the-art SW-based safety mechanism which allows DSP to address ASIL KPIs specifically targeting permanent HW faults.
The DSP cores are developed as SEooC (generic off the shelf IP), with ASIL B integrity level. This paper discusses the challenges in developing the STL with RTL design and fault injection analysis, in achieving the SPFM for identified modules based on Technical Safety Requirements. Also highlighted are various challenges involving collaboration between multiple teams – SW, HW, PM and Safety. An ISO 26262 certified fault injection analysis tool and methodology is used to generate various results and customized reports to effectively iterate and improve the STL to achieve the required targets of SPFM.
Customized reports, from the tool has helped in clearly communicating the improvements needed for STL to reach the target Diagnostic Coverage and SPFM, like faults that are blocked by various configuration registers, which constants flops to be toggled to make STL more effective, time taken to detect fault after injection, SPFM for individual modules and all combined.
The DSP cores are developed as SEooC (generic off the shelf IP), with ASIL B integrity level. This paper discusses the challenges in developing the STL with RTL design and fault injection analysis, in achieving the SPFM for identified modules based on Technical Safety Requirements. Also highlighted are various challenges involving collaboration between multiple teams – SW, HW, PM and Safety. An ISO 26262 certified fault injection analysis tool and methodology is used to generate various results and customized reports to effectively iterate and improve the STL to achieve the required targets of SPFM.
Customized reports, from the tool has helped in clearly communicating the improvements needed for STL to reach the target Diagnostic Coverage and SPFM, like faults that are blocked by various configuration registers, which constants flops to be toggled to make STL more effective, time taken to detect fault after injection, SPFM for individual modules and all combined.
Embedded Systems and Software
AI
Embedded Systems
Engineering Tracks
DescriptionDue to difficulties in supporting Open Source Project in the Embedded System environment and limitations in applying fuzzing technology, we used SystemC-based full-path SSD VP, but it was not easy to apply due to performance issues and many unnecessary functions for SED verification.
Accordingly, we developed Security VP by removing unnecessary parts and introduced State Machine-based libFuzzer to improve the problem.
As a result, the execution time of Security VP was shortened to 82.5% compared to full-path VP, and code coverage was improved by 15.2%. In particular, the number of commands used to achieve specific coverage was reduced by 98.7%, and the number of commands used to achieve overall coverage was also reduced by 85.4%.
Accordingly, we developed Security VP by removing unnecessary parts and introduced State Machine-based libFuzzer to improve the problem.
As a result, the execution time of Security VP was shortened to 82.5% compared to full-path VP, and code coverage was improved by 15.2%. In particular, the number of commands used to achieve specific coverage was reduced by 98.7%, and the number of commands used to achieve overall coverage was also reduced by 85.4%.
Research Manuscript
EDA
Physical Design and Verification
DescriptionModern VLSI design flows necessitate fast and high-quality global routers. In this paper, we introduce DGR, a GPU-accelerated, differentiable global router capable of concurrent optimization for millions of nets, which we aim to open-source. Our innovation lies in the development of a routing Directed Acyclic Graph (DAG) forest to represent the 2D pattern routing space for all nets, enabling coordinated selection of Steiner trees and 2-pin routing paths from a global perspective. For efficient search within the DAG forest, we relax the discrete search space to be continuous and develop a differentiable solver accelerated by deep learning toolkits on GPUs. Experimental results demonstrate that DGR substantially mitigates routing overflow while concurrently reducing total wirelengths from 0.95% to 4.08% and via numbers from 1.28% to 2.54% in congested testcases compared to state-of-the-art academic global routers. Additionally, DGR exhibits favorable scalability in both runtime and memory with respect to the number of nets.
Research Manuscript
Security
Hardware Security: Primitives, Architecture, Design & Test
DescriptionAs a vital security primitive, the true random number generator (TRNG) is a mandatory component to build roots of trust for any encryption system. However, existing TRNGs suffer from bottlenecks of low throughput and high area-energy consumption. In this work, we propose DH-TRNG, a dynamic hybrid TRNG circuitry architecture with ultra-high throughput and area-energy efficiency. Our DH-TRNG exhibits portability to distinct process FPGAs and passes both NIST and AIS-31 tests without any post-processing. The experiments show it incurs only 8 slices with the highest throughput of 670 Mbps and 620 Mbps on Xilinx Virtex-6 and Artix-7, respectively. Compared to the state-of-the-art TRNGs, our proposed design has the highest Throughput/(Slices·Power) with 2.63× increase.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionØ Microcontroller designs are going through optimizations on multiple-scales to win market share – be it performance, MIPS, feature-set, more peripheral access, more systems on chip supporting more applications.
Ø Simultaneous support of varied applications and win customers mandate availability of higher count of pins/GPIOs as compared to predecessors/competitors – which indirectly means sacrificing the count of power/ground pins.
Ø For our design, we have set target of power-pins reduction by 40% on Core and IO power supplies, thereby providing more pins for GPIOs. Along with this, there was increase in supply tolerance.
Ø Reduced count of power/ground pins, increased functionality, higher supply tolerance on IO supply have negative impacts on IR drop, timing, Signal Integrity (SI), and hence overall design performance.
Ø Industrial solutions like PTV compensation circuits, programmable drive cells would solve the problem for SI, however, they bring higher area overhead.
Ø Through this paper we present cost-effective (area) design strategies that were incorporated to keep all the above design vectors in reasonable limits without area penalties. We will showcase the signal-integrity aspects wrt GPIO pads and see how they were addressed.
Ø Simultaneous support of varied applications and win customers mandate availability of higher count of pins/GPIOs as compared to predecessors/competitors – which indirectly means sacrificing the count of power/ground pins.
Ø For our design, we have set target of power-pins reduction by 40% on Core and IO power supplies, thereby providing more pins for GPIOs. Along with this, there was increase in supply tolerance.
Ø Reduced count of power/ground pins, increased functionality, higher supply tolerance on IO supply have negative impacts on IR drop, timing, Signal Integrity (SI), and hence overall design performance.
Ø Industrial solutions like PTV compensation circuits, programmable drive cells would solve the problem for SI, however, they bring higher area overhead.
Ø Through this paper we present cost-effective (area) design strategies that were incorporated to keep all the above design vectors in reasonable limits without area penalties. We will showcase the signal-integrity aspects wrt GPIO pads and see how they were addressed.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAs Silicon scaling continue to reach into Angstrom domain, dimension scaling slows down. however silicon feature and compute power stills continue to increase at rate comparable to Moore's Law. Power Integrity is now becoming a key challenge for sub nanometer processes. Die-level Power Integrity signoff is normally done at the final stage of IC design. Power-Plan design and synthesis on the other hand are done before Automated-Place-And-Route (APR). A key conundrum for Power Plan design and synthesis, is the lack of reference information, especially for new IP, such as a latest high-performance CPU. A tapeout quality IR signoff for APR, requires a post-routed (final-stage) APR database, a couple of post-layout simulation pattern which exercises the logic to consume current from the on-die power-grid in a near-realistic worst-case manner, an optimized package model which describe the package ball to bump impedance. All three critical input information for IR analysis becomes available only at the end of the IC implementation process, posing risk to tapeout schedule and possible IC failures due to severe IR drop. In this presentation, we demonstrate how Sigma-DVD resolve this conundrum, allowing our engineers to identify Dynamic-IR hotspots, without end-of-the-stage functional pattern, and hence "shift-left" to strengthen Power-plan on potential weak-spots, before the weak-spots gets identifed too late in the implantation process.
Research Manuscript
Design
Emerging Models of Computation
DescriptionCombinatorial optimization problems (COP) are NP-hard and intractable to solve using conventional computing. The Ising model-based annealer has gained increasing attention recently due to its efficiency and speed in finding approximate solutions. However, Ising solvers for travelling salesman problems (TSP) usually suffer from a scalability issue due to quadratically increasing number of spins. In this paper, we propose a digital computing-in-memory (CIM) based clustered annealer to solve tens of thousands of city-scale TSP with only a few mega-byte (MB) of static random access memory (SRAM), using hierarchical clustering to solve input sparsity and digital CIM flexibility to solve weight sparsity. The intrinsic process variations between SRAM devices are utilized to generate the noisy bit errors during pseudo-read under reduced supply voltage, realizing the annealing process. The design space of cluster size and programmability is explored to understand the trade-offs of solution quality and hardware cost, for TSP scale ranging from 3080 to 85900 cities. The proposed design speeds up the convergence by >10^9× with <25% solution quality overhead compared with the CPU baseline. The comparison with state-of-the-art scalable annealers shows a >10^13× improvement on functionally normalized area and power.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAbstract :
Digital continuity plays a crucial role in modern semiconductor industry. Semiconductor design and manufacturing complexity include the need for sophisticated process control, rigorous quality assurance, and seamless integration with upstream and downstream operations. This topic explores the concept of digital continuity from design to manufacturing through model-based approach and displaying its impact on optimizing the entire product development, manufacturing lifecycle, end-to-end traceability and resource optimization. In this paper, we are ensuring seamless information flow from the Engineering BOM to Manufacturing BOM to the Bill of process (BOP) and resources. The integration of these components becomes imperative for efficient design and product lifecycle management. It also explores the role of Product Lifecycle Management (PLM), planning optimization and life cycle management.
Digital continuity plays a crucial role in modern semiconductor industry. Semiconductor design and manufacturing complexity include the need for sophisticated process control, rigorous quality assurance, and seamless integration with upstream and downstream operations. This topic explores the concept of digital continuity from design to manufacturing through model-based approach and displaying its impact on optimizing the entire product development, manufacturing lifecycle, end-to-end traceability and resource optimization. In this paper, we are ensuring seamless information flow from the Engineering BOM to Manufacturing BOM to the Bill of process (BOP) and resources. The integration of these components becomes imperative for efficient design and product lifecycle management. It also explores the role of Product Lifecycle Management (PLM), planning optimization and life cycle management.
Research Manuscript
EDA
Timing and Power Analysis and Optimization
DescriptionIn VLSI design, accurate pre-routing timing prediction is paramount. Traditional machine learning-based methods require extensive data, posing challenges for advanced technology nodes due to the time-consuming data preparation. To mitigate this issue, we propose a novel transfer learning framework that uses data from previous nodes for learning on the target node. Our method initially disentangles and aligns timing path features across different nodes, then predicts each path's arrival time employing a Bayesian-based model capable of handling highly variable arrival time and generalizing to new designs. Experimental results on transfer learning from 130nm to 7nm nodes validate our method's effectiveness.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe remarkable advancements in Neural Networks' precision have ignited a revolution in their architecture, demanding ever-expanding memory and computational resources. As we confront the limitations posed by current hardware, such as memory and processing capabilities, one innovative solution emerges: the distribution of neural network model inference across multiple devices. While most prior efforts have focused on optimizing single-device inference or partitioning models to enhance inference throughput. This work proposes a framework that searches for optimal model splits and distributes the partitions across the combination of a given set of devices taking into consideration the throughput and energy. Participating devices are strategically grouped into homogeneous and heterogeneous clusters consisting of general-purpose CPU and GPU architectures, as well as emerging Compute-In-Memory (CIM) accelerators. The framework simultaneously optimizes inference throughput and energy consumption with a weighting control parameter. Compared to the performance of a single GPU, it helps to achieve up to 4$\times$ speedup with approximately 4$\times$ per-device energy reduction in a heterogeneous setup. The algorithm also finds a smooth Pareto-like curve on the throughput-energy space for CIM devices.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIn recent years, the emphasis on model fairness in AI applications for edge devices has grown. However, the traditional focus on AI model optimization has revolved around accuracy and efficiency,
formulating it as a two-objective problem. Consequently, the fairness of the model often goes unaddressed, potentially leading to unjust treatment of minorities. To rectify this oversight, it is imminent
and essential to include fairness in the model optimization, making it a three-objective optimization problem in terms of accuracy, efficiency, and fairness. By examining the existing methods, we found
that the weight distribution will affect efficiency and fairness, but these two metrics are always considered separately. Confronting the obstacle, we propose a novel optimization framework namelyFAIST, calibrating a fair model by controlling weight distribution to optimize fairness, efficiency, and accuracy simultaneously. We first devise an optimization algorithm that can guide the training
to generate model weights following the desired distribution. Then, we integrate the optimizer into a reinforcement learning process to identify hyperparameters of distribution to yield high performance. Evaluation of dermatology and face attribute datasets demonstrates FAIST's simultaneous improvements, with a notable 27.24% fairness improvement on the ISIC2019 dataset
formulating it as a two-objective problem. Consequently, the fairness of the model often goes unaddressed, potentially leading to unjust treatment of minorities. To rectify this oversight, it is imminent
and essential to include fairness in the model optimization, making it a three-objective optimization problem in terms of accuracy, efficiency, and fairness. By examining the existing methods, we found
that the weight distribution will affect efficiency and fairness, but these two metrics are always considered separately. Confronting the obstacle, we propose a novel optimization framework namelyFAIST, calibrating a fair model by controlling weight distribution to optimize fairness, efficiency, and accuracy simultaneously. We first devise an optimization algorithm that can guide the training
to generate model weights following the desired distribution. Then, we integrate the optimizer into a reinforcement learning process to identify hyperparameters of distribution to yield high performance. Evaluation of dermatology and face attribute datasets demonstrates FAIST's simultaneous improvements, with a notable 27.24% fairness improvement on the ISIC2019 dataset
Research Manuscript
Security
Hardware Security: Primitives, Architecture, Design & Test
DescriptionThis study introduces a refined Flooding Injection Rate-adjustable Denial-of-Service (DoS) model for Network-on-Chips (NoCs) and more importantly presents DL2Fence, a novel framework utilizing Deep Learning (DL) and Frame Fusion (2F) for DoS detection and localization. Two Convolutional Neural Networks models for classification and segmentation were developed to detect and localize DoS respectively. It achieves detection and localization accuracies of 95.8% and 91.7%, and precision rates of 98.5% and 99.3% in a 16x16 NoC. The framework's hardware overhead notably decreases by 76.3% when scaling from 8x8 to 16x16, and it requires 42.4% less hardware compared to state-of-the-arts. This advancement demonstrates DL2Fence's effectiveness in balancing outstanding detection performance in large-scale NoCs with extremely low hardware overhead.
Research Manuscript
AI
Security
AI/ML Security/Privacy
DescriptionWith deep learning deployed in many security-sensitive areas, machine learning security is becoming progressively important. Recent studies demonstrate attackers can exploit system-level techniques exploiting the RowHammer vulnerability of DRAM to deterministically and precisely flip bits in Deep Neural Networks (DNN) model weights to affect inference accuracy. The existing defense mechanisms are software-based, such as weight reconstruction requiring expensive training overhead or performance degradation. On the other hand, generic hardware-based victim-/aggressor-focused mechanisms impose expensive hardware overheads and preserve the spatial connection between victim and aggressor rows. In this paper, we present the first DRAM-based victim-focused defense mechanism tailored for quantized DNNs, named DNN-Defender that leverages the potential of in-DRAM swapping to withstand the targeted bit-flip attacks with a priority protection mechanism. Our results indicate that DNN-Defender can deliver a high level of protection downgrading the performance of targeted RowHammer attacks to a random attack level. In addition, the proposed defense has no accuracy drop on CIFAR-10 and ImageNet datasets without requiring any software training or incurring hardware overhead.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionData rotation across spatial engines is a common strategy for reusing data from neighboring on-chip
engine buffers. However, the existing data rotation approach often results in inadequately sized
data tiles. In this paper, we introduce DNNPhaser, a framework designed to implement a multiphase
approach to 2D co-rotation. By preserving a phase difference between the rotation rings, DNNPhaser facilitates the coherent rotation of both input data rings and kernel data rings, ensuring the maintenance of appropriately sized data tiles. Experimental results demonstrate that DNNPhaser achieves a geometric mean EDP reduction of 24.8% for DNNs compared to Tangram on a spatial accelerator.
engine buffers. However, the existing data rotation approach often results in inadequately sized
data tiles. In this paper, we introduce DNNPhaser, a framework designed to implement a multiphase
approach to 2D co-rotation. By preserving a phase difference between the rotation rings, DNNPhaser facilitates the coherent rotation of both input data rings and kernel data rings, ensuring the maintenance of appropriately sized data tiles. Experimental results demonstrate that DNNPhaser achieves a geometric mean EDP reduction of 24.8% for DNNs compared to Tangram on a spatial accelerator.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionPhotonic computing has emerged as a promising solution for accelerating computation-intensive artificial intelligence (AI) workloads, offering unparalleled speed and energy efficiency, especially in resource-limited, latency-sensitive edge computing environments. However, the deployment of analog photonic tensor accelerators encounters reliability challenges due to hardware noises and environmental variations. While off-chip noise-aware training and on-chip training have been proposed to enhance the variation tolerance of optical neural accelerators with moderate, static noises, we observe a notable performance degradation over time due to temporally drifting variations, which requires a real-time, in-situ calibration mechanism. To tackle this challenging reliability issues, for the first time, we propose a lightweight dynamic on-chip remediation framework, dubbed DOCTOR, providing adaptive, in-situ accuracy recovery against temporally drifting noises. The DOCTOR framework intelligently monitors the chip status using adaptive probing and performs fast in-situ training-free calibration to restore accuracy when necessary. Recognizing nonuniform spatial variation distributions across devices and tensor cores, we also propose a variation-aware architectural remapping strategy to avoid executing critical tasks on noisy devices. Extensive experiments show that our proposed framework can guarantee sustained performance with drifting variations with 34% higher accuracy and 164x lower overhead compared to state-of-the-art on-chip training methods.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionAddress translation using a logical-to-physical (L2P) mapping table is essential for the NAND Flash-based SSDs. Unfortunately, as SSD capacity increases, the L2P mapping table size also increases. Ac
Research Manuscript
Embedded Systems
Embedded Software
DescriptionThanks to the evolving network depth, convolutional neural networks (CNNs) have achieved impressive performance across various intelligent embedded scenarios towards embedded intelligence. Nonetheless, this trend also leads to degraded hardware efficiency as the network evolves deeper and deeper. In contrast, shallow networks exhibit superior hardware efficiency, which, unfortunately, suffer from inferior accuracy. To tackle this dilemma, we establish the first deep-to-shallow transformable neural architecture search (NAS) paradigm, namely Double-Win NAS (DW-NAS), which is dedicated to automatically exploring deep-to-shallow transformable networks to marry the best of both worlds. Extensive experiments on two NVIDIA Jetson intelligent embedded systems clearly show the superiority of DW-NAS over previous state-of-the-art methods.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionQuantization is one of the most hardware-efficient ways to reduce inference costs for deep neural network (DNN) models. Nevertheless, with the continuous growth of DNN model size, existing static quantization methods fail to utilize the sparsity of models sufficiently. Motivated by the pervasive dynamism in data tensors across DNN models, we propose a dynamic precision quantization algorithm to further reduce computational costs. Furthermore, to address the shortcomings of existing precision-flexible accelerators, we design a novel accelerator, Drift, and achieve online scheduling to efficiently support dynamic precision execution. Evaluation results show that Drift achieves 2.85x speedup and 3.12x energy saving over existing precision-flexible accelerators.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThis study presents a Deep Reinforcement Learning (DRL) framework for optimizing the routing of multiple droplets in Digital Microfluidic Biochips (DMFBs). Our approach significantly reduces computational costs by optimizing voltage-applied cell positions, enabling the parallel movement of multiple droplets with a single neural network pass. Experimental results on various DMFB sizes and droplet counts demonstrate a reduction of up to 98% in neural network parameters and 95% in memory usage for a 10x10 grid with two droplets. Additionally, the proposed method enhances success rates of routing as the number of droplets increases, surpassing existing multi-agent DRL techniques.
IP
Engineering Tracks
IP
DescriptionSoCs designed for compute-intensive workloads, such as AI training and inferencing, continue to grow and power budgets are increasing geometrically. Power consumption is comprised of dynamic and static power elements. The latter is generally fixed and determined by process technology and design techniques, while the former depends on workloads and frequency. This variability of workloads can drive rapid changes in current draw, which causes voltage droop, a rapid drop in power rails that can lead to timing glitches and system failures. For example, sudden changes in models or weights can drive these sudden shifts in workloads causing voltage droops.
Silicon design teams have attempted to address droop in various ways, but all methods have significant downsides. The typical options employed are increasing voltage margins, reducing operating frequencies, scheduling workloads through software, or using active droop mitigation methods that may be fully custom or tailored to their needs. Each of these solutions has advantages and drawbacks regarding power, performance, and implementation effort.
This discussion will explore the root causes of droop, its impact on power, and the increasing challenges in advanced nodes. It will also delve into modern droop mitigation techniques, highlighting the advantages of a tightly-coupled, synthesizable solution.
Silicon design teams have attempted to address droop in various ways, but all methods have significant downsides. The typical options employed are increasing voltage margins, reducing operating frequencies, scheduling workloads through software, or using active droop mitigation methods that may be fully custom or tailored to their needs. Each of these solutions has advantages and drawbacks regarding power, performance, and implementation effort.
This discussion will explore the root causes of droop, its impact on power, and the increasing challenges in advanced nodes. It will also delve into modern droop mitigation techniques, highlighting the advantages of a tightly-coupled, synthesizable solution.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionFPGA-based accelerators have emerged as an effective solution for GPT inference, given their inherent flexibility and capacity for domain-specific customization. Despite their potential, two primary challenges have impeded their efficient use: the disparate compute-to-memory access ratios in GPT's encoding and generation stages, and the rapid increase in hardware resource demands for nonlinear operations due to longer text lengths and larger embedding dimensions.
To overcome these obstacles, we introduce DTrans, an FPGA accelerator tailored for GPT based on dataflow transformation and features nonlinear-operators fusion. DTrans features a two-pronged approach: a two-stage dataflow transformation to align with the unique computational and access needs of GPT's different stages, and a sequence-length decoupling method for nonlinear operators. This approach allows for the overlapping of computational delays in operations like Softmax and layer normalization with matrix operations in tasks involving long sentences. Furthermore, DTrans uses a two-level alternating input pipeline, which efficiently manages GPT's computing flow, inclusive of residual connections and variable inter-layer delays.
Our comparative analyses reveal that DTrans outperforms the GPU(V100) in terms of throughput and energy efficiency, achieving improvements of 11.99x and 11.7x, respectively. When compared with state-of-the-art GPT inference accelerators, DTrans demonstrates more than 5.64x and 5.22x enhancements in these metrics.
To overcome these obstacles, we introduce DTrans, an FPGA accelerator tailored for GPT based on dataflow transformation and features nonlinear-operators fusion. DTrans features a two-pronged approach: a two-stage dataflow transformation to align with the unique computational and access needs of GPT's different stages, and a sequence-length decoupling method for nonlinear operators. This approach allows for the overlapping of computational delays in operations like Softmax and layer normalization with matrix operations in tasks involving long sentences. Furthermore, DTrans uses a two-level alternating input pipeline, which efficiently manages GPT's computing flow, inclusive of residual connections and variable inter-layer delays.
Our comparative analyses reveal that DTrans outperforms the GPU(V100) in terms of throughput and energy efficiency, achieving improvements of 11.99x and 11.7x, respectively. When compared with state-of-the-art GPT inference accelerators, DTrans demonstrates more than 5.64x and 5.22x enhancements in these metrics.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionDRAM has evolved across generations by increasing its transfer rates in response to computer system's growing demand for bandwidth. However, higher transfer rates have increased the likelihood of errors known as link errors, occuring during the data transmission process. Because the existing Rank-Level ECC(RL-ECC) employed by system companies is not sufficient to cope with new threats, CRC (Cyclic Redundancy Check) has been adopted in recent memory architecture to address this issue. But CRC comes with the drawback of requiring additional transfers, which lead to performance degradation. Moreover, since CRC can only detect, it triggers re-transmission in the system for correction Which may be additional overhead for systems. This paper proposes a novel RL-ECC, Dual-Axis ECC, that can also provide CRC's detection capability, ensuring there is no performance degradation due to additional transfers and mitigating re-transmission while still fulfilling RL-ECC's original purpose by exploiting unused syndromes of QPC. Our evaluation shows that compared to QPC with CRC, Dual-Axis ECC without CRC can provide same reliability level. Moreover, It can speed up applications by average 2.52\%, up to 4.88\%.
Research Manuscript
Embedded Systems
Embedded System Design Tools and Methodologies
DescriptionRecommendation systems are the backbone for numerous user applications on edge devices. However, the compute and memory-intensive nature of recommendation models renders them unsuitable for edge devices. Nevertheless, by decoupling the model fraction related to user history (e.g., past visited pages, liked posts) and user attributes (such as age, gender), we can offload partial recommendation models onto local edge devices. Hence, we present Duet, a novel collaborative edge-cloud recommendation system that intelligently decomposes the recommendation model into two smaller models – user and item models -- that execute simultaneously on the edge device and cloud before coming together to deliver final recommendations. Further, we propose a lightweight Duet architecture to support user models on resource-constrained edge devices. Overall, Duet reduces the average latency by 6.4x and improves energy efficiency by 4.6x across five recommendation models.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAs technology scales down, metal resistances have increased, resulting in potentially more voltage drop. Therefore, Dynamic Voltage Drop (DVD) significantly affects performance in recent process technologies. Moreover, transistor density has increased, resulting in higher power density. Thus, power integrity and timing check must be done simultaneously. In this work, DVD-aware STA flow based on Cadence's Tempus PI is proposed. To show its effectiveness on real silicon, the proposed method is evaluated by correlating with 10nm test chip that is specially designed for DVD-aware STA and implemented in Samsung's 10nm process.
Research Manuscript
Design
In-memory and Near-memory Computing Circuits
DescriptionComputing-in-memory has demonstrated great energy-efficiency by integrating computing units into memory. However, previous research on CIM has rarely utilized sparsity in activation and weight concurrently. Thus, we implemented an accelerator called Dyn-Bitpool which innovates on two fronts: 1) a balanced working scheme called "pool first and cross lane sharing" to maximize the performance benefiting from bit-level sparsity in activation; 2) dynamic topology of CIM arrays to effectively handle low hardware utilization issue stemming from value-level sparsity in weight. All the contributions collaborate to speed up Dyn-Bitpool by 1.89x and 2.64x on average compared with two state-of-the-art accelerators featuring CIM.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionSparse Matrix-Matrix Multiplication (SpMM) is one of the key operators in many fields, showing dynamic features in terms of sparsity, element distribution, and data dependency. Previous studies have proposed FPGA-based SpMM accelera- tors with fixed configurations, leaving three major challenges unsolved: 1) Partitioning matrices with the fixed sub-matrix size leads to performance loss, because the optimal feasible sub- matrix size to minimize memory access varies with dynamic sparsity. 2) The fixed row-base allocation scheme of streaming architecture leads to unbalanced workloads because of dynamic element distribution across sparse matrix rows. 3) Data conflict makes the elements in one row cannot be processed consecutively. Architectures with fixed execution order rely on time-consuming pre-processing to deal with dynamic data dependency.
Motivated by the observation that fixed configurations leads to performance loss, we propose DySpMM by introducing the dynamic design methodology to SpMM architectures. The config- urable data distribution data path is designed to enable dynamic sub-matrix size, achieving up to 3.43× speed-up. The element-wise allocation unit is introduced into hardware for dynamic workload balancing, improving utilization up to 3.74×. The interleaved reorder unit is proposed to automatically reorder the sparse elements and dynamically avoid conflicts, completely avoiding the pre-processing overhead. The evaluation of DySpMM on FPGA shows that DySpMM achieves 1.42× geomean throughput of the state-of-the-art accelerator Sextans and 1.78× energy efficiency compared with V100S GPU.
Motivated by the observation that fixed configurations leads to performance loss, we propose DySpMM by introducing the dynamic design methodology to SpMM architectures. The config- urable data distribution data path is designed to enable dynamic sub-matrix size, achieving up to 3.43× speed-up. The element-wise allocation unit is introduced into hardware for dynamic workload balancing, improving utilization up to 3.74×. The interleaved reorder unit is proposed to automatically reorder the sparse elements and dynamically avoid conflicts, completely avoiding the pre-processing overhead. The evaluation of DySpMM on FPGA shows that DySpMM achieves 1.42× geomean throughput of the state-of-the-art accelerator Sextans and 1.78× energy efficiency compared with V100S GPU.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionDynamic graph neural networks (DGCNs) have been proposed to extend machine learning techniques to applications involving dynamic graphs. Typically, a DGCN model includes a graph convolutional network (GCN) followed by a recurrent neural network (RNN) to capture both spatial and temporal information. To efficiently perform distinct neural network models as well as maximize the data reuse and hardware utilization, customized hardware designs for such applications require a reconfigurable computing engine, flexible dataflow, and efficient data locality exploitation. We propose an efficient DGCN accelerator named E-DGCN. Specifically, E-DGCN includes modified Processing Elements (PEs) with a flexible interconnection design to support diverse computation patterns and various dataflows. Additionally, a lightweight vertex caching algorithm is proposed to exploit data locality, enabling E-DGCN to selectively load required vertices during DGCN inference. These implementations provide benefits in managing data computation and communication.
Research Manuscript
EDA
RTL/Logic Level and High-level Synthesis
DescriptionLogic synthesis plays a crucial role in the digital design flow. It has a decisive influence on the final Quality of Results (QoR) of the circuit implementations. However, existing multi-level logic optimization algorithms often employ greedy approaches with a series of local optimization steps. Each step breaks the circuit into small pieces (e.g., k-feasible cuts) and applies incremental changes to individual pieces separately. These local optimization steps could limit the exploration space and may miss opportunities for significant improvements. To address the limitation, this paper proposes using e-graph in logic synthesis. The new workflow, named E-Syn, makes use of the well-established e-graph infrastructure to efficiently perform logic rewriting. It explores a diverse set of equivalent Boolean representations while allowing technology-aware cost functions to better support delay-oriented and area-oriented logic synthesis. Experiments over a wide range of benchmark designs show our proposed logic optimization approach reaches a wider design space compared to the commonly used AIG-based logic synthesis flow. It achieves on average 15.29% delay saving in delay-oriented synthesis and 6.42% area saving for area-oriented synthesis.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionClock trees significantly contribute to the overall power consumption of a design, accounting for approximately 30-40% of the total power. Effectively estimating and analyzing clock power at the System-on-Chip (SoC) level is crucial for identifying and optimizing weak areas in the design. The identification of power bugs prompts the exploration of various Clock Gating Strategies to enhance power efficiency.
Existing methods for clock tree power estimation at the gate level exhibit dependencies on processes like clock tree synthesis (CTS). However, these dependencies, occurring late in the cycle, hinder design optimization within the strict timelines of the SoC. Close to Base Tape-out (BTO), attempting design optimization becomes more challenging, as changes can disrupt established timelines.
This paper introduces a pioneering workflow for early clock power estimation, providing feedback to cores/IPs at the Register Transfer Level (RTL) stage. This approach aims to address the limitations of current methods and emphasizes a proactive strategy for optimizing clock power in the early stages of design, thus overcoming the constraints imposed by late-cycle dependencies and stringent timelines.
Existing methods for clock tree power estimation at the gate level exhibit dependencies on processes like clock tree synthesis (CTS). However, these dependencies, occurring late in the cycle, hinder design optimization within the strict timelines of the SoC. Close to Base Tape-out (BTO), attempting design optimization becomes more challenging, as changes can disrupt established timelines.
This paper introduces a pioneering workflow for early clock power estimation, providing feedback to cores/IPs at the Register Transfer Level (RTL) stage. This approach aims to address the limitations of current methods and emphasizes a proactive strategy for optimizing clock power in the early stages of design, thus overcoming the constraints imposed by late-cycle dependencies and stringent timelines.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionLow Power design is now required to satisfy the current global market request in reducing ASIC power consumption. Incorrect power aware description can compromise the original design functional behavior, such as the propagation of corrupted signals due to an incorrect isolation control signals protocol. Low Power structural checks ensure that the design is structurally safe but do not guarantee functional correctness. Low Power functional simulations highly depend on simulation scenarios, which may result in non-exhaustive verification in case of a lack of test cases.
This paper details our experiences in establishing a robust power aware verification flow to catch low power Bugs early in the design cycle reducing the overall sign-off time. We present how the power aware formal verification, combined with a custom automatic property's extraction, helped us to obtain a simulation scenario independent analysis of power aware design functionality. The flow allows fast and specific LP checks without requiring any verification scenario setup. We share the results of our analysis, which highlight the bugs found using this methodology. We shall show how you can adopt this flow to make your power aware signoff comprehensive.
This paper details our experiences in establishing a robust power aware verification flow to catch low power Bugs early in the design cycle reducing the overall sign-off time. We present how the power aware formal verification, combined with a custom automatic property's extraction, helped us to obtain a simulation scenario independent analysis of power aware design functionality. The flow allows fast and specific LP checks without requiring any verification scenario setup. We share the results of our analysis, which highlight the bugs found using this methodology. We shall show how you can adopt this flow to make your power aware signoff comprehensive.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionFunctional Coverage serves as a metric for measuring the completeness of verification efforts, often requiring a significant time investment. The testbench (TB) employed to achieve verification coverage may involve complex constraints and incomplete scenarios, potentially causing issues of over-constraint or under-constraint due to its randomized nature. Such flawed or incomplete random TBs can lead to unnecessary time and effort spent on checking UNR or re-running regressions after a verification engineer's review. This paper presents a methodology for validating TBs under a simulation environment using Formal Technology to mitigate the existing validation Turnaround Time (TAT). Leveraging Formal Technology and C2A(Constraint to Assume) developed internally, constraint and coverage model in random simulation TB can be verified early to eliminate over/under-constraints. Furthermore, additional functional coverage can be generated by applying internally developed R2C(RTL to Coverage) and C2C(Counter to Coverage) to the RTL. As a result, The improved TB can be applied to the simulator from the early stages, confirming a reduction in validation TAT. This approach facilitates the creation of high-quality coverage-based TBs and helps the early detection of hard-to-find bugs in a simulator-based environment.
Research Manuscript
Design
In-memory and Near-memory Computing Circuits
DescriptionAnalog Computing-in-Memory (ACIM) is an emerging architecture to perform efficient AI edge computing. However, current ACIM designs usually have unscalable topology and still heavily rely on manual efforts. These drawbacks limit the ACIM application scenarios and lead to an undesired time-to-market. This work proposes an end-to-end automated ACIM based on a synthesizable architecture (EasyACIM). With a given array size and customized cell library, EasyACIM can generate layouts for ACIMs with various design specifications end-to-end automatically. Leveraging the multi-objective genetic algorithm (MOGA)-based design space explorer, EasyACIM can obtain high-quality ACIM solutions based on the proposed synthesizable architecture, targeting versatile application scenarios. The ACIM solutions given by EasyACIM have a wide design space and competitive performance compared to the state-of-the-art (SOTA) ACIMs.
Embedded Systems and Software
AI
Embedded Systems
Engineering Tracks
DescriptionThe convergence of the Internet of Things (IoT), Heterogeneous Computing Architectures, Artificial Intelligence (AI), Machine Learning (ML), and Generative AI (GenAI) is ushering in a new era of computation and analysis. Our panelists will explore a deeper understanding of the intricate interplay between Edge Intelligence and GenAI, with a focus on the technical hurdles and ethical considerations.
By processing data closer to its source, edge computing can harness the power of AI-ML in real-time. This paradigm shift is redefining the capabilities of IoT and computational architectures. Join us as we explore the practical challenges involved in integrating GenAI into edge computing such as limited computational resources, latency reduction, and the development of lightweight AI models.
Edge computing, fortified by GenAI, is changing the game in critical sectors like healthcare, manufacturing, automotive, smart cities, and semiconductor design and manufacturing. Real-time data processing is enhancing decision-making, improving efficiency, and even saving lives. Through case studies and examples, we'll discuss how engineers and researchers are at the forefront of developing solutions that drive these innovations.
While the technical aspects are fascinating, with great power comes great responsibility. The ubiquity of edge computing and GenAI raises crucial ethical questions. How can we ensure data privacy and security at the edge? What safeguards can be put in place to mitigate bias in AI algorithms? Who is accountable when autonomous systems make critical decisions?
Our panel comprises seasoned experts who have grappled with these questions in academic research, policy making, product and infrastructure design and deployment as well as investing and mentoring. We invite you to be a part of the conversation that is shaping the future of technology.
By processing data closer to its source, edge computing can harness the power of AI-ML in real-time. This paradigm shift is redefining the capabilities of IoT and computational architectures. Join us as we explore the practical challenges involved in integrating GenAI into edge computing such as limited computational resources, latency reduction, and the development of lightweight AI models.
Edge computing, fortified by GenAI, is changing the game in critical sectors like healthcare, manufacturing, automotive, smart cities, and semiconductor design and manufacturing. Real-time data processing is enhancing decision-making, improving efficiency, and even saving lives. Through case studies and examples, we'll discuss how engineers and researchers are at the forefront of developing solutions that drive these innovations.
While the technical aspects are fascinating, with great power comes great responsibility. The ubiquity of edge computing and GenAI raises crucial ethical questions. How can we ensure data privacy and security at the edge? What safeguards can be put in place to mitigate bias in AI algorithms? Who is accountable when autonomous systems make critical decisions?
Our panel comprises seasoned experts who have grappled with these questions in academic research, policy making, product and infrastructure design and deployment as well as investing and mentoring. We invite you to be a part of the conversation that is shaping the future of technology.
Research Manuscript
AI
AI/ML Algorithms
DescriptionEfficiently adapting Large Language Models (LLMs) on resource-constrained devices, such as edge devices, is vital for applications requiring continuous and privacy-preserving adaptation. However, existing solutions fall short due to the high memory and computational overhead associated with LLMs. To address this, we introduce an LLM tuning framework, Edge-LLM, that features three core components: (1) a unified compression method offering cost-effective layer-wise pruning ratios and quantization policies, (2) an adaptive tuning and voting scheme that selectively adjusts a subset of layers during each iteration and then adaptively combines their outputs for the final inference, thus reducing backpropagation depth and memory overhead during adaptation, and (3) a complementary search space that optimizes device workload and utilization. Experiment results demonstrate that Edge-LLM achieves efficient on-device adaptation with comparable performance with vanilla tuning methods.
Research Manuscript
Embedded Systems
Embedded System Design Tools and Methodologies
DescriptionFull-waveform inversion (FWI) plays a vital role in geoscience to explore the subsurface. It utilizes the seismic wave to image the subsurface velocity map. As the machine learning (ML) technique evolves, the data-driven approaches using ML for FWI tasks have emerged, offering enhanced accuracy and reduced computational cost compared to traditional physics-based methods. However, a common challenge in geoscience --- the unprivileged data --- severely limits ML effectiveness. The issue becomes even worse during model pruning, a step essential in geoscience due to environmental complexities. To tackle this, we introduce the EdGeo toolkit, which employs a diffusion-based model guided by physics principles to generate high-fidelity velocity maps. The toolkit uses the acoustic wave equation to generate corresponding seismic waveform data, facilitating the fine-tuning of pruned ML models. Our results demonstrate significant improvements in SSIM scores and reduction in both MAE and MSE across various pruning ratios. Notably, the ML model fine-tuned using data generated by EdGeo yields superior quality of velocity maps, especially in representing unprivileged features, outperforming other existing methods.
Research Manuscript
Design
Quantum Computing
DescriptionIn the noisy intermediate-scale quantum era, mid-circuit measurement and reset operations facilitate novel circuit optimization strategies by reducing a circuit's qubit count in a method called resizing. This paper introduces two such algorithms. The first one leverages gate-dependency rules to reduce qubit count by 61.6% or 45.3% when optimizing depth as well. Based on numerical instantiation and synthesis, the second algorithm finds resizing opportunities in previously unresizable circuits via dependency rules and other state-of-the-art tools. This resizing algorithm, implemented in BQSKit, reduces qubit count by 20.7% on average for these previously impossible-to-resize circuits.
Research Manuscript
Security
Embedded and Cross-Layer Security
DescriptionEmbedded operating systems, considering their widespread use in security-critical applications, are not effectively tested with sanitizers to effectively root out bugs. Sanitizers provide a means to detect bugs that are not visible directly through exceptional or erroneous behaviors, thus uncovering more potent bugs during testing.
In this paper, we propose EmbSan, an embedded systems sanitizer for a diverse range of embedded operating system firmware through the use of dynamic instrumentation of sanitizer facilities and de-coupled on-host runtime libraries. This allows us to perform sanitation for multiple embedded OSs during fuzzing, such as many Embedded Linux-based firmware, various FreeRTOS firmware, and detect actual bugs within them. We evaluated EmbSan's effectiveness on firmware images based on Embedded Linux, FreeRTOS, LiteOS, and VxWorks. Our results show that EmbSan can detect the same criteria of actual bugs found in the Embedded Linux kernel as reference implementations of KASAN, and exhibits a slowdown of 2.2× to 3.2× and 5.2× to 5.7× for KASAN and KCSAN, respectively, which is on par with established kernel sanitizers. EmbSan and embedded OS fuzzers also found a total of 41 new bugs in Embedded Linux, FreeRTOS, LiteOS and VxWorks.
In this paper, we propose EmbSan, an embedded systems sanitizer for a diverse range of embedded operating system firmware through the use of dynamic instrumentation of sanitizer facilities and de-coupled on-host runtime libraries. This allows us to perform sanitation for multiple embedded OSs during fuzzing, such as many Embedded Linux-based firmware, various FreeRTOS firmware, and detect actual bugs within them. We evaluated EmbSan's effectiveness on firmware images based on Embedded Linux, FreeRTOS, LiteOS, and VxWorks. Our results show that EmbSan can detect the same criteria of actual bugs found in the Embedded Linux kernel as reference implementations of KASAN, and exhibits a slowdown of 2.2× to 3.2× and 5.2× to 5.7× for KASAN and KCSAN, respectively, which is on par with established kernel sanitizers. EmbSan and embedded OS fuzzers also found a total of 41 new bugs in Embedded Linux, FreeRTOS, LiteOS and VxWorks.
Research Manuscript
Design
Emerging Models of Computation
DescriptionComputing with memory is an energy-efficient computing approach. It pre-computes a function and store its values in a lookup table (LUT), which can be retrieved at runtime. Approximate Boolean decomposition has been recently proposed to reduce the LUT size for implementing complex functions, but it takes a long time to find a decomposition with a minimized error. As a parallel algorithm developed based on the Ising model, simulated bifurcation (SB) is promised to be a high-performance approach for combinatorial optimization. In this paper, we propose an efficient SB-based approximate function decomposition approach. Specifically, a new approximate disjoint decomposition method, called column-based approximate disjoint decomposition, is first proposed to fit the Ising model. Then, it is adapted to the Ising model-based optimization solver. Moreover, two improvement techniques are developed for an efficient search of the approximate disjoint decomposition when using SB. The experiment results shows that compared to the state-of-the-art work, our approach achieves a 11% smaller mean error distance with an average 1.16× speedup when approximately decomposing 16-input 16-output Boolean functions.
Research Manuscript
Design
Design for Manufacturability and Reliability
DescriptionResolution Enhancement Techniques (RETs) are critical to meet the demands of advanced technology nodes. Among RETs, Source Mask Optimization (SMO) is pivotal, concurrently optimizing both the source and the mask to expand the process window. Traditional SMO methods, however, are limited by sequential and alternating optimizations, leading to extended runtimes without performance guarantees. This paper introduces a unified SMO framework utilizing the accelerated Abbe forward imaging to enhance precision and efficiency. Further, we propose the innovative \texttt{BiSMO} framework, which reformulates SMO through a bilevel optimization approach, and present three gradient-based methods to tackle the challenges of bilevel SMO. Our experimental results demonstrate that \texttt{BiSMO} achieves a remarkable 40\% reduction in error metrics and 8$\times$ increase in runtime efficiency, signifying a major leap forward in SMO.
Research Manuscript
Embedded Systems
Embedded Software
DescriptionSimulink has emerged as the fundamental infrastructure that supports modeling, simulation, verification, and code generation for embedded software development. To improve the performance of the code generated from Simulink models, state-of-the-art code generators employ various optimization techniques, such as expression folding, variable reuse, and parallelism. However, they overlook the presence of redundant calculations within data-intensive models widely used to perform substantial data processing in embedded scenarios, which can significantly undermine the efficiency and performance of the generated code.
This paper proposes Frodo, an efficient code generator for data-intensive Simulink models via redundancy elimination. Frodo first conducts model analysis to construct the dataflow graph and derive the I/O mapping of each block. Then, for each block within the dataflow graph, Frodo recursively determines its calculation range by leveraging the I/O mapping of its subsequent blocks. After that, Frodo generates concise code for optimizable blocks in accordance with the precise calculation range. We implemented and evaluated Frodo on benchmark Simulink models. Compared with the state-of-the-art code generators Simulink Embedded Coder, DFSynth, and HCG, the code generated by Frodo is 1.17x - 8.55x faster in terms of execution duration across different compilers and architectures, without incurring additional overhead of memory usage.
This paper proposes Frodo, an efficient code generator for data-intensive Simulink models via redundancy elimination. Frodo first conducts model analysis to construct the dataflow graph and derive the I/O mapping of each block. Then, for each block within the dataflow graph, Frodo recursively determines its calculation range by leveraging the I/O mapping of its subsequent blocks. After that, Frodo generates concise code for optimizable blocks in accordance with the precise calculation range. We implemented and evaluated Frodo on benchmark Simulink models. Compared with the state-of-the-art code generators Simulink Embedded Coder, DFSynth, and HCG, the code generated by Frodo is 1.17x - 8.55x faster in terms of execution duration across different compilers and architectures, without incurring additional overhead of memory usage.
Research Manuscript
EDA
Analog CAD, Simulation, Verification and Test
DescriptionIn this paper, we present an optimized methodology for performing state-space-based equivalence checking of nonlinear analog circuits by using a gradient-ascent-based search algorithm to efficiently traverse a common state space. Essentially, the method searches for critical regions where the functional behaviors of two circuit designs show the greatest divergence. The key challenges in this approach are the mapping of both designs onto a common canonical state space, the computation of the gradient, and the exclusion of unreachable regions within the state space. To address the first challenge, we use locally linearized systems and leverage the Kronecker Canonical Form (KCF). To facilitate the computation of the gradient, we employ a purpose-built target function, and to exclude unreachable regions, we utilize vector projection techniques. Through experiments with nonlinear analog circuits and a scalability analysis, we demonstrate the successful and efficient computation performed with the proposed methodology, achieving speedups of up to 468 times.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionHigh Bandwidth Memory (HBM) in 2.5D interposers is to address the need for increased memory bandwidth in AI and HPC applications. HBM channel design is crucial for achieving the high-speed data transfers. However, routing such a channel is challenging due to the tight interconnections and the need to manage signal integrity (SI) in a compact space. It is common to take months to route a HBM channel and run multiple iterations to meet the SI requirement. The paper proposes an efficient flow including steps to quickly explore routing pattern during the pre-layout stage with Xpeedic Metis tool, auto route the HBM channel with Synopsys 3DIC Compiler, and run post-layout SI analysis with integrated Xpeedic Metis. The demo example shows tremendous time saving with the new flow.
Research Manuscript
Design
Design for Manufacturability and Reliability
DescriptionInverse lithography technology (ILT) is one of most powerful resolution enhancement technologies (RETs) used in chip manufacturing. Due to the high computational requirements of ILT, large layouts are often split into smaller tiles and then assembled to obtain the final result. This paper states the challenges that may emerge during layout assembly and proposes to use the multigrid Schwarz method to address these issues. Experimental results show that our method achieves comparable quality results to full-chip correction and exhibits better efficiency.
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionWith the prosperous development of Deep Neural Network (DNNs), numerous Process-In-Memory (PIM) designs have emerged to accelerate DNN models with exceptional throughput and energy-efficiency. PIM accelerators based on Non-Volatile Memory (NVM) or volatile memory offer distinct advantages for computational efficiency and performance. NVM based PIM accelerators, demonstrated success in DNN inference, face limitations in on-device learning due to high write energy, latency, and instability. Conversely, fast volatile memories, like SRAM, offer rapid read/write operations for DNN training, but suffer from significant leakage currents and large memory footprints. In this paper, for the first time, we present a fully-digital sparse processing in hybrid NVM-SRAM design, synergistically combines the strengths of NVM and SRAM, tailored for on-device continual learning. Our designed NVM and SRAM based PIM circuit macros could support both storage and processing of N:M structured sparsity pattern, significantly improving the storage and computing efficiency. Exhaustive experiments demonstrate that our hybrid system effectively reduces area and power consumption while maintaining high accuracy, offering a scalable and versatile solution for on-device continual learning.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionMemory bandwidth, types of compute elements and the NoC play key roles in designing a chiplet-based accelerator. In this work, we investigate the strategic placement of memory chiplets to ensure efficient data access, optimized throughput, and maximal utilization of hardware resources. We model an architecture with 64 compute chiplets, 16 memory chiplets, and 16 I/O chiplets. We evaluate six architectures with different memory chiplet placements and propose a clustered-memory configuration which results in an 8% reduction in average latency, 22% reduction in packet latency, and 20% gain in average throughput compared to a baseline architecture.
Research Manuscript
Design
In-memory and Near-memory Computing Circuits
DescriptionOpen Modification Search (OMS) is a promising algorithm for mass spectrometry analysis that enables the discovery of modified peptides. However, OMS encounters challenges as it exponentially extends the search scope. Existing OMS accelerators either have limited parallelism or struggle to scale effectively with growing data volumes. In this work, we introduce an OMS accelerator utilizing multi-level-cell (MLC) RRAM memory to enhance storage capacity by 3x. Through in-memory computing, we achieve 1.7x to 76.7x faster data processing with two to three orders of magnitude energy efficiency improvement. The functionality is tested on a fabricated MLC RRAM chip. To address errors from memory, we leverage hyperdimensional computing, providing robustness by tolerating up to 10% memory errors while delivering massive parallelism in hardware.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWe propose neural network models that predict read access time (RAT) and read access yield (RAY) in SRAM, considering wide range of design variables. Using transfer learning, the RAT model reduces post-layout simulation time and training costs, achieving 1.2 million times faster prediction time of 0.18ms than HSPICE, with 2.14% error rate. The RAY model leverages transformer architecture for enhancing accuracy with 11k times faster prediction time of 0.27s than HSPICE, with 1.31% error rate. Both models save time for entire design process and enhance accuracy, with considering macro-level interactions and employing regularization methods specifically designed to effectively capture nonlinearities.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionShort circuit or cross current power is the component of dynamic power that occurs when both pull-up and pull-down networks are on at the same time. As we move to smaller technologies, cross current contribution to overall chip power is significantly increasing. So modelling this accurately has become important. Following the industry standard, IEEE 2416, contributor-based modeling enhances power modeling efficiency by using PVT (Process, Voltage, and Temperature) independent modeling. We present a new approach to model cross current power in a unique way, that is both accurate and efficient, fitting into the contributor modeling paradigm. This a first in the industry, which is used in the tape out of IBM microprocessors. The presented techniques also allow for hierarchical separation of cross current power component, for effective management of its consumption, in dynamic power dominated high-performance microprocessors. This approach overcomes the limitations of existing methods (.libs etc.) which though accurate, are inefficient. Experimental results on cells from standard cell library used in IBM microprocessors demonstrates accuracy of the proposed model to be within ~+5% when compared to detailed circuit simulations. The model size increase is negligible, and computation cost is minimal compared to existing approaches.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionSynaptic delay parameterization of neural network models have
remained largely unexplored but recent literature has been show-
ing promising results, suggesting the delay parameterized models
are simpler, smaller, sparser, and more energy efficient than
similar performing non-delay parameterized
ones. We introduce Shared Circular Delay Queue (SCDQ), a novel
hardware structure for supporting synaptic delays on digital neuro-
morphic accelerators. Our analysis and hardware results show that
it scales better in terms of memory, than current commonly used
approaches, and is more amortizable to algorithm-hardware
co-optimizations, where in fact, memory scaling is modulated by
model sparsity and not merely network size.
remained largely unexplored but recent literature has been show-
ing promising results, suggesting the delay parameterized models
are simpler, smaller, sparser, and more energy efficient than
similar performing non-delay parameterized
ones. We introduce Shared Circular Delay Queue (SCDQ), a novel
hardware structure for supporting synaptic delays on digital neuro-
morphic accelerators. Our analysis and hardware results show that
it scales better in terms of memory, than current commonly used
approaches, and is more amortizable to algorithm-hardware
co-optimizations, where in fact, memory scaling is modulated by
model sparsity and not merely network size.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWith the continuous evolution of large models, large model
training has become increasingly critical. However, large-
scale model training typically requires a substantial energy
consumption, which adds to the cost of training these mod-
els. We present EffiPipe, an energy-efficient GPU scheduling
system for large-scale model training tasks. EffiPipe con-
ducts fine-grained scheduling of operators. Incorporating
dynamic frequency adjustment for both computing and mem-
ory, and taking into account distributed model training sce-
narios.Compared to existing works, we can reduce power
consumption by 20-30% while ensuring performance is main-
tained
training has become increasingly critical. However, large-
scale model training typically requires a substantial energy
consumption, which adds to the cost of training these mod-
els. We present EffiPipe, an energy-efficient GPU scheduling
system for large-scale model training tasks. EffiPipe con-
ducts fine-grained scheduling of operators. Incorporating
dynamic frequency adjustment for both computing and mem-
ory, and taking into account distributed model training sce-
narios.Compared to existing works, we can reduce power
consumption by 20-30% while ensuring performance is main-
tained
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionMessage Passing-based Graph Neural Networks (GNNs) have been widely used to analyze graph data, in which complex vertex and edge operations are performed via the exchange of information between connected vertices. Such complex GNN operations are highly dependent on the graph structure and can no longer characterized as general sparse-dense or matrix multiplications. Consequently, current data reuse and workload balance optimizations have limited applicability to Message Passing-based GNN acceleration. In this paper, we leverage the mathematical insights from Gram Matrix to simultaneously exploit data reuse and workload balancing opportunities for GNN accelerations. Upon this, we further propose a novel accelerator shortly termed as EGMA that can efficiently facilitate a wide range of GNN models with much-improved data reuse and workload balance. Consequently, EGMA can achieve performance speedup by 1.57×, 1.72×, and 1.43× and energy reduction by 38.19%, 34.02%, and 24.54% on average compared to Betty, FlowGNN, and ReGNN, respectively.
Research Manuscript
Embedded Systems
Embedded Memory and Storage Systems
DescriptionModern mobile devices adopt two-level memory swapping consisting of ZRAM and storage devices to relieve memory pressure.
In the swap subsystem, ZRAM can improve application responsiveness and reduce write traffic to storage devices while consuming physical memory and additional CPU cycles.
To better utilize ZRAM and improve system performance, we propose ElasticZRAM, an elastic ZRAM to redesign the traditional memory swapping with full awareness of the characteristics of applications and NAND flash-based storage devices on mobile devices.
Experimental results on Google Pixel 6 demonstrate that ElasticZRAM improves application response time by up to 24.8\% with negligible overhead compared with state-of-the-arts.
In the swap subsystem, ZRAM can improve application responsiveness and reduce write traffic to storage devices while consuming physical memory and additional CPU cycles.
To better utilize ZRAM and improve system performance, we propose ElasticZRAM, an elastic ZRAM to redesign the traditional memory swapping with full awareness of the characteristics of applications and NAND flash-based storage devices on mobile devices.
Experimental results on Google Pixel 6 demonstrate that ElasticZRAM improves application response time by up to 24.8\% with negligible overhead compared with state-of-the-arts.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn high-speed SerDes design, to understand the EMag(electromagnetic) coupling between various elements of a high-frequency semiconductor device is very important, these EMag interactions include not only the silicon chip but also extend to the package that encloses it. At sign-off phase, it is common to find that block level pre-lvs EMag simulation result shows big difference when compare with measurement data, it is very necessary and important to perform EMag simulation at sign-off phase to reduce the gap.
Traditional EMag simulations method only consider chip coupling and not the packaging layers with on-chip metals model, that may lead to design specification violations. Traditional EMag flow only extract layout with passive devices, if EMag coupling is not fully considered, it will lead to a large mismatch between post-lvs simulation result and measurement.
In high-speed SerDes design, high-precision and high-efficiency electromagnetic modeling simulation is required to minimizing the associated EMag risks. RaptorH die+package modeling flow can predict the impact of EMag coupling with package layers at block stage; Exalto post-lvs EMag simulation can resolve mismatch between post-simulation and measurement at sign-off stage; and finally increase confidence in the performance of high-speed SerDes design.
Traditional EMag simulations method only consider chip coupling and not the packaging layers with on-chip metals model, that may lead to design specification violations. Traditional EMag flow only extract layout with passive devices, if EMag coupling is not fully considered, it will lead to a large mismatch between post-lvs simulation result and measurement.
In high-speed SerDes design, high-precision and high-efficiency electromagnetic modeling simulation is required to minimizing the associated EMag risks. RaptorH die+package modeling flow can predict the impact of EMag coupling with package layers at block stage; Exalto post-lvs EMag simulation can resolve mismatch between post-simulation and measurement at sign-off stage; and finally increase confidence in the performance of high-speed SerDes design.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionBus Functional Models (BFMs) are commonly used in digital design and verification processes. They serve as abstract representations of the behavior of buses or communication interfaces within a system. People use BFMs for several reasons, such as early system-level simulation, verification of communication protocols, speed up verification, modular verification, test bench development, and debugging.
Companies may choose to use proprietary Bus Functional Models (BFMs) instead of standard-defined BFMs for various reasons such as, Customization for Specific Requirements, Competitive Advantage, Protection of Intellectual Property, Optimized Performance, Integration with In-House Tools, and Industry-Specific Standards. for example Ethernet proprietary BFMs caters the Automotive, IoT, Security, Networking and cloud-based application SOCs.
While going forward with proprietary BFM for their IP companies also face some functional verification challenges, such as they can't directly connect Third-party Verification IPs (discrepancy in Connections and frame fields) and need to define their own Protocol checks and Packet scoreboard In case they choose to go for third party verification IP then Verification IPs Requires Flexible topologies, and it requires support for extra proprietary fields.
This paper will focus on understanding the problem and the solution of proprietary BFM verification using the use case of Ethernet frame BFM carrying the Upper layer proprietary frames. We will try to explain the problem and solution using topology diagrams and all the required pointers.
Companies may choose to use proprietary Bus Functional Models (BFMs) instead of standard-defined BFMs for various reasons such as, Customization for Specific Requirements, Competitive Advantage, Protection of Intellectual Property, Optimized Performance, Integration with In-House Tools, and Industry-Specific Standards. for example Ethernet proprietary BFMs caters the Automotive, IoT, Security, Networking and cloud-based application SOCs.
While going forward with proprietary BFM for their IP companies also face some functional verification challenges, such as they can't directly connect Third-party Verification IPs (discrepancy in Connections and frame fields) and need to define their own Protocol checks and Packet scoreboard In case they choose to go for third party verification IP then Verification IPs Requires Flexible topologies, and it requires support for extra proprietary fields.
This paper will focus on understanding the problem and the solution of proprietary BFM verification using the use case of Ethernet frame BFM carrying the Upper layer proprietary frames. We will try to explain the problem and solution using topology diagrams and all the required pointers.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIn electronic design automation, logic optimization operators play a pivotal role in minimizing the gate count of logic circuits. However, their computation demands are high. Operators such as refactor conventionally form iterative cuts for each node, striving for a more compact representation - a task which often fails 98% on average. Prior research has sought to mitigate computational cost through parallelization. In contrast, our approach leverages a classifier to prune unsuccessful cuts preemptively, thus eliminating unnecessary resynthesis operations. Experiments on the refactor operator using the EPFL benchmark suite and 10 large industrial designs demonstrate that this technique can speedup logic optimization by 3.9× on average compared with the state-of-the-art ABC implementation.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionSpMV is a critical kernels in multiple application domains. The performance of SpMV on SIMD devices suffers from control divergences greatly. This paper proposes an In-SRAM Computing based SpMV optimization framework. We divide the SpMV into two stages: a compute-intensive and a control-intensive stage. The first stage has been efficently accelerated on most current SIMD devices. To optimize the second stage, we convert the control divergences to the memory divergences, and utilize the feature of multi-bank SRAM to eliminate the memory divergences' overheads. Experimental results indicate that our solution achieves significant performance speedups over the highly optimized vector SpMV kernels.
Exhibitor Forum
DescriptionMany hardware and silicon designers consider chiplets a critical enabler for more capable and cost-efficient systems. Chiplets are well established amongst large players that control all components/aspects of a design (i.e., single vendor), and the allure of a "plug and play" chiplet market has garnered significant attention and investment from the industry. Although the industry needs to address some technical and business hurdles before that vision comes to fruition, OEMs and chipmakers can realize most of the benefits of chiplet-based designs today. Specifically, small groups of companies with aligned product strategies and (typically) complementary expertise are forming multi-vendor ecosystems. Within these ecosystems, the companies can coordinate the functionality, requirements, and interfaces of each chiplet (and, of course, the die-to-die interconnects that glue them together) to meet the needs of a specific product or product family. This talk describes chiplet interconnect solutions—die-to-die PHY and link layer—that support all three use cases (single-vendor, multi-vendor ecosystem, and plug-and-play). It outlines how OEMs and chip makers can successfully navigate a multi-vendor ecosystem approach to implement chiplet-based designs today.
Research Manuscript
Design
Design for Manufacturability and Reliability
DescriptionLayout pattern generation via deep generative models is a promising methodology for building practical large-scale pattern libraries.
However, although improving optical proximity correction (OPC) is a major target of existing pattern generation methods, they are not explicitly trained for OPC and integrated into OPC methods.
In this paper, we propose EMOGen to enable the co-evolution of layout pattern generation and learning-based OPC methods.
With the novel co-evolution methodology, we achieve up to 39% enhancement in OPC and 34% improvement in pattern legalization.
However, although improving optical proximity correction (OPC) is a major target of existing pattern generation methods, they are not explicitly trained for OPC and integrated into OPC methods.
In this paper, we propose EMOGen to enable the co-evolution of layout pattern generation and learning-based OPC methods.
With the novel co-evolution methodology, we achieve up to 39% enhancement in OPC and 34% improvement in pattern legalization.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionCDC static verification tool with machine learning (ML) capability helps engineers with apt root cause analysis (RCA) to bring down the noise in results. An improper constrained design may leak a bug into silicon, or an over constrained design may not lead to verification closure. Often it is required in static methodologies to have an automated data analysis or ML solution which can detect the issues in the CDC setup stage of the design. Using the causality analysis, EDA tools suggest better constraints to report real issues and eliminate noise.
This paper investigates how improper or missed constraints can affect the CDC analysis results and subsequently devises a methodology for an effective use of ML feature in form of causality reports. Firstly, RCA can start from fine tuning the clock constraints with the suggested constraints for the clock tree to maintain an optimal pessimism in CDC analysis. Secondly, engineers should focus on other constraints like stables, constants etc., on data paths as suggested in RCA. Lastly, engineers should investigate other miscellaneous constraints through causality reports to reduce noise. Thus, RCA should be done in a progressive iterative manner for an effective CDC analysis with real issues.
This paper investigates how improper or missed constraints can affect the CDC analysis results and subsequently devises a methodology for an effective use of ML feature in form of causality reports. Firstly, RCA can start from fine tuning the clock constraints with the suggested constraints for the clock tree to maintain an optimal pessimism in CDC analysis. Secondly, engineers should focus on other constraints like stables, constants etc., on data paths as suggested in RCA. Lastly, engineers should investigate other miscellaneous constraints through causality reports to reduce noise. Thus, RCA should be done in a progressive iterative manner for an effective CDC analysis with real issues.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn the dynamic landscape of electronics design, the escalating market demand for new devices has led to increased complexity in evaluating and comparing configurations and feature requirements based on customer needs and packages. This intricacy poses a challenge for designers, making decision-making in this domain a laborious task. In the initial stages of product development, designers' endeavor to assess the cost of the new device by estimating its die size (silicon area) and exploring various configuration possibilities. Furthermore, a strategic focus on optimizing PPA (Power, Performance, and Area) at a given process node involves increasing performance (MHz) and adding memories, leading to higher power consumption and larger die sizes. Ensuring compatibility with a target package (E.g., QFP: Quad Flat Package) introduces complexities like complex ground rings and down-bondings. Addressing these challenges necessitates highly efficient, predictable, and fast solutions. Presently, there is a lack of automated tools to tackle this problem. In this paper, we propose an automated solution to address aforementioned challenges, facilitating informed decisions at the outset of the design process. This work aims to prevent late surprises and enhance overall predictability.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionLarge language models (LLMs) have demonstrated impressive abilities in various domains while the inference cost is expensive. Many previous studies exploit quantization methods to reduce LLM inference cost by reducing storage and accelerating computation. The state-of-the-art methods use 2-bit quantization for mainstream LLMs (\textit{e.g.} Llama2-7b, etc.). However, challenges still exist in reducing LLM inference cost with 2-bit quantization: \textbf{(1) Nonnegligible accuracy loss for 2-bit quantization.} Weights are quantized by groups, while the ranges of weights are large in some groups, resulting in large quantization errors and nonnegligible accuracy loss (\textit{e.g.} >3\% for Llama2-7b with 2-bit quantization in GPTQ and Greenbit). \textbf{(2) Limited accuracy improvement by adding 4-bit weights.} Increasing 10\% extra average bit more 4-bit weights only leads to <0.5\% accuracy improvement on a quantized Llama2-7b model. \textbf{(3) Time-consuming dequantization operations on GPUs.} Mainstream methods require a dequantization operation to perform computation on the quantized weights, and the 2-order dequantization operation is applied because scales of groups are also quantized. These dequantization operations lead to >50\% execution time, hindering the potential of reducing LLM inference cost.
To tackle these challenges and enable fast and low-cost LLM inference on GPUs, we propose the following techniques in this paper. \textbf{(1) Range-aware quantization with memory alignment.} We point out that the range of weights by groups varies. Thus, we only quantize a small fraction of groups with the larger range using 4-bit with memory alignment consideration on GPUs. \textbf{(2) Accuracy-aware sparse outlier.} We point out that the distribution of the sparse outliers with larger weights is different in 2-bit and 4-bit groups, and only a small fraction of outliers require 16-bit quantization. Such design leads to >0.5\% accuracy improvement with <3\% average increased bit for Llama2-7b. \textbf{(3) Asynchronous dequantization.} We point out that calculating the scales of each group is independent of the loading weights of each group. Thus, we design the asynchronous dequantization on GPUs, leading to up to 3.92$\times$ speedup. We conduct extensive experiments on different model families and model sizes. We achieve 2.85-bit for each weight considering all scales/zeros for different models. The end-to-end speedup for Llama2-7b is 1.74$\times$ over the original model, and we reduce both runtime cost and hardware cost by up to 2.70$\times$ and 2.81$\times$ with less GPU requirements.
To tackle these challenges and enable fast and low-cost LLM inference on GPUs, we propose the following techniques in this paper. \textbf{(1) Range-aware quantization with memory alignment.} We point out that the range of weights by groups varies. Thus, we only quantize a small fraction of groups with the larger range using 4-bit with memory alignment consideration on GPUs. \textbf{(2) Accuracy-aware sparse outlier.} We point out that the distribution of the sparse outliers with larger weights is different in 2-bit and 4-bit groups, and only a small fraction of outliers require 16-bit quantization. Such design leads to >0.5\% accuracy improvement with <3\% average increased bit for Llama2-7b. \textbf{(3) Asynchronous dequantization.} We point out that calculating the scales of each group is independent of the loading weights of each group. Thus, we design the asynchronous dequantization on GPUs, leading to up to 3.92$\times$ speedup. We conduct extensive experiments on different model families and model sizes. We achieve 2.85-bit for each weight considering all scales/zeros for different models. The end-to-end speedup for Llama2-7b is 1.74$\times$ over the original model, and we reduce both runtime cost and hardware cost by up to 2.70$\times$ and 2.81$\times$ with less GPU requirements.
Research Manuscript
Design
Design of Cyber-physical Systems and IoT
DescriptionCycle Queuing and Forwarding (CQF) configures the same cycle length on the flow path, resulting in certain flows unschedulable. Enhanced CQF (ECQF) based flow aggregation utilizes variable cycle length to address this issue. However, it remains a conceptual model without a concrete implementation. In this paper, we propose a jointly optimize aggregation cycle and flows' offsets (JACO) mechanism to achieve ECQF-based flow aggregation. We also design an incremental heuristic algorithm for JACO. Finally, we evaluate the performance of JACO in different scenarios using OMNet++ simulation platform. Compared with ECQF, the results show that JACO reduces latency and improves resource utilization.
Research Manuscript
AI
Design
AI/ML System and Platform Design
DescriptionIn transformer models, data reuse within an operator is insufficient, which prompts more aggressive multiple tensor-wise operator fusion (multi-tensor fusion). Due to the complexity in tensor-wise operator dataflow, conventional fusion techniques often fall short by limited dataflow options and short fusion length. In this study, we first identify three challenges on multi-tensor fusion that result in inferior fusions. Then we propose dataflow adaptive tiling (DAT), a novel inter-operator dataflow to enable an efficient fusion of multiple operators connected in any form and chained in any length. Then, we broaden the dataflow exploration from intra-operator to inter-operator and develop an exploration framework to quickly find the best dataflow on spatial accelerators with given on-chip buffer size. Experiment results show that DAT delivers 2.24X and 1.74X speedup and 35.5% and 15.5% energy savings on average for edge and cloud accelerators, respectively, comparing to the state-of-the-art dataflow explorer FLAT. In addition, DAT exploration framework will be open-sourced.
Research Manuscript
Embedded Systems
Embedded System Design Tools and Methodologies
DescriptionAfter a large language model (LLM) is deployed on edge devices, it is desirable for these devices to learn from user-generated conversation data to generate user-specific and personalized responses in real-time. However, user-generated data usually contains sensitive and private information, and uploading such data to the cloud for annotation is not preferred if not prohibited. While it is possible to obtain annotation locally by directly asking users to provide preferred responses, such annotations have to be sparse to not affect user experience. In addition, the storage of edge devices is usually too limited to enable large-scale fine-tuning with full user-generated data. It remains an open question how to enable on-device LLM personalization, considering sparse annotation and limited on-device storage. In this paper, we propose a novel framework to select and store the most representative data online in a self-supervised way. Such data has a small memory footprint and allows infrequent requests of user annotations for further fine-tuning. To enhance fine-tuning quality, multiple semantically similar pairs of question texts and expected responses are generated using the LLM. Our experiments show that the proposed framework achieves the best user-specific content-generating capability (accuracy) and fine-tuning speed (performance) compared with vanilla baselines. To the best of our knowledge, this is the very first on-device LLM personalization framework.
IP
Engineering Tracks
IP
DescriptionEver increasing demand for higher data transfer speed leads to evolution of new Serial link protocols and advancement in existing one. Implementation and support of these Serial link protocols require evolution of PHY and Controller. PHY IPs require to be tested upfront on Silicon Chip (PHY IPs are Analog- Mixed Signal IPs) for every technology node/Foundry. Data from this PHY Chip needs to be transferred to FPGA for validation with Controller. Increase in bit rate poses a significant challenge due to need for large number of GPIOs in the PHY chip. This increased number of GPIOs increase the size in PAD limited PHY Chip which increases the cost. To address this problem, proposed solution is to use lanes of lower speed SerDes to transfer data between PHY Chip and FPGA instead of multiple parallel GPIOs
Keynote
Special Event
Design
DescriptionImmersive computing (including virtual, augmented, mixed, and extended reality, metaverse, digital twins, and spatial computing) has the potential to transform most industries and human activities to create a better world for all. Delivering on this potential, however, requires bridging an orders of magnitude gap between the power, performance, and quality-of-experience attributes of current and desirable immersive systems. With a number of conflicting requirements - 100s of milliwatts of power, milliseconds of latency, unbounded compute to realize realistic sensory experiences – no silver bullet is available. Further, the true goodness metric of such systems must measure the subjective human experience within the immersive application. This talk calls for an integrative research agenda that drives codesigned end-to-end systems from hardware to system software stacks to foundation models spanning the end-user device/edge/cloud, with metrics that reflect the immersive human experience, in the context of real immersive applications. I will discuss work pursuing such an approach as part of the IMMERSE Center for Immersive Computing which brings together immersive technologies, applications, and human experience, and in the ILLIXR project based on an open-source end-to-end system to democratize immersive systems research.
Research Manuscript
Design
Emerging Models of Computation
DescriptionDeep neural networks (DNNs) have significantly advanced over the past decade, embracing diverse artificial intelligence (AI) tasks. In-memory computing (IMC) architecture emerges as a promising paradigm, improving the energy efficiency of multiply-and-accumulate (MAC) operations within DNNs by integrating the parallel computations within the memory arrays. Various high-precision analog IMC array designs have been developed based on both SRAM and emerging non-volatile memories (NVMs). These designs perform MAC operations of partial input and weight, with the corresponding partial products then fed into shift-add circuitry to produce the final MAC results. However, existing works often present intricate shift-add process for weight. The traditional digital shift-add process is limited in throughput due to time-multiplexing of ADCs, and advancing the shift-add process to the analog domain necessitates customized circuit implementations, resulting in compromises in energy and area efficiency. Furthermore, the joint optimization of MAC operations and the weight shift-add process is rarely explored. In this paper, we propose novel, energy efficient dual designs of ferroelectric FET (FeFET) based high precision analog IMC featuring inherent shift-add capability. We introduce a FeFET based IMC paradigm that performs partial MAC in each column, and inherently integrates the shift-add process for 4-bit weights by leveraging FeFET's analog storage characteristics. This effectively eliminates the need for additional dedicated shift-add circuitry in multi-bit weight processing. This paradigm supports both 2's complement mode (2CM) and non-2's complement mode (N2CM) MAC, thereby offering flexible support for 4-/8-bit weight data in 2's complement format. Building upon this paradigm, we propose novel FeFET based dual designs, CurFe for the current mode and ChgFe for the charge mode, to accommodate the high precision analog domain IMC architecture. Evaluation results at circuit and system levels indicate that the circuit/system-level energy efficiency of the proposed FeFET-based analog IMC is 1.56X/1.37X higher when compared to the state-of-the-art analog IMC designs.
Research Manuscript
AI
Design
AI/ML, Digital, and Analog Circuits
DescriptionThere is an increasing demand for ultra-low power in Edge AI devices, such as smartphones, wearables, and Internet-of-Things sensor systems, with constrained battery budgets. Current AI computation units face challenges, primarily from the memory-wall issue, limiting overall system-level performance. In this paper, we propose a new SRAM-based Compute-In-Memory (CIM) accelerator optimized for Spiking Neural Networks (SNNs) inference. Our proposed architecture employs a multiport SRAM design with multiple decoupled read ports to enhance the throughput and transposable read-write ports to facilitate online learning. Furthermore, we develop an Arbiter circuit for efficient data processing and port allocations during the computation. Results for a 128x128 array in 3nm FinFET technology demonstrate a 3.1x improvement in speed and a 2.2x enhancement in energy efficiency with our 5R1W SRAM design compared to the traditional single-port SRAM design. At the system level, a throughput of 44 MInf/s at 607 pJ/Inf and 29 mW is achieved.
Research Manuscript
EDA
Design Verification and Validation
DescriptionGiven a formula F, the problem of model counting is to compute the number of solutions (also known as models) of F. Over the past decade, model counting has emerged as key building block of quantitative reasoning in design automation and artificial intelligence. Given the wide ranging applications, scalability remains the major challenge in the development of model counters. Motivated by the observation that the formula simplification can dramatically impact the performance of the state of the art exact model counters, we design a new state of the art preprocessor, Puura, that relies on tight integration of techniques. The design of Puura is motivated from our observation that it is often beneficial to employ preprocessing techniques whose overhead may be prohibitive for the task of SAT solving but not for model counting: accordingly, we rely on a specifically tailored SAT solver design for redundancy detection, sampling-boosted backbone detection, as well as storing of redundancy information for the purposes of improving propagation within top-down model counters. Our detailed empirical evaluation demonstrates that Puura achieves significant performance improvements over prior model counting preprocessors in terms of instance-size reductions achieved as well as the runtime improvements of the downstream model counters.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionPre-Silicon IR-drop analysis accuracy depends on the logic toggling scenarios covered, along with other factors. Vector-based IR drop analysis is more accurate as real stimuli are used, but generally cannot cover the whole design or toggling scenarios, hence vectorless analysis is conventionally used. The key aspect in vectorless IR analysis is coverage of cells toggling scenarios. The State-propagation based vectorless analysis in Cadence Voltus is the right approach to ensure Silicon behavior is mimicked in the Pre-Si IR analysis, as it assigns activity at sources and then propagates it through the downstream logic-cone. But it lacks in a few aspects related how different Event Generators can be enforced to toggle together, how more unique toggling scenarios can be created and how to control the analysis based on target power. Three major enhancements are proposed to address those – back-propagation to enable concurrent clock and data network handling, increased scenario coverage by reshuffling flops, and target power based controls. The enhancements help improve the vectorless analysis significantly through realistic toggling activities, increased coverage and granular power controls. This in turn helps in having more robust Pre-Si IR analysis closure, enhancing chances of 1st pass Silicon success.
Research Manuscript
EDA
Analog CAD, Simulation, Verification and Test
DescriptionThe complicated dielectric profile under advanced process technologies challenges the accuracy of floating random walk (FRW) based capacitance extraction, as the latter pre-computes the surface Green's functions for a finite set of multi-dielectric transition cubes and makes approximations of transition cubes during the FRW process. In this work, we derive analytic surface Green's functions for transition cubes with arbitrary stratified dielectrics and propose a fast algorithm named AGF to compute them. A capacitance solver named FRW-AGF is then proposed to incorporate AGF into the FRW process to accurately model realistic transition cubes. Experimental results show that the proposed AGF is over 100x, faster than the state-of-the-art, and FRW-AGF largely improves the accuracy of RWCap4 [3, 17] (making all errors to golden values below 5%) without degrading computational speed and parallel scalability.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThis paper presents innovative approaches to reduce the runtime of complex System on Chip (SoC) verification, particularly in the context of Analog Mixed-Signal (AMS) co-simulation. The methodologies discussed can be applied to improve the scope of AMS simulations and improve the quality and coverage. The primary focus is on a SystemVerilog EEnet methodology tailored for Analog Test Bus (ATB) test cases, aimed at shrinking the scope of AMS co-simulation and enhancing Analog Behavioral Models (ABMOD) for Digital Mixed-Signal (DMS) test cases.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThe objective of this paper is to show how it is possible to enhance and accelerate verification by exploiting Python scripting.
Digital verification engineers are very often required to write a huge amount of repetitive code. This is particularly evident for verification structures used to verify memory elements, such as register maps, OTPs or NVMs. The main risk related to this task is the likeliness of committing mistakes while writing the same thing, slightly changed, over and over. Moreover, small changes in specification may lead to a large amount of code to be rewritten.
The solution found lies in a proficient use of the Python language to analyze and elaborate specification files to produce all the code needed for verification, both UVM and Formal.
Two cases are presented: the first one is a specific application for register maps. The second use case is a generic approach to covergroups and assertions writing automation.
Results can be summarized as follows: it was found that both the initial time and maintenance time are reduced across the duration of a single project. Additionally, by sharing the scripts among different team members, the script writing effort becomes neglectable. Lastly, having your own scripts allows you to customize them to your own needs.
Overall, the usage and reusage of python scripts reduces verification time, leading to a shorter time-to-market.
Digital verification engineers are very often required to write a huge amount of repetitive code. This is particularly evident for verification structures used to verify memory elements, such as register maps, OTPs or NVMs. The main risk related to this task is the likeliness of committing mistakes while writing the same thing, slightly changed, over and over. Moreover, small changes in specification may lead to a large amount of code to be rewritten.
The solution found lies in a proficient use of the Python language to analyze and elaborate specification files to produce all the code needed for verification, both UVM and Formal.
Two cases are presented: the first one is a specific application for register maps. The second use case is a generic approach to covergroups and assertions writing automation.
Results can be summarized as follows: it was found that both the initial time and maintenance time are reduced across the duration of a single project. Additionally, by sharing the scripts among different team members, the script writing effort becomes neglectable. Lastly, having your own scripts allows you to customize them to your own needs.
Overall, the usage and reusage of python scripts reduces verification time, leading to a shorter time-to-market.
Front-End Design
Design
Engineering Tracks
Front-End Design
DescriptionIn hardware designs, complex Datapath algorithms are initially realized using high-level languages like C/C++, or SystemC to establish the correctness of architecture intent. Same algorithm is then implemented using System Verilog, considering hardware aspects like power, performance, and area. The high-level abstracted model developed and verified initially is used as a golden specification to verify the RTL. It has been established that simulation alone is not sufficient for such complex algorithms. However, when performing Formal Equivalence Verification (FEV) between these two models, FEV tools often struggle to establish equivalency due to significant differences in data handling between the specification (spec)and implementation (impl). Currently, there is no automated or solver support from industry tools that can internally handle these abstraction differences and provide full confidence on the design. An incomplete proof in algorithm design can lead to missing deep corner case bugs. Considering we have many such algorithms which if left unconverged pose a threat to the accuracy of the blocks, we have developed a strategy on how to understand the abstraction difference and reduce it by creating/suggesting intermediate models to the tools. These models are proven in steps to assist the tools to get complete confidence. This presentation talks about one such use case which explains the rational behind identifying potential un-verifiable abstraction differences and then explains the techniques to create intermediate models and assist the tools for the same. Such methodology can vastly enhance the potential of the FV community to converge on designs which are otherwise deemed impossible to verify through traditional FEV.
Exhibitor Forum
DescriptionMicroelectronics security is the last bastion of cybersecurity. It is an expansive attack surface and is increasingly susceptible to malicious inclusions and in-field attack. Vulnerabilities include hardware trojans, embedded functions, takeover attacks, kill switches, performance degradation, and unauthorized infiltration and exfiltration of instructions or data. To address the growing landscape of semiconductor device exposure, cybersecurity solutions are incorporating AI and ML technology to enhance system monitoring, threat detection, and in-system mitigation. This presentation will explore ML applications to expose design risks, and ensure these components function as intended and are cyber-hardened against vulnerabilities – throughout the life cycle.
There are three essential considerations for the development of an intrusion-detecting ML model: 1) next-generation ability to expose (and instrument) design vulnerabilities pre-silicon; 2) the ability to insert and emulate cyber intrusions into the semiconductor device, and 3) the ability to observe the design behavior at crucial design points on-chip, at-speed, and in-system.
In this presentation, a methodology will be presented to identify vulnerable nodes, and to add instrumentation that make them observable, and collect for ML training.
Using functional vectors, a golden model of expected behavior at these crucial nodes is created from data collected from instrumentation. Next, cyber-attacks are emulated and the response at these crucial nodes is monitored. Using data collected from normal device operation, an ML model is trained to detect the intrusion and initiate corrective actions for remediation or recovery.
This methodology has been effectively demonstrated in a prototype application.
There are three essential considerations for the development of an intrusion-detecting ML model: 1) next-generation ability to expose (and instrument) design vulnerabilities pre-silicon; 2) the ability to insert and emulate cyber intrusions into the semiconductor device, and 3) the ability to observe the design behavior at crucial design points on-chip, at-speed, and in-system.
In this presentation, a methodology will be presented to identify vulnerable nodes, and to add instrumentation that make them observable, and collect for ML training.
Using functional vectors, a golden model of expected behavior at these crucial nodes is created from data collected from instrumentation. Next, cyber-attacks are emulated and the response at these crucial nodes is monitored. Using data collected from normal device operation, an ML model is trained to detect the intrusion and initiate corrective actions for remediation or recovery.
This methodology has been effectively demonstrated in a prototype application.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionDeep Neural Network (DNN) applications demanding high memory bandwidth present a significant challenge. DNNs contain weight and activation data. Weight data are only read during inference, whereas activation data are modified to store intermediate results. We propose a reduced retention-time MRAM-based main memory, where MRAM is divided into two partitions with different retention times. In this scheme, DNN weights are mapped to the long retention-time partition, while activation data can be mapped to the short retention-time partition. Two circular buffer mapping schemes demonstrate an average improvement of up to 14.4 % over DRAM in bandwidth.
Front-End Design
Design
Engineering Tracks
Front-End Design
DescriptionThe complexity of modern System on Chip (SoC) designs has led to advanced verification techniques, facing the challenge of detecting undetected functional failures.
In the area of wireless and wireline communication products, designers initially model Digital Signal Processing (DSP) blocks using high-level languages like MATLAB® or C/C++. Subsequently, they translate these models into Register Transfer Level (RTL) implementations. Traditional verification methods involve months-long Universal Verification Methodology (UVM) dynamic simulations, yet subtle bugs in critical DSP blocks can remain hidden.
To address this, Formal Equivalence Verification (FEV) emerges as a powerful complement to dynamic simulations. Thanks to recent improvements in the capabilities of formal solvers, FEV offers a unique advantage by mathematically checking the functional correctness of an RTL implementation (timed) against its high-level C/C++ model (untimed), drastically reducing verification time, and ensuring exhaustive coverage of the design state space.
This paper presents an in-depth exploration of complex FSM with Datapath verification, namely Floating-Point MAC, Tone Generator, and Automatic Gain Control, utilizing Formal Equivalence methodology. Despite these DSP blocks had been validated throughout UVM dynamic simulations which lasted months, subtle corner cases hidden bugs were spotted in few weeks by adopting the proposed flow, therefore reducing the verification effort.
In the area of wireless and wireline communication products, designers initially model Digital Signal Processing (DSP) blocks using high-level languages like MATLAB® or C/C++. Subsequently, they translate these models into Register Transfer Level (RTL) implementations. Traditional verification methods involve months-long Universal Verification Methodology (UVM) dynamic simulations, yet subtle bugs in critical DSP blocks can remain hidden.
To address this, Formal Equivalence Verification (FEV) emerges as a powerful complement to dynamic simulations. Thanks to recent improvements in the capabilities of formal solvers, FEV offers a unique advantage by mathematically checking the functional correctness of an RTL implementation (timed) against its high-level C/C++ model (untimed), drastically reducing verification time, and ensuring exhaustive coverage of the design state space.
This paper presents an in-depth exploration of complex FSM with Datapath verification, namely Floating-Point MAC, Tone Generator, and Automatic Gain Control, utilizing Formal Equivalence methodology. Despite these DSP blocks had been validated throughout UVM dynamic simulations which lasted months, subtle corner cases hidden bugs were spotted in few weeks by adopting the proposed flow, therefore reducing the verification effort.
Front-End Design
Design
Engineering Tracks
Front-End Design
DescriptionThis paper presents an innovative approach to "no-harm" verification using Formal Sequential Equivalence Checking (FSEC) for the comprehensive comparison of original and customized designs, specifically targeting equivalence without the introduction of new instructions. The discussion delves into the challenges inherent in FSEC and proposes a set of customization rules designed to produce a new Register-Transfer Level (RTL) with minimal modifications. In instances where automatic proof strategies prove insufficient, the paper advocates for a manual, divide-and-conquer verification approach. Notably, it provides insights into the verification of entire design with a black-boxed execute stage. By showcasing the ongoing relevance of formal verification, this work underscores its effectiveness in identifying issues beyond the scope of contemporary verification methodologies, establishing its pivotal role in ensuring design integrity
Research Manuscript
Security
Hardware Security: Primitives, Architecture, Design & Test
DescriptionServerless computing has gained widespread attention, and Trusted Execution Environments (TEEs) are well-suited for safeguarding user privacy. However, the additional startup procedure introduced by TEEs imposes considerable performance overhead on confidential serverless workloads. This paper introduces a novel parallelized enclave startup design, EnTurbo, which eliminates the integrity dependence of the enclave startup procedure, accelerating it while ensuring its security. Additionally, EnTurbo parallelizes the measurement procedure, enabling multi-thread measurement for acceleration with provable security. We evaluate EnTurbo by running confidential serverless workloads on SGX simulation mode. Results show that EnTurbo effectively speeds up enclave serverless by 1.42x-6.48x (SGXv1) and 1.33x-3.76x (SGXv2).
Research Manuscript
AI
Security
AI/ML Security/Privacy
DescriptionSpiking neural networks (SNNs) are emerging as energy-efficient alternatives to conventional artificial neural networks (ANNs). Their event-driven information processing significantly reduces computational demands while maintaining competitive performance.
However, as SNNs are increasingly deployed in edge devices, various security concerns have emerged. While significant research efforts have been dedicated to addressing the security vulnerabilities stemming from malicious input, often referred to as adversarial examples, the security of SNN parameters remains relatively unexplored.
This work introduces a novel attack methodology for SNNs known as Energy-Oriented SNN attack (EOS). EOS is designed to increase the energy consumption of SNNs through the malicious manipulation of binary bits within their memory systems (i.e., DRAM), where neuronal information is stored.
The key insight of EOS lies in the observation that energy consumption in SNN implementations is intricately linked to spiking activity.
The bit-flip operation, the well-known Row Hammer technique, is employed in EOS. It achieves this by identifying the most robust neurons in the SNN based on the spiking activity, particularly those related to the firing threshold, which is stored as binary bits in memory. EOS employs a combination of spiking activity analysis and a progressive search strategy to pinpoint the target neurons for bit-flip attacks. The primary objective is to incrementally increase the energy consumption of the SNN while ensuring that accuracy remains intact.
With the implementation of EOS, successful attacks on SNNs can lead to an average of $43\%$ energy increase with no drop in accuracy.
However, as SNNs are increasingly deployed in edge devices, various security concerns have emerged. While significant research efforts have been dedicated to addressing the security vulnerabilities stemming from malicious input, often referred to as adversarial examples, the security of SNN parameters remains relatively unexplored.
This work introduces a novel attack methodology for SNNs known as Energy-Oriented SNN attack (EOS). EOS is designed to increase the energy consumption of SNNs through the malicious manipulation of binary bits within their memory systems (i.e., DRAM), where neuronal information is stored.
The key insight of EOS lies in the observation that energy consumption in SNN implementations is intricately linked to spiking activity.
The bit-flip operation, the well-known Row Hammer technique, is employed in EOS. It achieves this by identifying the most robust neurons in the SNN based on the spiking activity, particularly those related to the firing threshold, which is stored as binary bits in memory. EOS employs a combination of spiking activity analysis and a progressive search strategy to pinpoint the target neurons for bit-flip attacks. The primary objective is to incrementally increase the energy consumption of the SNN while ensuring that accuracy remains intact.
With the implementation of EOS, successful attacks on SNNs can lead to an average of $43\%$ energy increase with no drop in accuracy.
Research Manuscript
AI
AI/ML Algorithms
DescriptionThe exploration of Processing-In-Memory (PIM) accelerators has garnered significant attention within the research community. However, the utilization of large-scale neural networks on Processing-In-Memory (PIM) accelerators encounters challenges due to constrained on-chip memory capacity. To tackle this issue, current works explore model compression algorithms to reduce the size of Convolutional Neural Networks (CNNs). Most of these algorithms either aim to represent neural operators with reduced-size parameters (e.g., quantization) or search for the best combinations of neural operators (e.g., neural architecture search). Designing neural operators to align with PIM accelerators' specifications is an area that warrants further study. In this paper, we introduce the Epitome, a lightweight neural operator offering convolution-like functionality, to craft memory-efficient CNN operators for PIM accelerators (EPIM). On the software side, we evaluate epitomes' latency and energy on PIM accelerators and introduce a PIM-aware layer-wise design method to enhance their hardware efficiency. We apply epitome- aware quantization to further reduce the size of epitomes. On the hardware side, we modify the datapath of current PIM accelerators to accommodate epitomes and implement a feature map reuse technique to reduce computation cost. Experimental results reveal that our 3-bit quantized EPIM-ResNet50 attains 71.59% top-1 accuracy on ImageNet, reducing crossbar areas by 30.65×. EPIM surpasses the state-of-the-art pruning methods on PIM
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionPlacement is crucial in the physical design, as it greatly affects power, performance, and area metrics. Recent advancements in analytical methods, such as DREAMPlace, have demonstrated impressive performance in global placement. However, DREAMPlace has some limitations, e.g., may not guarantee legalizable placements under the same settings, leading to fragile and unpredictable results. This paper highlights the main issue as being stuck in local optima, and proposes a hybrid optimization framework to efficiently escape the local optima, by perturbing the placement result iteratively. The proposed framework achieves significant improvements compared to state-of-the-art methods on two popular benchmarks.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe fast Fourier transform (FFT) is widely used to convert a time-domain signal into its frequency-domain representation in various fields. Previous works have demonstrated efficient FFT implementation on various accelerators. The emergence of AI Engines (AIE) on AMD Xilinx's Versal ACAP brings the possibility of further improvement in computing efficiency. However, previous solutions have been restricted to a single-AIE manner, which limits the FFT size and neglects the potential of employing multiple AIEs. This paper proposes the ESFA framework, which can efficiently and automatically implement a scalable FFT on the Versal ACAP with multiple AIEs. The framework includes an analytical model to report the quality of results (QoRs) estimation for legal FFT partition modes, comprehensively covering the throughput-resource trade-off choices across the design space. In addition, an automatic code generator is developed in the framework to enable an agile implementation of the desired design. Our experiments on the VCK190 board show that we achieve 9,226 MS/s throughput on the 1K-point FFT with a data width of 32, which obtains up to 12.3x speedup compared with AMD Xilinx's library targeting AIE, meanwhile, 17.5x, 5.1x, and 10.1x speedup compared to the state-of-the-art designs based on ASIC, CGRA, FPGA.
Research Manuscript
Autonomous Systems
Autonomous Systems (Automotive, Robotics, Drones)
DescriptionEvent-based vision sensors have demonstrated great promise in applications like autonomous UAVs. However, deploying event-based algorithms on heterogeneous edge platforms is inefficient due to mismatch between irregular nature of event streams and diverse characteristics of algorithms (mixture of spiking and conventional neural networks) on one hand and the underlying hardware platform on the other. We introduce Ev-Edge, a framework that contains three key optimizations to boost performance of event-based vision systems on edge platforms. Ev-Edge achieves 1.28x-2.05x latency and 1.23x-2.15x energy improvements over an all-GPU implementation and 1.42x-1.98x latency improvements over round-robin scheduling methods in multi-task execution scenarios with negligible accuracy loss on the NVIDIA Jetson Xavier platform.
IP
Engineering Tracks
IP
DescriptionStandard cell libraries are the foundation to implementing the largest, most advanced digital designs today. Selecting the correct standard cell library for a design is an important step that has lasting implications that will impact final power, performance, and area (PPA) metrics of the chip, as well as tapeout schedule.
Liberty (.lib) models encapsulate PPA characteristics of standard cell libraries, but profiling PPA of .libs from different sources is difficult because of varying cell types, pins, timing arcs, and structural differences between libraries. Synthesizing a test design with different .libs to measure PPA requires a schedule overhead, and results will be heavily influenced by the type of test design used.
This paper discusses a methodology for comparing PPA at the library level, where a library analysis tool is utilized to help correctly align different cell, pins, timing arcs, and other library information between libraries, so that apples-to-apples comparisons can be made between cell types of interest. We also cover the different visualization and analysis templates that are useful for benchmarking libraries.
This approach enables users to quickly and correctly profile different .libs, select the correct library for the use case, to improve chip-level PPA and design closure schedule.
Liberty (.lib) models encapsulate PPA characteristics of standard cell libraries, but profiling PPA of .libs from different sources is difficult because of varying cell types, pins, timing arcs, and structural differences between libraries. Synthesizing a test design with different .libs to measure PPA requires a schedule overhead, and results will be heavily influenced by the type of test design used.
This paper discusses a methodology for comparing PPA at the library level, where a library analysis tool is utilized to help correctly align different cell, pins, timing arcs, and other library information between libraries, so that apples-to-apples comparisons can be made between cell types of interest. We also cover the different visualization and analysis templates that are useful for benchmarking libraries.
This approach enables users to quickly and correctly profile different .libs, select the correct library for the use case, to improve chip-level PPA and design closure schedule.
Research Manuscript
Security
Hardware Security: Attack and Defense
DescriptionDeep neural networks are susceptible to model piracy and adversarial attacks when malicious end-users have full access to the model parameters. Recently, a logic locking scheme called HPNN has been proposed. HPNN utilizes hardware root-of-trust to prevent end-users from accessing the model parameters. This paper investigates whether logic locking is secure on deep neural networks. Specifically, it presents a systematic I/O attack that combines algebraic and learning-based approaches. This attack incrementally extracts key values from the network to minimize sample complexity. Besides, it employs a rigorous procedure to ensure the correctness of the extracted key values. Our experiments demonstrate the accuracy and efficiency of this attack on large networks with complex architectures. Consequently, we conclude that HPNN-style logic locking and its variants we can foresee are insecure on deep neural networks.
Research Manuscript
EDA
Analog CAD, Simulation, Verification and Test
DescriptionAutomated analog circuit design migration significantly alleviates the burden on designers in circuit sizing under various operating conditions. Conventional methods model the migration problem as black-box optimization, requiring excessive iterations of costly simulations to converge. Reinforcement learning exhibits significant promise in transfer learning, as it enables the generation of circuits that fulfill specifications efficiently. The paper proposes a novel value decomposition-based multi-agent reinforcement learning framework, aiming to model complex analog circuits and eliminate the need for manually defined specifications of sub-circuits for new operating conditions. Additionally, it incorporates generalized domain randomization techniques to leverage the varying information across diverse domains. Experiment demonstrates that our algorithm can efficiently generate circuits meeting specifications under new operating conditions in few number of steps, outperforming state-of-the-art methods.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe pervasive proliferation of computing infrastructure in recent decades has led to an increased fraction of worldwide energy consumption and greenhouse gas (GHG) emissions associated with computing. Such contributions are projected to increase quickly. Traditionally, computing research has been primarily focused on performance, power, and area optimization, with a much lower emphasis on the carbon footprint (CF) associated with computations. Hence, more holistic techniques are needed to mitigate Information Communication Technology's GHG emissions. To address this need, we propose Evergreen, a three-part approach comprised of (1) a holistic model of operational and embodied emissions of compute hardware, transmission infrastructure, and battery energy storage systems, (2) a CF predictor based on this model, and (3) a user-driven, carbon-aware scheduler to minimize GHG emissions of workloads on cloud environments. To the best of our knowledge, this work proposes the most holistic model and corresponding scheduler so far.
Using a case study, we demonstrate that Evergreen can reduce emissions by 19.6x with carbon-optimal scheduling compared to latency-optimal scheduling in data centers with only a 2.1% latency overhead.
Using a case study, we demonstrate that Evergreen can reduce emissions by 19.6x with carbon-optimal scheduling compared to latency-optimal scheduling in data centers with only a 2.1% latency overhead.
Research Manuscript
EDA
Physical Design and Verification
DescriptionYield estimation and optimization have become increasingly important for circuit design as technology nodes scale down. Simple yet well-established minimal norm importance sampling (MNIS) still serves as an industrial standard due to its robustness and reliability. In this study, we generalize the classic MNIS and propose Every Failure Is A Lesson (EFIAL) to utilize every failure sample (instead of one in MNIS) to construct the proposal distribution. EFIAL is completely tuning-free and the update computation complexity is only $\Ocal(M)$ ($M$ is the number of failure samples) by utilizing the blessing of dimensionality.
The idea of EFIAL is then extended to the state-of-the-art (SOTA) pre-sampling method, onion sampling, to significantly boost efficiency, by up to 9.08x (4.68x on average). Extensive evaluations against SOTA yield estimation methods reveal that EFIAL achieves a speedup of up to 13.54x (5.16x on average) and an accuracy improvement of up to 24.91\%.
The idea of EFIAL is then extended to the state-of-the-art (SOTA) pre-sampling method, onion sampling, to significantly boost efficiency, by up to 9.08x (4.68x on average). Extensive evaluations against SOTA yield estimation methods reveal that EFIAL achieves a speedup of up to 13.54x (5.16x on average) and an accuracy improvement of up to 24.91\%.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThere has been an everlasting trend to represent information in tabular form in any kind of document irrespective of the application since tabular format help convey more information in lesser words. But extracting and processing data from these tables can be very challenging. Talking In context of Very Large-Scale Integration (VLSI) domain etc. lots of useful information is encapsulated in tabular form in data sheets, specification etc. This stands true for other applications like AI, ML, DBMS etc. as well. Talking about Double Data Rate Generation 5 (DDR5) specification itself has more than 300+ tables which contributes ~20% or more of entire specification and there's been >50 spec iterations in getting there. The manual extraction of this data and information is prone to error and requires significant efforts. Hence we proposed a solution for automatically extracting tabular information and doing some processing to convert it into executable code form. Solution demonstrated is generic enough to be extended to other any application where information from tabular form needs to be extracted and processed.
Research Manuscript
Design
In-memory and Near-memory Computing Circuits
DescriptionComputation using Processing in-memory (PIM) is performed by breaking down computationally expensive operations into in-memory kernels that can be efficiently executed using non-volatile memory. Logic styles such as MAGIC requires that each output memory cell is prepared for evaluation before executing the functional logic operation. State-of-the-art synthesis algorithms perform the preparation immediately after memory cells have expired. Unfortunately, this results in that columns of cells are prepared one-by-one, instead of leveraging efficient parallel data preparation instructions. In this paper, we propose the PREP framework that maximizes the opportunities for parallel column preparation using execution sequence optimization.
Research Manuscript
EDA
Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionWith the continuous advancement of processors, modern micro-architecture designs have become increasingly complex. The vast design space presents significant challenges for human designers, making design space exploration (DSE) algorithms a significant tool for $\mu$-arch design. In recent years, efforts have been made in the development of DSE algorithms, and promising results have been achieved. However, the existing DSE algorithms, e.g., Bayesian Optimization and ensemble learning, suffer from poor interpretability, hindering designers' understanding of the decision-making process.
To address this limitation, we propose utilizing Fuzzy Neural Networks to induce and summarize knowledge and insights from the DSE process, enhancing the interpretability and controllability of DSE results.
Furthermore, to improve efficiency, we introduce a multi-fidelity reinforcement learning approach, which primarily conducts exploration using inexpensive but imprecise data, thereby substantially diminishing the reliance on costly data.
Experimental results show that our method achieved excellent results with a very limited sample budget and successfully surpasses the current state-of-the-art.
To address this limitation, we propose utilizing Fuzzy Neural Networks to induce and summarize knowledge and insights from the DSE process, enhancing the interpretability and controllability of DSE results.
Furthermore, to improve efficiency, we introduce a multi-fidelity reinforcement learning approach, which primarily conducts exploration using inexpensive but imprecise data, thereby substantially diminishing the reliance on costly data.
Experimental results show that our method achieved excellent results with a very limited sample budget and successfully surpasses the current state-of-the-art.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe inverse design in distributed circuits refers to generating circuits that tightly meet desirable specifications. Previous methods assume either over-restricted candidate templates or the differentiablity of the evaluation procedure. These assumptions are contrary to the real design practice which uses none-restrictive template types and non-differentiable evaluation procedures. In this paper, we propose Distributed Circuit Design Agent (DCDA), which generates distributed circuits to meet desirable transfer functions without any assumption regarding the design template types. Our agent trains a neural network that produces a near-optimal joint distribution as a set of conditional distributions to sample all design dimensions in a single step. We map sampled design dimensions to physical properties of resonators in order to establish physical evaluation feedback, which helps the agent adjust its sampling policy. Our experimental results show that without any assumption regarding template types, most generated distributed circuits from our method achieve better performance than those generated by the state-of-the-art approach in the inverse design.
Research Manuscript
Autonomous Systems
Autonomous Systems (Automotive, Robotics, Drones)
DescriptionFederated learning (FL) supports massive edge devices to collaboratively train object detection models in mobile computing scenarios. However, the distributed nature of FL exposes significant security vulnerabilities. Existing attack methods either require considerable costs to compromise the majority of participants, or suffer from poor attack success rates. Inspired by this, we devise an efficient fake node-based perception poisoning attacks strategy (FNPPA) to target such weaknesses. In particular, FNPPA poisons local data and injects multiple fake nodes to participate in aggregation, aiming to make the local poisoning model more likely to overwrite clean updates. Moreover, it can achieve greater malicious influence on target objects at a lower cost without affecting the normal detection of other objects. We demonstrate through exhaustive experiments that FNPPA exhibits superior attack impact than the state-of-the-art in terms of average precision and aggregation effect.
IP
Engineering Tracks
IP
DescriptionWith ever-shrinking CMOS technology, particularly in nanometer regime, when devices are operating at ultra-low voltages, device variation poses major challenge for SRAM designs that use smallest feature devices. Achieving good yield on silicon requires very high sigma qualification (>6-sigma) based on the application and total capacity used in the SoC. Existing CAD solutions mostly rely on methodologies such as importance sampling or extreme value distribution (EVD). These methodologies suffer from inaccuracies and are not feasible for larger circuits. Further, non-gaussian nature of variations puts a hard limitation on such methodologies.
In this work, we have demonstrated the precise yield estimation using HSMC and DSVC available in synopsys AVA suite. This methodology is using machine-learning algorithm to precisely evaluate n-sigma measurements even for larger circuits (~2k MOS). We identify the dominant blocks in memory for variation such as bitcell, sense-amplifier and wordline underdrive circuit. HSMC and DSVC help to capture n-sigma distant behaviour for these blocks. Replacing these blocks by their m-sigma equivalent spice models in full memory instance helps to analyze n-sigma full entity qualification even in a nominal simulation, thus avoiding the need of statistical simulation at full memory instance. This methodology helps to reduce turn-around time for this analysis from ~2 weeks to ~1 day.
In this work, we have demonstrated the precise yield estimation using HSMC and DSVC available in synopsys AVA suite. This methodology is using machine-learning algorithm to precisely evaluate n-sigma measurements even for larger circuits (~2k MOS). We identify the dominant blocks in memory for variation such as bitcell, sense-amplifier and wordline underdrive circuit. HSMC and DSVC help to capture n-sigma distant behaviour for these blocks. Replacing these blocks by their m-sigma equivalent spice models in full memory instance helps to analyze n-sigma full entity qualification even in a nominal simulation, thus avoiding the need of statistical simulation at full memory instance. This methodology helps to reduce turn-around time for this analysis from ~2 weeks to ~1 day.
Research Manuscript
Design
Quantum Computing
DescriptionSilicon quantum dot devices stand as promising candidates for large-scale quantum computing due to their extended coherence times, compact size, and recent experimental demonstrations of sizable qubit arrays. Despite the great potential, controlling these arrays remains a significant challenge. This paper introduces a new virtual gate extraction method to quickly establish orthogonal control on the potentials for individual quantum dots. Leveraging insights from the device physics, the proposed approach significantly reduces the experimental overhead by focusing on crucial regions around charge state transition. Furthermore, by employing an efficient voltage sweeping method, we can efficiently pinpoint these charge state transition lines and filter out erroneous points. Experimental evaluation using real quantum dot chip datasets demonstrates a substantial 5.84x to 19.34x speedup over conventional methods, thereby showcasing promising prospects for accelerating the scaling of silicon spin qubit devices.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionMulti-voltage SoC's with uncorrelated supplies are becoming predominantly common with a lot of devices coming up in the market with low power requirements. Here, Non-timing critical blocks are designed at lower voltage (power saving) and High-performance blocks are designed at higher voltage (desired performance). In such Low power SOC's, Timing Closure poses a bigger challenge with tight schedules and predictable results before tape-out as timing signoff of the chip has to be done on multiple corners and multiple modes (MCMM). Single voltage timing analysis is easier. But, with the multi-level supply voltage and dynamic scaling features, the timing analysis complexity increases because timing signoff has to be done additionally on cross-voltage paths, which are not guaranteed to be worst case timing at either voltage corner. Multi-voltage designs need exhaustive analysis of cross voltage domain paths to make sure all worst-case paths are identified under all voltage combinations. With numerous operating PVT corners, timing analysis across corners becomes further challenging. Synopsys Primetime's based Simultaneous multi-voltage aware analysis (SMVA) was helpful to attain this, to do the analysis of all cross-domain paths under all voltage scenarios in a single run, without the need for margining that can add pessimism. This paper describes Primetime based SMVA methodology for predictable and faster Timing Closure of Multiple Power Domains Based Designs.
Research Manuscript
AI
Security
AI/ML Security/Privacy
DescriptionWith the fast evolution of large language models (LLMs), privacy concerns with user queries arise as they may contain sensitive information. Private inference based on homomorphic encryption (HE) has been proposed to protect user query privacy. However, private embedding table query has to be formulated as a HE-based matrix-vector multiplication problem and hence, suffers from enormous computation and communication overhead. We observe the overhead mainly comes from the neglect of 1) the one-hot nature of user queries and 2) the robustness of the embedding table to low-precision quantization noise. Hence, in this paper, we propose a private embedding table query optimization framework, dubbed FastQuery. FastQuery features a communication-aware embedding table quantization algorithm and a one-hot-aware dense packing algorithm to simultaneously reduce both the computation and communication costs. Compared to prior-art HE-based frameworks, e.g., CrypTFlow2, Iron, Cheetah, and CHAM, FastQuery achieves 2.7 ∼ 4.5× computation and 75.1 ∼ 84.4× communication reduction on both LLAMA-7B and LLAMA-30B.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionTraining Graph Neural Networks(GNNs) on a large monolithic graph presents unique challenges as the graph cannot fit within a single machine and it cannot be decomposed into smaller disconnected components. Distributed sampling-based training distributes the graph across multiple machines and trains the GNN on small parts of the graph that are randomly sampled every training iteration. We show that in a distributed environment, the sampling overhead is a significant component of the training time for large-scale graphs. We propose FastSample which is composed of two synergistic techniques that greatly reduce the distributed sampling time: 1)~a new graph partitioning method that eliminates most of the communication rounds in distributed sampling , 2)~a novel highly optimized sampling kernel that reduces memory movement during sampling. We test FastSample on large-scale graph benchmarks and show that FastSample speeds up distributed sampling-based GNN training by up to 2x with no loss in accuracy.
Research Manuscript
Design
Quantum Computing
DescriptionMeasurement-based quantum computing (MBQC) is a promising quantum computing paradigm that carries out computation through one-way measurements on entangled photon qubits. Practical photonic hardware first generates a 2D mesh of resource states with each being a small number of entangled photon qubits and then exploits fusion operations to connect resource states to scale up the computation. Given that the fusion operation is highly error-prone, it is important to reduce the number of fusions for an MBQC circuit.
In this paper, we propose FCM, a fusion-aware scheme that exploits wire cutting to improve the fidelity of MBQC. By cutting a large MBQC circuit into several smaller subcircuits, FCM effectively reduces the number of fusions in each subcircuit and thus improves the computation fidelity. Given circuit cutting requires classical post-processing to combine the results of subcircuits, FCM strives to achieve the best cutting strategy under different settings. Experimental evaluation of representative benchmarks demonstrates that, when cutting a large circuit to two subcircuits, FCM reduces the maximum number of fusions of all subcircuits by 59.6% on average (up to 69.1%).
In this paper, we propose FCM, a fusion-aware scheme that exploits wire cutting to improve the fidelity of MBQC. By cutting a large MBQC circuit into several smaller subcircuits, FCM effectively reduces the number of fusions in each subcircuit and thus improves the computation fidelity. Given circuit cutting requires classical post-processing to combine the results of subcircuits, FCM strives to achieve the best cutting strategy under different settings. Experimental evaluation of representative benchmarks demonstrates that, when cutting a large circuit to two subcircuits, FCM reduces the maximum number of fusions of all subcircuits by 59.6% on average (up to 69.1%).
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionDigital-Compute-In-Memory (DCIM) has demonstrated significant energy and area efficiency in convolutional neural network (CNN) accelerators, particularly for high precision applications. However, to mitigate parasitic effects on word and bit lines, most DCIMs employ fine-grained multiply-accumulate operations, which causes new challenges and opportunities but have not been widely explored. This paper proposes FDCA: a Fine-grained Digital-CIM based CNN Accelerator with hybrid quantization and weight-stationary dataflow, in which the key contributions are :1) a hybrid quantization approach for CNNs leveraging hessian trace and approximation is utilized. This method incorporates the ratio of computation time and storage time into quantization, achieving high efficiency while maintaining accuracy; 2) a Cartesian Genetic Programming based approximate shift and accumulate unit with error compensation is proposed, where an approximate adder tree is generated to compensate for errors introduced by DCIM; 3) a weight-stationary dataflow optimized for fine-grained DCIM is used to improve the utilization of CIM macro and eliminate dataflow stalls. The experimental results demonstrate that under 28-nm process, when running VGG16 and ResNet50 on CIFAR100, the proposed FDCA achieves 17.1TOPS/W and 18.79TOPS/W with accuracy loss by 0.71% and 0.98%, respectively. Compared to previous works, this work achieves 1.76× and 1.29× improvements in energy efficiency with less accuracy loss.
Research Manuscript
Design
Emerging Models of Computation
DescriptionIn scenarios with limited training data or where explainability is crucial, conventional neural network-based machine learning models often face challenges.
In contrast, Bayesian inference-based algorithms excel in providing interpretable predictions and reliable uncertainty estimation in these scenarios.
While many state-of-the-art in-memory computing (IMC) architectures leverage emerging non-volatile memory (NVM) technologies to offer unparalleled computing capacity and energy efficiency for neural network workloads, their application in Bayesian inference is limited.
This is because the core operations in Bayesian inference, i.e., cumulative multiplications of prior and likelihood probabilities, differ significantly from the multiplication-accumulation (MAC) operations common in neural networks, rendering them generally unsuitable for direct implementation in most existing IMC designs.
In this paper, we propose FeBiM, an efficient and compact Bayesian inference engine powered by multi-bit ferroelectric field-effect transistor (FeFET)-based IMC.
FeBiM effectively encodes the trained probabilities of a Bayesian inference model within a compact FeFET-based crossbar.
It maps quantized logarithmic probabilities to discrete FeFET states.
As a result, the accumulated outputs of the crossbar naturally represent the posterior probabilities, i.e., the Bayesian inference model's output given a set of observations.
This approach enables efficient in-memory Bayesian inference without the need for additional calculation circuitry.
As the first FeFET-based in-memory Bayesian inference engine, FeBiM achieves an impressive storage density of 26.32 Mb/mm2 and a computing efficiency of 581.40 TOPS/W in a representative Bayesian classification task.
These results demonstrate 10.7x/43.4x improvement in compactness/efficiency compared to the state-of-the-art hardware implementation of Bayesian inference.
In contrast, Bayesian inference-based algorithms excel in providing interpretable predictions and reliable uncertainty estimation in these scenarios.
While many state-of-the-art in-memory computing (IMC) architectures leverage emerging non-volatile memory (NVM) technologies to offer unparalleled computing capacity and energy efficiency for neural network workloads, their application in Bayesian inference is limited.
This is because the core operations in Bayesian inference, i.e., cumulative multiplications of prior and likelihood probabilities, differ significantly from the multiplication-accumulation (MAC) operations common in neural networks, rendering them generally unsuitable for direct implementation in most existing IMC designs.
In this paper, we propose FeBiM, an efficient and compact Bayesian inference engine powered by multi-bit ferroelectric field-effect transistor (FeFET)-based IMC.
FeBiM effectively encodes the trained probabilities of a Bayesian inference model within a compact FeFET-based crossbar.
It maps quantized logarithmic probabilities to discrete FeFET states.
As a result, the accumulated outputs of the crossbar naturally represent the posterior probabilities, i.e., the Bayesian inference model's output given a set of observations.
This approach enables efficient in-memory Bayesian inference without the need for additional calculation circuitry.
As the first FeFET-based in-memory Bayesian inference engine, FeBiM achieves an impressive storage density of 26.32 Mb/mm2 and a computing efficiency of 581.40 TOPS/W in a representative Bayesian classification task.
These results demonstrate 10.7x/43.4x improvement in compactness/efficiency compared to the state-of-the-art hardware implementation of Bayesian inference.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIn the natural world, energy and information are deeply entwined, mutually constraining and complementing each other. To exploit this natural merit, this paper proposes an FEI strategy: Fusion processing of sensing Energy and Information for infrared smart vision system. The proposed Information-Power-Coupler (IPCp) takes the ability of simultaneous energy harvesting and low power in-pixel computing, which utilizes in-situ coupled energy to process the containing information on the same focal plane. Furthermore, a self-adaptive Intelligent-Power-Controller (IPCtrl) is capable of scheduling the harvested energy to complete low-power neural network inference is introduced. The implementation of (IPC2 ) system utilizes a software-hardware co-design strategy to exploit the layer-wise characteristic of the computation process and circuit topology, achieving energy-efficient self-sustainable fusion processing of sensing energy and information. Simulation results show that the intelligent power controller could supply 594.68nW with the power conversion efficiency of 93.38%, when the harvested energy from the information power coupler is 636.84nW. The performance validates the self-sustainability of the system with the self-powered image recognition of a complete network running at 4fps with an accuracy of 99.4%.
Research Manuscript
Design
SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionFully Homomorphic Encryption (FHE) is a privacy-preserving technique that allows computation directly on encrypted data. In this work, we investigate execution the of FHE machine learning (ML) applications. We show that the runtime hardware reconfigurability of the underlying execution units of homomorphic operations is highly desirable for efficient hardware resource utilization. Based on the observation, we propose FHE-CGRA, a coarse-grained reconfigurable architecture (CGRA) acceleration framework for end-to-end homomorphic applications. The experiment shows that FHE-CGRA achieves up to 8.15x speedup against a conventional CGRA for accelerating FHE-encrypted convolution neural network (FHE-CNN) models, and 16.48x power efficiency w.r.t. the state-of-the-art FPGA.
Research Manuscript
Design
Quantum Computing
DescriptionNeutral atom arrays, particularly the field programmable quantum array (FPQA) with atom movement, show promise for quantum computing. FPQA has a dynamic qubit connectivity, facilitating cost-effective execution of long-range gates, but also poses new challenges in compilation. Inspired by FPGA compilation strategy, we develop a router, \name, that leverages flying ancillas to implement 2-Q gates between data qubits mapped to fixed atoms. Equipped with domain-specific routing techniques, \name achieves 1.4$\times$, 27.7$\times$, and 6.7$\times$ reductions in circuit depth for 100-qubit random, quantum simulation, and quantum approximate optimization algorithm circuits, respectively, compared to alternative fixed architectures.
Research Manuscript
EDA
RTL/Logic Level and High-level Synthesis
DescriptionMost RTL designs originate from behavioral descriptions specified in C or C++. These are often written by SW designers. Hardware (HW) designers then manually build an efficient hardware implementation of that application using a Hardware Description Language (HDL) like Verilog or VHDL. Although it has been shown that High-Level Synthesis (HLS) provides a direct path to synthesizing these behavioral descriptions into RTL, the quality of the generated RTL is often still unacceptable, hence, requiring the manual RTL design. This is nevertheless time consuming and error prone. In particular, finding bugs introduced in the manual design is very tedious as HW designers typically rely on long simulations that generate large waveforms that have to be thoroughly scrutinized.
To address this, in this work we present an automated method to accurately point to where in an RTL description a bug is located by using HLS. In particular we leverage the ability of HLS to generate a variety of different micro-architectures to automatically find a design architecturally `similar' to the manually optimized one in order to help locate the bug.
To address this, in this work we present an automated method to accurately point to where in an RTL description a bug is located by using HLS. In particular we leverage the ability of HLS to generate a variety of different micro-architectures to automatically find a design architecturally `similar' to the manually optimized one in order to help locate the bug.
Research Manuscript
Embedded Systems
Embedded Software
DescriptionData deduplication is promised to extend the lifetime and capacity of storage on mobile devices. However, existing data deduplication works show high memory consumption and indexing costs for
maintaining a fingerprint for each data block, especially when the duplicate ratio of data blocks on mobile systems is about 10% to 30%. In this paper, we propose a novel approach called FinerDedup to
optimize the memory costs and retrieval efficiency of data deduplication. FinerDedup drastically reduces the number of fingerprints by screening out the duplicate data blocks via random forest and Bloom filter. We implement FinerDedup on real mobile devices with Android 10 and evaluate it with real workloads. Extensive experimental results show that FinerDedup can reduce 85% of fingerprints and 20% of I/O latency over the widely-used DmDedup.
maintaining a fingerprint for each data block, especially when the duplicate ratio of data blocks on mobile systems is about 10% to 30%. In this paper, we propose a novel approach called FinerDedup to
optimize the memory costs and retrieval efficiency of data deduplication. FinerDedup drastically reduces the number of fingerprints by screening out the duplicate data blocks via random forest and Bloom filter. We implement FinerDedup on real mobile devices with Android 10 and evaluate it with real workloads. Extensive experimental results show that FinerDedup can reduce 85% of fingerprints and 20% of I/O latency over the widely-used DmDedup.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionMoE (Mixture-of-Experts) mechanism has been widely adopted in transformer-based models to facilitate further expansion of model parameter size and enhance generalization capabilities. However, the practical deployment of MoE mechanism for transformer on resource-constrained platforms, such as FPGA, remains challenging due to heavy memory footprints and impractical runtime costs introduced by the MoE mechanism. Diving into the MoE mechanism, we raise two key observations: (1) Expert weights are heavy but cold, making it ideal to leverage expert weight sparsity. (2) There exists highly skewed expert activation paths for MoE layers in transformer-based models, making it feasible to conduct expert prediction and prefetching. Motivated by these two observations, we propose FLAME, the first algorithm-hardware co-optimized MoE accelerating framework designed to fully leverage MoE sparsity for efficient transformer deployment on FPGA. First, to leverage expert weight sparsity, we integrate an N:M pruning algorithm, allowing for the pruning of expert weights without significantly compromising model accuracy. Second, to settle expert activation sparsity, we propose a circular expert prediction (CEPR) strategy. CEPR prefetches expert weights from external storage to on-chip cache before the activated expert index is determined. Last, we co-optimize both MoE sparsity through the introduction of an efficient pruning-aware expert buffering (PA-BUF) mechanism. Experimental results demonstrate that FLAME achieves 84.4% accuracy of expert prediction with merely two expert caches on-chip. In comparison with CPU and GPU, FLAME achieves 4.12× and 1.49× speedup, respectively.
Tutorial
Design
DescriptionIn the semiconductor industry, floating gate (flash) transistors have exclusively been used for non-volatile memory such as USB memory and solid-state drives (SSDs). This tutorial will present work on circuit design and design automation approaches demonstrating that flash can be used to design high-quality general-purpose VLSI ICs, both digital and analog. In particular, we will cover:
A) Flash-based realizations of both digital [2-9] and secure digital [1] ICs. These realizations have shown significantly improved power, delay and area compared to CMOS standard-cell based designs. The approaches in [2-7] use a PLA-based design flow. In contrast, [8-9] utilize a standard-cell based design approach augmented with flash cells, thereby leveraging many decades of EDA development in the standard-cell based design flow. The approach of [1] provides significant security against foundry-based reverse engineering, without a penalty in power, delay or area compared to CMOS designs. In all these approaches, we have developed logic synthesis flows to automate the insertion of flash-based cells in the design.
B) Flash-based realizations of analog circuits such as low-dropout voltage regulators [10-11], Digital-to-Analog converters [13], FIR filters [12], and other DSP engines. Many benefits are availed by using flash-based designs for these [12-13] circuits, including reduced area, power, energy. In [10-11], flash-based design enables the use of the same design to achieve several LDO specifications, thereby resulting in a significant saving in manufacturing costs.
C) Flash-based mixed-signal designs such as convolutional neural network accelerators (both analog [14-16] and digital [17] variants), and other flash-based in-memory computing designs [18]. With flash-based mixed-signal current-mode CNN realizations [14-16], several common CNN architectures can be realized on the same die, resulting in 50X lower energy, and a latency improvement of 15X to 490,000X over [17], which is a state-of-the-art BNN.
A common theme of the above designs is that flash-based designs demonstrate several advantages over conventional CMOS designs, such as performance tunability, the ability to counteract circuit aging due to effects such as NBTI, the control of speed binning, and the ability to mitigate the effects of process variations. For secure designs, we show that if an adversary illegally gains possession of the IC, our approach can allow the functionality of a "kill switch", whereby the circuit operator can erase the flash transistors in the secure design, rendering it non-functional. We further demonstrate that scalability in the 3rd dimension can be leveraged for all these designs, using emerging 3D NAND and NOR flash technologies that are widely available for flash memory applications. Even though flash transistors do not scale to the feature sizes of traditional CMOS designs, we show that by using 3D flash fabrication techniques, a similar chip-level density (compared to traditional CMOS designs) in terms of transistors/area can be achieved.
Based on our findings, we posit that the programmability, robustness, stability, and maturity of flash give it a significant edge to CMOS in many ways, making it a practical alternative to CMOS in many applications.
A) Flash-based realizations of both digital [2-9] and secure digital [1] ICs. These realizations have shown significantly improved power, delay and area compared to CMOS standard-cell based designs. The approaches in [2-7] use a PLA-based design flow. In contrast, [8-9] utilize a standard-cell based design approach augmented with flash cells, thereby leveraging many decades of EDA development in the standard-cell based design flow. The approach of [1] provides significant security against foundry-based reverse engineering, without a penalty in power, delay or area compared to CMOS designs. In all these approaches, we have developed logic synthesis flows to automate the insertion of flash-based cells in the design.
B) Flash-based realizations of analog circuits such as low-dropout voltage regulators [10-11], Digital-to-Analog converters [13], FIR filters [12], and other DSP engines. Many benefits are availed by using flash-based designs for these [12-13] circuits, including reduced area, power, energy. In [10-11], flash-based design enables the use of the same design to achieve several LDO specifications, thereby resulting in a significant saving in manufacturing costs.
C) Flash-based mixed-signal designs such as convolutional neural network accelerators (both analog [14-16] and digital [17] variants), and other flash-based in-memory computing designs [18]. With flash-based mixed-signal current-mode CNN realizations [14-16], several common CNN architectures can be realized on the same die, resulting in 50X lower energy, and a latency improvement of 15X to 490,000X over [17], which is a state-of-the-art BNN.
A common theme of the above designs is that flash-based designs demonstrate several advantages over conventional CMOS designs, such as performance tunability, the ability to counteract circuit aging due to effects such as NBTI, the control of speed binning, and the ability to mitigate the effects of process variations. For secure designs, we show that if an adversary illegally gains possession of the IC, our approach can allow the functionality of a "kill switch", whereby the circuit operator can erase the flash transistors in the secure design, rendering it non-functional. We further demonstrate that scalability in the 3rd dimension can be leveraged for all these designs, using emerging 3D NAND and NOR flash technologies that are widely available for flash memory applications. Even though flash transistors do not scale to the feature sizes of traditional CMOS designs, we show that by using 3D flash fabrication techniques, a similar chip-level density (compared to traditional CMOS designs) in terms of transistors/area can be achieved.
Based on our findings, we posit that the programmability, robustness, stability, and maturity of flash give it a significant edge to CMOS in many ways, making it a practical alternative to CMOS in many applications.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionWhen deleting data in the storage, the host system creates I/O requests for the data and sends them to the storage. Besides, if the host system sanitizes the data, it should also handle other I/O processes. Since the deletion processes might burden the host system, the file system in the host system could store its small metadata only in the storage to minimize the overhead. This kind of data management could be effective for better performance. However, in this case, the data supposed to be deleted cannot help but remain in the storage. In our flash-based storage system, we propose a novel scheme to eliminate all the I/O resources for the deletion processes in the host system. Moreover, this new flash-based storage thoroughly sanitizes the data without other specified I/O commands. Consequently, we experimented with the new flash-based storage called PaFS and compared the overall performance with the legacy storage.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionTransformer models have become popular in various AI applications due to their exceptional performance. However, their impressive performance comes with significant computing and memory costs, hindering efficient deployment of Transformer-based applications. Many solutions focus on leveraging sparsity in weight matrix and attention computation. However, previous studies fail to exploit unified sparse pattern to accelerate all three modules of Transformer (QKV generation, attention computation, FFN). In this paper, we propose FNM-Trans, an adaptable and efficient algorithm-hardware co-design aimed at optimizing all three modules of the Transformer by fully harnessing 𝑁 : 𝑀 sparsity. At the algorithm level, we fully
explore the interplay of dynamic pruning with static pruning under high 𝑁 : 𝑀 sparsity. At the hardware level, we develop a dedicated hardware architecture featuring a custom computing engine and a softmax module, tailored to support varying levels of 𝑁 :𝑀 sparsity. Experiment results show that, our algorithm optimizes accuracy by 11.03% under 2:16 attention sparsity and 4:16 weight sparsity, compared to other methods. Additionally, FNM-Trans achieves speedups of 27.13× and 21.24× over Intel i9-9900X and NVIDIA RTX 2080 Ti, respectively, and outpaces current FPGA-based Transformers by 1.88× to 36.51×.
explore the interplay of dynamic pruning with static pruning under high 𝑁 : 𝑀 sparsity. At the hardware level, we develop a dedicated hardware architecture featuring a custom computing engine and a softmax module, tailored to support varying levels of 𝑁 :𝑀 sparsity. Experiment results show that, our algorithm optimizes accuracy by 11.03% under 2:16 attention sparsity and 4:16 weight sparsity, compared to other methods. Additionally, FNM-Trans achieves speedups of 27.13× and 21.24× over Intel i9-9900X and NVIDIA RTX 2080 Ti, respectively, and outpaces current FPGA-based Transformers by 1.88× to 36.51×.
Front-End Design
Design
Engineering Tracks
Front-End Design
DescriptionModern System-on-Chips (SoCs) are complex structures that integrate multiple clocks and Clock Domain Crossing (CDC) paths.
Glitch propagation is a critical aspect of CDC, as glitches can cause metastability, leading to unpredictable behavior and data corruption.
A combination of process technology trends coupled with increased intervention by synthesis tools in logic generation can lead to cases in which a design that is CDC-clean at RTL stage fails in the post-synthesis gate-level netlist.
This paper presents such a case along with the tools that were used to analyze potential CDC violations.
We will describe a formal glitch qualification engine that was used for the final verdict.
Additionally we will examine the root cause that led to this scenario and provide recommendations to prevent similar occurrences in the future.
Glitch propagation is a critical aspect of CDC, as glitches can cause metastability, leading to unpredictable behavior and data corruption.
A combination of process technology trends coupled with increased intervention by synthesis tools in logic generation can lead to cases in which a design that is CDC-clean at RTL stage fails in the post-synthesis gate-level netlist.
This paper presents such a case along with the tools that were used to analyze potential CDC violations.
We will describe a formal glitch qualification engine that was used for the final verdict.
Additionally we will examine the root cause that led to this scenario and provide recommendations to prevent similar occurrences in the future.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionFormal Verification is one of the major focus areas in the recent days to minimize project cycle times, faster coverage and easy plug and use models. With the increase in computing power and formal tools, the usage of formal in mainstream projects is increasing rapidly. Formal verification setup is done in 2 major steps: 1. Learning and setting up the specific application of any given formal tool, writing specific input files for it, 2. Learning tool commands, writing assertions, Report analysis and debug. Step one requires prior knowledge about each application, their use case and specific commands that are required to run it, which changes for each app. Step 2 requires the user a new specific tool input file for every application he wants to run. Here we are specifying a methodology Formal Tool Kit(FTK) for direct and easy setup by introducing a custom format to give inputs which can be reused for every app and tool the user intends to run. This in turn greatly reduces the setup time taken by the user. FTK also provided additional tools for user to analyze and organize reports for easier debug and reruns.
Research Manuscript
EDA
Design Verification and Validation
DescriptionEfficient verification of ALUs has always been a challenge. Traditionally, they are verified at a low level, leading to state space explosion for larger bit widths. We symbolically can verify ALUs for all bit widths at once.
Chisel is a hardware description language embedded in Scala. Our key idea is to transform arithmetic Chisel designs into Scala software programs that simulate their behavior, then apply Stainless, a deductive formal verification tool for Scala. We validate the effectiveness by verifying dividers and multipliers in two open-source RISC-V processors, and conclude that our approach requires less manual guidance than others.
Chisel is a hardware description language embedded in Scala. Our key idea is to transform arithmetic Chisel designs into Scala software programs that simulate their behavior, then apply Stainless, a deductive formal verification tool for Scala. We validate the effectiveness by verifying dividers and multipliers in two open-source RISC-V processors, and conclude that our approach requires less manual guidance than others.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionFormal verification plays a crucial role in today's Design Verification flows. There's no doubt that Formal verification does tasks that are extremely difficult in Simulation-based verification. One such task is checking forward progress in a System, ensuring that a system eventually reaches a desired state or completes its intended behavior. Forward progress verification guarantees that a system continuously makes progress and avoids getting stuck in an undesirable or deadlock state.
This problem is especially bigger when you are dealing with critical design components in high-performance computing systems. These systems interact with multiple internal/external controllers, where it's not always guaranteed that the drivers will strictly align to spec. Even if it returns unexpected data, for systems running for months it's not a big issue. But if garbage in causes a hang, that's bad because in most cases it's a full system reset needed to get it going.
This paper focuses on the different techniques and methodologies employed to analyze and prove forward progress and its significance in the context of formal verification.
We will discuss various approaches for ensuring forward progress in the design, using Liveness and Safety assertions both, and later examine how and where each assertion type should be used. with some case studies that demonstrate the application of forward progress checks and the type of critical issues it has found in the designs.
This problem is especially bigger when you are dealing with critical design components in high-performance computing systems. These systems interact with multiple internal/external controllers, where it's not always guaranteed that the drivers will strictly align to spec. Even if it returns unexpected data, for systems running for months it's not a big issue. But if garbage in causes a hang, that's bad because in most cases it's a full system reset needed to get it going.
This paper focuses on the different techniques and methodologies employed to analyze and prove forward progress and its significance in the context of formal verification.
We will discuss various approaches for ensuring forward progress in the design, using Liveness and Safety assertions both, and later examine how and where each assertion type should be used. with some case studies that demonstrate the application of forward progress checks and the type of critical issues it has found in the designs.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe Large Language Models (LLMs) have been popular and widely used in creative ways because of their powerful capabilities. However, the substantial model size and complexity prevent LLMs from being implemented on resource-constrained computing devices efficiently, making it challenging to sustain exciting task performance. The Field-Programmable Gate Arrays (FPGAs), which are suitable for low-latency processing tasks but contain finite resources regarding logic elements, memory size, and bandwidth, have become an intriguing choice for implementing LLMs. In this paper, we propose the FOTA-Quant, the FPGA-Oriented Token Adaptive Quantization framework to achieve LLM acceleration on resource-constrained FPGAs. On the algorithm level, to fit the memory of a single FPGA, we minimize the model size by quantizing the model weights into INT4. Then to further reduce the model complexity and maintain the task performance, we utilize a mixed precision scheme with error-regulized pruning for the activations. On the hardware level, we propose general-precision matrix multiplication to support 8x8, 4x8, 4x4, and optimize the resource utilization for chipset (multi-die) FPGAs to improve the overall quantized LLMs performances. Experiments show that FOTA-Quant achieves simultaneously quantized model weights and activations while maintaining task performance comparable to the existing weight-only quantization methods. Moreover, FOTA-Quant achieves an on-FPGA speedup of up to 5.21x compared to its FP16 counterparts, marking a pioneering advancement in this domain.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionNeural networks demand increasing computational power and memory access due to growing parameter sizes. A solution is low bit-width quantization, but conventional uniform quantization suffers from distribution mismatches, leading to accuracy loss. We introduce Fibonacci Quantization, closely aligning with neural network data distributions using Fibonacci numbers. Fibonacci Quantization Processor (FQP) features two multiplication-free computing units: the Dualistic-Transformation Adder for large numbers multiplication and the Bit-Exclusive Adder for small numbers multiplication. Additionally, Topological-Order Routing optimizes data mapping onto these units. FQP demonstrates either a 0.98% accuracy improvement or 2.17x higher energy efficiency for ResNet50 on ImageNet1k compared to uniform quantization.
Research Manuscript
Design
Design for Manufacturability and Reliability
DescriptionInverse lithography technology (ILT) is vital in optical proximity correction, tending to generate curvilinear masks for optimal process windows. Traditional curvilinear mask manufacturing involves fracturing into rectangles, requiring expensive mask write times. A novel E-beam mask writer that writes variable radius circles per shot significantly reduces the shot count for curvilinear masks. We present two methods to generate circular fracturing-aware masks. The first one converts pixel-based masks from existing ILT methods into circle-based masks using predefined rules. The second one integrates circular constraints into the ILT process, generating circle-based masks directly via optimization. Extensive experimental results validate both approaches' effectiveness.
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionComputing in memory (CIM) realizes energy-efficient neural network algorithms by implementing highly parallel multiply-and-accumulate (MAC) operation. However, the MAC delay of CIM will sharply increase with the improvement of computing precision, which restricts its development. In this work, we propose a full-digital recursive MAC (FRM) operation based on spin-transfer-torque magnetic random access memory (STT-MRAM) CIM system to enable fast and energy-efficient image recognition application. First, the fast FRM scheme is proposed by utilizing the recursive operations of read and addition in segmented bit-line array, which effectively reduces the delay of MAC operations to 3.5ns and 4ns for 8-bit and 16-bit input and weight precision, respectively. Second, we design an image recognition system using FRM-CIM architecture as the processing element (PE), where the adaptive pruning method for layers is proposed to improve the compatibility of it with the neural network. By performing image recognition for the MNIST and CIFAR-10 datasets, results show that the throughput and energy efficiency of the FRM-CIM system are 58.51TOPS/mm2 and 11.3–56.72 TOPS/W under 8–16-bit precision, which are improved by 4.3 times and 2.6 times compared with the state-of-the-art works. Finally, the recognition accuracy can reach 96.65% and 82.7% on MNIST and CIFAR-10, respectively.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionRTL verification has existed for decades and is crucial for identifying potential bugs before chip tape-out. However, hand-crafting test cases are time-consuming and error-prone, even for experienced test engineers. Prior work has attempted to lighten this burden by rule-guided random generation. However, this does not eliminate the manual effort of writing rules about the detailed hardware behavior. Motivated by the increased need for RTL verification in the era of Domain-Specific Architecture (DSA) and the advances in large language models (LLMs), we set out to explore whether LLMs can capture RTL behavior and generate test cases automatically by introducing three distinct prompt approaches to enhance the LLM's ability to generate tests. We utilize the GPT-3.5, an advanced LLM, to verify a 12-stage, multi-issue, out-of-order RV64GC processor, achieving a 14% increase in block coverage rate and an 11% increase in expression coverage rate compared to randomization. Moreover, the combination of LLM and handcrafting achieves great optimization of human resources, which demonstrates a potential methodology for future processor verification. In addition, we provide an open-source prompt library integrated with GPT-3.5 applicable, providing a standardized set of prompts for catering to a diverse range of processor verification scenarios. The prompt library is available at https://github.com/From-RTL-to-Prompt/LLM-prompt-library.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionFormal property verification (FPV) has existed for decades and has been shown to be effective at finding intricate RTL bugs. However, formal properties, such as those written as SystemVerilog Assertions (SVA), are time-consuming and error-prone to write, even for experienced users. Prior work has attempted to lighten this burden by raising the abstraction level so that SVA is generated from high-level specifications. However, this does not eliminate the manual effort of reasoning and writing about the detailed hardware behavior. Motivated by the increased need for FPV in the era of heterogeneous hardware and the advances in large language models (LLMs), we set out to explore whether LLMs can capture RTL behavior and generate correct SVA properties. First, we design an FPV-based evaluation framework that measures the correctness and completeness of SVA. Then, we evaluate GPT4 iteratively to craft the set of syntax and semantic rules needed to prompt it toward creating better SVA. We extend the open-source AutoSVA framework by integrating our improved GPT4-based flow to generate safety properties, in addition to facilitating their existing flow for liveness properties. Lastly, our use cases evaluate (1) the FPV coverage of GPT4-generated SVA on complex open-source RTL and (2) using generated SVA to prompt GPT4 to create RTL from scratch. Through these experiments, we find that GPT4 can generate correct SVA even for flawed RTL, without mirroring design errors. Particularly, it generated SVA that exposed a bug in the RISC-V CVA6 core that eluded the prior work's evaluation.
Work-in-Progress Poster
Fully Automated Implementation of Reservoir Computing Models on FPGAs for Nanosecond Inference Times
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWe propose an efficient and generic Field-Programmable Gate Array (FPGA) implementation of Reservoir Computing using Linear Cellular Automata models for the application of time series processing. Our implementation results from a fully automated model definition to FPGA bitstream design automation process. Hence, it significantly reduces the design time and complexity. It can be clocked with at least 475 MHz on a Xilinx Zynq Ultrascale+ FPGA
while performing one prediction in every clock cycle. Since our implementation only uses lookup tables and registers, it is platform independent. It thus can not only be run on high-performance but also on low-cost FPGAs without special hardware components. Being up to six orders of magnitudes faster than other Reservoir Computing model implementations, our implementation enables intelligent real-time sensor signal processing for applications requiring MHz sampling rates, like structure-borne noise monitoring or high-frequency oscillation analysis.
while performing one prediction in every clock cycle. Since our implementation only uses lookup tables and registers, it is platform independent. It thus can not only be run on high-performance but also on low-cost FPGAs without special hardware components. Being up to six orders of magnitudes faster than other Reservoir Computing model implementations, our implementation enables intelligent real-time sensor signal processing for applications requiring MHz sampling rates, like structure-borne noise monitoring or high-frequency oscillation analysis.
Embedded Systems and Software
AI
Embedded Systems
Engineering Tracks
DescriptionIn conventional embedded SoC development, SW development schedule is determined by HW prototype (e.g., FPGA). The SW development based on virtual platform (VP) has been used in industrial filed. This presentation describes how to apply the IP model provided by the 3rd party to the in-house VP for improving functional accuracy of in-house VP. The proposed method supports shared memory-based IPC to synchronize between the two model's simulations. It also translates data format from application-specific I/F (VP-side) into SystemC-TLM I/F (EDA IP-side), and vice-versa. The experimental results show that the functional accuracy increases up to 100% with only a small fraction (i.e., 8%) of the simulation time increase. The proposed work helps not only for shift-left of SW development, but also for improving SW quality through VP-based CI/CD because of the improvement of functional accuracy.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionPoint-based deep neural networks have demonstrated remarkable ability in analyzing point cloud. However, challenges arise in sampling and grouping layers, particularly in terms of time and energy consumption. In this paper, we introduce a Morton code-based data structure which stores point data with the shared upper bits together. We also propose a fused sampling and grouping approach with a reduced search space, which reuses the point data and the calculated distances. Additionally, a dedicated hardware supporting the proposed method is introduced. Experimental results show that our approach effectively reduces the number of calculations and data accesses with negligible accuracy loss.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionNow, chiplets aren't just any building blocks; they're like the superheroes of the semiconductor world. By promising the adaptability and scalability compels a need for a cost-effective and time-efficient approach to super architecture testing. Thus, demanding sustainability of the overall design verification process.
Our paper unveils a state-of-the-art modularized architecture that accommodates multiple protocols, seamlessly overlaid onto the UCle framework. This can be tailored to suit a variety of design verification topologies, adapting to the unique needs of each project.
The vision is simple: Chiplets, whether from today or tomorrow, should seamlessly integrate into a cohesive whole.
Our paper unveils a state-of-the-art modularized architecture that accommodates multiple protocols, seamlessly overlaid onto the UCle framework. This can be tailored to suit a variety of design verification topologies, adapting to the unique needs of each project.
The vision is simple: Chiplets, whether from today or tomorrow, should seamlessly integrate into a cohesive whole.
Research Manuscript
Design
SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionGraph partitioning is fundamental for many CAD algorithms because it divides a large circuit into smaller pieces with manageable complexity. As the size of the circuit graph continues to grow, graph partitioning becomes increasingly time-consuming. Recent research has introduced parallel graph partitioners using either multi-core CPUs or GPUs. However, their performance is limited to a few CPU cores and available GPU memory. As a result, we propose G-kway, an efficient multilevel GPU-accelerated k-way graph partitioner. Experimental results have shown that G-kway outperforms both the state-of-the-art CPU-based and GPU-based parallel partitioners with an average speedup of 8.6x and 3.8x, respectively.
Research Manuscript
EDA
Timing and Power Analysis and Optimization
DescriptionStatic timing analysis (STA) is an important stage in the modern EDA design flow. But STA becomes timing-consuming with the growth of modern circuit size. Recent research has leveraged task dependency graph (TDG) parallelism to accelerate STA. Despite the speedup through TDG parallelism, the performance can be further enhanced by reducing the scheduling cost. A common solution for reducing scheduling cost is TDG partitioning. However, the runtime of existing TDG partitioning algorithms grows rapidly as the TDG size enlarges. Also, TDG partitioning is frequently invoked during STA process. This make TDG partitioning runtime adds up to a significant portion of the entire STA runtime. As a result, it is important to optimize the runtime performance of TDG partitioning.
In this paper, we propose G-PASTA, a GPU-accelerated TDG partitioning algorithm by harnessing the computation power of modern GPU architectures. We evaluate the performance of G-PASTA on a set of TDGs from large designs. Compared to the state-of-the-art TDG partitioner, G-PASTA is up to 41.8× faster, while improving TDG runtime by 2×.
In this paper, we propose G-PASTA, a GPU-accelerated TDG partitioning algorithm by harnessing the computation power of modern GPU architectures. We evaluate the performance of G-PASTA on a set of TDGs from large designs. Compared to the state-of-the-art TDG partitioner, G-PASTA is up to 41.8× faster, while improving TDG runtime by 2×.
Research Manuscript
AI
AI/ML Application and Infrastructure
DescriptionPerformance estimation is a crucial component in the optimization processes of accelerator development on the Versal ACAP architecture.
However, existing approaches present limitations - they are either too slow to facilitate efficient iterations, or they lack the necessary accuracy due to the specific AIE array architecture and two-level programming model of Versal ACAP.
To tackle this challenge, we propose G$^2$PM, a performance modeling technique based on a hierarchical graph representation centered on the AIE array.
More specifically, we employ a hierarchical graph neural network to identify features of both kernel programs and dataflow programs, taking into account the hardware and software characteristics of the Versal ACAP architecture.
In our evaluations, our method demonstrates significant improvements, achieving a mean error rate of less than 1.6\% and providing a speed-up factor of 4165$\times$ compared to the simulation-based method.
However, existing approaches present limitations - they are either too slow to facilitate efficient iterations, or they lack the necessary accuracy due to the specific AIE array architecture and two-level programming model of Versal ACAP.
To tackle this challenge, we propose G$^2$PM, a performance modeling technique based on a hierarchical graph representation centered on the AIE array.
More specifically, we employ a hierarchical graph neural network to identify features of both kernel programs and dataflow programs, taking into account the hardware and software characteristics of the Versal ACAP architecture.
In our evaluations, our method demonstrates significant improvements, achieving a mean error rate of less than 1.6\% and providing a speed-up factor of 4165$\times$ compared to the simulation-based method.
Research Manuscript
AI
Design
AI/ML System and Platform Design
DescriptionAdversarial ensemble defense is one of the most effective techniques for defending against adversarial attacks, which constructs ensembles of multiple DNNs to improve the model's robustness. However, deploying ensemble defense methods on existing DNN inference systems is inefficient and impractical due to their dynamics and randomness. To this end, we propose an inference system for adversarial ensemble defense called Garrison, which can deliver robust and low-latency predictions using Multi-Instance GPUs. Our evaluations show that Garrison can improve adversarial robustness by up to 24.5% while accelerating ensemble inference by up to 6.6x compared to the state-of-the-art inference framework.
Research Manuscript
Security
Hardware Security: Primitives, Architecture, Design & Test
DescriptionA heterogeneous integrated system in package (SIP) system integrates chiplets outsourced from different vendors into the same substrate for better performance. However, during post-integration testing, the sensitive testing data designated for a specific chiplet can be blocked, tampered or sniffed by other malicious chiplets. This paper proposes GATE-SiP which is an authenticated partial encryption protocol to enable secure testing. Within GATE-SiP, the sensitive testing pattern will only be sent to the authenticated chiplet. In addition, partial encryption of the sensitive data prevents data sniff threats without causing significant penalties on timing overhead. Extensive simulation results show the GATE-SiP protocol only brings 6.74% and 14.31% on area and timing overhead, respectively.
Research Manuscript
EDA
Timing and Power Analysis and Optimization
DescriptionComposite Current Source (CCS) timing model plays an important role in modern static timing analysis (STA) because it precisely captures the timing behavior of a design at advanced nodes. However, CCS is extremely time-consuming due to its accurate but complicated timing models. To overcome this challenge, we introduce GCS-Timer, a GPU-accelerated CCS-based timing analysis algorithm. Unlike existing methods that perform model order reduction to trade accuracy for speed, GCS-Timer achieves high accuracy through a fast simulation-based analysis using GPU computing. Experimental results show that GCS-Timer can complete CCS analysis with better accuracy and achieve 3.2X faster runtime compared with a 16-threaded industrial standard timer.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionHeterogeneous Graph Neural Networks (HGNNs) have broadened the applicability of graph representation learning to heterogeneous graphs. However, the irregular memory access pattern of HGNNs leads to the buffer thrashing issue in HGNN accelerators.
In this work, we identify an opportunity to address buffer thrashing in HGNN acceleration through an analysis of the topology of heterogeneous graphs. To harvest this opportunity, we propose a graph restructuring method and map it into a hardware frontend named GDR-HGNN.
GDR-HGNN dynamically restructures the graph on the fly to enhance data locality for HGNN accelerators.
Experimental results demonstrate that, with the assistance of GDR-HGNN, a leading HGNN accelerator achieves an average speedup of 14.6$\times$ and 1.78$\times$ compared to the state-of-the-art software framework running on A100 GPU and itself, respectively.
In this work, we identify an opportunity to address buffer thrashing in HGNN acceleration through an analysis of the topology of heterogeneous graphs. To harvest this opportunity, we propose a graph restructuring method and map it into a hardware frontend named GDR-HGNN.
GDR-HGNN dynamically restructures the graph on the fly to enhance data locality for HGNN accelerators.
Experimental results demonstrate that, with the assistance of GDR-HGNN, a leading HGNN accelerator achieves an average speedup of 14.6$\times$ and 1.78$\times$ compared to the state-of-the-art software framework running on A100 GPU and itself, respectively.
Tutorial
EDA
DescriptionFor efficient design, verification and validation of integrated circuits and components it is important to have an easy to customize and extend workflow. Python has become the standard programming language for machine learning, scientific computing and engineering.
Gdsfactory is a python library to build chips (Photonics, Analog, Quantum, MEMs, …) that provides you a common syntax for design (KLayout, Ansys, tidy3d, MEEP, MPB, DEVSIM, SAX, …), verification (Klayout DRC, LVS, netlist extraction, connectivity checks, fabrication models) and validation (JAX neural network model extraction, pandas, SQL database).
In this tutorial we will cover the gdsfactory design automation tool. Gdsfactory provides you an end to end workflow that combines layout, verification and validation using an extensible, open source, python driven flow for turning your chip designs into validated products.
https://gdsfactory.github.io/gdsfactory-photonics-training/notebooks/10_layout_full.html
Gdsfactory is a python library to build chips (Photonics, Analog, Quantum, MEMs, …) that provides you a common syntax for design (KLayout, Ansys, tidy3d, MEEP, MPB, DEVSIM, SAX, …), verification (Klayout DRC, LVS, netlist extraction, connectivity checks, fabrication models) and validation (JAX neural network model extraction, pandas, SQL database).
In this tutorial we will cover the gdsfactory design automation tool. Gdsfactory provides you an end to end workflow that combines layout, verification and validation using an extensible, open source, python driven flow for turning your chip designs into validated products.
https://gdsfactory.github.io/gdsfactory-photonics-training/notebooks/10_layout_full.html
Research Panel
AI
EDA
DescriptionGenerative AI (GenAI) technologies for modalities including text, image, speech etc., are poised for huge practical impact in a range of industries. How will GenAI impact the EDA business, and perhaps conversely, does EDA have a role to play in advancing GenAI? Recent results suggest GenAI can indeed play a transformative role across the design flow from chip specification and verification, to pre- and post-silicon test, physical design and design for manufacturability, thereby improving designer productivity, time-to-market and design quality. Conversely, EDA can play a crucial role in addressing the massive training and inference costs of state-of-art trillion parameter or more GenAI models via pruning, specialization and acceleration. The panel will seek to address several key questions the about the role of GenAI and EDA namely:
(1) Can GenAI design a full chip? Intentionally provocative, panelists will be asked whether GenAI methods alone, or with limited supervision, can translate natural language design intent to high-quality GDSII, along with test and verification procedures? What role will human expertise, experience and intuition play in a GenAI driven flow, and which parts can be truly automated? In sum, what are the killer applications for GenAI in chip design?
(2) Specialized vs. general-purpose foundation models for chip design? Generalized foundation models like GPT-4, Bard etc. have shown exceptional abilities to generalize to unseen tasks, including potentially RTL code and EDA script generation. Will these massive foundation models suffice or do we need smaller and specialized foundation models for hardware design? Specialized models can improve performance on hardware=-specific tasks, in addition to having manageable training and inference costs.
(3) Open- vs. closed-sourced datasets and models for hardware? Many semiconductor companies have massive internal datasets that can be used to train foundation models for hardware, but these models will likely not be released publicly due to IP issues. Unlike for software, open datasets of hardware are scarce—for example, Verilog is only 0.004% of the code on GitHub. Are there avenues for training large open-source GenAI models for chip design, or do we expect these models to be internal and/or black-boxed?
(4) Regulatory, legal, safety and robustness issues? The recent Executive Order by the Biden administration requires, amongst other things, the development of " standards, tools, and tests to help ensure that AI systems are safe, secure, and trustworthy." What does this mean for GenAI models in the EDA context? Models trained on open-source datasets must additionally worry about copyright and IP violation issues, user privacy and the ``right to be forgotten" and additional concerns about inadvertent or malicious backdoors in ML models.
(1) Can GenAI design a full chip? Intentionally provocative, panelists will be asked whether GenAI methods alone, or with limited supervision, can translate natural language design intent to high-quality GDSII, along with test and verification procedures? What role will human expertise, experience and intuition play in a GenAI driven flow, and which parts can be truly automated? In sum, what are the killer applications for GenAI in chip design?
(2) Specialized vs. general-purpose foundation models for chip design? Generalized foundation models like GPT-4, Bard etc. have shown exceptional abilities to generalize to unseen tasks, including potentially RTL code and EDA script generation. Will these massive foundation models suffice or do we need smaller and specialized foundation models for hardware design? Specialized models can improve performance on hardware=-specific tasks, in addition to having manageable training and inference costs.
(3) Open- vs. closed-sourced datasets and models for hardware? Many semiconductor companies have massive internal datasets that can be used to train foundation models for hardware, but these models will likely not be released publicly due to IP issues. Unlike for software, open datasets of hardware are scarce—for example, Verilog is only 0.004% of the code on GitHub. Are there avenues for training large open-source GenAI models for chip design, or do we expect these models to be internal and/or black-boxed?
(4) Regulatory, legal, safety and robustness issues? The recent Executive Order by the Biden administration requires, amongst other things, the development of " standards, tools, and tests to help ensure that AI systems are safe, secure, and trustworthy." What does this mean for GenAI models in the EDA context? Models trained on open-source datasets must additionally worry about copyright and IP violation issues, user privacy and the ``right to be forgotten" and additional concerns about inadvertent or malicious backdoors in ML models.
Exhibitor Forum
DescriptionThe emergence of generative AI presents tremendous opportunities for advancing technical and business processes for high tech and semiconductor industries. From optimizing complex system design processes and accelerating time-to-market for new products to augmenting human capabilities, improving engineering and manufacturing methodologies and processes with applied generative AI has unlimited potential. Generative design methodologies powered by AI can automatically design chips and electronic subsystem given the right prompts and desired parameters and constraints without intensive engineering efforts and freeing up resources. Or generative Engineering Assistants can help a new engineers become up to 2X more productive by interacting with design tools using natural language. For process improvements that directly impact project timelines and business outcomes, generative AI can facilitate rapid development of product datasheets, technical manuals, and associated documentation customized to target audience and markets. Further efficiency gains can be realized by using engineering assistants for research and providing engineers contextual recommendations, thereby assisting human teams to quickly address critical research problems. We will discuss generative AI services on AWS and how some of these services can be leveraged to build a generative AI Engineering Assistant for semiconductor design.
Research Manuscript
AI
AI/ML Algorithms
DescriptionNon-linear functions are prevalent in Transformers and their lightweight variants, incurring substantial and frequently underestimated hardware costs. Previous state-of-the-art works optimize these operations by piece-wise linear approximation and store the parameters in look-up tables (LUT), but most of them require unfriendly high-precision arithmetics such as FP/INT 32 and lack consideration of integer-only INT quantization. This paper proposed a genetic LUT-Approximation algorithm namely GQA-LUT that can automatically determine the parameters with quantization awareness. The results demonstrate that GQA-LUT achieves negligible degradation on the challenging semantic segmentation task for both vanilla and linear Transformer models. Besides, proposed GQA-LUT enables the employment of INT8-based LUT-Approximation that achieves an area savings of 81.3~81.7% and a power reduction of 79.3~80.2% compared to the high-precision FP/INT 32 alternatives.
Research Manuscript
Design
SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionModern out-of-order processors are increasingly expanding resources such as reorder buffer (ROB) and instruction queue (IQ) for memory-level parallelism (MLP). While this expansion effectively addresses the memory wall challenge, it also incurs notable cost and energy trade-offs. To tackle this, we propose Geneva, a microarchitecture that improves performance and energy efficiency. Geneva reallocates a portion of the ROB to serve as a dynamic queue (DQ), used as the ROB, IQ, or both depending on operational needs. Geneva saves energy by 15.6% and improves performance by 2.6% compared to the out-of-order core baseline.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWe present GL0AM, a GPU-accelerated logic simulator that performs delay-annotated gate-level simulation, supporting a wide range of sequential gate types and simulation scenarios, including SRAMs. We propose a methodology to perform the simulation in 2 portions to increase parallelism in the application, where the first portion performs 0-delay cycle-simulation, and the second performs re-simulation. We use netlist graph partitioning to minimize synchronization overhead during the 0-delay simulation to increase speedup for this difficult to parallelize simulation process. GL0AM achieves simulation speedup of 15-448X when compared to a commercial simulator across a diverse set of benchmarks, which we aim to open-source.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionTechnology scaling results in smaller die size with increased amount of bulk-biasing penalty. In this paper, we propose a physical synthesis methodology for digital circuit that minimizes the overhead. Considering bulk-biasing constraint of mutual distance and density, we first place global bulk-biasing active before standard cell placement. Then, to guarantee the biasing constraints, we transform the placement using slide-window algorithm. For 3D V-NAND digital design [1], measurement result showed 7.2% of area reduction compared to the conventional methodology.
[1] M. Kim et al., "A 1Tb 3b/Cell 8th-Generation 3D-NAND Flash Memory with 164MB/s Write Throughput and a 2.4Gb/s Interface," ISSCC 2022:136-137.
[1] M. Kim et al., "A 1Tb 3b/Cell 8th-Generation 3D-NAND Flash Memory with 164MB/s Write Throughput and a 2.4Gb/s Interface," ISSCC 2022:136-137.
Research Manuscript
EDA
Physical Design and Verification
DescriptionThe back-side metal layers exhibit lower parasitics compared to the front-side layers in advanced technologies, making them suitable for clock-net distribution. In this study, we explore the advantages of using back-side metal layers for clock routing, which is shared with a power delivery network. Our Graph Neural Network (GNN) based framework, effectively distributes the clock-tree between the front and back sides. We address the back-side clock nets' creation by incorporating back-side buffers. Our results demonstrate better clock and full-chip metrics represented by an increase of up to 13% in the effective frequency with equivalent power consumption, using 3 nm technology.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWe focus on the high learning efficiency of critic networks in "DNN-Opt", an automated op-amp sizing algorithm using reinforcement learning, and propose "GNN-Opt", a replacement for GNN architecture that can work across different topologies. When sizing was performed on the NMOS basic differential pair, GNN-Opt not only learned criticism and sized to calculate high FoM as well as DNN-Opt, but also obtained high FoM from the beginning by converting the training model to a PMOS differential pair and performing inference only, and achieved high FoM without learning. The performance was higher than that of the model without learning the NMOS design.
Research Manuscript
AI
AI/ML Algorithms
DescriptionGraph Neural Networks (GNNs) succeed significantly in many applications recently. However, balancing GNNs training runtime cost, memory consumption, and attainable accuracy for various applications is non-trivial. Previous training methodologies suffer from inferior adaptability and lack a unified training optimization solution. To address the problem, this work proposes GNNavigator, an adaptive GNN training configuration optimization framework. GNNavigator meets diverse GNN application requirements due to our unified software-hardware co-abstraction, proposed GNNs training performance model, and practical design space exploration solution. Experimental results show that GNNavigator can achieve up to 3.1X speedup and 44.9% peak memory reduction with comparable accuracy to state-of-the-art approaches.
Research Manuscript
AI
AI/ML Algorithms
DescriptionRecent research has shown the potential of Model-based Reinforcement Learning (MBRL) to enhance energy efficiency of Heating, Ventilation, and Air Conditioning (HVAC) systems. However, existing methods rely on black-box thermal dynamics models and stochastic optimizers, lacking reliability guarantees and posing risks to occupant health. We address this by redesigning HVAC controllers using decision trees extracted from thermal models and historical data, providing deterministic, verifiable, and interpretable policies. Extensive experiments show that our method saves 68.4% more energy and increases human comfort gain by 14.8% compared to the state-of-the-art method, plus a 1127x reduction in computation overhead. Code: https://github.com/30363/Veri-HVAC.
IP
Engineering Tracks
IP
DescriptionThe increasing dynamics of automotive use-cases lead to novel challenges given the rigid requirements on dependability of these systems. While Advanced Driver Assistance use-cases call for more and more compute performance other use-cases raise the need for flexible and frequent integration of updates and upgrades. In this context, we see contradicting paradigms: on one hand, to optimize for specific dedicated hardware targets, thus rendering the application dependent on this hardware, and on the other hand to make hardware a commodity that is easy to change with a proclaimed decoupling of hardware and software.
At the same time, the multi-year turnaround cycles from OEMs down to Tier2 suppliers make it harder to predict the required functionality and optimal split between general purpose and dedicated hardware.
This leads to significant challenges along the vertical supply chain:
• Where are the right places to introduce abstractions to decouple functional layers? Where shall we avoid that to allow for optimizations?
• Which technology layers need what grade of standardization to warrant successful adoption and still foster competition through USPs?
• How can we shorten the turnaround cycles by integrating tools of standardized interfaces?
New technologies and advancements seem to be promising, e.g., chiplet technology, but must stand the test of automotive qualification which raises additional open questions.
In this session we will go along the vertical and bring together esteemed industry leader to give an insight into the challenges on each level.
At the same time, the multi-year turnaround cycles from OEMs down to Tier2 suppliers make it harder to predict the required functionality and optimal split between general purpose and dedicated hardware.
This leads to significant challenges along the vertical supply chain:
• Where are the right places to introduce abstractions to decouple functional layers? Where shall we avoid that to allow for optimizations?
• Which technology layers need what grade of standardization to warrant successful adoption and still foster competition through USPs?
• How can we shorten the turnaround cycles by integrating tools of standardized interfaces?
New technologies and advancements seem to be promising, e.g., chiplet technology, but must stand the test of automotive qualification which raises additional open questions.
In this session we will go along the vertical and bring together esteemed industry leader to give an insight into the challenges on each level.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionHarmonic balance (HB) simulation is a method to calculate frequency domain steady-state response in non-linear circuits. For multiple tones, high speed circuits in advanced Finfet process, the runtime of HB analysis becomes particularly challenging. And may often encounter convergence issues for large netlists. In this paper, we proposed a new method to accelerate the HB analysis by GPU acceleration. The runtime can be reduced from 20 days down to 2 hours in a high speed LCVCO design in Finfet process while still meeting the accuracy requirements. It gives the circuit designer more flexibility to either run more circuit simulation before tape-out or reduce the total SPICE simulation time.
IP
Engineering Tracks
IP
DescriptionThis presentation is about tracking dynamic power using real-world emulation use-cases/workloads at GPU IP level rather than bottoms up projections based on unit level tests. This enables exploration of biases in real-world data-patterns and design scenarios while also being able to track relative power trends and quantify the impact of RTL, physical design and software changes on IP power throughout the course of project execution. We use SAIF from emulation for average power estimation flow and FSDB for power optimization flow. Our approach resulted in the discovery of significant power optimization opportunity at the GFX IP level. Furthermore, it helped us successfully identify regressions in power between successive versions of the GFX IP using relative RTL trend tracking technique. Overall, this helps with left shifting the GFX IP level power analysis and optimization effort and avoid late RTL surprises which directly translates into better perf-per-watt for the product.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe breadth-first-search (BFS) algorithm serves as a fundamental building block for graph traversal with a wide range of applications, spanning from the electronic design automation (EDA) field to social network analysis. Many contemporary real-world networks are dynamic and evolve rapidly over time. In such cases, recomputing the BFS from scratch after each graph modification becomes impractical. While parallel solutions, particularly for GPUs, have been introduced to handle the size complexity of static networks, none have addressed the issue of work-efficiency in dynamic networks. In this paper, we propose a GPU-based BFS implementation capable of processing batches of network updates concurrently. Our solution leverages batch information to minimize the total workload required to update the BFS result. Additionally, we introduce a technique for relabeling nodes, enhancing locality during dynamic BFS traversal. We present experimental results on a diverse set of large networks with varying characteristics and batch sizes.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionSparse Matrix-Matrix Multiplication (SpMM) is widely used in many scientific and engineering applications, such as numerical simulations and graph neural networks. Previous researchers have proposed numerous sparse formats and corresponding algorithms to enhance performance on GPUs. However, no single SpMM solution consistently outperforms the others due to the complexity of sparse patterns. In this paper, we propose using Graph Attention Network (GAT) to learn these patterns and select an optimal sparse format for SpMM acceleration on GPUs. First, a sparse matrix can inherently be treated as an adjacency matrix in graph, which intuitively transforms the task of format selection into a graph classification problem. Second, we employ GAT to learn the intricate relationships between the characteristics of sparse matrices and the performance of GPU kernels. Our approach preserves most of matrices' structural information and incorporates performance-related statistics as node embeddings, enabling attention mechanisms and message-passing capabilities of GAT to effectively focus on potential latency bottlenecks. Extensive experiments show that our method outperforms state-of-the-art SpMM GPU kernels, delivering an average 1.3x to 1.6x GFLOPs speedup across a diverse set of over 1700 sparse matrices derived from real applications.
Research Manuscript
EDA
Test, Validation and Silicon Lifecycle Management
DescriptionThe increasing complexity of Electrical and Electronic (E/E) systems underscores the need for protective measures to ensure functional safety (FuSa) in high-assurance environments. This entails the identification and fortification of vulnerable nodes to enhance system reliability during mission-critical scenarios. Traditionally, the assessment of E/E system reliability has relied on fault injection (FI)
techniques and simulations. However, FI faces challenges in coping with escalating design complexity, including resource demands and timing overheads. Furthermore, it falls short in identifying critical components that may lead to functional failures. To address these challenges, we propose a Machine Learning (ML)-based framework for predicting critical nodes in hardware designs. The process begins
with constructing a graph from the design netlist, forming the foundation for training a Graph Convolutional Network (GCN). The GCN model utilizes graph node attributes, node labels, and
edge connections to learn and predict critical nodes in the circuit. The model furnishes up to 93.7% accuracy in identifying vulnerable circuit nodes during evaluation on diverse designs such as Synchronous Dynamic Random Access Memory (SDRAM) controller, OpenRISC 1200 (OR1200) modules. Furthermore, we incorporate an explainability analysis to interpret individual node predictions.
This analysis discerns the critical design factors influencing fault criticality in the design. Moreover, to the best of our knowledge, we, for the first time, perform a regression analysis to generate
node criticality scores, quantifying the degrees of criticality, that can enable prioritizing resources towards critical nodes.
techniques and simulations. However, FI faces challenges in coping with escalating design complexity, including resource demands and timing overheads. Furthermore, it falls short in identifying critical components that may lead to functional failures. To address these challenges, we propose a Machine Learning (ML)-based framework for predicting critical nodes in hardware designs. The process begins
with constructing a graph from the design netlist, forming the foundation for training a Graph Convolutional Network (GCN). The GCN model utilizes graph node attributes, node labels, and
edge connections to learn and predict critical nodes in the circuit. The model furnishes up to 93.7% accuracy in identifying vulnerable circuit nodes during evaluation on diverse designs such as Synchronous Dynamic Random Access Memory (SDRAM) controller, OpenRISC 1200 (OR1200) modules. Furthermore, we incorporate an explainability analysis to interpret individual node predictions.
This analysis discerns the critical design factors influencing fault criticality in the design. Moreover, to the best of our knowledge, we, for the first time, perform a regression analysis to generate
node criticality scores, quantifying the degrees of criticality, that can enable prioritizing resources towards critical nodes.
Research Manuscript
AI
AI/ML Algorithms
DescriptionThe key to device-edge co-inference paradigm is to partition models into computation-friendly and computation-intensive parts across device and edge, respectively. However, for Graph Neural Networks (GNNs), partitioning without architecture exploration is ineffective due to various computational-communication overheads of GNN operations over heterogeneous devices. We present GCoDE, the first automatic framework that co-designs the GNN architecture and operation mapping. GCoDE abstracts communication process into explicit operation, fuses architecture search and operations mapping in a joint-optimization space. Also, the performance-awareness approach enables effective evaluation of architecture efficiency. Experiments show GCoDE achieves up to 44.9x speedup and 98.2% energy reduction across various systems.
Research Manuscript
AI
Design
AI/ML, Digital, and Analog Circuits
DescriptionUnlike circuit parameter and sizing optimizations, the automated design of analog circuit topologies poses significant challenges for learning-based approaches. One challenge arises from the combinatorial growth of the topology space with circuit size, which limits the topology optimization efficiency. Moreover, traditional circuit evaluation methods are time-consuming, while the presence of data discontinuity in the topology space makes the accurate prediction of circuit performance exceptionally difficult for unseen topologies. To tackle these challenges, we design a novel Graph-Transformer-based Network (GTN) as the surrogate model for circuit evaluation, offering a substantial acceleration in the speed of circuit topology optimization without sacrificing performance. Our GTN model architecture is designed to embed voltage changes in circuit loops and current flows in connected devices, enabling accurate performance predictions for circuits with unseen topologies. Taking the power converter circuit design as an experimental task, our GTN model significantly outperforms an analytical approach and baseline methods directly utilizing graph neural networks. Furthermore, GTN achieves less than 5% relative error and 196× speed-up compared with high-fidelity simulation. Notably, our GTN surrogate model empowers an automatic circuit design framework to discover circuits of comparable quality to those identified through high-fidelity simulation while reducing the time required by up to 97.2%.
Workshop
AI
DescriptionLarge language models (LLMs) have been a significant breakthrough in artificial intelligence, demonstrating remarkable success in solving various real-world problems. These models, trained on vast amounts of text data, have shown an uncanny ability to generate human-like text, understand context, answer questions, and write code. They have been successfully deployed in numerous applications, including customer service, content creation, and language translation, to name a few. The versatility and robustness of LLMs have made them an invaluable tool in the AI toolkit, opening up avenues for exploration and innovation. Microsoft, Google, Meta, and Amazon have invested in generative AI technologies like LLMs.
One such avenue that has garnered attention is using LLMs in chip design. Digital chip design, a complex and intricate process, involves the creation of integrated circuits used in various electronic devices. LLMs are expected to aid designers during design and concept development, verification, validation, and security checks. For instance, Synopsys has recently developed an LLM-based framework to aid chip design and development.
With these revolutionary developments and interest from leading electronic design automation companies (Cadence, Synopsys, Siemens) and chip design companies (Intel, Nvidia, Qualcomm, IBM, etc.), there is an increasing need for a greater understanding of LLMs' roles in EDA. We will organize the first "Workshop on Gen-AI for Chip Design." This workshop will comprise three types of knowledge sharing:
(i) Two-hour session on visionary talks from industry and government leaders (10:00-Noon),
(ii) Two-hour lunch poster session from academic/industry researchers (Noon-2:00).
(iii) Two-hour prompt engineering tutorial/competition (e.g., RTL generation) open to the EDA, AI, and Design community (students, academics, practitioners, hobbyists) (2:00-4:00).
(iv) One-hour closing session on the next steps.
Together, these sessions will highlight the trends and needs in industry, research adventures in academia, and bridge the gap between the two.
One such avenue that has garnered attention is using LLMs in chip design. Digital chip design, a complex and intricate process, involves the creation of integrated circuits used in various electronic devices. LLMs are expected to aid designers during design and concept development, verification, validation, and security checks. For instance, Synopsys has recently developed an LLM-based framework to aid chip design and development.
With these revolutionary developments and interest from leading electronic design automation companies (Cadence, Synopsys, Siemens) and chip design companies (Intel, Nvidia, Qualcomm, IBM, etc.), there is an increasing need for a greater understanding of LLMs' roles in EDA. We will organize the first "Workshop on Gen-AI for Chip Design." This workshop will comprise three types of knowledge sharing:
(i) Two-hour session on visionary talks from industry and government leaders (10:00-Noon),
(ii) Two-hour lunch poster session from academic/industry researchers (Noon-2:00).
(iii) Two-hour prompt engineering tutorial/competition (e.g., RTL generation) open to the EDA, AI, and Design community (students, academics, practitioners, hobbyists) (2:00-4:00).
(iv) One-hour closing session on the next steps.
Together, these sessions will highlight the trends and needs in industry, research adventures in academia, and bridge the gap between the two.
Research Manuscript
EDA
Test, Validation and Silicon Lifecycle Management
DescriptionGrowing global concerns about climate change highlight the need for environmentally sustainable computing. The ecological impact of computing, including operational and embodied, is a key consideration. Field Programmable Gate Arrays (FPGAs) stand out as promising sustainable computing platforms due to their reconfigurability across various applications. This paper introduces GreenFPGA,
a tool estimating the total carbon footprint (CFP) of FPGAs over their lifespan, considering design, manufacturing, reconfigurability (reuse), operation, disposal, and recycling. Using GreenFPGA, the paper evaluates scenarios where the ecological benefits of FPGA reconfigurability outweigh operational and embodied carbon costs, positioning FPGAs as a environmentally sustainable choice for
hardware acceleration compared to Application-Specific Integrated Circuits (ASICs). Experimental results show that FPGAs have lower CFP than ASICs, particularly for multiple distinct, low-volume applications, or short application lifespans.
a tool estimating the total carbon footprint (CFP) of FPGAs over their lifespan, considering design, manufacturing, reconfigurability (reuse), operation, disposal, and recycling. Using GreenFPGA, the paper evaluates scenarios where the ecological benefits of FPGA reconfigurability outweigh operational and embodied carbon costs, positioning FPGAs as a environmentally sustainable choice for
hardware acceleration compared to Application-Specific Integrated Circuits (ASICs). Experimental results show that FPGAs have lower CFP than ASICs, particularly for multiple distinct, low-volume applications, or short application lifespans.
Research Manuscript
AI
Design
AI/ML System and Platform Design
DescriptionThis work proposes GSPO, an automatic unified framework that jointly applies graph substitution and parallelization for DNN inference. GSPO uses joint optimization computation graph (JOCG) to represent both graph substitution and parallelization at the operator level. Then, a novel cost model customized for joint optimization is used to quickly evaluate the computation graph execution time. Combined with backtracking search algorithm, GSPO is able to find the optimal joint optimization solution within acceptable search time. Compared to existing frameworks applying equivalent graph substitution or parallelization, GSPO can achieve up to 27.1% end-to-end performance improvement and reduce search time by up to 94.3%.
Research Manuscript
Security
Hardware Security: Primitives, Architecture, Design & Test
DescriptionMulti-Scalar Multiplication (MSM) is a fundamental cryptographic
primitive, which plays a crucial role in Zero-knowledge proof systems.
In this paper, we optimize the single MSM Process Element
(PE) utilizing buckets with fewer conflicts, enhanced by Greedy-based
scheduling, to achieve higher efficiency. The evaluation results
show our optimized single MSM PE achieving a speedup of
over two times on average, peaking at 3.63 times compared to previous
works. Furthermore, we introduce Gypsophila, a scalable
and bandwidth-optimized architecture for implementing multiple
MSM PEs. Leveraging the characteristics of the bucket method,
we optimize the data flow by balancing the throughput of bucket
classification, bucket aggregation, and result aggregation in MSM.
Simultaneously, multiple PEs with different data access patterns
share a universal point input channel and post-processing unit,
which improves the module utilization and mitigates the bandwidth
pressure. Gypsophila with 16 PEs, accomplishes 16 MSM tasks in
a mere 1.01% additional time, showcasing an approximate 7.8% reduction
in area, with only about 1/16 of the bandwidth requirement,
compared with 16 PEs without input channel and post-process unit
sharing.
primitive, which plays a crucial role in Zero-knowledge proof systems.
In this paper, we optimize the single MSM Process Element
(PE) utilizing buckets with fewer conflicts, enhanced by Greedy-based
scheduling, to achieve higher efficiency. The evaluation results
show our optimized single MSM PE achieving a speedup of
over two times on average, peaking at 3.63 times compared to previous
works. Furthermore, we introduce Gypsophila, a scalable
and bandwidth-optimized architecture for implementing multiple
MSM PEs. Leveraging the characteristics of the bucket method,
we optimize the data flow by balancing the throughput of bucket
classification, bucket aggregation, and result aggregation in MSM.
Simultaneously, multiple PEs with different data access patterns
share a universal point input channel and post-processing unit,
which improves the module utilization and mitigates the bandwidth
pressure. Gypsophila with 16 PEs, accomplishes 16 MSM tasks in
a mere 1.01% additional time, showcasing an approximate 7.8% reduction
in area, with only about 1/16 of the bandwidth requirement,
compared with 16 PEs without input channel and post-process unit
sharing.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionLow-Latency and Low-Power Edge AI is essential for Virtual Reality and Augmented Reality applications. Recent advances show that hybrid models, combining convolution layers (CNN) and transformers (ViT), often achieve superior accuracy/performance tradeoff on various computer vision and machine learning (ML) tasks. However, hybrid ML models can pose system challenges for latency and energy-efficiency due to their diverse nature in dataflow and memory access patterns.
In this work, we leverage the architecture heterogeneity from Neural Processing Units (NPU) and Compute-In-Memory (CIM) and perform diverse execution schemas to efficiently execute these hybrid models. We also introduce H4H-NAS, a Neural Architecture Search framework to design efficient hybrid CNN/ViT models for heterogeneous edge systems with both NPU and CIM. Our H4H-NAS approach is powered by a performance estimator built with NPU performance results measured on real silicon, and CIM performance based on industry IPs. H4H-NAS searches hybrid CNN/ViT models with fine granularity and achieves significant (up to 1.34%) top-1 accuracy improvement on ImageNet dataset. Moreover, results from our Algo/HW co-design reveal up to 56.08% overall latency and 41.72% energy improvements by introducing such heterogeneous computing over baseline solutions. The framework guides the design of hybrid network architectures and system architectures of NPU+CIM heterogeneous systems.
In this work, we leverage the architecture heterogeneity from Neural Processing Units (NPU) and Compute-In-Memory (CIM) and perform diverse execution schemas to efficiently execute these hybrid models. We also introduce H4H-NAS, a Neural Architecture Search framework to design efficient hybrid CNN/ViT models for heterogeneous edge systems with both NPU and CIM. Our H4H-NAS approach is powered by a performance estimator built with NPU performance results measured on real silicon, and CIM performance based on industry IPs. H4H-NAS searches hybrid CNN/ViT models with fine granularity and achieves significant (up to 1.34%) top-1 accuracy improvement on ImageNet dataset. Moreover, results from our Algo/HW co-design reveal up to 56.08% overall latency and 41.72% energy improvements by introducing such heterogeneous computing over baseline solutions. The framework guides the design of hybrid network architectures and system architectures of NPU+CIM heterogeneous systems.
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionNear-data processing (NDP), a solution to reduce data movement overhead between host and memory, should not interfere with host access to ensure system fairness. We propose a cost-effective and energy-efficient LRDIMM-based NDP architecture (HAIL-DIMM) that can seamlessly interleave NDP and regular memory access and is a drop-in replacement for existing main memory modules. The proposed NDP exploits the interleaving capability of the memory controller to interleave NDP and host access naturally. To take advantage of bank interleaving, an atomic operation of the proposed NDP, which consists of data movement and computation, is recognized by the memory controller as a DDR READ/WRITE but by the HAIL-DIMM as NDP based on the request's address. We implement a prototype of the proposed NDP architecture on an FPGA platform as proof of concept. The evaluation results show that the NDP system achieves up to 2.19x speedup in latency and up to 45.4 % energy saving for data movement over the baseline system in memory-bound workloads.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionSubstantially different from an ordinary differential equation (ODE), a partial differential equation (PDE) contains partial derivatives over multiple independent variables. As a result, the computation complexity of PDEs increase dramatically. However, since PDEs are widely used in modeling natural and biological phenomenon, such as thermodynamics and fluid dynamics, it's necessary to compose efficient hardware PDE solvers while maintaining a high accuracy at the same time. In this paper, dynamic stochastic computing (DSC) is considered to implement PDE solvers with reduced circuit complexity. In a DSC-based implementation, a varying signal is encoded by a dynamic stochastic sequence (DSS) consisting of 0's and 1's. The numerical solutions are then obtained by operations on the stochastic bits from multiple DSS's instead of processing complex fixed-/floating-point numbers as in a conventional arithmetic circuit, thus significantly reducing the circuit area and power consumption. Basic stochastic circuits are proposed to provide unbiased estimates of the solutions for the heat and Burgers equations. When these basic circuits are connected in an array, they can solve a 2-D heat equation and a 1-D Burgers equation, respectively. The quality of the results produced by the proposed circuits is high. The RMSE is lower than 4.99*10e-3 when solving the heat equation, and lower than 1.840*10e-4 solving the Burgers equation, while up to 93.90% hardware and 97.09% power savings are achieved compared to fixed-point implementations.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionAs four-level pulse-amplitude-modulation (PAM-4) signaling becomes widely adopted for high-speed wireline communication, achieving robust equalization is crucial due to reduced eye-opening compared to PAM-2. Utilizing analog-to-digital converters and digital signal processing with feed-forward equalizer (FFE) and decision-feedback equalizer (DFE), conventional equalizer adaptation may result in suboptimal bit-error-rate (BER) performance. This paper introduces an improved hill-climbing algorithm to obtain the optimal main tap position in FFE and the DFE tap coefficient. Implemented on an FPGA with a 12-tap FFE and 1-tap DFE, experimental results on a real channel model demonstrate superior BER performance compared to the conventional approaches.
Research Manuscript
Design
SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionThe increasing deployment of artificial intelligence (AI) for critical decision-making amplifies the necessity for trustworthy AI, where uncertainty estimation plays a pivotal role in ensuring trustworthiness. Dropout-based Bayesian Neural Networks (BayesNNs) are prominent in this field, offering reliable uncertainty estimates. Despite their effectiveness, existing dropout-based BayesNNs typically employ a uniform dropout design across different layers, leading to suboptimal performance. Moreover, as diverse applications require tailored dropout strategies for optimal performance, manually optimizing dropout configurations for various applications is both error prone and labor-intensive. To address these challenges, this paper proposes a novel neural dropout search framework that automatically optimizes both the dropout-based BayesNNs and their hardware implementations on FPGA. We leverage one-shot supernet training with an evolutionary algorithm for efficient dropout optimization. A layer-wise dropout search space is introduced to enable the automatic design of dropout-based BayesNNs with heterogeneous dropout settings. Extensive experiments demonstrate that our proposed framework can effectively find design configurations on the Pareto frontier. Compared to manually-designed dropoutbased BayesNNs on GPU, our search approach produces FPGA designs that can achieve up to 33× higher energy efficiency. Compared to state-of-the-art FPGA designs of BayesNN, the solutions from our approach can achieve higher algorithmic performance. Our designs and tools will be open-source upon paper acceptance.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionNon-invasive sensing of deep tissue, a key task in many medical cyber-physical systems, is inherently challenged by low signal-to-noise ratio (SNR), and unpredictable anatomical and physiological tissue dynamics, which render a particular sensor design sub-optimal. The use of multiple sensors can conceptually enable the system to operate more robustly under such dynamics, assuming that the data acquired by different sensors can be adaptively integrated to form a coherent view of the tissue.
In this paper, we present an algorithm for data fusion at several levels of information abstraction, raw data, feature and decision levels, to meet this need. We validate the proposed technique via non-invasive fetal heart rate tracking using in-vivo data collected in gold-standard pregnant ewe experiments. The root-mean-squared error of our three-level hierarchical data fusion compared to a single-level and two-level fusion improved by over 31% and 19%, respectively. This underscores the robustness of our approach in overcoming deep tissue sensing challenges.
In this paper, we present an algorithm for data fusion at several levels of information abstraction, raw data, feature and decision levels, to meet this need. We validate the proposed technique via non-invasive fetal heart rate tracking using in-vivo data collected in gold-standard pregnant ewe experiments. The root-mean-squared error of our three-level hierarchical data fusion compared to a single-level and two-level fusion improved by over 31% and 19%, respectively. This underscores the robustness of our approach in overcoming deep tissue sensing challenges.
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionLarge-scale transformer with millions of weights achieves great success in multiple natural language processing (NLP) tasks. To release the memory bottleneck of multi-task model deployment, transfer learning tunes part of weights with shared parameters among tasks. Moreover, computing-in-memory (CIM) emerges as an efficient solution for neural network (NN) acceleration. With higher storage density, RRAM-CIM can store the large-scale model without costly weight loading, compared with another mainstream SRAM-CIM. However, the RRAM rewrite for tunning and dynamic weight matrix-vector-multiplication (MVM) in transformers requires high-cost RRAM writing in RRAM-CIM. Current hybrid CIM can compensate the weakness of RRAM-CIM by adding SRAM-CIM with independent MVM operation. However, the tunned weights in transfer learning cannot be implemented due to the demand for the cooperative addition of MVM results from shared weights and tunned weights. In this paper, a hybrid three-dimension RRAM-CIM and SRAM-CIM architecture (HEIRS) is proposed for multi-task transformer acceleration, with the monolithically 3D integration of high-density RRAM-CIM and high-performance SRAM-CIM. The 3D RRAM-CIM with ultra-high density stores the whole NN model with mitigated off-chip weight loading. The SRAM-CIM is employed for efficiently performing dynamic weight MVM without RRAM write operation. Moreover, a novel hybrid-CIM paradigm is proposed with an input selective adder tree, to support cooperative addition in transfer learning. The experiment shows that, compared with RRAM-CIM and SRAM-CIM, the proposed HEIRS improves the energy efficiency by up to 7.83x and 2.29x on BERT, respectively. Meanwhile, the latency is also reduced by up to 85.5% and the storage density is enhanced by 7.2x, compared to RRAM-CIM.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionCross die paths in 3DIC requires many additional signoff corner analysis compared to conventional 2DIC signoff corners owing to different possible conditions at each die level. In case of multi voltage 3DIC interface, signoff corners need to be coupled with 3DIC voltage scenarios in order to create complete multi voltage signoff scenarios. 3DIC simultaneous multi voltage analysis compresses the voltage scenarios per unique 3DIC process/temp/BEOL combinations, which in turn reduces no. of analysis corners and helps in reducing compute requirement. Dominant corner selection approach helps further limit the analysis corners and reduce the overall compute requirement. Context derived from 3DIC multi voltage timing analysis can be used as voltage scenario specific I/O budget (min/max) to die level 2DIC Timing analysis in order to optimize the setup/hold timing of 3DIC interface
Configurable delay cells added on 3DIC interface paths can be used for silicon tuning of 3DIC interface paths
Configurable delay cells added on 3DIC interface paths can be used for silicon tuning of 3DIC interface paths
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionLarge matrix multiplications are crucial in transformers, especially in self-attention. We propose a heterogeneous vector systolic accelerator where each processing element (PE) has varying vector lane widths, diverging from homogeneous lane widths across all PEs. We partition input matrices into sub-matrices for efficient mapping onto PEs, optimizing resource utilization and minimizing latency. We implement the design on an AMD-Xilinx ZCU104 FPGA. The heterogeneous architectures reports 1.68x better throughput and latency compared to a homogeneous architecture, with a 23% better resource utilization. While using heterogeneous vector tiles, we prefer tiles with larger
lane widths for optimal throughput.
lane widths for optimal throughput.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionManufacturing of semiconductor designs pass through many complex steps, among which is process variabilities compensation that is applied to the layout geometries as selective edge bias to improve the yield of the products. Biasing layout polygons will impact the resistance and capacitances of the layout and thus the parasitic extraction step needs to be aware of this bias values, moreover, the final polygons after bias must not result in a design rule violation and must not change the circuitry topology of the design. A biasing algorithm typically involves modifying the dimensions and/or positions of the polygons to ensure that they meet the design rules and are manufacturable. Implementing the bias algorithm correctly is critical to ensure correct compensation and manufacturability. This paper presents an automated QA method to assure bias is implemented correctly, thus ensuring downstream manufacturing processes are applied to a correct layout. This solution introduces the LVS Retarget Checker designed for this purpose. The proposed method provides high coverage and enhancing the reliability of PDK (Process Design Kit), and elevating the overall quality of the design.
Tutorial
EDA
DescriptionThis half-day tutorial aims to impart a comprehensive understanding of the theory and application of superconductor electronics, spanning from the foundational principles of superconductivity to the operational intricacies of superconductor logic cells and digital circuits. The tutorial will explore diverse applications, ranging from neuromorphic computing and signal processing to homomorphic computing and quantum computing.
Research Manuscript
EDA
RTL/Logic Level and High-level Synthesis
DescriptionThe usability and popularity of high-level synthesis (HLS) tools are still limited due to lack of support for dynamic memory management (DMM). Though HLS-compatible DMM solutions have been proposed recently, nevertheless, based on our investigation, none of them can hit high performance (i.e., minimal memory (de-)allocation latency) and resource efficiency (i.e., managing arbitrarily sized memory with minimal FPGA resource consumption) with one stone, seriously limiting their practicality. In response, we propose HeroDMM, a high-performance and resource-efficient dynamic memory manager for HLS. Results show that HeroDMM outperforms state-of-the-art HLS-compatible DMM solutions by 61.69%--99.99% in performance improvement and 23.79%--97.22% in resource consumption savings.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionKey-Value Stores (KVStores) have been widely used in real-world production environments. To guarantee data durability, traditional KVStores suffer from high write latency mainly caused by the long networking and remote data-persisting time. To solve this problem, this paper presents a novel remote data-persisting path for KVStores with low latency. The novelty of this study is two-fold. First, we present PMRDirect, which utilizes the Persistent Memory Region (PMR) in the NVMe standard to construct a remote data-persisting path from RDMA NICs to the PMR region inside an SSD. Second, to showcase PMRDirect in KVStores, we develop a new accessing stack called PMRAccess, enabling remote clients to persist data to existing KVStores quickly. We conduct extensive experiments to compare PMRDirect with a few remote data-persisting paths and evaluate PMRAccess on LevelDB. The results show that PMRDirect achieved the lowest write latency and the highest write bandwidth. Moreover, PMRAccess outperforms the SSD-based accessing stack by up to 6.1× in write throughput and 36× in write tail-latency, and it achieves 1.7× higher write throughput and 0.59× lower write tail-latency over the PMEM-based accessing stack.
Research Manuscript
Design
Quantum Computing
DescriptionIn this paper, we introduce HiLight, an optimization framework designed for enhancing SC communication. HiLight integrates qubit-mapping strategies with program- and hardware-level optimizations, providing high-performance and lightweight scalable solutions. Featuring SWAP-less initial placement, HiLight utilizes qubit-proximity and pattern matching to minimize path congestion. In its routing strategy, HiLight employs fast gate-ordering and braiding path-finding to maximize gate parallelism and expedite optimal path selection. The combined optimizations improve latency and resource utilization. Compared with the state-of-the-art approach, HiLight achieves a remarkable reduction in latency and runtime by 43.5% and 91.9%, respectively, signifying its potential to advance the FTQC era.
Research Manuscript
EDA
Analog CAD, Simulation, Verification and Test
DescriptionThis study presents a novel high-dimensional multi-objective optimization method via adaptive gradient-based subspace sampling for analog circuit sizing. To handle constrained multi-objective optimization, we exploit promising regions from a non-crowded Pareto front, with lightweight Bayesian optimization (BO) based on a novel approximate constrained expected hypervolume improvement. This lightweight BO is computational efficient with constant complexity concerning simulation numbers. To tackle high-dimensional challenges, we reduce the effective dimensionality around promising regions by sampling candidates in an adaptive subspace. The subspace is constructed with gradients and previous success steps with their significance decaying over iterations. The gradients are approximated by sparse regression without additional simulations. The experiments on synthetic benchmarks and analog circuits illustrate advantages of the proposed method over Bayesian and evolutionary baselines.
Research Manuscript
AI
Design
AI/ML System and Platform Design
DescriptionWith the rise of tiny IoT devices powered by machine learning (ML), many researchers have directed their focus toward compressing models to fit on tiny edge devices. Recent works have achieved remarkable success in compressing ML models for object detection and image classification on microcontrollers with small memory, e.g., 512kB SRAM. However, there remain many challenges prohibiting the deployment of ML systems that require high-resolution images. Due to fundamental limits in memory capacity for tiny IoT devices, it may be physically impossible to store large images without external hardware. To this end, we propose a high-resolution image scaling system for edge ML, called HiRISE, which is equipped with selective region-of-interest (ROI) capability leveraging analog in-sensor image scaling. Our methodology not only significantly reduces the peak memory requirements, but also achieves up to 17.7x reduction in data transfer and energy consumption.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionStatic Timing Analysis (STA) has evolved dramatically in the last 20 years. From a fairly simple extraction and back annotation of a single flat entity, and analysis in a couple of bounding timing corners, it has grown exponentially as technology nodes have advanced to consider many other physical factors, and handle design data sizes and STA engineering team sizes almost unthinkable 20 years ago. Join us on this historical retrospective, a brief check-in to current STA techniques and requirements, and a glimpse into the future as to what may be coming in the near future, and what EDA can do to help. Speakers from Marvell, Synopsys, and IBM will cover the full history, present and future of STA in this exciting and entertaining presentation.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
Description3DIC design planning has been one of the interesting and challenging areas of research and development in recent times due to its higher demand and technological improvement that can cater to latest requirements in IC design.
This piece of work has been carried out to explore the different ventures in this space that can result in efficient optimization of performance and area parameters thereby overcoming different challenges in the process.
The key is to establish proper die to die connections and effectively develop a three-dimensional structure that is fully functional.
This piece of work has been carried out to explore the different ventures in this space that can result in efficient optimization of performance and area parameters thereby overcoming different challenges in the process.
The key is to establish proper die to die connections and effectively develop a three-dimensional structure that is fully functional.
Research Manuscript
Design
Design for Manufacturability and Reliability
DescriptionArtificial intelligence (AI) edge devices often feature numerous storage units and sequential logic circuits, making them vulnerable to soft errors. For reliable and critical edge AI applications, assessing System-on-Chip (SoC) reliability in advance is essential. Here, there are two cases: a self-designed SoC (white-box), or a commercial off-the-shelf (COTS) chip (black-box). This study uses alpha particle irradiation results on our 22nm AI SoC as a golden reference to estimate soft error impacts, injecting faults across the entire chip in the white-box case and into the accessible memory and registers in the black-box case. The results demonstrate a high degree of consistency between the white-box case and golden reference, meaning that pre-silicon reliability assessment is feasible. As for the black-box case, the proportion of memory in the SoC remains unchanged and is still significantly larger than that of registers, and hence the simulation results between black-box and white-box are not substantially different.
Research Manuscript
Embedded Systems
Embedded Software
DescriptionThe advent of ultra-low-latency storage devices has narrowed the performance gap between storage and CPU in computing platforms, facilitating synchronous I/O adoption. Yet, this approach introduces substantial busy waiting time and underutilizes computing units. To address this, we propose a light-weighted Idle-Time-Stealing (ITS) design. This involves a self-improving thread conducting pre-fetching for high-priority processes during synchronous I/O, and an I/O-waiting process continuing subsequent instruction executions when justifiable. Another thread, the self-sacrificing thread, proactively switches low-priority process I/O requests from synchronous to asynchronous mode, prioritizing high-priority executions. Experimental results demonstrate the effectiveness of our ITS design in reducing CPU idle time.
Tutorial
EDA
DescriptionRISC-V is an industry wide ISA (Instruction Set Architecture) standard used for developing embedded processors that target Semiconductor products of any type. PSS (Portable Stimulus) is an Accellera standard verification language used by EDA companies to develop tools, that given a PSS Model, generates coverage driven scenarios to enable meeting verification goals with less effort, taking advantage of portability, abstraction, and automation capabilities enabled by the language. In this tutorial we teach how to code PSS Models needed for the verification of any RISC-V platform (e.g. RISC-V embedded core platform, RISC-V multi-core coherent platform, RISC-V SOC (System on Chip) with external interfaces, etc.).
Synopsys as a RISC-V developer is providing reference methodologies for the verification and debugging of RISC-V system designs are available now, along with Synopsys EDA flows, emulation and virtual prototyping solutions, and methodologies to further support RISC-V SoC verification. Collaborative efforts include RISC-V verification methodology cookbook for Bluespec cores, "Understanding UVM Coverage for RISC-V Processor Designs" white paper, RISC-V and processor verification using ImperasDV verification solutions, and the industry-leading Synopsys VCS® simulation and Verdi® debug tools for improved efficiency (see news release).
As PSS usage grows together with the incoming requests to better enable PSS for RISC-V platforms, we endeavor to expand on a methodology cookbook with the addition of PSS. In this tutorial we enable the RISC-V PSS eco community with some fresh ideas on how to use PSS to get started. We introduce the PSS modeling patterns below that can be used to get started and hopefully provide an appetite to use and create more.
For each modeling pattern, we give a name and a short explanation of what the pattern consists of:
(1) Basic: PSS modeling techniques that can be used to generate basic RISC-V assembly code sequences. (2) Integration: PSS modeling techniques that can be used to generate RISC-V assembly code that interacts with generated traffic scenario's consisting of embedded C and SV testbench generate code.(3) Nested loops and routines: PSS modeling techniques that can be used to generate legal assembly code with nested loops and nested routine calls.(4) Memory sharing: PSS modeling techniques that can be used to generate blocks of assembly code that share memory, with exclusive and non-exclusive access. (5) Runtime parameterization: PSS modeling techniques that can be used to generate parameterized assembly code run on a post-silicon, where a host device can change parameters on-the-fly.(6) Validating the scenario: PSS modeling techniques to create a reference model in PSS that can be used as an executable specification to debug and validate PSS generated scenarios.
The expectation is that this 3-hour tutorial will provide any RISC-V platform developer with a good enough tool kit to be able to perform all verification requirements needed for a RISC-V platform.
Synopsys as a RISC-V developer is providing reference methodologies for the verification and debugging of RISC-V system designs are available now, along with Synopsys EDA flows, emulation and virtual prototyping solutions, and methodologies to further support RISC-V SoC verification. Collaborative efforts include RISC-V verification methodology cookbook for Bluespec cores, "Understanding UVM Coverage for RISC-V Processor Designs" white paper, RISC-V and processor verification using ImperasDV verification solutions, and the industry-leading Synopsys VCS® simulation and Verdi® debug tools for improved efficiency (see news release).
As PSS usage grows together with the incoming requests to better enable PSS for RISC-V platforms, we endeavor to expand on a methodology cookbook with the addition of PSS. In this tutorial we enable the RISC-V PSS eco community with some fresh ideas on how to use PSS to get started. We introduce the PSS modeling patterns below that can be used to get started and hopefully provide an appetite to use and create more.
For each modeling pattern, we give a name and a short explanation of what the pattern consists of:
(1) Basic: PSS modeling techniques that can be used to generate basic RISC-V assembly code sequences. (2) Integration: PSS modeling techniques that can be used to generate RISC-V assembly code that interacts with generated traffic scenario's consisting of embedded C and SV testbench generate code.(3) Nested loops and routines: PSS modeling techniques that can be used to generate legal assembly code with nested loops and nested routine calls.(4) Memory sharing: PSS modeling techniques that can be used to generate blocks of assembly code that share memory, with exclusive and non-exclusive access. (5) Runtime parameterization: PSS modeling techniques that can be used to generate parameterized assembly code run on a post-silicon, where a host device can change parameters on-the-fly.(6) Validating the scenario: PSS modeling techniques to create a reference model in PSS that can be used as an executable specification to debug and validate PSS generated scenarios.
The expectation is that this 3-hour tutorial will provide any RISC-V platform developer with a good enough tool kit to be able to perform all verification requirements needed for a RISC-V platform.
Analyst Presentation
AI
EDA
IP
DescriptionWith the advent of generative AI, the landscape of the semiconductor industry is changing. AI is accelerating the in-sourcing of semiconductor design by systems companies. Vertical integration from silicon to systems is on the rise. One of the consequences of this changing landscape is that chip design appears to be capturing a greater piece of the pie in semiconductor value chain. This trend represents a major opportunity for the EDA and IP industry, who are the pick-and-shovel provider to the AI gold rush. In this presentation, we review the latest trends in semiconductors and explain why we think EDA and IP industry is the place to be in the AI super cycle.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe Processing-In-Memory (PIM) architecture becomes a promising candidate for deep learning accelerators by integrating computation and memory. Most PIM-based studies improve the performance and energy efficiency by using the weight stationary (WS) data flow. However, the activation movement is larger than the weight movement in some neural networks. In other words, activation movements may become the bottleneck for reducing latency and power consumption. There is a great potential to improve performance and energy efficiency by reducing activation movements. In this paper, we propose a Hybrid data flow PIM Architecture (HPA) that realizes the flexible combination of Input Stationary (IS) and WS data flow. To the best of our knowledge, this is the first hybrid data flow design for PIM architectures. The IS data flow replaces the convolution unrolling with selecting activations according to convolution windows. We also propose the parallel computing method and optimize the pipeline. Our experimental results and analysis demonstrate the potential of the HPA. The performance and energy efficiency of the HPA reaches 1.64GFLOPS∼ 63GFLOPS and 2.1TOPS/W∼ 151TOPS/W, respectively. Compared to the state-of-the-art design, the NEBULA, the HPA can significantly improve energy efficiency and performance by 22.1× and 7.8×, respectively, when deploying the MobileNet V1.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWavelength-routed optical networks-on-chip (WRONoCs) attract ever-increasing attention for supporting high-speed communications with low power and latency. Among all WRONoC routers, optical ring routers attract much interest for their simple structures. However, current designs of ring routers
have overlooked the customization problem. When adapting to applications that have specific communication requirements, current designs suffer high propagation loss caused by long worst-case signal paths and high splitter usage in power distribution networks (PDN). To address those problems, we propose a novel customization method, Hierarchical Ring (HRing), to synthesize application-specific ring routers. Instead of sequentially connecting all nodes, we cluster the nodes and connect them with sub-ring waveguides to reduce the path length. Besides, we propose a mixed integer linear programming model for wavelength assignment to reduce the number of PDN splitters. We compare HRing to three state-of-the-art ring router design methods for six applications. Experimental results show that HRing can greatly reduce the length of the longest signal path, the worst-case insertion loss,
and the number of splitters in the PDN, which contributes to a significant improvement in power efficiency.
have overlooked the customization problem. When adapting to applications that have specific communication requirements, current designs suffer high propagation loss caused by long worst-case signal paths and high splitter usage in power distribution networks (PDN). To address those problems, we propose a novel customization method, Hierarchical Ring (HRing), to synthesize application-specific ring routers. Instead of sequentially connecting all nodes, we cluster the nodes and connect them with sub-ring waveguides to reduce the path length. Besides, we propose a mixed integer linear programming model for wavelength assignment to reduce the number of PDN splitters. We compare HRing to three state-of-the-art ring router design methods for six applications. Experimental results show that HRing can greatly reduce the length of the longest signal path, the worst-case insertion loss,
and the number of splitters in the PDN, which contributes to a significant improvement in power efficiency.
Research Manuscript
Embedded Systems
Time-Critical and Fault-Tolerant System Design
DescriptionEmbedded Neural Networks (NNs) face significant challenges due to Single-Event Upsets (SEUs), compromising their reliability. To address this challenge, previous works study SEU layers sensitivity of AI models. Contrary to these techniques, remaining at high level, we propose a more accurate analysis, highlighting that, except for the last layer, faults transitioning from 0 to 1 significantly impact classification outcomes. Based on this specific behavior, we propose a simple hardware block able to detect and mitigate the SEU impact. Obtained results show that HTAG protection efficiency is near 96.85% for the LeNet-5 CNN inference model, suitable for an embedded system. This result can be improved with other protection methods for the classification layer. Additionally, it significantly reduces area overhead and critical path compared to existing approaches.
Research Manuscript
Hyb-Learn: A Framework for On-Device Self-Supervised Continual Learning with Hybrid RRAM/SRAM Memory
Design
Emerging Models of Computation
DescriptionWhile RRAM crossbar-based In-Memory Computing (IMC) has proven highly effective in accelerating Deep Neural Networks (DNNs) inference, RRAM-based on-device training is less explored due to its high energy consumption of weight re-programming and cells' low endurance problem. Besides, emerging trends indicate a need for on-device continual learning which sequentially acquires knowledge from multiple tasks to enhance user's experiences and eliminate data privacy concerns. However, learning on each new task leads to forgetting prior learned knowledge on prior tasks, which is known as catastrophic forgetting. To address these challenges, we are the first to propose a novel training framework, Hyb-Learn, for enabling on-device continual learning with a hybrid RRAM/SRAM IMC architecture design. Specifically, when training each new arriving task, our approach first partitions the model into two groups based on the proposed task-correlated PE-wise correlation to freeze or re-training, and correspondingly mapping to RRAM and SRAM, respectively. In practice, the RRAM stores frozen weights with strong task correlation to prior tasks to eliminate the high cost of weight reprogramming issue of RRAM, while the SRAM stores the remaining weights that will be updated. Furthermore, to maximize the freezing ratio for improving training efficiency while maintaining accuracy and mitigating catastrophic forgetting, we incorporate self-supervised learning algorithms that are initialized from a pre-trained model for training each new task.
Research Manuscript
Design
Quantum Computing
DescriptionQuantum computing based on Neutral Atoms (NAs) provides a wide range of computational capabilities, encompassing high-fidelity long-range interactions with native multi-qubit gates, and the ability to shuttle arrays of qubits.
While previously these capabilities have been studied individually, we propose the first approach of a fast hybrid compiler to perform circuit mapping and routing based on both high-fidelity gate interactions and qubit shuttling.
We delve into the intricacies of the compilation process when combining multiple capabilities and present effective solutions to address resulting challenges.
The final compilation strategy is then showcased across various hardware settings, revealing its versatility, and highlighting potential fidelity enhancements achieved through the strategic utilization of combined gate- and shuttling-based routing.
With the additional multi-qubit gate support for both routing capabilities, the proposed approach is able to take advantage of the full spectrum of computational capabilities offered by NAs.
While previously these capabilities have been studied individually, we propose the first approach of a fast hybrid compiler to perform circuit mapping and routing based on both high-fidelity gate interactions and qubit shuttling.
We delve into the intricacies of the compilation process when combining multiple capabilities and present effective solutions to address resulting challenges.
The final compilation strategy is then showcased across various hardware settings, revealing its versatility, and highlighting potential fidelity enhancements achieved through the strategic utilization of combined gate- and shuttling-based routing.
With the additional multi-qubit gate support for both routing capabilities, the proposed approach is able to take advantage of the full spectrum of computational capabilities offered by NAs.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionSC reduces the complexity of arithmetic circuits but brings extra conversion cost and time complexity of O(2^N), which leads to a much lower efficiency than binary. This paper proposes a linear-time-complexity, O(N), and conversion-free hybrid stochastic computing (HSC). Moreover, a hybrid stochastic computing in-memory method is proposed, mapping multiplication and addition of HSC into memory's enable and addressing circuit. Thus, any original memory can realize HSC operation without additional circuits. The experiment shows that FPGA-based block memory (BRAM) operating matrix multiplication reaches 1.152 TOPS and 17.2 TOPS/W·bit. Each 18K-BRAM provides 18 GOPS performance (INT8) with 8.34 mW at 600 MHz.
IP
Engineering Tracks
IP
DescriptionTo accelerate the Convolution and Matrix Multiplications for CNNs and Transformers respectively, we propose an FPGA-based Vector Systolic Array (VSA) Accelerator. This custom IP employs adaptable vector lane-width to enable parallel data processing for enhanced throughput. We enhance this architecture by introducing a hybrid tiled vector systolic design which utilizes LUTs and DSPs in a complimentary fashion by using a unique data mapping strategy. Results show a 7x and 1.26x increase in throughput for single-tile and multi-tile configurations, respectively. The hybrid tile approach achieves competitive throughputs of 1165 GOPs and 1072 GOPs for Vector-6 and 8, outperforming related work by 3.8x. Additionally, we designed this architecture with a novel convolution method to reduce latency and packaged it as a customizable IP targeted for an FPGA accelerator. This design reduces memory access latency while maintaining competitive throughput by reusing kernels and by partitioning image matrices to suit the different lane widths.
Research Manuscript
Security
Hardware Security: Attack and Defense
DescriptionCache side-channels are a major threat to cryptographic implementations, particularly block ciphers. Traditional manual hardening methods transform block ciphers into Boolean circuits, a practice refined since the late 90s. The only existing automatic approach based on Boolean circuits achieves security but suffers from performance issues. This paper examines the use of Lookup Tables (LUTs) for automatic hardening of block ciphers against cache side-channel attacks.We present a novel method combining LUT-based synthesis with quantitative static analysis in our HyCaMi framework. Applied to seven block cipher implementations, HyCaMi shows significant improvement in efficiency, being 9.5× more efficient than previous methods, while effectively protecting against cache side-channel attacks. Additionally, for the first time, we explore balancing speed with security by adjusting LUT sizes, providing faster performance with slightly reduced leakage guarantees, suitable for scenarios where absolute security and speed must be balanced.
Research Manuscript
Design
Emerging Models of Computation
DescriptionComputationally challenging combinatorial optimization problems (COPs) play a fundamental role in various applications.
To tackle COPs, many Ising machines and Quadratic Unconstrained Binary Optimization (QUBO) solvers have been proposed, which typically involve direct transformation of COPs into Ising models or equivalent QUBO forms (D-QUBO).
However, when addressing COPs with inequality constraints, this D-QUBO approach introduces numerous extra auxiliary variables, resulting in a substantially larger search space, increased hardware costs, and reduced solving efficiency.
In this work, we propose HyCiM, a novel hybrid computing-in-memory (CiM) based QUBO solver framework, designed to overcome aforementioned challenges.
The proposed framework consists of
(i) an innovative transformation method (first to our known) that converts COPs with inequality constraints into an inequality-QUBO form, thus eliminating the need of expensive auxiliary variables and associated calculations;
(ii) "inequality filter", a ferroelectric FET (FeFET)-based CiM circuit that accelerates the inequality evaluation, and filters out infeasible input configurations;
(iii) a FeFET-based CiM annealer that is capable of approaching global solutions of COPs via iterative QUBO computations within a simulated annealing process.
The evaluation results show that HyCiM drastically narrows down the search space, eliminating 2^100 to 2^2536 infeasible input configurations compared to the conventional D-QUBO approach.
Consequently, the narrowed search space, reduced to 2^100 feasible input configurations, leads to a substantial hardware area overhead reduction, ranging from 88.06% to 99.96%.
Additionally, HyCiM consistently exhibits a high solving efficiency, achieving a remarkable average success rate of 98.54%, whereas D-QUBO implementatoin shows only 10.75%.
To tackle COPs, many Ising machines and Quadratic Unconstrained Binary Optimization (QUBO) solvers have been proposed, which typically involve direct transformation of COPs into Ising models or equivalent QUBO forms (D-QUBO).
However, when addressing COPs with inequality constraints, this D-QUBO approach introduces numerous extra auxiliary variables, resulting in a substantially larger search space, increased hardware costs, and reduced solving efficiency.
In this work, we propose HyCiM, a novel hybrid computing-in-memory (CiM) based QUBO solver framework, designed to overcome aforementioned challenges.
The proposed framework consists of
(i) an innovative transformation method (first to our known) that converts COPs with inequality constraints into an inequality-QUBO form, thus eliminating the need of expensive auxiliary variables and associated calculations;
(ii) "inequality filter", a ferroelectric FET (FeFET)-based CiM circuit that accelerates the inequality evaluation, and filters out infeasible input configurations;
(iii) a FeFET-based CiM annealer that is capable of approaching global solutions of COPs via iterative QUBO computations within a simulated annealing process.
The evaluation results show that HyCiM drastically narrows down the search space, eliminating 2^100 to 2^2536 infeasible input configurations compared to the conventional D-QUBO approach.
Consequently, the narrowed search space, reduced to 2^100 feasible input configurations, leads to a substantial hardware area overhead reduction, ranging from 88.06% to 99.96%.
Additionally, HyCiM consistently exhibits a high solving efficiency, achieving a remarkable average success rate of 98.54%, whereas D-QUBO implementatoin shows only 10.75%.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIntegrating hybrid memories with heterogeneous processors could leverage heterogeneity in both compute and memory domains for better system efficiency.
To ensure performance isolation, we introduce Hydrogen, a novel hardware architecture to optimize the allocation of hybrid memory resources to heterogeneous CPU-GPU systems.
Hydrogen supports efficient capacity and bandwidth partitioning between CPUs and GPUs in both memory tiers.
We propose decoupled memory channel mapping and token-based data migration throttling to enable flexible partitioning. We also support epoch-based online search for optimized configurations and lightweight reconfiguration with reduced data movements.
Hydrogen significantly outperforms existing designs by up to 1.31x.
To ensure performance isolation, we introduce Hydrogen, a novel hardware architecture to optimize the allocation of hybrid memory resources to heterogeneous CPU-GPU systems.
Hydrogen supports efficient capacity and bandwidth partitioning between CPUs and GPUs in both memory tiers.
We propose decoupled memory channel mapping and token-based data migration throttling to enable flexible partitioning. We also support epoch-based online search for optimized configurations and lightweight reconfiguration with reduced data movements.
Hydrogen significantly outperforms existing designs by up to 1.31x.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe attention mechanism is a pivotal element within the Transformer architecture, making a substantial contribution to its exceptional performance. Within this attention mechanism, Softmax is an imperative component that enables the model to assess the degree of correlation between various segments of the input. Yet, prior research has shown that Softmax operations can significantly increase processing latency and energy consumption in the Transformer network due to their internal nonlinear operations and data dependencies.
In this work, we proposed~\textit{Hyft}, a hardware efficient floating point Softmax accelerator for both training and inference. Hyft aims to reduce the implementation cost of different nonlinear arithmetic operations by adaptively converting intermediate results into the most suitable numeric format for each specific operation, leading to reconfigurable accelerator with hybrid numeric format.
In this work, we proposed~\textit{Hyft}, a hardware efficient floating point Softmax accelerator for both training and inference. Hyft aims to reduce the implementation cost of different nonlinear arithmetic operations by adaptively converting intermediate results into the most suitable numeric format for each specific operation, leading to reconfigurable accelerator with hybrid numeric format.
Research Manuscript
Design
Emerging Models of Computation
DescriptionNonparametric statistics methods are a class of robust and potent statistics, which are widely used in various domains such as finance, medicine, and computer science. Such methods deliver an accurate estimation without an assumed data distribution. Moreover, they can handle discrete data with various data sources. Despite their desirable features, the calculation of large-scale nonparametric statistics is both compute- and memory-intensive, and the performance overhead hinders them from widespread usage.
This paper identifies that the key performance bottleneck lies in the rank-based operations which are intensively involved in variants of nonparametric statistics methods. These rank-based operations can thereby be fully accelerated and structurally reused among diverse statistics. We then introduce Hynify, a high-throughput and unified accelerator that facilitates a rich set of nonparametric statistics. To ensure comprehensiveness, we capture three primary computational paradigms of nonparametric statistical methods, namely, aggregation, pair-wise rank, and concordance, with the right architecture designs. To improve throughput, Hynify exploits fine-grained computation and pipelining for increased performance. We implement Hynify in FPGA demonstration and representative experimental results demonstrate that Hynify delivers up to 160x/21x throughput improvement over GPU and 64-core CPU, respectively, while achieving up to 781x/62x energy efficiency improvement.
This paper identifies that the key performance bottleneck lies in the rank-based operations which are intensively involved in variants of nonparametric statistics methods. These rank-based operations can thereby be fully accelerated and structurally reused among diverse statistics. We then introduce Hynify, a high-throughput and unified accelerator that facilitates a rich set of nonparametric statistics. To ensure comprehensiveness, we capture three primary computational paradigms of nonparametric statistical methods, namely, aggregation, pair-wise rank, and concordance, with the right architecture designs. To improve throughput, Hynify exploits fine-grained computation and pipelining for increased performance. We implement Hynify in FPGA demonstration and representative experimental results demonstrate that Hynify delivers up to 160x/21x throughput improvement over GPU and 64-core CPU, respectively, while achieving up to 781x/62x energy efficiency improvement.
Research Manuscript
Design
In-memory and Near-memory Computing Circuits
DescriptionThe memory wall is a growing issue in modern comput-
ing systems due to the disparity between device computing
power and data communication speed. To alleviate mem-
ory wall, Compute Express Link (CXL) is proposed to cre-
ate a shared and coherent memory space between the host
and device, offering opportunities to use device DRAM as
cache and device memory as primary storage for memory-
intensive tasks. However, challenges arise when utilizing
device DRAM as cache, including high cache miss penalties
caused by data access granularity mismatches and ineffi-
cient hardware cache management mechanisms. To tackle
these issues, we propose Smart DRAM-Caching, an efficient
framework that employs Gaussian Mixture Model (GMM)
for intelligent caching and eviction on hardware. Compared
with traditional cache replacement strategies LRU, our on-
board measurements reveal that a ?% increase in cache hit
rate can result in a ?% reduction in average device memory
access latency. Compared with learning-based methods like
LSTM, our approach achieves ?× speedup with less hardware
resource consumption.
ing systems due to the disparity between device computing
power and data communication speed. To alleviate mem-
ory wall, Compute Express Link (CXL) is proposed to cre-
ate a shared and coherent memory space between the host
and device, offering opportunities to use device DRAM as
cache and device memory as primary storage for memory-
intensive tasks. However, challenges arise when utilizing
device DRAM as cache, including high cache miss penalties
caused by data access granularity mismatches and ineffi-
cient hardware cache management mechanisms. To tackle
these issues, we propose Smart DRAM-Caching, an efficient
framework that employs Gaussian Mixture Model (GMM)
for intelligent caching and eviction on hardware. Compared
with traditional cache replacement strategies LRU, our on-
board measurements reveal that a ?% increase in cache hit
rate can result in a ?% reduction in average device memory
access latency. Compared with learning-based methods like
LSTM, our approach achieves ?× speedup with less hardware
resource consumption.
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionArtificial intelligence is evolving with various algorithms such as deep neural network (DNN), Transformer, recommendation system (RecSys) and graph convolutional network (GCN). Correspondingly, multiply-accumulate (MAC) and content search are two main operations, which can be efficiently executed on the emerging computing-in-memory (CIM) and content-addressable-memory (CAM) paradigms. Recently, the emerging Indium-Gallium-Zine-Oxide (IGZO) transistor becomes a promising candidate for both CIM/CAM circuits, featuring ultra-low leakage with >300s data retention time and high-density BEOL fabrication.
This paper proposes IG-CRM, the first IGZO-based circuits and architecture design for CIM/CAM applications. The main contributions include: 1) at cell level, propose IGZO-based 3T0C/4T0C cell design that enables both CIM and CAM functionalities while matching IGZO/CMOS voltage; 2) at circuit level, utilize the BEOL IGZO transistor to reduce digital adder tree area in CIM circuits; 3) at architecture level, propose a reconfigurable CIM/CAM architecture with four macro structures based on 3T0C/4T0C cells. The proposed IG-CRM architecture shows high area/energy efficiency on various applications including DNN, Transformer, RecSys and GCN. Experiment results show that IG-CRM achieves 8.09x area saving compared with the SRAM-based non-reconfigurable CIM/CAM baseline, and 1.53E3/51.9 times speedup and 1.63E4/7.62E3 times energy efficiency improvement compared with CPU and GPU on average.
This paper proposes IG-CRM, the first IGZO-based circuits and architecture design for CIM/CAM applications. The main contributions include: 1) at cell level, propose IGZO-based 3T0C/4T0C cell design that enables both CIM and CAM functionalities while matching IGZO/CMOS voltage; 2) at circuit level, utilize the BEOL IGZO transistor to reduce digital adder tree area in CIM circuits; 3) at architecture level, propose a reconfigurable CIM/CAM architecture with four macro structures based on 3T0C/4T0C cells. The proposed IG-CRM architecture shows high area/energy efficiency on various applications including DNN, Transformer, RecSys and GCN. Experiment results show that IG-CRM achieves 8.09x area saving compared with the SRAM-based non-reconfigurable CIM/CAM baseline, and 1.53E3/51.9 times speedup and 1.63E4/7.62E3 times energy efficiency improvement compared with CPU and GPU on average.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThis paper explores some of the challenges encountered during the digital implementation of the world's first fully integrated SoC solution for Direct-to-Satellite IoT connectivity chip. The chip is a mix of analog and digital sections and was implemented in GF 22nm process node. Due to the stringent application requirements, the digital design involved special planning of the power grid to address IR, and a macro placement to support an odd shape block. Moreover, very tight clock latency was required to meet the timing metrics. Careful considerations to the floorplanning became important to mitigate congestion, especially with the limited number of metal layers and the special power grid. The engineering team had to meet a tight tapeout timeline, which meant there was not a lot of time to do full-flow iterations. We had to focus on choosing the optimal floorplan with P&R results at the placement stage, and therefore finding a tool that correlated very well between post placement, post route and signoff, was key to drive this project to completion.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionAshenhurst-Curtis decomposition (ACD) is a known decomposition technique used, in particular, to map combinational logic into LUT structures when synthesizing hardware designs. However, available implementations of ACD suffer from excessive complexity and slow run time, which limits their applicability and scalability. This paper presents several simplifications leading to a fast and versatile technique of ACD suitable for delay optimization. We utilize this new formulation to enhance delay-driven LUT mapping by performing ACD on the fly. Compared to state-of-the-art technology mapping, experiments demonstrate an average delay improvement of 17.94%, with affordable run time. Additionally, our method improves heavily optimized LUT networks.
Front-End Design
AI
Design
Engineering Tracks
Front-End Design
DescriptionWith the increasing demand of AI and ML applications, the need for specialized hardware designs becomes imperative to achieve high performance and energy efficiency in computing. Our AI Engines (AIE) are developed to proficiently accelerate such workloads, particularly for complex ML models with competitive energy efficiency. For the energy efficient computing in AIE, we developed a workload-aware power analysis methodology to push the limits of PPA targets, and started Shift Left at the early stage of RTL design for AIE in Ryzen, Epyc and Versal product families. The framework includes power vector generation, automatic workload selection, power report analysis, creation of power model at RTL level. In addition to the early power estimations for design changes at RTL level, it generates AIE core pipeline instruction statistics used in AIE advanced power modeling training procedure and other valuable information such as data dependencies that can increase accuracy of power model. We observed that dynamic power and CG related metric WCPP were significantly improved by average 27% and 56% respectively, throughout the AIE IP design developments.
Research Manuscript
Design
In-memory and Near-memory Computing Circuits
DescriptionAnalog in-memory-computing (IMC) is an attractive technique with a higher energy efficiency to process machine learning workloads.
However, the analog computing scheme suffers from large interface circuit overhead.
In this work, we propose a macro with a hybrid analog-digital mode computation to reduce the precision requirement of the interface circuit.
Considering the distribution of the multiplication and accumulation (MAC) value, we propose a nonlinear transfer function of the computing circuits by only accurately computing low MAC value in the analog domain, with a digital mode to deal with the high MAC value with smaller possibility.
Silicon measurement results show that the proposed macro could achieve 160 GOPS/mm^2 area efficiency and 25.5 TOPS/W for 8b/8b matrix computation.
The architectural-level evaluation for real workloads shows that the proposed macro can achieve up to 2.92x higher energy efficiency than conventional analog IMC designs.
However, the analog computing scheme suffers from large interface circuit overhead.
In this work, we propose a macro with a hybrid analog-digital mode computation to reduce the precision requirement of the interface circuit.
Considering the distribution of the multiplication and accumulation (MAC) value, we propose a nonlinear transfer function of the computing circuits by only accurately computing low MAC value in the analog domain, with a digital mode to deal with the high MAC value with smaller possibility.
Silicon measurement results show that the proposed macro could achieve 160 GOPS/mm^2 area efficiency and 25.5 TOPS/W for 8b/8b matrix computation.
The architectural-level evaluation for real workloads shows that the proposed macro can achieve up to 2.92x higher energy efficiency than conventional analog IMC designs.
Workshop
Design
DescriptionToday's computer architectures and device technologies used to manufacture them are facing major challenges, rendering them incapable of delivering the performances required by complex applica- tions such as Big-Data processing and Artificial Intelligence (AI). The iMACAW workshop aims at providing a forum to discuss In-Memory-Computing (as an alternative architecture) and its po- tential applications. To this end, we take a cross-layer and cross-technology approach covering State-of-the-Art (SoA) works that use SRAM, DRAM, FLASH, RRAM, PCM, MRAM, or FeFET as their memory technology. The workshop also aims at reinforcing the In-Memory-Computing (IMC) community and at offering a holistic vision of this emerging computing paradigm to the de- sign automation communities. This workshop proposal follows the two previous editions hosted in DAC1, it will provide an opportunity for the audience to listen to invited speakers who are pioneers of the field, learn from them, ask questions, and interact with them. Open submission contributors also get the opportunity to share their knowledge, present their most-recent work, and their work in progress with the community, interact with other experts in the field, and receive feedback.
Research Manuscript
EDA
Analog CAD, Simulation, Verification and Test
DescriptionThis work presents inGRASS, a novel algorithm designed for incremental spectral sparsification of large undirected graphs. The proposed inGRASS algorithm is highly scalable and parallel-friendly, having a nearly linear time complexity for the setup phase and the ability to update the spectral sparsifier in $O(\log N)$ time for each incremental change made to the original graph with $N$ nodes. A key component in the setup phase of inGRASS is a multilevel resistance embedding step for efficiently identifying spectrally critical edges and effectively pruning spectrally similar ones, which is achieved by decomposing the initial sparsifier into node clusters with bounded effective-resistance diameters achieved through a low-resistance-diameter decomposition (LRD) scheme. The update phase of inGRASS exploits low-dimensional node embedding vectors for efficiently estimating the importance and uniqueness of each newly added edge. As demonstrated through extensive experiments, inGRASS achieves state-of-the-art results in incremental spectral sparsification of graphs obtained from various tasks, such as circuit simulations, finite element analysis, and social networks.
Research Manuscript
EDA
Timing and Power Analysis and Optimization
DescriptionCritical Path Generation (CPG) is crucial for static timing analysis applications to validate timing constraints. Recent years have witnessed CPG algorithms that rank critical paths efficiently and accurately. However, they all lack incrementality, which is the ability to quickly update critical paths after the circuit is incrementally modified. To solve this, we introduce Ink, an efficient incremental CPG algorithm. Ink identifies reusable paths for the next query and effectively prunes the path search space. Ink is up to 22.4× faster and consumes up to 31% less memory than a state-of-the-art timer when generating one million paths on a large design.
Research Manuscript
AI
Design
AI/ML System and Platform Design
DescriptionDeep Neural Network (DNN) inference consumes significant computing resources and development efforts due to the growing model size. Quantization is a promising technique to reduce the computation and memory cost of DNNs. Most existing quantization methods rely on fixed-point integers or floating-point types, which require more bits to maintain model accuracy. In contrast, variable-length quantization, which combines high precision for values with significant magnitudes (i.e., outliers) and low precision for normal values, offers algorithmic advantages but introduces significant hardware overhead due to variable-length encoding and decoding. Also, existing quantization methods are less effective for both (dynamic) activations and (static) weights due to the presence of outliers.
In this work, we propose INSPIRE, an algorithm/architecture co-designed solution that employs an Index-Pair (INP) quantization and handles outliers globally with low hardware overheads and high performance gains. The key insight of INSPIRE lies in identifying typical features associated with important values, encoding them as indexes, and precomputing corresponding results for efficient storage in lookup table. During inference, the results of inputs with paired index can be directly retrieved from the table, which eliminates the need for any computational overhead. Furthermore, we design a unified processing element architecture for INSPIRE and highlight its seamless integration with existing DNN accelerators. As a result, INSPIRE-based accelerator surpasses the state-of-the-art quantization accelerators with a remarkable $9.31\times$ speedup and $81.3\%$ energy reduction, respectively, while maintaining superior model accuracy.
In this work, we propose INSPIRE, an algorithm/architecture co-designed solution that employs an Index-Pair (INP) quantization and handles outliers globally with low hardware overheads and high performance gains. The key insight of INSPIRE lies in identifying typical features associated with important values, encoding them as indexes, and precomputing corresponding results for efficient storage in lookup table. During inference, the results of inputs with paired index can be directly retrieved from the table, which eliminates the need for any computational overhead. Furthermore, we design a unified processing element architecture for INSPIRE and highlight its seamless integration with existing DNN accelerators. As a result, INSPIRE-based accelerator surpasses the state-of-the-art quantization accelerators with a remarkable $9.31\times$ speedup and $81.3\%$ energy reduction, respectively, while maintaining superior model accuracy.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWith the rise of faulty chips post-deployment, efficient in-field testing becomes crucial. This paper introduces a novel method for generating software-based self-test (SBST) programs for processor cores using reinforcement learning, employing toggle coverage as a proxy metric. Our approach, which builds test programs incrementally, was tested on two types of RISC-V cores. It outperformed random generation, achieving over 80% toggle coverage for 200 instructions. When evaluated with the stuck-at-fault model, it showed a substantial improvement in fault coverage, enhancing the coverage achieved through random methods by 1.7 times in out-of-order cores, thus demonstrating its robustness in in-field testing.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAccurate coupling capacitances are a key part in the design of modern image sensor cells due to their high speed requirements, large number of active devices and interconnects, and complex inter-layer dielectric structure. Automation of 3D structure creation integrated with the design flow, as well as speed and robustness of capacitance calculation are crucial for a seamless design and optimization flow. Periodicity of image sensor arrays necessitates availability of periodic boundary conditions. High structural complexity (many layout elements, many metal interconnect levels and many dielectric layers) demands efficient numerics for reasonable runtimes. We demonstrate the application of our capacitance simulation package CellCap3D to a PLHB (pixel-elevel hybrid bond) image sensor cell and discuss specifics.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIn the rapidly advancing landscape of computing, hardware accelerator designs are pivotal for satisfying high-performance and low-power demands. Systolic array (SA) architectures, tailored for general matrix multiplication (GEMM) operations, are ideal for image processing workloads. In this work, an integrated MAC (IMAC) factored SA is proposed. Unlike prior focus on standalone multipliers and adders, IMAC optimizes multiplier-and-accumulator (MAC) units. The new IMAC approach was introduced to three categories of Processing Elements (PE) that define SAs and were further evaluated against four state-of-the-art (SOTA) SA designs. IMAC-SA reported noteworthy advantages: a design footprint reduction of 17.30% to 26.40%, power savings from 5.46% to 15.85%, and a maximum critical path delay improvement of 9.47% over other SOTA designs.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionTo this day, the design of analog integrated circuits is a predominantly manual task, heavily reliant upon the knowledge and intuition of human experts. Many current automation approaches aim to be holistic solutions, attempting to take the human out of the loop. This work, in turn, does not intend to substitute human designers for algorithms, but support their qualities in the established flow. Here, the performance space of analog ICs is modeled by PVT-aware neural networks, and visualized with parallel coordinate plots. This responsive visualization gives insights into the relations of parameters through interactive exploration, where any parameter can be the cause while all others show the immediate effect. Thus, complex decision-making problems based on the experience of seasoned designers, such as topology selection or circuit sizing, are transformed into intuitive perceptual problems. Through the responsiveness and immediacy of the implementation, designers are encouraged to explore the entire performance space, instead of basing all decisions on previous designs, never leaving the beaten path. A data generation and training procedure for surrogate models is outlined. Models for three operational amplifiers in three different technologies illustrate the applicability and feasibility of the presented approach. Additionally, a web-based demo, including the source code, is available for review.
Research Manuscript
Design
AI/ML System and Platform Design
DescriptionIn the realm of video-oriented tasks, Video Transformer models (VidT), an evolution from vision Transformers (ViT), have demonstrated considerable success. However, their widespread application is constrained by substantial computational demands and high energy consumption. Addressing these limitations and thus improving VidT efficiency has become a hot topic. Current methodologies solve this challenge by dividing a video into several features and applying intra-feature sparsity. However, they neglect the crucial point of inter-feature redundancy and often entail prolonged latency in fine-tuning phases. In response, this paper introduces InterArch, a tailored framework designed to significantly enhance VidT efficiency. We first design a novel inter-feature sparsity algorithm consisting of hierarchical deduplication and recovery. The deduplication phase capitalizes on temporal similarities at both block and element levels, enabling the elimination of redundant computations across features in both coarse-grained and fine-grained manners. To prevent long-latency fine-tuning, we employ a lightweight recovery mechanism that constructs approximate features for the sparsified data. Furthermore, InterArch incorporates a regular dataflow strategy, which consolidates sparse features and effectively translates sparse computations into dense ones. Complementing this, we develop a spatial array architecture equipped with augmented processing elements (PEs), specifically optimized for our proposed dataflow. Extensive experiment results demonstrate that InterArch can achieve satisfactory performance speedups and energy saving.
IP
Engineering Tracks
IP
DescriptionSensors and signal converters exist at the boundaries of digital and analog components, usually with voltage barriers in the boundaries. However, the foundry technology of sensors hasn't scaled as fast as logic process with its advancement into lower voltage and power. Many MEMs and RF applications still need voltage levels >10V for normal operation, requiring additional discrete components into a system board to down-shift the higher voltage (HV) RF and Analog domains to the low voltage CMOS products.
Certus Semiconductor's HV solution were designed in the standard CMOS process nodes, without requiring extra masks, technology, or layers. This capability has enabled PMIC, MEMs and RFIC developers to bring directly to die, any HV signal that would have otherwise required off-chip components to downgrade the voltage levels to <5V, thus providing customers a direct competitive edge in the marketplace of HV RF, analog and MEMs. These IO solutions have been proved with silicon verification with the embedded ESD protection.
During the design process, Certus utilizes Siemens' Calibre DRC for design guidance, ensuring the monitoring of specialized layouts. Additionally, Analog FastSPICE Platform, a foundry-certified tool across various process technologies from major foundries, for designing and verifying parasitic models during chip finalization.
Certus Semiconductor's HV solution were designed in the standard CMOS process nodes, without requiring extra masks, technology, or layers. This capability has enabled PMIC, MEMs and RFIC developers to bring directly to die, any HV signal that would have otherwise required off-chip components to downgrade the voltage levels to <5V, thus providing customers a direct competitive edge in the marketplace of HV RF, analog and MEMs. These IO solutions have been proved with silicon verification with the embedded ESD protection.
During the design process, Certus utilizes Siemens' Calibre DRC for design guidance, ensuring the monitoring of specialized layouts. Additionally, Analog FastSPICE Platform, a foundry-certified tool across various process technologies from major foundries, for designing and verifying parasitic models during chip finalization.
Special Session (Research)
Design
DescriptionWith the rise of DL, our world braces for AI in every edge device, creating an urgent need for edge-AI SoCs. These SoCs need to support high throughput and energy-efficient processing, with a short time to market and at least 100 times more energy-efficient while offering sufficient flexibility and scalability. Since the design space is huge, advanced tooling and a holistic approach with innovations at all levels of the design hierarchy is needed. Starting with an overview of SoA DL processing support and our methodology, this presentation covers several design choices impacting the energy efficiency and flexibility of DL hardware.
Special Session (Research)
Autonomous Systems
DescriptionSimultaneous Localization and Mapping (SLAM) is an important but costly computing workload in practical mobile robot applications. Recently, neural network-based SLAM has attracted a lot of attention because of its strong task performance enabled by the powerful learning capability of neural networks. On the other hand, it further increases the complexity of the SLAM module. In this work, we propose to perform algorithm and hardware co-design towards accelerating neural SLAM. By jointly optimizing the SLAM model and the micro-architecture dataflow, our approach significantly improves the computation and energy efficiency while preserving high task performance. Experiments across different scenarios demonstrate the efficacy of this co-optimization approach.
Special Session (Research)
Autonomous Systems
DescriptionAutonomous Unmanned Aerial Vehicles (UAVs) are on the rise in the industrial and academic communities. Since most UAVs are severely size, weight, and power constrained, building computing for UAVs is challenging. Current domain-specific hardware/software design for UAVs mainly focuses on a single component in the whole system, such as visual-inertial odometry (VIO), depth estimation, and control, which lacks optimization of systematic flight performance. Thus, we conduct a complete computational analysis of each component during flight, and propose a systematic flight model considering the accuracy and latency from all key components. Guided by this model, an automatic exploration method in the algorithm space is implemented to find the optimal depth estimation and VIO algorithms for high-speed flight. We conduct experiments on an embedded GPU and an ASIC chip respectively, and hardware-in-the-loop flight simulations demonstrate that this model can help us find better design than baselines.
Special Session (Research)
AI
DescriptionFinance has been identified as the first industry sector to benefit from quantum computing, due to the abundance of use cases with high complexity and the fact that, in finance, time is of the essence, which makes the case for solutions to be computed with high accuracy in real time. Typical use cases in finance that lend themselves to quantum computing are portfolio optimization, derivative pricing, risk analysis, and several problems in the realm of machine learning. Specifically, we will focus on the recent progress of quantum optimization algorithms in both the near-term hardware and the fault-tolerant realm.
Special Session (Research)
AI
Design
DescriptionAI-based sensing methods for emerging sensing modalities such as multi-spectral sensing and integrated communications and sensing (CIS) require a large amount of dedicated resources, such as data storage, frequency spectrum, hardware, processing, and energy consumption. In this industry-led talk, we demonstrate key enabling techniques to perform AI-based sensing at the edge. To do so, we first explore the capabilities/opportunities for AI-based sensing methods that use communication features extracted from a mmWave 5G standard-compliant system. As an example, we experimentally demonstrate the use of directional communication features extracted from ambient mm-Wave 5G signals to perform object classification over 230 scenes with 98% accuracy in an indoor environment with an inference time <6ms per scene. Then, we describe the key system components to enable such opportunities, including (i) intelligent data ingestion, (ii) limited spectrum usage, (iii) software-based synchronization for data storage, and (iv) the use of communication features to perform sensing tasks. Finally, we provide an outlook on the next steps for the deployment of these emerging capabilities, including the utilization of AI accelerators.
Special Session (Research)
Design
DescriptionThis talk will discuss the neuromorphic approach in NimbleAI project combining a vertebrate eye-inspired foveated DVS chip coupled with an insect eye-inspired microlens array to sense and process light fields at different resolution levels with minimal latency and energy consumption. This sensor encodes 3D visual surroundings as sparse spikes in a 4D spatiotemporal domain adding depth to the current neuromorphic representation of visual information. The proposed sensor harnesses emerging 3D silicon integration technology to squeeze foveated DVS circuitry, memory, and SNN engines into a miniature silicon volume resembling the 3D structure of eyes and brains packed with photoreceptors and neurons.
Special Session (Research)
AI
DescriptionIn recent years, the importance of quantum computing has been increasingly recognized in the field of combinatorial optimization problems. Despite their importance, the application of the Quantum Approximate Optimization Algorithm (QAOA) in efficiently solving the Max-Cut problem remains challenges, primarily hindered by the limitations of available quantum computing resources. To address this challenge, we focus on optimizing initialization methods and extending techniques from basic unweighted Max-cut problems to more intricate weighted problems. We incorporate the Graph Neural Network (GNN) as a warm-start technique for initializing QAOA parameters. This approach significantly reduces the quantum computing resource overhead, thereby enhancing QAOA's capability to tackle sophisticated weighted optimization problems.
Special Session (Research)
AI
DescriptionThis talk presents Hardware Description Language Generative Pre-Trained Transformers (HDL-GPT), a novel approach that leverages the vast repository of open-source Hardware Description Language (HDL) codes to train superior quality large code models. The core premise of this research is the hypothesis that high-quality HDL is all you need to create models with exceptional performance and broad zero-shot generalization abilities. The talk elucidates the methods employed for the curation and augmentation of large corpora from open-source HDL code, transforming a highly variable quality data into high-quality data through careful prompting and context maintenance. We observe that the careful selection, filtering, and augmentation of data across HDLs can yield powerful models that surpass current state-of-the-art models. We also explore the impact of different fine-tuning methods on quality of results. We analyzed and performed experiments across a range of fine-tuned state-of-the-art LLMs. We demonstrate improvements of 50% to 200% over state-of-the-art HDL models on current benchmarks in tasks ranging from HDL circuit explanations, code generation, formal and simulation testbench creation, bug finding and fixing, to tasks in high-speed circuit design. HDLGPT opens new avenues for the development of advanced model training techniques for circuit design tasks.
Special Session (Research)
AI
Design
DescriptionIn the era of human-machine augmentation and coexistence, exemplified by smartphones, AI assistants, and smart devices, we're progressing towards seamless human-machine cooperation (HMC) or even symbiosis. Technologies like ChatGPT are pivotal, with new voice and image capabilities emerging. Imagine AI attuned to your senses, aiding in real-time. This AI power could reside in lightweight wearables if compute distribution hurdles are overcome, though communication energy is a challenge. Electro-Quasistatic Human Body Communication (EQS-HBC) offers a solution, enhancing security and efficiency. This talk explores Body-as-a-Wire technology's potential in IoB, outlining its role in future networks and the synergy of low-power communication with in-sensor intelligence for secure and efficient Human-Machine Cooperation.
Special Session (Research)
AI
DescriptionIn this talk, we address the longstanding challenge of automating the optimization and verification of High-level Synthesis (HLS)-based hardware accelerators. Traditional methods, including machine learning and compilation-based approaches, have been hindered by limitations in either the quality of results or their generalizability. To overcome these limitations, we introduce a novel framework utilizing Large Language Model (LLM) techniques. We begin by constructing an extensive HLS design and bug dataset, comprising 1113 real-world HLS designs sourced from 12 diverse HLS libraries and benchmark suites. This dataset is enhanced using LLM to inject complex logical HLS bugs, which cannot be captured by traditional HLS tools. Leveraging this enriched dataset, we develop and train a custom LLM specialized to not only generate optimized HLS designs (e.g., inserting optimization directives), but also to accurately identify HLS bugs in given HLS designs. Our experiments demonstrate that this model surpasses ChatGPT-4 Turbo in delivering higher quality optimizations and more accurate bug detection in HLS designs, while maintaining a smaller model size and inference latency. Collectively, these integrated frameworks mark a substantial advancement in AI accelerator design. They not only enhance the efficiency and accessibility of AI accelerator development but also serve as a bridge between AI algorithmic advancements and hardware innovation.
Special Session (Research)
AI
Design
DescriptionDeep neural networks (DNNs), crucial in fields like autonomous vehicles, often lack the necessary transparency and reliability. This talk addresses these limitations by introducing conformal prediction as an efficient alternative to the computationally intensive Bayesian inference, particularly suited for edge computing. We focus on extracting predictive uncertainties in applications such as visual odometry (VO) and 3D object detection. Our approach involves novel loss functions and training methods that utilize mutual information from various sensor streams and are capable of predicting disjoint uncertainty bounds. Additionally, we explore the interaction of sensing noise with predictive uncertainties, offering a dynamic, information-theoretic approach to regulate sensor power in real-time applications.
Special Session (Research)
AI
Description"In the rapidly evolving field of Artificial Intelligence (AI), the demand for efficient AI hardware accelerators is increasingly paramount. However, the complex and labor-intensive process of designing these accelerators presents significant challenges, hindering the pace of development in line with the evolving AI landscape. To address this, we propose the LLM4AIGChip initiative, which aims to leverage the extraordinary capabilities of Large Language Models (LLMs) to revolutionize AI accelerator design and enhance its accessibility. LLM4AIGChip consists of two key components: Data4AIGChip and GPT4AIGChip, each targeting fundamental bottlenecks in LLM-assisted AI accelerator design. Specifically, Data4AIGChip tackles the issues of dataset scarcity and quality in LLM-assisted hardware design by creating high-quality, specialized datasets, thereby augmenting the effectiveness of LLMs in AI accelerator design. In contrast, GPT4AIGChip focuses on employing LLMs to automate the design and verification processes of AI accelerators, leveraging the advanced capabilities of LLMs to streamline and simplify these tasks.
Collectively, these integrated frameworks mark a substantial advancement in AI accelerator design. They not only enhance the efficiency and accessibility of AI accelerator development but also serve as a bridge between AI algorithmic advancements and hardware innovation."
Collectively, these integrated frameworks mark a substantial advancement in AI accelerator design. They not only enhance the efficiency and accessibility of AI accelerator development but also serve as a bridge between AI algorithmic advancements and hardware innovation."
Special Session (Research)
AI
DescriptionNoisy Intermediate-Scale Quantum (NISQ) devices face limitations in qubit quantity, operational accuracy, coherence duration, and qubit connectivity within quantum processing units (QPUs). Dynamically remapping logical qubits to physical qubits in the compiler is essential for enabling two-qubit gates in algorithms. This process adds extra operations, reducing the algorithm's fidelity. Therefore, minimizing these additional gates is critical. In this work, we propose an approach to perform feature engineering on quantum circuit representations, creating detailed embeddings. This method facilitates the intricate integration of machine learning techniques. Compared with heuristic search-based algorithms, our approach lowers the overhead of quantum resources needed for adapting a logical circuit to a physical circuit executable.
Special Session (Research)
Design
DescriptionIn this talk we will discuss our vision for the development of neuromorphic accelerators based on integrated photonic technologies within the framework of the Horizon Europe NEUROPULS project. Photonic architectures that leverage phase-change and III-V materials for neuromorphic computing will be presented. A CMOS-compatible platform integrating such materials to fabricate photonic neuromorphic architectures will be discussed alongside a GEM5-based simulation platform to model the accelerator operation once interfaced with a RISC-V processor. Such platform will allow to model the accelerator scaling and its system-level behavior in terms of key metrics such as speed, energy consumption, and footprint.
Special Session (Research)
AI
Design
DescriptionIn this talk, the center director will outline key concepts in cognitive sensing. These include Analog-to-Insight (A2I) technology, which uses innovative algorithms and mixed-signal circuits for efficient feature extraction, and a closed-loop attention mechanism for real-time control that enhances feature quality while conserving power. Additionally, the talk will cover heterogeneous integration for compactly merging radar and lidar technologies and system software advancements for improved sensor-host processor collaboration. These developments aim to transform sensing-to-action in autonomous mobile robotics, such as self-driving cars and drones, by enabling more efficient navigation and mapping. Overall, CogniSense represents a significant leap in autonomous sensing, optimizing performance while minimizing resource use.
Special Session (Research)
EDA
DescriptionTo achieve the power, performance, and area (PPA) target in modern semiconductor design, the trend to go for More-than-Moore heterogeneous integration by packing various components/dies into a package becomes more obvious as the economic advantages of More-Moore scaling for on-chip integration are getting smaller and smaller. In particular, we have already encountered the high cost of moving to more advanced technology and the high fabrication cost associated with extreme ultraviolet (EUV) lithography, mask, process, design, electronic design automation (EDA), etc. Heterogeneous integration refers to integrating separately manufactured components into a higher-level assembly (in a package or even multiple packages in a PCB) that provides enhanced functionality and improved operating characteristics. Unlike the on-chip designs with relatively regular components and wirings, the physical design problem for heterogeneous integration often needs to handle arbitrary component shapes, diverse metal wire widths, and different spacing requirements between components, wire metals, and pads, with multiple cross-physics domain considerations such as system-level, physical, electrical, mechanical, thermal, and optical effects, which are not well addressed in the traditional chip design flow. In this talk, I will first introduce popular heterogeneous integration technologies and options, their layout modeling and physical design challenges, survey key published techniques, and provide future research directions for modern physical design for heterogeneous integration.
Special Session (Research)
Autonomous Systems
DescriptionThe pursuit of fully autonomous robotics is hindered significantly by computational limitations. Achieving true autonomy in robotics requires a complex interplay of fast, accurate processing, sophisticated algorithms, system-wide energy efficiency, and robust performance under varying conditions. Traditional approaches to enhancing computational efficiency in robotics have often been fragmented and ad hoc, tackling each challenge in isolation without a unified strategy. This talk introduces the concept of robomorphic systems as a novel solution to these obstacles, whereby we enhance a robot's operation by aligning robots' physical forms with their computational architectures. This innovative approach involves designing robotic systems whose physical structures align with their computational needs, leading to optimized processing speeds and reduced energy consumption. We delve into various case studies and examples to demonstrate the groundbreaking impact of robomorphic system design in forging a new era of specialized and efficient computing systems in robotics. By seamlessly merging robot morphology with computational design, robomorphic systems promise to unlock unprecedented capabilities and efficiencies, setting the stage for the future of advanced robotics.
Special Session (Research)
EDA
Description"As the electronics industry pivots from Moore's Law to More-than-Moore, we witness a convergence of technologies across IC and systems design. This fundamental shift in how we design today's products requires new advanced design flows that combine tools across the spectrum of EDA tools. One of the most critical capabilities of these system-level design flows is to enable seamless cross-domain co-design and analysis, allowing designers to achieve the highest performance and lowest cost products. The days of IC and package designers ‘throwing data over the wall' are over. Heterogeneous Integration (HI) is ushering in a new era of electronic product design with collaboration at its core – one that lives or dies on the seamless interaction between analog/digital IC teams and package design teams.
Chiplet-based architectures present a solution for most semiconductor and systems companies to enable More-than-Moore. System in a package (SiP) offers a flexible form factor while coming close to matching SoC performance, but with far lower overall cost, greater yield, and perhaps most importantly, faster time to market. While SiP as a packaging concept has been around for decades, the recent tsunami of designs adopting multi-chiplet architectures is truly a disruptive change.
The use of advanced packaging technologies to combine smaller, discrete chiplets into one system-in-package (SiP) not only pushes the need for more advanced multi-die packaging but it also makes packaging part and parcel of the process. Doing so significantly reduces dependence on Moore's Law at a time when building advanced monolithic system-on-chip (SoC) is no longer the best option from a cost and technology perspective.
This presentation will describe the challenges design teams face when pivoting from monolithic IC design to 3D heterogenous package design and how EDA addresses these challenges."
Chiplet-based architectures present a solution for most semiconductor and systems companies to enable More-than-Moore. System in a package (SiP) offers a flexible form factor while coming close to matching SoC performance, but with far lower overall cost, greater yield, and perhaps most importantly, faster time to market. While SiP as a packaging concept has been around for decades, the recent tsunami of designs adopting multi-chiplet architectures is truly a disruptive change.
The use of advanced packaging technologies to combine smaller, discrete chiplets into one system-in-package (SiP) not only pushes the need for more advanced multi-die packaging but it also makes packaging part and parcel of the process. Doing so significantly reduces dependence on Moore's Law at a time when building advanced monolithic system-on-chip (SoC) is no longer the best option from a cost and technology perspective.
This presentation will describe the challenges design teams face when pivoting from monolithic IC design to 3D heterogenous package design and how EDA addresses these challenges."
Special Session (Research)
EDA
DescriptionThermal-power challenges and increasingly expensive energy demands pose threats to the historical rate of increase in processor performance. Emerging energy-efficient computing schemes and heterogeneous 3D integration systems promise a substantial reduction in energy demand for emerging and growing computing needs. However, these conflicting trends have resulted in a substantial increase in both heat flux and power density (W/cm3), which makes it even more challenging to use conventional cooling technology solutions. High-performance, energy-efficient thermal management solutions from the device level to chip/package level are needed to tackle this thermal challenge. During this talk, I will first present the device level thermal conduction enhancement solutions using thermal metal via or thermal bumps for the microprocessor and LED packaging applications. The second part of this talk will focus on computational modeling, numerical optimization, microfluidic heatsink fabrications, and experimental investigations of advanced chip/package level thermal cooling solutions, including impingement jet cooling, manifold-based embedded microchannel cooling and cryogenic cooling.
Special Session (Research)
Autonomous Systems
DescriptionCausal reasoning often guides decision-making in humans, especially in the face of imperfect knowledge about the task or the environment. If robots have access to human causal knowledge, we hypothesize that it will enable the robots to make smarter decisions more efficiently under uncertainty. In this talk, we present a process to integrate human causal reasoning into a robot's decision-making process, specifically in the context of object assembly. We will also discuss broader implications of using generalized causal knowledge from a human to aid robots in completing different tasks, including ones they have not previously encountered.
Research Manuscript
EDA
Analog CAD, Simulation, Verification and Test
DescriptionAutomatic transistor sizing in circuit design continues to be a formidable challenge. Despite that Bayesian optimization (BO) has achieved significant success, it is circuit-specific, limiting the accumulation and transfer of design knowledge for broader applications. This paper proposes (1) efficient automatic kernel construction, (2) the first transfer learning across different circuits and technology nodes for BO, and (3) a selective transfer learning scheme to ensure only useful knowledge is utilized. These three novel components are integrated into BO with Multi-objective Acquisition Ensemble (MACE) to form Knowledge Alignment and Transfer Optimization (KATO) to deliver state-of-the-art performance: up to 2x simulation reduction and 1.2x design improvement over the baselines.
Exhibitor Forum
DescriptionA DAC panel in year 1993[1] emphasized the need for design engineers to have a component library management system/platform that provides software tools for rapid generation and management of up-to-date, EDA-neutral, complete, and reusable libraries. This would save time and costs in product design. However, some questions remain unanswered until recently:
1) Should this library platform be provided by EDA tool providers, by component providers, by ODM/OEM companies, or by a third-party organization?
2) What factors enable a library platform to truly provide up-to-date, accurate, and reusable libraries?
In today' industry, different EDA tools are not compatible with each other, making it nearly impossible for a tool provider to develop a library platform that support multiple EDA formats. Component providers and semiconductor companies are only responsible for giving components to customers and supplying specification datasheets in PDF format. Product design and manufacturing companies (ODM/OEM) can only create libraries according to their own needs. Therefore, a third-party organization would be best positioned to develop a library platform that supports all EDA formats, all new components of all vendors, and meet the design for manufacturing and assembly (DFMA) requirements of all ODM and OEM companies.
More importantly, only up-to-date, accurate, and reusable libraries can truly benefit design engineers working on different design lines and help companies save time and costs in the design process. The library platform that revolutionizes the industry must have the following features:
The platform is publicly accessible, allowing users to download libraries directly anywhere anytime.
Rapidly building highly accurate libraries, that not only align with generic parts information from spec datasheets, but also incorporate built-in DFMA rules into parts. This is crucial to customers of all tiers. Almost all tier-1 OEM/ODM companies require libraries to adhere to custom DFMA rules, mid-size companies need standardized DFMA to help them avoiding manufacturing issues, and small companies seek generic parts that follow IPC standards.
Offering on-demand access to libraries of the latest components. Simply having a large database of old libraries is not enough. Engineers use the latest components in their new designs and thus need up-to-date libraries. Relying on outdated libraries means engineers must manually adjust part drawings to accommodate differences between old and new specs and varying DFMA requirements. On-demand services can solve this issue.
Essentially, leveraging AI and automation technologies to support rapid on-demand services, eliminating the traditional manual part creation process. AI digitization technologies can extract necessary information from spec datasheets. The EDA library automation engine with programmable DFMA functions can encode spec data and DFMA rules for part creations.
Establishing a standardized library format that can fully describe both component specifications and DFMA rules. The format can transfer DFMA knowledge between EDA formats and be compatible with most EDA software. Then, the mentioned programmable DFMA functions supported by this DFMA-enriched format can code DFMA rules for rapid parts creation and modification.
Reference
[1] "Panel: The Key to EDA Results: Component & Library Management", DAC 1993, Don Conrad, Tom Fribourg, Jim Gruneisen, Glynn Marlow, Romesh Wadhwani, Bob Wiederhold
1) Should this library platform be provided by EDA tool providers, by component providers, by ODM/OEM companies, or by a third-party organization?
2) What factors enable a library platform to truly provide up-to-date, accurate, and reusable libraries?
In today' industry, different EDA tools are not compatible with each other, making it nearly impossible for a tool provider to develop a library platform that support multiple EDA formats. Component providers and semiconductor companies are only responsible for giving components to customers and supplying specification datasheets in PDF format. Product design and manufacturing companies (ODM/OEM) can only create libraries according to their own needs. Therefore, a third-party organization would be best positioned to develop a library platform that supports all EDA formats, all new components of all vendors, and meet the design for manufacturing and assembly (DFMA) requirements of all ODM and OEM companies.
More importantly, only up-to-date, accurate, and reusable libraries can truly benefit design engineers working on different design lines and help companies save time and costs in the design process. The library platform that revolutionizes the industry must have the following features:
The platform is publicly accessible, allowing users to download libraries directly anywhere anytime.
Rapidly building highly accurate libraries, that not only align with generic parts information from spec datasheets, but also incorporate built-in DFMA rules into parts. This is crucial to customers of all tiers. Almost all tier-1 OEM/ODM companies require libraries to adhere to custom DFMA rules, mid-size companies need standardized DFMA to help them avoiding manufacturing issues, and small companies seek generic parts that follow IPC standards.
Offering on-demand access to libraries of the latest components. Simply having a large database of old libraries is not enough. Engineers use the latest components in their new designs and thus need up-to-date libraries. Relying on outdated libraries means engineers must manually adjust part drawings to accommodate differences between old and new specs and varying DFMA requirements. On-demand services can solve this issue.
Essentially, leveraging AI and automation technologies to support rapid on-demand services, eliminating the traditional manual part creation process. AI digitization technologies can extract necessary information from spec datasheets. The EDA library automation engine with programmable DFMA functions can encode spec data and DFMA rules for part creations.
Establishing a standardized library format that can fully describe both component specifications and DFMA rules. The format can transfer DFMA knowledge between EDA formats and be compatible with most EDA software. Then, the mentioned programmable DFMA functions supported by this DFMA-enriched format can code DFMA rules for rapid parts creation and modification.
Reference
[1] "Panel: The Key to EDA Results: Component & Library Management", DAC 1993, Don Conrad, Tom Fribourg, Jim Gruneisen, Glynn Marlow, Romesh Wadhwani, Bob Wiederhold
Keynote
Special Event
AI
DescriptionJim Keller is CEO of Tenstorrent and a veteran hardware engineer. Prior to joining Tenstorrent, he served two years as Senior Vice President of Intel's Silicon Engineering Group. He has held roles as Tesla's Vice President of Autopilot and Low Voltage Hardware, Corporate Vice President and Chief Cores Architect at AMD, and Vice President of Engineering and Chief Architect at P.A. Semi, which was acquired by Apple Inc. Jim has led multiple successful silicon designs over the decades, from the DEC Alpha processors, to AMD K7/K8/K12, HyperTransport and the AMD Zen family, the Apple A4/A5 processors, and Tesla's self-driving car chip.
Research Manuscript
EDA
RTL/Logic Level and High-level Synthesis
DescriptionAI chip scales expediently in the large language models (LLMs) era. In contrast, the existing chip design space exploration methods, aimed at discovering optimal yet often infeasible or unproduceable Pareto-front designs, are hindered by neglect of design specifications. In this paper, we propose a novel Spec-driven transformed Bayesian optimization framework to find expected optimal RISC-V SoC architecture designs for LLM tasks. The highlights of our framework lie in a tailored transformed Gaussian process (GP) model prioritizing specified target metrics and a customized acquisition function (EHRM) in multi-objective optimization. Extensive experiments on large-scale RISC-V SoC architecture design explorations for LLMs, such as Transformer, BERT, and GPT-1, demonstrate that our method not only can effectively find the design according to QoR values from the spec, but also outperforms 34.59% in ADRS over previous state-of-the-art approach with 66.67% runtime.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionLogic locking has emerged as a technique for safeguarding intellectual property (IP) in chip designs. The continuous cat-and-mouse game between logic locking techniques and attackers has led to the development of new protection mechanisms and subsequent attack methods. However, existing attack scenarios fail to capture the real-world threats that logic locking techniques face. This paper investigates a previously overlooked attack scenario against logic locking in which the knowledgeable attacker possesses the locked netlist, knowledge about chip designs (represented as a library of designs) but no working system (as assumed by SAT and other attacks). We also propose a Knowledge-guided Oracle-Less Attack (KOLA) which leverages an adversary's prior knowledge represented by a library of previously encountered designs and employs weak division to measure similarity between the locked design and each library design. Our approach is applicable to both direct locked versions of library designs and modified/upgraded locked designs derived from the library. Experimental results demonstrate KOLA's ability to identify locked designs, even when significant changes have been made from the original library designs. This analysis sheds light on the vulnerabilities of logic locking techniques against knowledgeable attackers even if a working system is unavailable, highlighting the need for enhanced logic locking techniques that are able to thwart attacks from knowledgeable fab houses which can unlock systems even if a working system is unavailable.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionDespite the attractive performance and power-efficiency of application-specific hardware accelerators using FPGAs, emerging applications are often slow to benefit from them due to the high development overhead caused by the steep learning curve of its tools and paradigms, especially for control-heavy with irregular computation patterns. We present Labidus, a conveniently programmable on-chip cluster of RISC-V soft processors, augmented with a pool of queue-semantic accelerator functions automatically configured for each user software via static analysis. We evaluate Labidus using four established scientific computing applications, and demonstrate it can compete with, and sometimes even outperform manually optimized hardware accelerators on GB-scale workloads
Research Manuscript
Design
Emerging Models of Computation
DescriptionTo meet the increasingly complex experimental demands, the number of microvalves in flow-based microfluidic biochips has increased significantly, making it necessary to adopt multiplexers (MUXes) to actuate microvalves. However, existing MUX designs have limited coding capacities, resulting in excessive chip-to-world interface. This paper proposes a novel gate structure for modifying the current MUX architecture, along with a mixed coding strategy achieving the maximum coding capacity within the modified architecture. Additionally, a synthesis tool for the mixed-coding-based MUXes (LaMUXes) is presented. Experimental results demonstrate that the LaMUX is exceptionally efficient, substantially reducing the usage of pneumatic controllers and microvalves in MUXes.
Research Manuscript
Security
Embedded and Cross-Layer Security
DescriptionAutonomous driving systems (ADS) are boosted with deep neural networks (DNN) to perceive environments, while their security is doubted by DNN's vulnerability to adversarial attacks. Among them, a diversity of laser attacks emerges to be a new threat due to its minimal requirements and high attack success rate in the physical world. Nevertheless, current defense methods exhibit either low defense success rate or high computation cost against laser attacks. To fill this gap, we propose Laser Shield which leverages a polarizer along with a min-energy rotation mechanism to eliminate adversarial lasers from ADS scenes. We also provide a physical world dataset, LAPA, to evaluate its performance. Through exhaustive experiments with three baselines, four metrics, and three settings, Laser Shield is proved to exhibit the SOTA performance.
Late Breaking Results Poster
DescriptionAdiabatic Quantum-Flux-Parametron (AQFP) is a superconducting logic with extremely high energy efficiency. Recent research has made initial strides toward developing an AQFP-based crossbar accelerator. However several critical challenges from both the hardware and software side remain, preventing the design from being a comprehensive solution.
In this paper, we propose an AQFP-aware binary neural network architecture search framework that leverages software-hardware co-optimization to eventually search the AQFP-adapted neural network and the corresponding hardware configuration, providing a feasible AQFP-based solution for binary neural network (BNN) acceleration. Experimental results show that our framework can effectively search the AQFP-adapted neural network, consistently outperforming the representative AQFP-based framework.
In this paper, we propose an AQFP-aware binary neural network architecture search framework that leverages software-hardware co-optimization to eventually search the AQFP-adapted neural network and the corresponding hardware configuration, providing a feasible AQFP-based solution for binary neural network (BNN) acceleration. Experimental results show that our framework can effectively search the AQFP-adapted neural network, consistently outperforming the representative AQFP-based framework.
Late Breaking Results Poster
DescriptionHuman Pose Estimation (HPE) is increasingly being adopted in a wide range of applications, from healthcare to Industry 5.0. To address the intrinsic inaccuracy of such CNN-based software, the current trend involves applying filtering models to refine and improve the inference results. However, state-of-the-art filtering models are computationally intensive, limiting their use in resource-constrained devices. To overcome this limitation, we propose a real-time filtering technique based on diffusion models designed specifically for edge devices. Through a micro-benchmarking phase, we analyze how the model responds to various levels of noise and select the optimal setup for specific application scenarios. Using a widely available edge device, we evaluated the model's performance on both synthetic and real noise generated by a state-of-the-art HPE system. Preliminary results demonstrate a significant improvement in real-time filtering performance with minimal computational overhead.
Late Breaking Results Poster
DescriptionThis paper presents a circuit-algorithm co-design framework for learnable audio analog front-end (AFE) which includes an analog filterbank for feature extraction and a classifier based on Depthwise Separable Convolutional Neural Network (DSCNN). Instead of the traditional approach to design the analog filterbank and digital classifier separately, a learnable filterbank is proposed and its source-follower bandpass filter (SF-BPF) parameters are optimized together with the neural network classifier in a signal-to-noise ratio (SNR)-aware training process. A new system criterion function (LBPF) is proposed to include classification loss and filter performance into the training process. The optimized audio AFE achieves 10.6% and 11.7% reduction in BPF power and chip area, respectively. Meanwhile, this approach achieved 88.6%–94.5% accuracy for 10-keyword classification task across a wide range of input signal SNR from 5dB to 20dB, with only 16k trainable parameters.
Late Breaking Results Poster
DescriptionPlacement is a critical stage for VLSI routability optimization. A placement engine without considering the layout congestion might lead to poor solutions with routing failures. This paper introduces a Coulomb force-based global placement framework that addresses global and local routing congestions. We first present a routing path-based cell padding strategy for local congestion mitigation. Then, we construct a routability-aware placement model that utilizes virtual Coulomb forces to eliminate crucial global congestion. Compared with a leading academic placer, RePlAce, and the advanced commercial tool, Innovus, the experimental results on industrial benchmark suites show that our proposed algorithm achieves the best routability within the shortest runtime.
Late Breaking Results Poster
DescriptionDiverse solutions to the Boolean satisfiability (SAT) problem are essential for thorough testing and verification of software and hardware designs, ensuring reliability and applicability to real-world scenarios. We introduce a novel differentiable sampling method, called DiffSampler, which employs gradient descent (GD) to learn diverse solutions to the SAT problem. By formulating SAT as a supervised multi-output regression task and minimizing its loss function using GD, our approach enables performing the learning operations in parallel, leading to GPU-accelerated sampling and comparable runtime performance w.r.t. heuristic samplers. We demonstrate that DiffSampler can generate diverse uniform-like solutions similar to conventional samplers.
Late Breaking Results Poster
DescriptionControl channels on microfluidic large-scale integration (mLSI) chips are prone to blockage and leakage defects. The state-of-the-art test methods suffer efficiency concerns. In this work, we propose a built-in self-test (BIST) method that drastically improves the test efficiency. Given n to-be-tested control channels, we reduced the number of test patterns for blockage and leakage tests from up to n/2 to 1, and from up to log2(n+1) to up to log2(X(G)+1), respectively, where X(G) denotes the vertex chromatic number of a graph G consisting of n vertices. We fabricated our design and demonstrated the feasibility and efficiency of our method.
Late Breaking Results Poster
DescriptionRecent evolutions of recurrent neural networks (RNN) such as S4, S4D, and LRU, have shown remarkable potential for very long-range sequence modeling tasks for vision, language, and audio. They have shown a capacity to capture dependencies over tens of thousands of steps. Unlike transformers, which face significant memory consumption challenges with large context sizes, they are a promising alternative with their ability to operate effectively on embedded systems. While they have been evaluated for classification and segmentation tasks, no work in the literature has applied them in the context of human pose estimation. In this work we propose an architecture that combines such state space models (SSM) to graph attention networks (GAT) to enable their application to evaluate human action tasks on embedded systems.
Late Breaking Results Poster
DescriptionThe recent success of Quantum Neural Networks (QNNs) prompts model extraction attacks on cloud platforms, even under black-box constraints. These attacks repeatedly query the victim QNN with malicious inputs. However, existing extraction attacks tailored for classical models yield local substitute QNNs with limited performance due to NISQ computer noise. Drawing from bagging-based ensemble learning, which uses independent weak learners to learn from noisy data, we introduce a novel QNN extraction approach. Our experimental results show this quantum ensemble learning approach improves local QNN accuracy by up to 15.09% compared to previous techniques.
Late Breaking Results Poster
DescriptionThis paper proposes a fast system technology co-optimization (STCO) framework that optimizes power, performance, and area (PPA) for next-generation IC design, addressing the challenges and opportunities presented by novel materials and device architectures. We focus on accelerating the technology level of STCO using AI techniques, by employing graph neural network (GNN)-based approaches for both TCAD simulation and cell library characterization, which are interconnected through a unified compact model, collectively achieving over a 100X speedup over traditional methods. These advancements enable comprehensive STCO iterations with runtime speedups ranging from 1.9X to 14.1X and supports both emerging and traditional technologies.
Late Breaking Results Poster
DescriptionThis paper proposes a language-level modeling approach for HLS based on the state-of-the-art Transformer architecture. Our approach estimates the performance and resource requirements of HLS applications directly from the source code when different synthesis directives, in terms of HLS directives, are applied. Results show that the proposed architecture achieves 96.02% accuracy for predicting the feasibility class of applications and an average of 0.95 and 0.91 R^2 scores for predicting the actual performance and required resources, respectively.
Late Breaking Results Poster
Late Breaking Results: LLM-assisted Automated Incremental Proof Generation for Hardware Verification
DescriptionIn this paper, we propose a methodology for hardware verification assisted by Large Language Models (LLMs) in the incremental proof generation process. First, an LLM identifies the basic module of the Design Under Verification (DUV), followed by expanding the proof scope as more modules are added. LLMs assist in defining and verifying invariants for each module using the Z3 solver, and in formulating integration properties at module interfaces. Our case studies on a Ripple Carry Adder and a Dadda Tree multiplier demonstrate that LLMs enhance the efficiency and accuracy of hardware verification.
Late Breaking Results Poster
DescriptionThis work presents a machine learning (ML) technique to suppress reference ripple errors in successive approximation register (SAR) analog-to-digital converter (ADC). Reference voltage ripple due to switching in SAR ADC introduces dynamic error which manifests as spurs in the output spectrum and limits ADC resolution. Conventional techniques to suppress reference ripple require large decoupling capacitor and high-speed reference voltage buffer which consume large area and power. The proposed ML approach uses a supervised technique in which a low-speed 10MHz SAR ADC is used for learning and correcting reference ripple error in a 200MHz SAR ADC. Simulated in 28nm CMOS technology, the proposed ML approach reduces overall ADC power consumption by 4.9x without degrading performance.
Late Breaking Results Poster
DescriptionThe majority-inverter graph (MIG) is a homogeneous logic network widely used in logic synthesis for majority-based emerging technologies. Many logic optimization algorithms have been proposed for MIGs, including rewriting, resubstitution, and graph mapping. However, unlike AIGs, research on optimization flows for MIGs is limited. In this paper, we explore combinations of well-developed MIG optimization algorithms using an on-the-fly design space exploration strategy and present the latest best results on MIG size minimization of EPFL benchmarks. Significant reductions (of 88% and 79%) are observed for two specific benchmarks and an average of 14% improvement is achieved compared to the state-of-the-art flow.
Late Breaking Results Poster
DescriptionExisting printed circuit board (PCB) placement often fails to address complex constraints (e.g., diverse wire widths and intricate spacing rules) arising from heterogeneous components in modern designs. Manual placement requires expertise and is time-consuming. Thus, automated PCB placement is desired for large-scale, complex designs considering irregular component shapes, clearance conditions, wire areas, power circuit flow, and routability. This paper proposes an efficient force-directed global placement followed by legalization to handle these constraints. We derive power circuit anchor points to guide global placement to consider the power circuit flow and present a simulated annealing-based pad alignment method to handle complex spacing constraints. After global placement, we perform window-based legalization to remove component overlaps considering various constraints. Experimental results show our placer's superior routability and efficiency with comp
Late Breaking Results Poster
DescriptionThe evaluation of logic locking methods has long been predicated on an implicit assumption that only the correct key can unveil the true functionality of a protected circuit. Consequently, a locking technique is deemed secure if it resists a good array of attacks aimed at finding this correct key. In this paper, we challenge this one-key premise by introducing a more efficient attack methodology, focused not on identifying that one correct key, but on finding multiple, potentially incorrect keys that can collectively produce correct functionality from the protected circuit. The tasks of finding these keys can be parallelized, which is well suited for multi-core computing environments. Empirical results show our attack achieves a runtime reduction of up to 99.6% compared to the conventional attack that tries to find a single correct key.
Late Breaking Results Poster
DescriptionThis paper proposes a power rail routing flow for advanced multi-layered printed circuit boards (PCBs) to optimize segment area and via usage while satisfying IR drop requirements. With increasing current/voltage demands in modern PCBs, ultra-wide power rails may consume most routing space and cause significant routing problems. We present an effective overlap-aware rail sizing technique to distribute routing spaces appropriately according to current/voltage demands and a resistance-aware A*-search algorithm to resolve overlapping regions by rail detouring. Experimental results show that our work significantly outperforms the state-of-the-art rail router in the metal area and runtime, achieving respective reductions of 49\% and 28\%, without any current/voltage violations.
Late Breaking Results Poster
DescriptionField-programmable gate array (FPGA) macro placement holds a crucial role within the FPGA physical design flow since it substantially influences the subsequent stages of cell placement and routing. In this paper, we propose an effective and efficient routability-driven macro placement algorithm for modern FPGAs with cascade shape and region constraints. To reserve adequate space for cell placement and guarantee routability, we first develop a routability-driven mixed-size analytical global placement (GP) that evenly distributes both macros and cells while considering cascade shape and region constraints. Then, we propose an integer linear programming (ILP)-based cascade shape legalization (LG) followed by matching-based macro legalization to remove macro overlaps while satisfying the region constraints. Finally, a routability-driven detailed macro placement is proposed to refine the solution. Compared with the top contestants of the MLCAD 2023 contest, experimental results show that our algorithm achieves the best overall score and routability.
Late Breaking Results Poster
DescriptionLow-cost and hardware-efficient design of trigonometric functions is challenging. Stochastic computing (SC), an emerging computing model processing random bit-streams, offers promising solutions for this challenge. The existing implementations, however, often overlook the importance of the data converters necessary to generate the needed bitstreams. While recent advancements in SC bit-stream generators focus on basic arithmetic operations such as multiplication and addition, energy-efficient SC design of non-linear functions demands attention to both the computation circuit and the bit-stream generation. This work introduces TriSC, a novel approach for SC-based design of trigonometric functions enjoying state-of-the-art (SOTA) quasi-random bit-streams. Unlike SOTA SC designs of trigonometric functions that heavily rely on delay elements in mid-stages to decorrelate bit-streams, our approach avoids delay elements while improving the accuracy of the results. TriSC yields significant energy savings up to 92% compared to SOTA. As novel use cases studied for the first time in SC literature, we employ the proposed design for 2D image transformation and forward kinematics of a robotic arm, both computation-intensive applications demanding low-cost trigonometric designs.
Late Breaking Results Poster
DescriptionThe emergence of Field-coupled Nanocomputing (FCN) as a green and atomically-sized post-CMOS technology introduces a unique challenge for the development of physical design methods: unlike conventional computing, wire segments in FCN entail the same area and delay costs as standard gates. Hence, it is imperative to reconsider physical design strategies tailored for FCN to effectively address this distinctive characteristic. This paper unveils a recent breakthrough in minimizing the number of wire segments by an average of 20.13%, which, due to the high wire cost, also leads to an average decrease of 34.10% in area and 19.84% in critical path length.
Research Manuscript
AI
AI/ML Algorithms
DescriptionIn real-world neural network deployments, incoming data often contains noise and imperfections. Retraining on resource-constrained edge devices becomes essential to maintain performance. To tackle this challenge, we introduce LEAF, a hardware-efficient framework designed for adapting to degraded images. By analyzing neural network behavior on degraded images, we propose two techniques: 1) Selective Experience Replay for skipping unimportant images, reducing computation, and 2) Pseudo Noise Dithering for extremely low precision (3 or 4-bit) gradient quantization, enabling nearly full-integer training. Extensive experiments on CIFAR10 and Tiny ImageNet datasets, with various image degradations, demonstrate LEAF's ultra-low cost with minimal accuracy loss.
Research Manuscript
AI
AI/ML Algorithms
DescriptionApproximate Nearest Neighbor Search (ANNS) is a classical problem in data science. ANNS is both computationally-intensive and memory-intensive. As a typical implementation of ANNS, Inverted File with Product Quantization (IVFPQ) has the properties of high precision and rapid processing. However, the traversal of non-nearest neighbor vectors in IVFPQ leads to redundant memory accesses. This significantly impacts retrieval efficiency. A promising approach involves the utilization of learned indexes, leveraging insights from data distribution to optimize search efficiency. Existing learned indexes are primarily customized for low-dimensional data. How to tackle ANNS in high-dimensional vectors is a challenging issue.
This paper introduces Leanor, a learned index-based accelerator for the filtering of non-nearest neighbor vectors within the IVFPQ framework. Leanor minimizes redundant memory accesses, thereby enhancing retrieval efficiency. Leanor incorporates a dimension reduction component, mapping vectors to one-dimensional keys and organizing them in a specific order. Subsequently, the learned index leverages this ordered representation for rapid predictions. To enhance result accuracy, we conduct a thorough analysis of model errors and introduce a specialized index structure named learned index forest (LIF). The experimental results show that, compared to representative approaches, Leanor can effectively filter out non-neighboring vectors within IVFPQ, leading to a substantial enhancement in retrieval efficiency.
This paper introduces Leanor, a learned index-based accelerator for the filtering of non-nearest neighbor vectors within the IVFPQ framework. Leanor minimizes redundant memory accesses, thereby enhancing retrieval efficiency. Leanor incorporates a dimension reduction component, mapping vectors to one-dimensional keys and organizing them in a specific order. Subsequently, the learned index leverages this ordered representation for rapid predictions. To enhance result accuracy, we conduct a thorough analysis of model errors and introduce a specialized index structure named learned index forest (LIF). The experimental results show that, compared to representative approaches, Leanor can effectively filter out non-neighboring vectors within IVFPQ, leading to a substantial enhancement in retrieval efficiency.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe field of analog custom design faces significant challenges during layout generation due to its inherent complexity, slow execution, and propensity for errors. These issues are further exacerbated due to aggressive technology scaling within advanced process nodes. The performance of analog designs is highly sensitive to layout parasitics, underscoring the critical need for accurate parasitic estimation during all design stages, including schematic design, placement, and routing. This necessity stems from the direct impact of parasitics on key performance metrics such as device performance, IR drop, power consumption, and node voltage stability. This paper presents a novel methodology that employs transformer convolution based GNN architecture and integer linear programming (ILP) optimization techniques, to predict key layout parasitics for analog circuits. Our approach is distinct in its ability to comprehensively model capacitance and resistance parasitics in a scalable hierarchical tree structure. Through our novel parasitic modeling framework, we demonstrate on advanced sub-10nm process technology, a mean-average-percentage-error (MAPE) of 11% and 11.5% for point-to-point resistance and lumped capacitance estimation respectively. Using the estimated RC models, we were able to reduce the gap between pre and post layout simulation design metrics by a factor of 3X on industrial designs.
Research Manuscript
EDA
Analog CAD, Simulation, Verification and Test
DescriptionThis paper introduces Learn-by-Compare (LbC), a novel approach for analog performance modeling by employing semi-supervised contrastive regression. LbC employs a deep neural network encoder to come up with latent representations of sizing solutions by comparing similarity/dissimilarity of the underlying performance. Leveraging two levels of transistor-level sizing data augmentation (DA), namely LS-DA and GS-DA, LbC produces new data samples by employing design knowledge. Experimental results highlight LbC's superior predictive accuracy compared to traditional regression methods. Offering a streamlined semi-supervised learning methodology, LbC effectively incorporates simple design knowledge and representation learning for efficient analog performance modeling.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIndexes such as B-trees and hash tables, in database systems are used for fast retrieval of data. These are created on columns of a table and serve as a pointer in order to map a key to the position of a record on a table. In recent years, much research has been conducted on the faster index lookup. "Learned indexes" is one such area of research. These index models have achieved enormous performance improvements. However, query performance with learned indexes is restricted by CPU architecture.
FPGAs, on the other hand, offer a suitable alternative, by providing programmability. In this paper, we propose a new methodology that takes into consideration the advantages of both the learned index and FPGAs. We refer to this methodology as the Selective Mathematical operation AcceleRaTion (SMART) approach with an FPGA for the end-to-end acceleration of learned indexes. Being a hybrid between a CPU approach and an FPGA approach, the SMART model of index acceleration achieves FPGA-like performance while maintaining the data structure storage on the CPU.
With our SMART approach, the radix spline learned index was accelerated using the single FPGA and without any off-chip memory resources. The resulting index, called SMART-RS, achieves an overall speedup of 5.5× as compared to a CPU-based RS index on the SOSD benchmark datasets.
FPGAs, on the other hand, offer a suitable alternative, by providing programmability. In this paper, we propose a new methodology that takes into consideration the advantages of both the learned index and FPGAs. We refer to this methodology as the Selective Mathematical operation AcceleRaTion (SMART) approach with an FPGA for the end-to-end acceleration of learned indexes. Being a hybrid between a CPU approach and an FPGA approach, the SMART model of index acceleration achieves FPGA-like performance while maintaining the data structure storage on the CPU.
With our SMART approach, the radix spline learned index was accelerated using the single FPGA and without any off-chip memory resources. The resulting index, called SMART-RS, achieves an overall speedup of 5.5× as compared to a CPU-based RS index on the SOSD benchmark datasets.
Research Manuscript
AI
AI/ML Algorithms
DescriptionWhile graph neural networks (GNNs) have gained popularity for learning circuit representations in various electronic design automation (EDA) tasks, they face challenges in scalability when applied to large graphs and exhibit limited generalizability to new designs. These limitations make them less practical for addressing large-scale, complex circuit problems. In this work we propose HOGA, a novel attention-based model for learning circuit representations in a scalable and generalizable manner. HOGA first computes hop-wise features per node prior to model training. Subsequently, the hop-wise features are solely used to produce node representations through a gated self-attention module, which adaptively learns important features among different hops without involving the graph topology. As a result, HOGA is adaptive to various structures across different circuits and can be efficiently trained in a distributed manner. To demonstrate the efficacy of HOGA, we consider two representative EDA tasks: quality of results (QoR) prediction and functional reasoning. Our experimental results indicate that (1) HOGA reduces estimation error over conventional GNNs by 46.76% for predicting QoR after logic synthesis; (2) HOGA improves 10.0% reasoning accuracy over GNNs for identifying functional blocks on unseen gate-level netlists after complex technology mapping; (3) The training time for HOGA almost linearly decreases with an increase in computing resources.
Research Manuscript
EDA
Physical Design and Verification
DescriptionNon-integer multiple cell height (NIMCH) standard-cell libraries offer promising co-optimization for power, performance and area in advanced technology nodes. However, such non-uniform design introduces new layout constraints where any sub-region can only accommodate gates of the same cell height. The existing physical design flow for NIMCH circuits handles the constraint by clustering and relocating gates according to their cell heights, inevitably causing displacement that harms circuit performance. This paper proposes a placement-aware logic resynthesis procedure that explicitly adjusts cell heights after initial placement without changing cell location. Experiment results demonstrate that our approach can reduce the maximal delay by 26.1%.
Front-End Design
Design
Engineering Tracks
Front-End Design
DescriptionAs designs become increasingly complex, ensuring that corner cases are properly verified is a critical challenge. While traditional verification techniques like Constrained Random Verification (CRV) and formal methodology have been used for this process, they both have their limitations. Functional simulation is time and resource-intensive, and it is not exhaustive. The formal verification approach, though exhaustive, requires prior knowledge for creating the System Verilog properties to verify. To address this challenge, we need to explore innovative verification techniques that can help effectively verify complex designs and ensure that they meet the desired specifications. Waveforms and timing diagrams are commonly used by designers to represent design behavior over multiple cycles. To help capture information from failing scenario Wave-dumps or user-defined timing scenarios, we've developed a utility that quickly converts timing data into a System Verilog Property. This enables designers to independently reproduce and verify scenarios in a formal verification environment with ease. The proposed approach reduces scenario regeneration time by up to 180 times concerning Functional Simulation Verification.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionBottlenecks in Design verification sign-off process during project execution:
1. Coverage closure
2. Regression management
Problems faced during Coverage closure :
1. Multiple iterations of regressions
2. Covering all the bins (Lakhs of bins in current RTL designs)
3. Analyzing the uncovered bins
The motivation for writing this paper is to create awareness and introduce the automated techniques which saves iterations and execution time of DV engineer and make his life easier with reduced efforts for closing coverage
The presentation explains several such techniques and along the way also mentions different best practices that need to be followed within the test bench infrastructure for faster coverage and regression closure and also to catch the maximum number of bugs.
Leveraging Portable stimulus standard (PSS) for faster functional coverage closure and constraints offloading.
1. Coverage closure
2. Regression management
Problems faced during Coverage closure :
1. Multiple iterations of regressions
2. Covering all the bins (Lakhs of bins in current RTL designs)
3. Analyzing the uncovered bins
The motivation for writing this paper is to create awareness and introduce the automated techniques which saves iterations and execution time of DV engineer and make his life easier with reduced efforts for closing coverage
The presentation explains several such techniques and along the way also mentions different best practices that need to be followed within the test bench infrastructure for faster coverage and regression closure and also to catch the maximum number of bugs.
Leveraging Portable stimulus standard (PSS) for faster functional coverage closure and constraints offloading.
Research Manuscript
Security
Hardware Security: Attack and Defense
DescriptionSpectre-type attacks have demonstrated a major class of vulnerabilities
arising from speculative execution of instructions, the main performance enabler of modern CPUs. These attacks speculatively leak secrets that have been either speculatively loaded (seen in sandboxed programs) or non-speculatively loaded (seen in constant-time programs). Various hardware-only defenses have been proposed to mitigate both speculative and non-speculative secrets via all potential transmission channels. However, these solution rely on limited knowledge of the hardware about the program to conservatively restrict the execution of all instructions that can potentially leak information.
In this work, we discuss that not all instructions depend on older unresolved branches and they can safely execute without leaking speculative information.
We present Levioso, a novel hardware/software co-design, that provides comprehensive secure speculation guarantees while reducing performance overhead compared to the existing methodologies. Levioso informs the hardware about true branch dependencies in order to apply restrictions only when necessary. Our evaluations demonstrate that Levioso is able to significantly reduce the performance overhead compared to two state-of-the-art defenses from 51% and 43% to just 23%.
arising from speculative execution of instructions, the main performance enabler of modern CPUs. These attacks speculatively leak secrets that have been either speculatively loaded (seen in sandboxed programs) or non-speculatively loaded (seen in constant-time programs). Various hardware-only defenses have been proposed to mitigate both speculative and non-speculative secrets via all potential transmission channels. However, these solution rely on limited knowledge of the hardware about the program to conservatively restrict the execution of all instructions that can potentially leak information.
In this work, we discuss that not all instructions depend on older unresolved branches and they can safely execute without leaking speculative information.
We present Levioso, a novel hardware/software co-design, that provides comprehensive secure speculation guarantees while reducing performance overhead compared to the existing methodologies. Levioso informs the hardware about true branch dependencies in order to apply restrictions only when necessary. Our evaluations demonstrate that Levioso is able to significantly reduce the performance overhead compared to two state-of-the-art defenses from 51% and 43% to just 23%.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionNoisy intermediate-scale quantum (NISQ) computers suffer from state-dependent errors due to the short relaxation time of sensitive qubits.
This state-dependent error distorts the result distribution and makes inferring correct answers on NISQ machines challenging.
To tackle this challenge, we propose \textit{Libra} (coLaboratIng with Basis-inveRted quAntum-bit), which supports generating measurement results with balanced correct answers by mitigating quantum error caused by state-dependent bias.
Instead of running a regular circuit, \textit{Libra} executes the regular and basis-inverted circuits at half the number of measurements each.
The noise characteristics due to the relaxation shown from the regular and basis-inverted circuit executions are different since their defined initial bases are in opposite directions.
Finally, \textit{Libra} reconstructs the final distribution by multiplying and normalizing the measured results from the regular and inverted circuits.
In our experiments, \textit{Libra} achieves up to 78\% higher PST and up to 55\% lower L2-Norm error compared to previous quantum error mitigation techniques.
This state-dependent error distorts the result distribution and makes inferring correct answers on NISQ machines challenging.
To tackle this challenge, we propose \textit{Libra} (coLaboratIng with Basis-inveRted quAntum-bit), which supports generating measurement results with balanced correct answers by mitigating quantum error caused by state-dependent bias.
Instead of running a regular circuit, \textit{Libra} executes the regular and basis-inverted circuits at half the number of measurements each.
The noise characteristics due to the relaxation shown from the regular and basis-inverted circuit executions are different since their defined initial bases are in opposite directions.
Finally, \textit{Libra} reconstructs the final distribution by multiplying and normalizing the measured results from the regular and inverted circuits.
In our experiments, \textit{Libra} achieves up to 78\% higher PST and up to 55\% lower L2-Norm error compared to previous quantum error mitigation techniques.
Research Manuscript
Design
Emerging Models of Computation
DescriptionThis paper proposes a high-performance and energy-efficient optical near-sensor accelerator for vision applications, called Lightator. Harnessing the promising efficiency offered by photonic devices, Lightator features innovative compressive acquisition of input frames and fine-grained convolution operations for low-power and versatile image processing at the edge for the first time. This will substantially diminish the energy consumption and latency of conversion, transmission, and processing within the established cloud-centric architecture as well as recently designed edge accelerators. Our device-to-architecture simulation results show that with favorable accuracy, Lightator achieves 84.4 Kilo FPS/W and reduces power consumption by a factor of ~24x and 73x on average compared with existing photonic accelerators and GPU baseline
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionLinked lists provide a flexible and efficient way to share resources across multiple queues. While implementing a linked list is straightforward, its verification poses complexities. Using traditional simulation techniques, finding and debugging can be challenging and time-consuming. Formal verification can accelerate this effort and enhance functional coverage. However, it is limited by the scale at which it can operate. Linked List Proof Accelerator is a generic and scalable solution that uses abstraction techniques to limit the state space and input parameterization for easy adoption. It enables quicker debugging and left-shifts the bug-finding phase to the design phase.
Research Manuscript
Design
In-memory and Near-memory Computing Circuits
DescriptionIn-memory learned index has been an efficient approach supporting in-memory fast data access. However, existing learned indexes are inefficient in supporting variable-length keys. To address this issue, we propose a new in-memory learned index called LIVAK that adopts a hybrid structure involving trie, learned index, and B+-tree. Each node indexes an 8-byte slice of keys, and we use learned indexes for large nodes but B+-trees for small nodes. Also, LIVAK presents a character re-encoding mechanism to avoid performance degradation. We compare LIVAK with B+-tree, Masstree, and SIndex on various datasets and workloads, and the results suggest the efficiency of LIVAK.
Research Manuscript
Design
Design for Manufacturability and Reliability
DescriptionWith the rapid downscaling of technology nodes,
industrial flow such as pitch reduction, patterning flexibility, and lithography processing variability have been challenged.
Layout hotspot detection is one of the most challenging and critical steps, which requires technology upgrading.
Pattern matching and learning-based detectors are proposed as quick detection methods.
However, these computer vision (CV) model-based detectors use images transformed from layout GDS files as their inputs.
It leads to foreground information (e.g. metal polygons) loss and even distortion when shrinking the image size to fit the model input.
Moreover, plenty of irrelevant background information such as non-polygon pixels are also fed into the model,
which hinders the fitting of the model and results in a waste of computational resources.
Concerning the disadvantage of the traditional CV model, we propose a new layout hotspot detection paradigm,
which directly detects hotspots on GDS files by exploiting a hierarchical GDS semantic representation scheme and a well-designed pre-trained natural language processing (NLP) model.
Compared with state-of-the-art works,
ours achieves better results both on the ICCAD2012 metal layer benchmark and the more challenging ICCAD2020 via layer benchmark, which demonstrates the effectiveness and efficiency of our approach.
industrial flow such as pitch reduction, patterning flexibility, and lithography processing variability have been challenged.
Layout hotspot detection is one of the most challenging and critical steps, which requires technology upgrading.
Pattern matching and learning-based detectors are proposed as quick detection methods.
However, these computer vision (CV) model-based detectors use images transformed from layout GDS files as their inputs.
It leads to foreground information (e.g. metal polygons) loss and even distortion when shrinking the image size to fit the model input.
Moreover, plenty of irrelevant background information such as non-polygon pixels are also fed into the model,
which hinders the fitting of the model and results in a waste of computational resources.
Concerning the disadvantage of the traditional CV model, we propose a new layout hotspot detection paradigm,
which directly detects hotspots on GDS files by exploiting a hierarchical GDS semantic representation scheme and a well-designed pre-trained natural language processing (NLP) model.
Compared with state-of-the-art works,
ours achieves better results both on the ICCAD2012 metal layer benchmark and the more challenging ICCAD2020 via layer benchmark, which demonstrates the effectiveness and efficiency of our approach.
Research Manuscript
Design
AI/ML System and Platform Design
DescriptionAs generative AI such as ChatGPT rapidly evolves, the increasing incidence of data misconduct such as the proliferation of counterfeit news or unauthorized use of Large Language Models (LLMs) presents a significant challenge for consumers to obtain authentic information. While new watermarking schemes are recently being proposed to protect the intellectual property (IP) of LLM, the computation cost is unfortunately too high for the targeted real-time execution on local devices. In this work, a specialized hardware-efficient watermarking computing framework is proposed enabling model authentication at local devices. By employing the proposed hardware hashing for fast lookup and pruned bitonic sorting network acceleration, the developed architecture framework enables fast and efficient watermarking of LLM on the small local devices. The proposed architecture is evaluated on Xilinx XCZU15EG FPGA, demonstrating 30x computing speed-up, making this architecture highly suitable for integration into local mobile devices. The proposed algorithm to architecture codesign framework offers a practical solution to the immediate challenges posed by LLM misuse, providing a feasible hardware solution for Intellectual Property protection in the era of generative AI.
Research Manuscript
Security
Embedded and Cross-Layer Security
DescriptionNumerous embedded systems utilize firmware written in memory-unsafe C/C++. So, the firmware may exhibit spatial memory vulnerabilities, such as buffer overflows, which, if exploited by an attacker, can lead to various software attacks. While several studies have proposed defenses against these memory vulnerabilities, they often introduce significant performance and memory overhead or are impractical for application in embedded systems. In this paper, we introduce micro-fat pointer, a novel solution for heap memory safety for embedded systems. Notably, micro-fat leverages TT instructions newly introduced in ARMv8-M to implement an efficient bounds-checking mechanism. Our evaluation results demonstrate that micro-fat pointer exhibits a 41% performance improvement in compared to the existing state-of-the-art heap memory safety solution.
Research Manuscript
Security
Hardware Security: Primitives, Architecture, Design & Test
DescriptionMicroarchitectural attacks represent a challenging and persistent threat to modern processors, exploiting inherent design vulnerabilities in processors to leak sensitive information or compromise systems. Of particular concern is the susceptibility of Speculative Execution, a fundamental part of performance enhancement, to such attacks.
We introduce Specure, a novel pre-silicon verification method composing hardware fuzzing with Information Flow Tracking (IFT) to address speculative execution leakages. Integrating IFT enables two significant and non-trivial enhancements over the existing fuzzing approaches: i) automatic detection of microarchitectural information leakages vulnerabilities without golden model and ii) a novel Leakage Path coverage metric for efficient vulnerability detection. Specure identifies previously overlooked speculative execution vulnerabilities on the RISC-V Boom processor and explores the vulnerability search space 6.45× faster than existing fuzzing techniques. Moreover, Specure detected known vulnerabilities 20× faster.
We introduce Specure, a novel pre-silicon verification method composing hardware fuzzing with Information Flow Tracking (IFT) to address speculative execution leakages. Integrating IFT enables two significant and non-trivial enhancements over the existing fuzzing approaches: i) automatic detection of microarchitectural information leakages vulnerabilities without golden model and ii) a novel Leakage Path coverage metric for efficient vulnerability detection. Specure identifies previously overlooked speculative execution vulnerabilities on the RISC-V Boom processor and explores the vulnerability search space 6.45× faster than existing fuzzing techniques. Moreover, Specure detected known vulnerabilities 20× faster.
Research Manuscript
Design
Design of Cyber-physical Systems and IoT
DescriptionTwo-stage object detectors exhibit high accuracy and precise localization, especially for identifying small objects that are favorable for various edge applications. However, the high computation costs associated with two-stage detection methods cause more severe thermal issues on edge devices, incurring dynamic runtime frequency change and thus large inference latency variations. Furthermore, the dynamic number of proposals in different frames leads to various computations over time, resulting in further latency variations. The significant latency variations of detectors on edge devices can harm user experience and waste hardware resources. To avoid thermal throttling and provide stable inference speed, we propose LOTUS, a novel framework that is tailored for two-stage detectors to dynamically scale CPU and GPU frequencies jointly in an online manner based on deep reinforcement learning. To demonstrate the effectiveness of LOTUS, we implement it on NVIDIA Jetson Orin Nano and Mi 11 Lite mobile platforms. The results indicate that LOTUS can consistently and significantly reduce latency variation, achieve faster inference, and maintain lower CPU and GPU temperatures under various settings.
Research Manuscript
EDA
Test, Validation and Silicon Lifecycle Management
DescriptionWe propose an algorithmic test generation method for neuromorphic chips without Design-for-Testability.
Fault activation differentiates a neuron's good output and faulty output.
Fault propagation sensitizes fault effects to differentiate outputs of faulty chips and good chips.
On an L-layer Spiking Neural Network (SNN) model, we achieve 100% fault coverage using O(L) test configurations and test patterns under negligible or no weight variation.
Our results show that test effectiveness is maintained even with 4-bit weight quantization.
We incur no test escape and overkill even under 10% weight variation.
Our total test length is over 73K times shorter than previous works.
Fault activation differentiates a neuron's good output and faulty output.
Fault propagation sensitizes fault effects to differentiate outputs of faulty chips and good chips.
On an L-layer Spiking Neural Network (SNN) model, we achieve 100% fault coverage using O(L) test configurations and test patterns under negligible or no weight variation.
Our results show that test effectiveness is maintained even with 4-bit weight quantization.
We incur no test escape and overkill even under 10% weight variation.
Our total test length is over 73K times shorter than previous works.
IP
Engineering Tracks
IP
DescriptionWe present a complete on-chip built-in self-test (BIST) technique for testing a high-performance Continuous Time Sigma Delta Analog to Digital Converters (CTSD ADCs). A pre-stored pulse density modulated digital bitstream is filtered and applied to an inherently linear mixed signal FIR filter for stimuli generation. The analog sigma delta modulator output is processed by a digital filter chain and a single bin Discrete Fourier Transform (DFT) computer to accurately determine the spectral content and related performance figures. The proposed fully on-chip architecture obviates the need of an accurate analog signal generator and post processing on chip. Its area overhead is less than 5% of the total ADC area. The BIST circuit is plug and play with its operation independent of the sigma delta modulator and ADC specification. As the presented design is mostly digital, it is highly technology independent, enabling very short development times.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionFor FPGA-based neural network accelerators, digital signal processing (DSP) blocks have traditionally been the cornerstone for handling multiplications. This paper introduces LUTMUL, a transformative approach that harnesses the potential of look-up tables (LUTs) for performing these multiplications. Empirical analysis reveals that the availability of LUTs typically outnumbers DSPs by a factor of 100, offering a significant computational advantage. By exploiting this advantage of LUTs, our method demonstrates a potential boost in the performance of FPGA-based neural network computations together with a reconfigurable data-flow architecture. Our approach not only challenges the conventional compute bound on DSP-based accelerators but also sets a new benchmark for efficient neural network computation on FPGAs. Experimental results demonstrate that our design achieves the best inference speed among all FPGA-based accelerators, achieving a throughput of 1627 images per second and maintaining a top-1 accuracy of 70.95\% on the ImageNet dataset. Our method showcases great scalability, efficiency, and superior performance, marking a paradigm shift in FPGA-based neural network design and optimization.
Research Manuscript
EDA
Timing and Power Analysis and Optimization
DescriptionAs transistor size continues to scale down, process variation has become an essential factor determining semiconductor yield and economic return. The Liberty Variation Format (LVF) is the current industrial standard that expresses statistical timing behaviors based on single Gaussian model. However, it loses accuracy when the timing distribution is non-Gaussian due to growing process variations. This paper proposes a novel LVF2 distribution model to better capture the multi-Gaussian timing distribution while maintaining backward compatibility with LVF. The experiment using TSMC 22nm technology shows that compare to LVF, LVF2 reduces binning error of 7.74× in delay and 9.56× in transition, and reduces 3𝜎-yield error of 4.79× in delay and 7.18× in transition. The error reduction is reduced for path delay due to Central Limit Theorem (CLT). But it is still 2× for a typical circuit path with 8 times Fanout-of-4 (FO4) inverter delays.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionPower gating design is essential to save the power. It is important not only to design PDN (power delivery Networks) but also to place switch cells in terms of number and distributions.
It is necessary of switch cell ratio as Design Methodology to get robust power integrity by considering Static IR, Dynamic IR and Leakage from powerplan to eco.
However, designer face difficulty of Switch cells estimation as power from Switch cells in advance.
So, in ECO stage, if it there is not enough Switch cell, hard to insert more additional Switch cells in the empty area and to create additional PDN for Switch cell.
To solve the issues, optimal switch cell methodology is necessary with Machine learning.
Based on Linear Regression, optimal switch solution could be made by input parameters.
Without any tradeoff, design can be more robust. It leads to better Power integrity and TAT.
It is necessary of switch cell ratio as Design Methodology to get robust power integrity by considering Static IR, Dynamic IR and Leakage from powerplan to eco.
However, designer face difficulty of Switch cells estimation as power from Switch cells in advance.
So, in ECO stage, if it there is not enough Switch cell, hard to insert more additional Switch cells in the empty area and to create additional PDN for Switch cell.
To solve the issues, optimal switch cell methodology is necessary with Machine learning.
Based on Linear Regression, optimal switch solution could be made by input parameters.
Without any tradeoff, design can be more robust. It leads to better Power integrity and TAT.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAnalog-on-Top Analog Mixed Signal (AMS) Integrated Circuit (IC) design is a time-consuming process predominantly carried out by hand. Within this flow, usually, some area is reserved by the top-level integrator for the placement of digital blocks. Specific features of the area, such as size and shape, have a relevant impact on the possibility of implementing the digital logic with the required functionality. We propose an automated evaluation methodology to predict the feasibility of digital implementation based on a set of high-level features avoiding time-consuming Place-and-Route trials so to provide fast feedback between Digital and Analog Back-End designers during top-level placement.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionEnabling new applications such as autonomous driving or car architectures with centralized ECUs, heterogeneous systems-on-chip (SoCs) with multiple CPUs, multi-level memory hierarchies, various co-processors and hardware accelerators is becoming a key architectural paradigm. Such highly dense automotive SoCs implemented in advanced CMOS technologies are sensitive to process-voltage-temperature variations and other physical disturbances. To mitigate increasing sensitivity challenges of logic gates and memories to transient supply noise, temperature effects and process variations, robust-enough power-delivery networks (PDNs) must be implemented. However, PDN development is facing its own challenges such as late-stage sign-off during the SoC's development cycle, long simulation times, computationally intensive simulations, and late discovery of voltage-drop and electromigration violations when fixes are expensive to implement. Furthermore, PDNs are typically initially defined without considering the package. To overcome these limitations, we propose a machine-learning-driven floorplan-aware power-co-planning methodology using Ansys' OptiSlang that shifts left the PDN development to the prototyping (architecture) abstraction level and enables automation of PDN die-package co-design and verification. Our solution transforms PDN development from a process that produces a couple of simulation results using more than 10 experts in several months into one that compares more than 1000 results using a single expert in a few days.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionEfficient analog circuit design for given specifications is essential in the semiconductor industry, but it is challenging. To support this design process, various automation techniques have been proposed, but these barely utilize information gained from previous simulation data. As a result, learning-based methods utilizing neural networks have received much attention, since they have the ability to learn generalizability and a single neural network model can learn various tasks simultaneously. Inspired by this, we propose MACO, a transformer-based unified network for model-based optimization, designed for effective bidirectional prediction between circuit parameters and specifications across various circuit types. This framework is capable of handling diverse input lengths and providing variable-scale predictions, enhancing the optimization process, and helping circuit designers gain insight. We validate that MACO's learning efficiency is remarkably improved (more than 12X) compared to a single task learning.
Research Manuscript
Embedded Systems
Embedded System Design Tools and Methodologies
DescriptionWe propose MAFin that exploits the unique temperature effect inversion (TEI) property of a FinFET based multicore platform, where processing speed increases with temperature, in the context of approximate real-time computing. With an objective to maximize the QoS for a FinFET based multicore system, MAFin, our proposed real-time scheduler, first derives a task-to-core allocation while respecting system-wide constraints and prepares a schedule. During execution, MAFin further increases the achieved QoS by exploiting TEI property of FinFET based processors while balancing the performance and temperature and respects the imposed constraints on-the-fly by incorporating a prudential temperature cognizant frequency management mechanism.
Research Manuscript
Embedded Systems
Time-Critical and Fault-Tolerant System Design
DescriptionAs the deployment of neural networks in safety-critical applications proliferates, it becomes imperative that they exhibit consistent and dependable performance amidst hardware malfunctions. Several protection schemes have been proposed to protect neural networks, but they suffer from huge overheads or insufficient fault coverage. This paper presents Maintaining Sanity, a comprehensive and efficient protection technique for CNNs. Maintaining Sanity extends the state-of-the-art algorithm-based fault tolerance for CNN, utilizing hamming codes and checkpointing to correct over 99.6% of critical faults with about 72% runtime overhead and minimal memory overhead compared to traditional triple modular redundancy (TMR) techniques.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionComputing-In-Memory (CIM) is considered as a promising solution to address the von-Neumann bottleneck. However, in traditional CIM architecture, data conversion could take up most of hardware resources, such as chip area and energy. To overcome previous design limitations, this work, named MAM-CIM, proposes a computational architecture utilizing multilevel analog memory, tailored for near-sensor computation of time domain data. At the same time, employing scheduling techniques to conduct data resilience analysis for analog memory contributes to additional reductions in hardware overhead. A generic recurrent unit (GRU) network is implemented based on the proposed architecture for real-time keyword spotting (KWS) application with TSMC 180nm technology. The evaluation results indicate that it achieves an accuracy of 88.51% for 10 keywords and the memory area of the system is saved by 21.11%.
Research Manuscript
EDA
Analog CAD, Simulation, Verification and Test
DescriptionAdjoint sensitivity analysis is critical in modern integrated circuit design and verification, but its computational intensity grows significantly with the size of the circuit, the number of objective functions, and the accumulation of time points. This growth can impede its wider application. The intimate link between the forward integration in transient analysis and the reverse integration in adjoint sensitivity analysis allows for the retention of Jacobian matrices from transient analysis, thereby speeding up sensitivity analysis. However, Jacobian matrices across multiple timesteps are often so large that they cannot be stored in memory during the forward integration process, necessitating disk storage and incurring significant I/O overhead.
To address this, we develop a memory-efficient sensitivity analysis method that utilizes data compression to minimize memory overhead during simulation and enhance analysis efficiency. Our compression method can efficiently compress the sparse tensor that contains the Jacobian matrices over time by exploiting the spatiotemporal characteristics of the data and circuit attributes. It also introduces a shared-indices technique, a cutting-edge spatiotemporal prediction model, and robust residual encoding.
We evaluate our compression method on 7 datasets from real-world simulations and demonstrate that it can reduce the memory requirements for storing Jacobian matrices by more than 16x on average, which is significantly more efficient than other state-of-the-art compression techniques.
To address this, we develop a memory-efficient sensitivity analysis method that utilizes data compression to minimize memory overhead during simulation and enhance analysis efficiency. Our compression method can efficiently compress the sparse tensor that contains the Jacobian matrices over time by exploiting the spatiotemporal characteristics of the data and circuit attributes. It also introduces a shared-indices technique, a cutting-edge spatiotemporal prediction model, and robust residual encoding.
We evaluate our compression method on 7 datasets from real-world simulations and demonstrate that it can reduce the memory requirements for storing Jacobian matrices by more than 16x on average, which is significantly more efficient than other state-of-the-art compression techniques.
Research Manuscript
EDA
RTL/Logic Level and High-level Synthesis
DescriptionResubstitution is a flexible algorithmic framework for circuit restructuring that has been incorporated into many high-effort logic optimization flows. It is thus important to speed up resubstitution in order to obtain high-quality realizations of large-scale designs. This paper proposes a massively parallel AIG resubstitution algorithm targeting GPUs, with effective approaches to addressing cyclic dependencies and restructuring conflicts. Compared with ABC and mockturtle, our algorithm achieves 41.9x and 50.3x acceleration on average without quality degradation. When combining our resubstitution with other GPU algorithms, a GPU-based resyn2rs sequence obtains 46.4x speedup over ABC with 0.8% and 5.8% smaller area and delay respectively.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn Analog Design, matching is very critical to ensure yield, as even a few millivolts difference between neighboring devices can break the circuit. In this paper, we present a flow for matched placement and routing using Group Arrays. Group Arrays are repeated pattern of synchronized unit cells. The unit cell is repeated in patterns in the design such that the parameters (width, length etc.) of each unit is same as that of others in the array. To create repeated design patterns, number of rows and columns can be altered along with spacing between cells and orientation pattern of the cells. Each unit cell comprises of devices and routing, individually or in combination. As Group Array supports synchronous editing, changes made to unit cell is replicated across all cells. If the specifications are modified, then design changes can be done very quickly by working on the unit cells, simplifying ECOs and DRC corrections. The placement and routing of an entire block can be done efficiently using Group Array and is shown in this paper.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionFully homomorphic encryption (FHE) enables arbitrary computations on encrypted data without decryption, securing many emerging applications. Unfortunately, FHE computation is orders of magnitude slower than computation on plain data due to the explosion in data size after encryption. We propose a PIM-based FHE accelerator, MatHE, which exploits a novel processing in-memory technology with near-mat processing to achieve high-throughput and efficient acceleration for FHE. Our evaluation shows MatHE achieves 4.0× speedup and 6.9× efficiency improvement over state-of-the-art FHE accelerators.
Research Manuscript
EDA
Timing and Power Analysis and Optimization
DescriptionThe efficient analysis of power grids is a crucial yet computationally challenging task in integrated circuit (IC) design, given the shrinking power supply voltage of ultra deep-submicron VLSI design. Different from conventional modified nodal analysis analytical solving technique, this paper introduces MAUnet, an innovative machine-learning model that redefines state-of-the-art full-chip static IR drop prediction. MAUnet ingeniously integrates multi-scale convolutional blocks, attention mechanisms, and U-Net architecture to optimize prediction accuracy. The multi-scale convolutional blocks significantly enhance feature extraction from image-based data, while the attention mechanism precisely identifies hotspot regions. The U-Net architecture, on the other hand, enables scalable image-to-image prediction applicable to circuits of any size. Uniquely, MAUnet also incorporates a pioneering fusion method that synergies both power grids and image-based data. Additionally, we introduce a low-rank approximation transfer learning technique to extend MAUnet's applicability to unseen test cases. Benchmark tests validate MAUnet's superior performance, achieving an average error of less than 6% relative to the average IR drop on three benchmarks.The performance enhancements offered by our proposed method are substantial, outperforming the current state-of-the-art method, IREDGe, by considerable margins of 29%, 65%, and 68% in three canonical benchmarks. Transfer learning is validated to enable model to achieve effective improvement on real circuit test cases. Compared to commercial tools, which often require hours to deliver results, the proposed method provides orders of magnitude speed-up with negligible error in practice.
Research Manuscript
Autonomous Systems
Autonomous Systems (Automotive, Robotics, Drones)
DescriptionVision Transformers (ViTs) are highly accurate Machine Learning (ML) models. However, their large size and com-
plexity increase the expected error rate due to hardware faults. Measuring the error rate of large ViT models is challenging, as conventional microarchitectural fault simulations can take years to produce statistically significant data. This paper proposes a two-level evaluation based on data collected through more than 70 hours of neutron beam experiments and more than 600 hours of software fault simulation. We consider 12 ViT models executed in 2 NVIDIA GPU architectures. We first characterize the fault model in ViT's kernels to identify the faults that are more likely to propagate to the output. We then design dedicated procedures efficiently integrated into the ViT to locate and correct these faults. We propose Maximum corrupted Malicious values (MaxiMals), an experimentally tuned low-cost mitigation solution to reduce the impact of transient faults on ViTs. We demonstrate that MaxiMals can correct 90.7% of critical failures, with execution time overheads as low as 5.61%.
plexity increase the expected error rate due to hardware faults. Measuring the error rate of large ViT models is challenging, as conventional microarchitectural fault simulations can take years to produce statistically significant data. This paper proposes a two-level evaluation based on data collected through more than 70 hours of neutron beam experiments and more than 600 hours of software fault simulation. We consider 12 ViT models executed in 2 NVIDIA GPU architectures. We first characterize the fault model in ViT's kernels to identify the faults that are more likely to propagate to the output. We then design dedicated procedures efficiently integrated into the ViT to locate and correct these faults. We propose Maximum corrupted Malicious values (MaxiMals), an experimentally tuned low-cost mitigation solution to reduce the impact of transient faults on ViTs. We demonstrate that MaxiMals can correct 90.7% of critical failures, with execution time overheads as low as 5.61%.
Research Manuscript
Security
Hardware Security: Primitives, Architecture, Design & Test
DescriptionMicroarchitectural timing side-channels are known to compromise security in computing systems with shared buffers (like caches) and/or parallel execution of attacker and victim tasks. Counterintuitively, such threats exist even in simple microcontrollers lacking such features. This paper describes previously neglected SoC-wide timing side-channels and presents a new formal method for detection. In a case study on Pulpissimo, our method detected a vulnerability to a previously unknown attack variant that allows an attacker to obtain information about a victim's memory accesses. We applied a conservative fix and verified security of the SoC against the considered class of timing side-channels.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionProblem Statement : Modern applications demand memory intensive complex SOC's with tighter time to market schedules. For such designs; implementing high quality test and repair solution is a unique challenge to achieve with optimum mbist insertion efforts. Traditional MBIST insertion methods needs multiple MBIST insertion runs as many times as functional RTL or netlist changes. This keeps DFT team engaged running mbist insertion multiple times.
Approach/Methodology :
In this paper, we've broadly covered optimize design practices which helps reducing mbist insertion iterations. The design hierarchies are partitioned into memory clusters such a way that the mbist insertion is required only if there are memory changes in the given cluster. The flow is created such that the functional connections remains intact and only the bist pins of memories get hooked up to BIST logic. This avoids any mbist intercepts in functional paths. Different experiments are done to achieve optimum area, power, timing closure and bist runtimes.
When performing the DFT insertion flow with sub-blocks, you insert MemoryBIST and pre-DFT DRCs at the sub-block level and then move up to the sub-block's next parent physical block level (where the sub-block is instantiated) to perform ICL extraction/Synthesis/Scan insertion.
Impact/Results :
Multiple instantiations — You only need to perform the DFT insertion flow once for a sub-block. Thereafter, every instantiation of the sub-block includes the inserted DFT hardware
Small size — Most sub-blocks are not big enough to be considered their own physical regions which saves run time
Readiness — Sometimes the sub-block RTL is complete before the RTL for the physical layout region, thus you can begin DFT insertion on the sub-block as soon as RTL is ready
Approach/Methodology :
In this paper, we've broadly covered optimize design practices which helps reducing mbist insertion iterations. The design hierarchies are partitioned into memory clusters such a way that the mbist insertion is required only if there are memory changes in the given cluster. The flow is created such that the functional connections remains intact and only the bist pins of memories get hooked up to BIST logic. This avoids any mbist intercepts in functional paths. Different experiments are done to achieve optimum area, power, timing closure and bist runtimes.
When performing the DFT insertion flow with sub-blocks, you insert MemoryBIST and pre-DFT DRCs at the sub-block level and then move up to the sub-block's next parent physical block level (where the sub-block is instantiated) to perform ICL extraction/Synthesis/Scan insertion.
Impact/Results :
Multiple instantiations — You only need to perform the DFT insertion flow once for a sub-block. Thereafter, every instantiation of the sub-block includes the inserted DFT hardware
Small size — Most sub-blocks are not big enough to be considered their own physical regions which saves run time
Readiness — Sometimes the sub-block RTL is complete before the RTL for the physical layout region, thus you can begin DFT insertion on the sub-block as soon as RTL is ready
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionIn memory products, IO blocks for data interfaces have been using a custom-based design methodology. We use pre-layout simulation to predict chip performance in the early stage of schematic design using Steiner tree-based routing estimation. Based on the simulation results, we place and fix standard cells and macros, and then perform routing. For critical nets, we manually route the nets in the form of Steiner trees and auto route the rest of the nets automatically
However, the existing design methodology is facing limitations in terms of routability and turn-around time due to the increase of routing constraints with continuous rise in IO speed and the reduction of routing tracks due to area optimization.
In this work, we propose the following methods:
1. We develop a semi-automated router to generate a single-trunk Steiner tree considering basic design rules
2. We improve the routability of critical nets by combining the semi-automated router with the auto-router of the P&R tool
3. We minimize parasitic RC overhead by optimizing wire spacing and layers
We applied proposed methodology to a flash design and observed significant improvements in physical DRC violations and design turn-around time while maintaining layout expert's manual routing quality.
However, the existing design methodology is facing limitations in terms of routability and turn-around time due to the increase of routing constraints with continuous rise in IO speed and the reduction of routing tracks due to area optimization.
In this work, we propose the following methods:
1. We develop a semi-automated router to generate a single-trunk Steiner tree considering basic design rules
2. We improve the routability of critical nets by combining the semi-automated router with the auto-router of the P&R tool
3. We minimize parasitic RC overhead by optimizing wire spacing and layers
We applied proposed methodology to a flash design and observed significant improvements in physical DRC violations and design turn-around time while maintaining layout expert's manual routing quality.
Research Manuscript
EDA
Test, Validation and Silicon Lifecycle Management
DescriptionDue to rapid technology scaling in recent years, computation units such as AI systems have become highly susceptible to malfunctions in the hardware. Such malfunctions, when manifested in the accelerator memory, alter the pre-trained Deep Neural Network weight parameters, thereby generating faults, which in turn reduce the inference classification accuracy. To improve the reliability of the AI system, these faults are needed to be detected and mitigated by incorporating just-in-time strategy. Existing approaches for detection/mitigation of faults techniques are not ideal for just-in-time incorporation as the approaches prevents continuous execution or add significant latency overhead. To circumvent this issue, this paper explores uncertainty quantification in deep neural networks as a means of facilitating an efficient and novel fault detection approach in AI systems. Furthermore, in order to mitigate the impact of such faults, we propose MENDNet, which leverages the properties of multi-exit neural networks, coupled with the proposed uncertainty quantification framework. By tuning the confidence threshold for inference in each exit and leveraging the energy-based uncertainty quantification metric, MENDNet can make accurate predictions even in the presence of faults in the computation units. When evaluated on state-of-the-art network-dataset configurations and with multiple fault rate-fault position combinations, our proposed approach furnishes up to 80.42% improvement in inference classification accuracy over a traditional DNN implementation, thereby instilling the reliability of the AI accelerator in mission mode.
Research Manuscript
MERSIT: A Hardware-Efficient 8-bit Data Format with Enhanced Post-Training Quantization DNN Accuracy
AI
Design
AI/ML Architecture Design
DescriptionPost-training quantization (PTQ) models utilizing conventional 8-bit Integer or floating-point formats still exhibit significant accuracy drops in modern deep neural networks (DNNs), rendering them unreliable. This paper presents MERSIT, a novel 8-bit PTQ data format designed for various DNNs. While leveraging the dynamic configuration of exponent and fraction bits derived from Posit data format, MERSIT demonstrates enhanced hardware efficiency through the proposed merged decoding scheme. Our evaluation indicates that MERSIT yields more reliable 8-bit PTQ models, exhibiting superior accuracy across various DNNs compared to conventional floating-point formats.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe Number Theoretic Transform (NTT) has proven effective in enhancing polynomial multiplication efficiency for fully homomorphic encryption(FHE), yet lacks a universal methodology for generating NTT accelerators. In this paper, we propose a methodology for the NTT accelerator that accommodates polynomials of arbitrary degrees and moduli, achieving a balance between area and performance by adjusting the number of Processing Elements (PEs). Our design employs the Residue Number System (RNS) for modulus decomposition to enhance hardware resource utilization. In addition, we introduce a data movement strategy that eliminates bit-reversal operations, addresses memory conflicts, and reduces the clock cycle. Finally, we develop a configurable PE capable of adapting its data path, resulting in a universal architecture. The evaluation demonstrates that our design outperforms the existing work by 40% improvement in area-time product on average and up to 21.7× improvement in processing speed
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAs for the advancements in the fabrication process, Integrated Circuits (ICs) blocks are complicatedly designed. One of the main verification method verifying the layout using Design Rule Check (DRC) is commonly adopted. DRC is coded by the DRC team understanding the intention of Layout Design Rule (LDR) which is issued by the process engineer. One of the challenges faced between the LDR and DRC is insufficient information. The link between the LDR description and the DRC code is the key factor for preventing accident and turnaround time. Generating LDR based test pattern layout automatically enables to checks intention between DRC code and LDR. For verifying complicated derived layers and cell based test patterns are also needed which requires updating standardized descriptions. Analyzing the result of DRC error from the test pattern strengthens the LDR description and checks the DRC code correctness. We provide solution for standardization, automation flow between the LDR description and DRC code for quality assurance with reduced time.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionBillions of MCUs drive integration of more number of CPUs, DMAs and variety of peripherals at relatively higher performance. While the low-end SOC level performance, throughput requirements was seldom analyzed systematically, it is impractical to ignore these aspects in mid-end to high-end MCUs due to complexity of integration and performance requirements.
We present methodology to address the mentioned problem
- Peripheral model managing internal FIFO reads/writes agnostic of its function of communication/conversion/processing
- System DMA model with customization options to take care of channel priority, channel switching and R/W transfer latency
- Determine throughput at SOC level using above models by running dynamic simulation as per SOC specification
- Optimize internal memory/buffer/FIFO sizes according to model performance which in turn helps to save cost OR
- Analyze possible performance trade-offs for various scenarios with existing buffer/FIFO sizes and clock frequency of operation
In this presentation, we focus on following method -
Holistic simulation environment using cycle accurate C/SystemC models primarily supporting code execution by cycle accurate ARM CPU models. This is scalable to include/exclude other models as per SOC configuration, change peripheral and SOC configuration on need basis, more importantly, evaluate architecture trade-offs very early at the design exploration stage
We present methodology to address the mentioned problem
- Peripheral model managing internal FIFO reads/writes agnostic of its function of communication/conversion/processing
- System DMA model with customization options to take care of channel priority, channel switching and R/W transfer latency
- Determine throughput at SOC level using above models by running dynamic simulation as per SOC specification
- Optimize internal memory/buffer/FIFO sizes according to model performance which in turn helps to save cost OR
- Analyze possible performance trade-offs for various scenarios with existing buffer/FIFO sizes and clock frequency of operation
In this presentation, we focus on following method -
Holistic simulation environment using cycle accurate C/SystemC models primarily supporting code execution by cycle accurate ARM CPU models. This is scalable to include/exclude other models as per SOC configuration, change peripheral and SOC configuration on need basis, more importantly, evaluate architecture trade-offs very early at the design exploration stage
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWith demand for aggressive reduction in power budget, among other techniques, supply voltage scaling to the lowest possible levels still remains the simplest yet most effective solution for both dynamic and static power reduction. In this paper, we discuss the existing techniques to enable low voltage digital designs and propose a comprehensive method to define necessary voltage levels that helps the design team to take an informed decision on what the lowest supply voltage they should go to for reliable operation. At the same time, this ensures the functional robustness of the design along with accurate timing closure at ultra-low voltage. Extensive data is presented for a design at 65nm process node for each phase of the design proposed in this paper.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionMicrosoft addresses diverse IP challenges by prioritizing quality control for internal design teams, resolving handoff complexities, and managing 3rd party IPs with format inconsistencies. They emphasize early quality checks in design, acknowledging the rising cost of addressing IP issues later in the process, especially for intricate custom chips like Cobalt 100 and Maia 100.
To resolve these challenges, Microsoft has collaborated with Siemens to build and deploy a comprehensive IP QA framework covering database integrity, layout functionality equivalence, and validation of timing, power, noise parameters, and version-to-version IP QA. This framework integrates Siemens' Solido IP Validation into its Microsoft's CAD infrastructure.
This paper discusses how Microsoft's IP handoff flow automates and streamlines the entire process.
By catching potential issues much earlier in the design flow, the handoff flow has demonstrated remarkable results, saving approximately 2 weeks of engineering time. This not only contributes to substantial cost savings but also prevents the need for costly ECOs, leading to more predictable tapeout schedules.
To resolve these challenges, Microsoft has collaborated with Siemens to build and deploy a comprehensive IP QA framework covering database integrity, layout functionality equivalence, and validation of timing, power, noise parameters, and version-to-version IP QA. This framework integrates Siemens' Solido IP Validation into its Microsoft's CAD infrastructure.
This paper discusses how Microsoft's IP handoff flow automates and streamlines the entire process.
By catching potential issues much earlier in the design flow, the handoff flow has demonstrated remarkable results, saving approximately 2 weeks of engineering time. This not only contributes to substantial cost savings but also prevents the need for costly ECOs, leading to more predictable tapeout schedules.
Research Manuscript
Design
Design for Manufacturability and Reliability
DescriptionIn semiconductor manufacturing, pinpointing nanoscale wafer defects is crucial for yield and reliability. Deep learning methods for defect segmentation rely heavily on large, labor-intensive datasets and focus mainly on macroscopic wafer defects, not nanoscale morphology. Our research introduces a hybrid weakly supervised scanning electron microscope (SEM) defect segmentation system with two sub-networks: one for accurate defect localization and image cropping, another for detailed segmentation. Validated on 1,328 SEM image defects from a real facility, our model surpasses existing weakly supervised methods and equals fully supervised models in accuracy, with 10% labeling effort, providing a novel approach for high-precision defect segmentation.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionSeveral approaches have been proposed over the years to automatically generate specifications of digital systems by means of dynamic techniques, which are now ripe to be applied in large-scale industrial scenarios. On the other hand, the automatic extraction of specifications for the hybrid domain, where systems express both discrete and continuous behaviours, remains mainly unexplored. Therefore, in this paper, we propose a tool for dynamically mining the specifications of hybrid systems in the form of assertions compliant with the Signal Temporal Logic (STL), which has been proven to be effective at capturing the behaviours of such systems.
Our approach takes as input a set of execution traces of the target system and mixes clustering and decision-tree algorithms to generate STL assertions that describe what has been actually implemented.
Our approach takes as input a set of execution traces of the target system and mixes clustering and decision-tree algorithms to generate STL assertions that describe what has been actually implemented.
Late Breaking Results Poster
DescriptionAs technology scales down, multi-cell spacing constraints are imposed by modern circuit designs. In this paper, we propose a detailed placement algorithm considering multi-cell spacing constraints. First, an SAT-based multi-cell spacing violation reduction method is presented to reduce the number of violations with minimum displacement. Then, a window-based violation-eliminating method is adopted to resolve all the remaining violations. Finally, we refine the placement result with ILP to reduce the cell displacement. Compared with the state-of-the-art work, experimental results show that our algorithm achieves a 19\% improvement in displacement and a 39\% reduction in runtime.
Research Manuscript
Design
Quantum Computing
DescriptionQuantum computers have the potential to solve important problems which are fundamentally intractable on a classical computer.
The underlying physics of quantum computing platforms supports using multi-valued logic, which promises a boost in performance over the prevailing two-level logic.
One key element to exploiting this potential is the capability to efficiently prepare quantum states for multi-valued, or qudit, systems.
Due to the time sensitivity of quantum computers, the circuits to prepare the required states have to be as short as possible.
In this paper, we investigate quantum state preparation with a focus on mixed-dimensional systems, where the individual qudits may have different dimensionalities.
The proposed approach automatically realizes quantum circuits constructing a corresponding mixed-dimensional quantum state. To this end, decision diagrams are used as a compact representation of the quantum state to be realized.
We further incorporate the ability to approximate the quantum state to enable a finely controlled trade-off between accuracy, memory complexity, and number of operations in the circuit.
Empirical evaluations demonstrate the effectiveness of the proposed approach in facilitating fast and scalable quantum state preparation, with performance directly linked to the size of the decision diagram.
The implementation is freely available under the MIT license at redacted for double-blind submission.
The underlying physics of quantum computing platforms supports using multi-valued logic, which promises a boost in performance over the prevailing two-level logic.
One key element to exploiting this potential is the capability to efficiently prepare quantum states for multi-valued, or qudit, systems.
Due to the time sensitivity of quantum computers, the circuits to prepare the required states have to be as short as possible.
In this paper, we investigate quantum state preparation with a focus on mixed-dimensional systems, where the individual qudits may have different dimensionalities.
The proposed approach automatically realizes quantum circuits constructing a corresponding mixed-dimensional quantum state. To this end, decision diagrams are used as a compact representation of the quantum state to be realized.
We further incorporate the ability to approximate the quantum state to enable a finely controlled trade-off between accuracy, memory complexity, and number of operations in the circuit.
Empirical evaluations demonstrate the effectiveness of the proposed approach in facilitating fast and scalable quantum state preparation, with performance directly linked to the size of the decision diagram.
The implementation is freely available under the MIT license at redacted for double-blind submission.
Research Manuscript
EDA
Physical Design and Verification
DescriptionThis paper proposes a mixed-size 3D analytical placement framework for face-to-face stacked integrated circuits fabricated with heterogeneous technology nodes and connected by hybrid bonding technology.
The proposed framework efficiently partitions a given netlist into two dies and optimizes the positions of each macro, standard cell, and hybrid bonding terminal (HBT). A multi-technology objective function and a multi-technology density penalty calculation process are adopted to handle the heterogeneous-technology-node constraints during mixed-size 3D global placement. Furthermore, a 3D objective function is used to refine the placement result during HBT-cell co-optimization. Our placer achieves the best results for all contest test cases compared with the participating teams at the 2023 CAD Contest at ICCAD on 3D Placement with Macros.
The proposed framework efficiently partitions a given netlist into two dies and optimizes the positions of each macro, standard cell, and hybrid bonding terminal (HBT). A multi-technology objective function and a multi-technology density penalty calculation process are adopted to handle the heterogeneous-technology-node constraints during mixed-size 3D global placement. Furthermore, a 3D objective function is used to refine the placement result during HBT-cell co-optimization. Our placer achieves the best results for all contest test cases compared with the participating teams at the 2023 CAD Contest at ICCAD on 3D Placement with Macros.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionAs process technologies keep on advancing, longer physical design time with more iteration is required Longer physical design time with more iteration makes hard to push PPA to limit. DRV(Design Rule Violation) is not know until before detail routing, which is most time-consuming process, and it is hard to fix it after detail routing by routing alone and cell placement need to be modified. GRC(Global Routing Congestion), can no longer directly correlate the DRVs after detailed routing, what makes detailed routing necessary to verify whether DRVs occurs due to current placement in advanced process even though it is time consuming process To reduce physical design time, predict DRVs with fast runtime and moderate accuracy is required to modify cell placement before running detailed routing.
Research Manuscript
ML-based Physical Design Parameter Optimization for 3D ICs: From Parameter Selection to Optimization
AI
AI/ML Application and Infrastructure
DescriptionWhile various studies have shown effective parameter optimizations for specific designs, there is limited exploration of parameter optimization within the domain of 3D Integrated Circuits. We present the first comprehensive study, both qualitatively and quantitatively, comparing five state-of-the-art (SOTA) techniques for parameter optimization applied to 3D ICs. Additionally, we introduce an end-to-end machine learning-based framework, encompassing important parameter selection through optimization, all without human intervention. Extensive studies across six industrial designs under the TSMC 28nm technology node reveal that our proposed framework outperforms SOTA techniques in three different optimization objectives in both optimization quality and runtime.
Research Manuscript
AI
Design
AI/ML System and Platform Design
DescriptionPoint Cloud Neural Network (PCNN) plays an essential role in various 3D applications, with some of them even being time-sensitive and safety-critical. However, the large scale of unordered points with lengthy features results in heavy computational workloads, making them far from real-time processing. To address this challenge, we propose MoC, a Morton-code-based fine-grained quantization for accelerating PCNNs. Specifically, we utilize Morton code to capture the spatial locality among points. Then, we gather nearby points with similar features into a region. Considering the similarity in features of nearby points, we propose to decompose features into base and offsets, where the offsets fall within a narrow range. Building upon this, we introduce a two-level mixed-precision quantization. In the first level, we quantize offsets with low precision, while keeping the base in high precision to ensure accuracy. For the second level, noticing the different data distribution of offsets across various regions, we employ two types of low precision at the region level, which provides opportunities to further accelerate feature computations. To support our algorithm, we design a hardware architecture that parallelizes the Morton code path with the critical path. In our extensive experiments on various datasets, our algorithm-architecture co-designed method demonstrates 12x, 6.3x, 4.7x, 3.8x, 3.4x and 2.8x speedup and 19.3x, 9.7x, 6.0x, 5.2x, 4.6x and 4.1x energy savings over CPU, Server and Edge GPUs, state-of-the-art ASICs (incl. PointAcc, MARS, PRADA) with negligible accuracy loss.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAbstract
Model-based semiconductor engineering represents a paradigm shift in the design, development, and manufacturing of semiconductor devices. This topic explores the transformative impact of leveraging models at various stages from requirement to the real physical semiconductor design. Embracing a model-based approach enhances collaboration, enables virtual prototyping, and supports comprehensive analysis of complex semiconductor systems. Using MBSE approach, we will highlight requirement-engineering values, also display how to create system level functional architecture and system traceability links. MBSE approach contributes to streamline collaboration on requirements to design processes, throughout the semiconductor project lifecycle. It ensures end-to-end traceability and accelerates delivering consistent and complete semiconductor design by capturing and managing the requirements accurately.
Model-based semiconductor engineering represents a paradigm shift in the design, development, and manufacturing of semiconductor devices. This topic explores the transformative impact of leveraging models at various stages from requirement to the real physical semiconductor design. Embracing a model-based approach enhances collaboration, enables virtual prototyping, and supports comprehensive analysis of complex semiconductor systems. Using MBSE approach, we will highlight requirement-engineering values, also display how to create system level functional architecture and system traceability links. MBSE approach contributes to streamline collaboration on requirements to design processes, throughout the semiconductor project lifecycle. It ensures end-to-end traceability and accelerates delivering consistent and complete semiconductor design by capturing and managing the requirements accurately.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionIn today's rapidly evolving technological landscape, the demand for higher performance in System-on-Chip (SoC) designs is paramount. As applications become more complex and data-intensive, pushing performance boundaries is essential to meet the growing requirements of diverse industries such as artificial intelligence, 5G communications, and autonomous systems.
Achieving enhanced performance in SoCs ensures improved computational capabilities, faster data processing, and overall superior functionality, aligning with the ever-increasing expectations for speed and efficiency in modern electronic devices.
Developing high-performance System-on-Chip (SoC) designs presents a myriad of challenges. Ensuring the reliability and robustness of high-performance SoCs poses a significant challenge, with potential issues such as signal integrity, thermal management, and electromigration.
Design closure becomes intricate as timing, power, and noise margins need meticulous optimization to avoid bottlenecks. Additionally, the need to meet aggressive time-to-market goals while adhering to stringent cost constraints adds another layer of complexity.
Introducing a novel model margining algorithm for high-performance System-on-Chip (SoC) closure, this paper addresses the critical need for efficient design closure in advance semiconductor architectures. By enhancing margining techniques, our algorithm optimizes performance without compromising reliability, where margins are dynamically tailored for optimal design closure. This reduces any late surprises at signoff stage which otherwise would trigger rerunning of the entire Backend Implementation flow due to infeasibility of design closure. The paper also discusses the prior approaches and its limitations in optimal design closure for High Performance SoCs.
Achieving enhanced performance in SoCs ensures improved computational capabilities, faster data processing, and overall superior functionality, aligning with the ever-increasing expectations for speed and efficiency in modern electronic devices.
Developing high-performance System-on-Chip (SoC) designs presents a myriad of challenges. Ensuring the reliability and robustness of high-performance SoCs poses a significant challenge, with potential issues such as signal integrity, thermal management, and electromigration.
Design closure becomes intricate as timing, power, and noise margins need meticulous optimization to avoid bottlenecks. Additionally, the need to meet aggressive time-to-market goals while adhering to stringent cost constraints adds another layer of complexity.
Introducing a novel model margining algorithm for high-performance System-on-Chip (SoC) closure, this paper addresses the critical need for efficient design closure in advance semiconductor architectures. By enhancing margining techniques, our algorithm optimizes performance without compromising reliability, where margins are dynamically tailored for optimal design closure. This reduces any late surprises at signoff stage which otherwise would trigger rerunning of the entire Backend Implementation flow due to infeasibility of design closure. The paper also discusses the prior approaches and its limitations in optimal design closure for High Performance SoCs.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionClock tree synthesis is the process of connecting the clocks to all clock pins of sequential circuits by using inverters or buffers in order to balance the skew and minimize the insertion delay. There are multiple techniques available for clock distribution. Some of these methods include the single clock tree, multi-tap clock tree, clock mesh, and Flex H-Tree.
Flex H-tree is a sophisticated methodology for clock distribution that maintains electrical symmetry but is flexible in geometrical symmetry. It helps in achieving the minimum skew in a clock tree, relaxes the requirement to be geometrically symmetric, and automates synthesis even in floorplans with placement restrictions.
FlexH-Tree enables users to make smart latency vs. timing trade-offs. The classical problem that designers face while building FlexH-Tree is the selection of the number of tap points.
There are an infinite number of combinations for tap-points.
This paper provides a noble approach to determining the optimal number of tap-points per clock library cells.
Flex H-tree is a sophisticated methodology for clock distribution that maintains electrical symmetry but is flexible in geometrical symmetry. It helps in achieving the minimum skew in a clock tree, relaxes the requirement to be geometrically symmetric, and automates synthesis even in floorplans with placement restrictions.
FlexH-Tree enables users to make smart latency vs. timing trade-offs. The classical problem that designers face while building FlexH-Tree is the selection of the number of tap points.
There are an infinite number of combinations for tap-points.
This paper provides a noble approach to determining the optimal number of tap-points per clock library cells.
Research Manuscript
Design
In-memory and Near-memory Computing Circuits
DescriptionElliptic curve cryptography (ECC) is widely used in security applications such as public key cryptography (PKC) and zero-knowledge proofs (ZKP). ECC is composed of modular arithmetic, where modular multiplication takes most of the processing time. Computational complexity and memory constraints of ECC limit the performance. Therefore, hardware acceleration on ECC is an active field of research. Processing-in-memory (PIM) is a promising approach to tackle this problem. In this work, we design ModSRAM, the first 8T SRAM PIM architecture to compute large-number modular multiplication efficiently. In addition, we propose R4CSA-LUT, a new algorithm that reduces the cycles for an interleaved algorithm and eliminates carry propagation for addition based on look-up tables (LUT). ModSRAM is co-designed with R4CSA-LUT to support modular multiplication and data reuse in memory with 52% cycle reduction compared to prior works with only 32% area overhead.
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionMixture-of-Experts (MoE) large language models (LLM) have memory requirements that often exceed the GPU memory capacity, requiring costly parameter movement from secondary memories to the GPU for expert computation. In this work, we present Mixture of Near-Data Experts (MoNDE), a near-data computing solution that efficiently enables MoE LLM inference. MoNDE reduces the volume of MoE parameter movement by transferring only the hot experts to the GPU, while computing the remaining cold experts inside the host memory device. By replacing the transfers of massive expert parameters with the ones of small activations, MoNDE enables far more communication-efficient MoE inference, thereby resulting in substantial speedups over the existing parameter offloading frameworks for both encoder and decoder operations.
Research Manuscript
AI
Design
AI/ML System and Platform Design
DescriptionThere has been a growing trend in deploying deep neural networks (DNNs) on tiny devices.
However, deploying DNNs on such devices poses significant challenges due to the contradiction between DNNs' substantial memory requirements and the stringent memory constraints of tiny devices.
Some prior works incur large latency overhead to save memory and target only simple CNNs, while others employ coarse-grained scheduling for complicated networks, leading to limited memory footprint reduction. This paper proposes MoteNN that performs fine-grained scheduling via operator partitioning on arbitrary DNNs to dramatically reduce peak memory usage with little latency overhead.
MoteNN presents a graph representation named Axis Connecting Graph (ACG) to perform operator partition at graph-level efficiently. MoteNN further proposes an algorithm that finds the partition and schedule guided by memory bottlenecks.
We evaluate MoteNN using various popular networks and show that MoteNN achieves up to 80% of peak memory usage reduction compared to state-of-art works with nearly no latency overhead on tiny devices.
However, deploying DNNs on such devices poses significant challenges due to the contradiction between DNNs' substantial memory requirements and the stringent memory constraints of tiny devices.
Some prior works incur large latency overhead to save memory and target only simple CNNs, while others employ coarse-grained scheduling for complicated networks, leading to limited memory footprint reduction. This paper proposes MoteNN that performs fine-grained scheduling via operator partitioning on arbitrary DNNs to dramatically reduce peak memory usage with little latency overhead.
MoteNN presents a graph representation named Axis Connecting Graph (ACG) to perform operator partition at graph-level efficiently. MoteNN further proposes an algorithm that finds the partition and schedule guided by memory bottlenecks.
We evaluate MoteNN using various popular networks and show that MoteNN achieves up to 80% of peak memory usage reduction compared to state-of-art works with nearly no latency overhead on tiny devices.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionRe-Order Buffer (ROB) is a fundamental component in modern microprocessor designs. A novel design is proposed to significantly reduce the area and dynamic power of a conventional ROB design without performance loss. A novel hardware structure removes redundancies existing in the original ROB entries by storing common information shared by many such entries separately. Cycle-accurate simulation results show that the area and power are reduced by 47% and 39% respectively in a CPU configuration modelled after the Intel Skylake processor. A design methodology is proposed for the
novel design considering a trade-off between performance and power/area with a quantitative approach.
novel design considering a trade-off between performance and power/area with a quantitative approach.
Research Manuscript
Design
SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionMulti-scalar multiplication (MSM) is the most computation-intensive part in proof generation of Zero-knowledge proof (ZKP). In this paper, we propose MSMAC, an FPGA accelerator for large-scale MSM. MSMAC adopts a specially designed Instruction Set Architecture (ISA) for MSM and optimizes pipelined Point Addition Unit (PAU) with hybrid Karatsuba multiplier. Moreover, a runtime system is proposed to split MSM tasks with the optimal sub-task size and orchestrate execution of Processing Elements (PEs). Experimental results show that MSMAC achieves up to 328X and 1.96X speedups compared to the state-of-the-art implementation on CPU (one core) and GPU, respectively, outperforming the state-of-the-art ASIC accelerator by 1.79X. On 4 FPGAs, MSMAC performs 1,261X faster than a single CPU core.
Research Manuscript
Design
Design of Cyber-physical Systems and IoT
DescriptionSplit Computing (SC), where a Deep Neural Network (DNN) is intelligently split with a part of it deployed on an edge device and the rest on a remote server is emerging as a promising approach. It allows the power of DNNs to be leveraged for latency-sensitive applications that do not allow the entire DNN to be deployed remotely, while not having sufficient computation bandwidth available locally. In many such embedded scenarios, such as those in the automotive domain, computational resource constraints also necessitate MultiTask Learning (MTL), where the same DNN is used for multiple inference tasks instead of having dedicated DNNs for each task, which would need more computing bandwidth. However, how to partition such a multi-tasking DNN to be deployed within a SC framework has not been sufficiently studied. This paper studies this problem and MTL-Split, our novel proposed architecture, shows encouraging results on both synthetic and real-world data. The code implementing this architecture will be made publicly available.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionSpiking neural networks (SNNs) for stress-detection using physiological time-series signals of electrodermal activity (EDA), body temperature, and a multi-modal signal comprised of both, are designed and evaluated in this work. Execution of the SNNs on Intel Loihi-2 (a neuromorphic research chip) showed 5× to 83× better energy-delay product (EDP) compared to equivalent artificial neural networks (ANNs) implemented on a low-power edge-GPU, and a marginal gain of 1.3× to 2.6× over Spiking Quantized Neural Network (SQNN) equipped with Dynamic Adaptive Leaky integrate and Fire neurons (DALIF) and Dynamic Adaptive Current based Leaky integrate and Fire neurons (DACLIF). Significant EDP gain (83×), supplemented with fast inference rate (∼ 9×) were reported for the multi-modal SNN, which is composed of ∼ 9× and ∼ 1.8× less parameters in comparison with the corresponding ANN run on an edge-GPU and SQNN run on FPGA respectively.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionReliable, and low-power stress-detector ‘at the edge' is extremely beneficial for continuous monitoring of post-stroke patients. In this context, feed-forward spiking neural networks (SNNs) for stress detection using physiological time-series signals of electrodermal activity (EDA), body temperature, and a multi-modal signal comprised of both, are designed and evaluated. Execution of the SNNs on Intel Loihi-2 (a neuromorphic research chip) showed 5× to 83×, and 9× to 123× better energy-delay product (EDP) compared to equivalent artificial neural networks (ANNs) executed on a low-power edge-GPU, and FPGA respectively. Significant EDP gain (83×), supplemented with fast inference rate (∼ 9×) were reported for the multi-modal SNN, which is composed of ∼ 9× less parameters in comparison with the corresponding ANN run on an edge-GPU.
Research Manuscript
Design
Design for Manufacturability and Reliability
DescriptionTechnology Computer Aided Design (TCAD) is a crucial step in the design and manufacturing of semiconductor devices. It involves solving physical equations that describe the behavior of semiconductor devices to predict various device parameters. Traditional TCAD methods, such as finite volume and finite element methods, discretize relevant physical equations to achieve numerical simulations of devices, significantly burdening the computation resources. For the first time, this paper proposes a novel method for TCAD simulation based on Physics-Informed Neural Networks (PINNs).
We proposed Multi-order Differential Neural Network (MDNN), an improved Radial Basis Function Neural Network (RBFNN) model. By training MDNN, it achieves the coupled solution of the Poisson equation and drift-diffusion equation under steady-state conditions, without the need for a pre-existing dataset. To the best of our knowledge, this marks the first instance of an ML-TCAD simulation that does not require any pre-existing data. For an example of PN junction diode, this method effectively simulates the basic physical characteristics of the device, with a self-consistent solution error of less than 1×10^-5.
We proposed Multi-order Differential Neural Network (MDNN), an improved Radial Basis Function Neural Network (RBFNN) model. By training MDNN, it achieves the coupled solution of the Poisson equation and drift-diffusion equation under steady-state conditions, without the need for a pre-existing dataset. To the best of our knowledge, this marks the first instance of an ML-TCAD simulation that does not require any pre-existing data. For an example of PN junction diode, this method effectively simulates the basic physical characteristics of the device, with a self-consistent solution error of less than 1×10^-5.
Research Manuscript
EDA
Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionWavelength-routed optical networks-on-chip (WRONoCs) are well-known for providing high-speed and collision-free communication in multi-core processors. Previous work was unable to simultaneously reduce the design complexity and total optical power consumption of WRONoC. Besides, in current designs, each microring resonator (MRR), which is the key component of WRONoCs, is configured to demultiplex to one specific wavelength. This significantly increases the MRR usage and the insertion loss. In this work, we adapt different types of ONoC routers into the mesh-based template. To reduce MRR usage, we take advantage of an important feature of MRR, multi-resonance, so that a single MRR can demultiplex signals on multiple wavelengths. To this end, we propose an efficient design method that synthesizes mesh-based WRONoCs using multi-resonance MRRs and existing optical routers to reduce total power consumption. The experimental results show that our method outperforms state-of-the-art design methods in significantly reducing MRR usage and optical power.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionA conditional diffusion probabilistic neural network tailored for swift, scalable multiterminal obstacle-avoiding pathfinding within VLSI systems is introduced. This method departs from conventional pathfinding strategies by leveraging the unique capabilities of diffusion models, which translate pathfinding into a graphical representation for enhanced path generation. Based on experimental results, the runtime for this diffusion-based pathfinding method remains constant as system complexity increases, resulting in a wirelength similar to that of state-of-the-art. The constant runtime complexity along with lack of scalability limitations represent a significant improvement over traditional learning-based pathfinding methods, highlighting the potential of diffusion models to transform additional EDA applications.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAmidst the increasing complexity of computing systems, the precision and integrity of module designs, particularly the Instruction Length Decode unit (ILD) in modern processors, stand as paramount concerns. The ILD's role in identifying instruction boundaries and enabling accurate field extraction becomes more intricate with innovative Byte-Level Speculative Parallel decoding techniques. Traditional verification methods, inadequate for the dynamic nature of modern ILD designs, underscore the need for a comprehensive approach. This paper addresses this challenge by proposing a methodology, the Trilogy Assurance Paradigm (TAP), designed to rigorously validate ILD functionality. Beyond ILD, TAP extends its applicability to diverse complex IPs. Focused on the CPU pipeline, this exploration delves into the ILD's significance and the intricacies of byte-level speculative decoding. TAP's potency lies in its holistic approach, encompassing top-down control path analysis, bottom-up data-path logic scrutiny, and integration assessments for diverse architectural contexts. This paper presents a comprehensive solution to verify intricate modern ILD designs and extends its methodology's applicability to various complex IPs.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIn the traditional statistical injection, uncertainty regarding failure rates leads to adopting conservative assumptions that maximise sample size. Consequently, fault injection experiments become excessively time-consuming for applications with minimal error margins. To mitigate the limitations of existing approaches, we investigate the potential of Bayesian sampling in minimising the sample size. Preliminary results indicate up to 5X reduction with an error rate consistently below 1%.
Research Manuscript
EDA
Physical Design and Verification
DescriptionIn modern IC design, routing significantly impacts chip performance, power, area, and design iteration count. Critical challenges in routing include generating rectilinear Steiner minimum tree (RSMT) for each net and handling routing resource among nets. Due to limited resources and net scale, congestion is inevitable in VLSI circuit routing. Most competitive routers address congestion after routing without prior net guidance, leading to difficulty in managing resources among nets. To tackle routing and congestion, we suggest introducing a net resource allocation step as a potentially desirable initial routing stage. Firstly, we introduce the concept of net region probability density (NRPD) to achieve suitable net resource allocation. Using a prior NRPD, we model the resource allocation problem as quadratic programming (QP). We utilize penalty method to solve the QP quickly and obtain a posterior NRPD for each net on each grid. Based on the posterior NRPD and congestion map, we introduce a cost scheme to guide net routing. This cost scheme supports a weighted RSMT construction technique for better topological solutions. Additionally, we propose an iterative method for global routing and track assignment, improving detailed routing quality and optimizing design rule violations. Experimental results show the effectiveness of net resource allocation and demonstrate superior performance of our router over the OpenROAD's router across multiple metrics.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionNeuromorphic computing has gained wildly attention these years because of its advantages of low power consumption and high energy efffciency. As a biologically credible unsupervised learning rule, Spike Timing Dependent Plasticity (STDP) uses the spike information between pre-synaptic and post-synaptic neurons to update the synaptic weights. Existing neuromorphic architectures with event-driven STDP on-chip learning can be divided into two categories, the trace-based ones and the counter-based ones. However, resource-friendly and fully event-based STDP on-chip learning is still not achieved. In this paper, we propose a novel resourcefriendly neuromorphic processor architecture named "NeuCore" with fully event-driven on-chip STDP learning. Combining the advantages of both trace-based and counter-based event-driven STDP on-chip learning architectures,several key techniquessuch asspikedriven updating of trace and weight are proposed and well-scheduled to further improve the energy efffciency and implementation efffciency of on-chip learning. TTe experimental results show that the accuracy of NeuCore reaches 96.0% and 90.04% on N-MNIST and DVS128 Gesture datasets. NeuCore achieves 4.44× and 2.5× speed up in performance and learning throughputrespectively when compared with the state-of-the-art trace-based work.NeuCore achieves 1.37× - 2.27× implementation efffciency when compared with the state-of-the-art counter-based work.
Research Manuscript
EDA
Design Verification and Validation
DescriptionThere is a pressing need to ensure the safety of closed-loop systems with NN controllers. To address this issue, we propose a novel approach for generating barrier certificates, which combines counterexample-guided learning with efficient SOS-based verification. Our proposed method offers an efficient verification procedure that solves three linear matrix inequality (LMI) constraint feasibility testing problems, instead of relying on an SMT solver to verify the barrier certificate conditions. We conduct comparison experiments on a set of benchmarks, demonstrating the advantages of our method in terms of efficiency and scalability, which enable effective verification of high-dimensional systems.
Research Manuscript
EDA
Design Verification and Validation
DescriptionModern SAT solvers depend on conflict-driven clause learning to avoid recurring conflicts. Deleting less valuable learned clauses is a crucial component of modern SAT solvers to ensure efficiency. However, a single clause deletion policy cannot guarantee optimal performance on all SAT instances. This paper introduces a new clause deletion metric to diversify existing clause deletion approaches. Then, we propose to use machine learning to evaluate and select clause deletion policies adaptively based on the input instance. We show that our method can reduce the runtime of the state-of-the-art SAT solver Kissat by 5.8\% on large industry benchmarks.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionA core objective of physical design is to minimize wirelength (WL) when placing chip components on a canvas. Computing the minimal WL of a placement requires finding rectilinear Steiner minimum trees (RSMTs), an NP-hard problem. We propose NeuroSteiner, a neural model that distills GeoSteiner, an optimal RSMT solver, to navigate the cost–accuracy frontier of WL estimation. NeuroSteiner is trained on synthesized nets labeled by GeoSteiner, alleviating the need to train on real chip designs. Moreover, NeuroSteiner's differentiability allows to place by minimizing WL through gradient descent. On ISPD 2005 and 2019, NeuroSteiner can obtain 0.3% WL error while being 60% faster than GeoSteiner, or 0.2% and 30%.
SKYTalk
AI
Design
EDA
DescriptionThis talk provides a broad, visionary perspective about the dynamic changes impacting electronic design automation tools and methodologies and their pivotal role in re-shaping engineering lifecycle management. It explains the trend toward convergence of EDA and CAE domains to address exploding system complexity and deliver multi-disciplinary solutions. Future workflows must incorporate digital threads that connect virtual prototypes and digital twins with physical systems. Predictive simulation and analysis across domains is key to accelerating engineering lifecycles.
Rapid industry adoption of AI, heterogeneous integrated circuits and chiplet technologies, software automation using scripting languages, and comprehensive data and intellectual property management tools is driving a seismic shift in design and verification methodologies. This talk covers how these EDA technologies contribute to more efficient and effective enterprise lifecycles. Application of these technologies must elevate RF, microwave, and mixed-signal design to an equal footing with digital design to achieve modernization of engineering workflows.
Rapid industry adoption of AI, heterogeneous integrated circuits and chiplet technologies, software automation using scripting languages, and comprehensive data and intellectual property management tools is driving a seismic shift in design and verification methodologies. This talk covers how these EDA technologies contribute to more efficient and effective enterprise lifecycles. Application of these technologies must elevate RF, microwave, and mixed-signal design to an equal footing with digital design to achieve modernization of engineering workflows.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionSoC creation is done by integrating Logical subsystems using system verilog language. Connectivity between different subsystems is defined by different specs. Some specs are defined in Documents and xls formats which leads to a real challenge to keep Design up to date with developing spec. When moving from one SoC to another and rebuilding with semi-automation, we face a multitude of major bugs. We understood that our current solution based on Verilog-auto features reached its limits.
Through the presented methodology, we build an automated process enabling to: Extract connectivity information from an existing SoC project; Categorize the extracted connectivity, to keep only what is required for the new generation of SoC projects; and Generate the new connectivity.
Defacto customer built this methodology based on Defacto's SoC Compiler APIs which enabled us to generate a full top level in 5 seconds.
We estimate a global reduction of at least 40% of the execution and Man month overall effort.
Through the presented methodology, we build an automated process enabling to: Extract connectivity information from an existing SoC project; Categorize the extracted connectivity, to keep only what is required for the new generation of SoC projects; and Generate the new connectivity.
Defacto customer built this methodology based on Defacto's SoC Compiler APIs which enabled us to generate a full top level in 5 seconds.
We estimate a global reduction of at least 40% of the execution and Man month overall effort.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn advanced technology nodes, metal pitches have not scaled with area which has led to increased cross-coupling between instances where switching instances in the immediate vicinity have a significant impact on an instance's IR drop, with a measurable impact from areas outside the immediate vicinity from the shared PDN. In order to ensure that there are no IR drop escapes, it is imperative to undertake two separate analyses, the first where the local region around an instance is sensitized to expose all attacker-victim combinations that can happen over the course of product lifetimes, obtained with SigmaDvD, and the second, areas outside the local regions modelled to represent design behavior, analyzed with traditional transient Dynamic IR. The capability to combine both analyses does not exist in present tools. We present SigmaAV, a brand-new engine from Ansys that merges SigmaDvD and transient Dynamic IR to derive an IR drop that encompasses the impact from local aggressors with IR drop events outside the local vicinity. In the presentation, we review what this technology is, the trends that we saw on our designs and how we leverage its exhaustive nature in IR-STA for Fmax improvement and power, area reduction.
Embedded Systems and Software
AI
Embedded Systems
Engineering Tracks
DescriptionAI-driven computing demands more power, and power consumption increases exponentially. However, fine-grained hardware-level power management can mitigate this issue. Compared to software-level management, it can reduce power consumption by 20% to 60%.
Nevertheless, such management adds complexity to hardware design, which can be challenging given fixed design timelines. Within the conventional design process, multiple engineers participate at different stages of the design with only partial information, resulting in a complex and inefficient process.
To address this, we introduce a no-code design platform for power and clock management systems. This platform simplifies the design process, reduces setup and modification times, and has enabled our first customer to design their system quickly, meeting all specifications in line with automotive SOC standards.
Our no-code solution dramatically reduces design time and resource requirements compared to traditional methods. The entire system configuration process takes about one week, and design outputs are generated in just few minutes.
With our no-code design platform, a single engineer can efficiently configure complex SOCs, significantly reducing idle power consumption and design efforts for SOCs.
Nevertheless, such management adds complexity to hardware design, which can be challenging given fixed design timelines. Within the conventional design process, multiple engineers participate at different stages of the design with only partial information, resulting in a complex and inefficient process.
To address this, we introduce a no-code design platform for power and clock management systems. This platform simplifies the design process, reduces setup and modification times, and has enabled our first customer to design their system quickly, meeting all specifications in line with automotive SOC standards.
Our no-code solution dramatically reduces design time and resource requirements compared to traditional methods. The entire system configuration process takes about one week, and design outputs are generated in just few minutes.
With our no-code design platform, a single engineer can efficiently configure complex SOCs, significantly reducing idle power consumption and design efforts for SOCs.
IP
Engineering Tracks
IP
Description"Network on a Chip" or NoC refers to a communication subsystem that enables communication between various components or modules on the chip. It is a network-based approach to managing data transfer and communication within a microprocessor or a system-on-chip (SoC).
Research Manuscript
Design
Design for Manufacturability and Reliability
DescriptionAccurate estimation of rare failure occurrence probability is crucial for ensuring the proper and reliable functioning of integrated circuits (ICs). Conventional Monte Carlo methods are inefficient, demanding an exorbitant number of samples to achieve reliable estimates. Inspired by the exact sampling capabilities of normalizing flows, we revisit this problem and propose normalizing flow assisted importance sampling, termed NOFIS. NOFIS first learns a sequence of proposal distributions associated with predefined nested subset events by minimizing KL divergence losses. Next, it estimates the rare event probability by utilizing importance sampling in conjunction with the last proposal. The efficacy of our NOFIS method is substantiated through comprehensive qualitative visualizations, affirming the optimality of the learned proposal distribution, as well as 10 quantitative experiments (covering electronic Opamp and Charge Pump circuits, and photonic Y-branch), which highlight NOFIS's superior accuracy over baseline approaches.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionChop and Swap methodology was architected for noise mitigation that saved
multiple designers weeks of manual chip integration work in a project.
When a noise event happens, the closer the aggressor and victim wires are, the
more noise coupling occurs, causing NIOT (Noise Impact on Timing) or NIOF
(Noise Impact on Function) fails. Chop and Swap methodology identifies long
parallel lines between aggressor and victim wires, then chops the wire into pieces
and moves them further away from each other. Chop and Swap carefully
optimizes the location of the wire so that it not only reduces NIOF fails by 10x, but
also improves timing most of the time. Chop and Swap was used to solve 10s of
thousands of NIOF fails overnight in our latest project. Prior to automation these
fails were addressed manually, requiring several integrators working several
weeks.
This work provides tremendous productivity and turn-around-time (TAT) boost in a
modern complex chip-level design.
multiple designers weeks of manual chip integration work in a project.
When a noise event happens, the closer the aggressor and victim wires are, the
more noise coupling occurs, causing NIOT (Noise Impact on Timing) or NIOF
(Noise Impact on Function) fails. Chop and Swap methodology identifies long
parallel lines between aggressor and victim wires, then chops the wire into pieces
and moves them further away from each other. Chop and Swap carefully
optimizes the location of the wire so that it not only reduces NIOF fails by 10x, but
also improves timing most of the time. Chop and Swap was used to solve 10s of
thousands of NIOF fails overnight in our latest project. Prior to automation these
fails were addressed manually, requiring several integrators working several
weeks.
This work provides tremendous productivity and turn-around-time (TAT) boost in a
modern complex chip-level design.
Research Manuscript
EDA
Timing and Power Analysis and Optimization
DescriptionThis paper proposes a neural-network-based power model, Nona, that accurately predicts the power consumption of heterogeneous CPUs on a commercial mobile device. With aggressive on-device power management in action, it becomes increasingly challenging to make accurate power predictions for diverse applications. To overcome the limitations of the existing power models based on linear regression, Nona uses a lightweight neural network with a small number of performance monitoring counters (PMCs) chosen from a system analysis and a loss function designed for power prediction.
Experiments on Google Pixel 6 show that Nona has a 3.4% average prediction error, improving on prior work by 2.6x.
Experiments on Google Pixel 6 show that Nona has a 3.4% average prediction error, improving on prior work by 2.6x.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn engineering, the use of Large Language Models (LLMs) for specific domain code generation presents a significant challenge and an important area of study. These models are crucial in assisting programming and development tasks, but they often require substantial computational resources and extensive datasets. Our method focuses on improving data preprocessing and optimizing prompt engineering techniques. We propose using LLMs in the data preprocessing phase to create data embeddings that more accurately reflect their contextual meanings within a semantic space. This will improve the relevance and quality of the generated code.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIncreasing functionality of Automotive multiprocessor SOCs has resulted in increasing power grid complexity leading to high voltage-ripple noise caused by simultaneous switching of multiple processor blocks in the SoC. Meeting chip-package-system (CPS) performance targets becomes daunting due to this issue. Designers grapple with the lack of accurate chip models for chip-package-system co-analysis for power integrity signoff involving microsecond long simulations. The conventional Chip Power Model (CPM) falls short in addressing low frequency noise (0.1 – 50 MHz) caused during chip mode-changes over longer durations. Multiprocessor chips have high demand currents that require techniques like clock and power gating to deal with excessive power requirement. However, Dynamic Voltage and Frequency Scaling (DVFS) and clock gating can induce significant simultaneous switching noise (SSN) on VDD. We present here the results of our study that utilized advanced chip power models involving time extensions, stitching of multiple models and modulation of high frequency chip currents over mode-changing low frequency current envelope, to help detect and mitigate high peak to peak voltage variations in our chip-package-system transient analysis with a faster turn-around-time.
Research Manuscript
EDA
Design Verification and Validation
DescriptionThe efficiency of validating complex System-on-Chips (SoCs) is contingent on the quality of the security properties provided. Generating security properties with traditional approaches often requires expert intervention and is limited to a few IPs, thereby resulting in a time-consuming and non-robust process. To address this issue, we, for the first time, propose a novel and automated Natural Language Processing (NLP)-based Security Property Generator (NSPG). Specifically, our approach utilizes hardware documentation in order to propose the first hardware security-specific language model, HS-BERT, for extracting security properties dedicated to hardware design. It is capable of phasing a significant amount of hardware specification, and the generated security properties can be easily converted into hardware assertions, thereby reducing the manual effort required for hardware verification. NSPG is trained using sentences from several SoC documentation and achieves up to
88% accuracy for property classification, outperforming ChatGPT. When assessed on five untrained OpenTitan hardware IP documents, NSPG aided in identifying eight security vulnerabilities in the buggy OpenTitan SoC presented in Hack@DAC 2022.
88% accuracy for property classification, outperforming ChatGPT. When assessed on five untrained OpenTitan hardware IP documents, NSPG aided in identifying eight security vulnerabilities in the buggy OpenTitan SoC presented in Hack@DAC 2022.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe concurrent execution of deep neural network (DNN) models over various input modalities is a growing trend for tackling complex scenarios where user inputs and contextual features collected by sampling visual and auditory signals are combined to provide more immersive and intelligent user-computer interactions. One field in which these types of application scenarios have shown to be pervasive is eXtended Reality (XR). One key challenge in enabling multi-modal execution is the requirement for XR systems to host and execute inference operations on a collection of different DNN models on mobile devices such as headsets. This work explores the optimization of XR memory architectures using embedded non-volatile memories (NVMs). We present nvmXR, a cross-stack evaluation framework that allows designers to compare different ReRAM-based architectures and identify optimal memory solutions based on specific workload constraints.
Exhibitor Forum
DescriptionObject-oriented software has made its mark in the industry. It has helped create large software systems. Coming from Software-industry it has been adopted by hardware verification in the form of languages like SystemVerilog and SystemC and methodologies like UVM. But what about embedded hardware ?
This presentation will discuss how embedded hardware can be presented in an object-oriented way to firmware and lower levels of system software. We will review the industry standards and non-standard formats that are currently prevalent in this space.
The benefits of this approach will be presented and finally we will discuss the roadmap to get to this level of operational excellence.
This presentation will discuss how embedded hardware can be presented in an object-oriented way to firmware and lower levels of system software. We will review the industry standards and non-standard formats that are currently prevalent in this space.
The benefits of this approach will be presented and finally we will discuss the roadmap to get to this level of operational excellence.
Research Manuscript
EDA
Physical Design and Verification
DescriptionEmerging applications in Printed Circuit Board (PCB) routing impose new challenges on automatic length matching, including adaptability for any-direction traces with their original routing preserved for interactiveness. The challenges can be addressed through two orthogonal stages: assign non-overlapping routing regions to each trace and meander the traces within their regions to reach the target length. In this paper, mainly focusing on the meandering stage, we propose an obstacle-aware detailed routing approach to optimize the utilization of available space and achieve length matching while maintaining the original routing of traces. Furthermore, our approach incorporating the proposed Multi-Scale Dynamic Time Warping (MSDTW) method can also handle differential pairs against common decoupled problems. Experimental results demonstrate that our approach has effective length-matching routing ability and compares favorably to previous approaches under more complicated constraints.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIn this paper, we propose ODILO, a new on-device incremental
learning framework for edge systems. The key part of ODILO is a new module, namely Efficient Incremental Module (EIM). EIM is composed of normal convolutions and lightweight operations. During incremental learning, EIM exploits some lightweight operations, called adapters, to effectively and efficiently learn features for new classes such that it can improve the accuracy of incremental learning while reducing model complexity as well as training overhead. The efficiency of ODILO is further bolstered by adapter fusion, prototypes, and efficient data augmentation. We conduct extensive experiments on the CIFAR-100 and Tiny-ImageNet datasets. Experimental results show that ODILO improves the accuracy by up to 4.21% over existing methods while reducing around 50% of model complexity. In addition, evaluations on real edge systems demonstrate its applicability for on-device machine learning. The code will be available upon acceptance.
learning framework for edge systems. The key part of ODILO is a new module, namely Efficient Incremental Module (EIM). EIM is composed of normal convolutions and lightweight operations. During incremental learning, EIM exploits some lightweight operations, called adapters, to effectively and efficiently learn features for new classes such that it can improve the accuracy of incremental learning while reducing model complexity as well as training overhead. The efficiency of ODILO is further bolstered by adapter fusion, prototypes, and efficient data augmentation. We conduct extensive experiments on the CIFAR-100 and Tiny-ImageNet datasets. Experimental results show that ODILO improves the accuracy by up to 4.21% over existing methods while reducing around 50% of model complexity. In addition, evaluations on real edge systems demonstrate its applicability for on-device machine learning. The code will be available upon acceptance.
Research Manuscript
AI
Security
AI/ML Security/Privacy
DescriptionDeep Neural Networks (DNNs), such as the widely-used ChatGPT model containing billions of parameters, are often kept secret due to the high training costs and privacy concerns surrounding the data used to train them.
Previous approaches to securing DNNs typically require expensive circuit redesign, resulting in additional overheads such as increased area, energy consumption, and latency. To address these issues, we propose a novel hardware-software co-design for DNN protection that leverages the inherent aging characteristics of circuits to provide effect protection.
Hardware-side, we employ random aging to produce authorized chips. This process circumvents the need for chip redesign, thereby eliminating any additional energy and area overhead. Moreover, the authorized chips demonstrate a considerable disparity in DNN inference performance when compared to unauthorized third-party chips. Software-side, we propose a novel Differential Orientation Fine-tuning method, which allows pre-trained DNNs to maintain its original accuracy on authorized chips with minimal fine-tuning, while the model's performance on unauthorized chips is reduced to random guessing. Comprehensive experiments on MLP, VGG, ResNet, Mixer and SwinTransformer validate the efficacy of our method.
Previous approaches to securing DNNs typically require expensive circuit redesign, resulting in additional overheads such as increased area, energy consumption, and latency. To address these issues, we propose a novel hardware-software co-design for DNN protection that leverages the inherent aging characteristics of circuits to provide effect protection.
Hardware-side, we employ random aging to produce authorized chips. This process circumvents the need for chip redesign, thereby eliminating any additional energy and area overhead. Moreover, the authorized chips demonstrate a considerable disparity in DNN inference performance when compared to unauthorized third-party chips. Software-side, we propose a novel Differential Orientation Fine-tuning method, which allows pre-trained DNNs to maintain its original accuracy on authorized chips with minimal fine-tuning, while the model's performance on unauthorized chips is reduced to random guessing. Comprehensive experiments on MLP, VGG, ResNet, Mixer and SwinTransformer validate the efficacy of our method.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionThe recent breakthroughs in the field of large language models (LLMs) owe much of their accomplishments to the exponential growth in model size (240×every two years), creating a significant challenge in computation and memory complexity for today's hardware. Quantization has emerged as a critical technique for reducing these complexities. However, existing approaches mainly employ a fixed quantization schemes, which is in-efficient in terms of requiring more bits to maintain model accuracy. In this work, we delve into the dynamics and heterogeneity present in both inter- and intra-layer distributions, particularly focusing on the highly dynamic range and compositions of the extremely large values, commonly referred to as outliers.
We propose Oltron, an algorithm/hardware co-design solution for outlier-aware quantization of LLMs with inter-/intra-layer adaptation. Oltron employs a holistic quantization framework with three key innovations. First, we propose a novel quantization algorithm capable of determining the optimal composition ratio of outliers among different layers and various channel groups within a layer. Second, we propose a reconfigurable architecture that can adjust computation fabric based on inter- and intra-layer distributions. Third, we propose a tile-based dataflow optimizer to meticulously plan the complicated computation and memory access schedule for the mix-precision tensors. Oltron is demonstrated to surpass existing outlier-aware accelerator, OliVe, by 1.9× performance improvement and 1.6× energy efficiency improvement, with a superior model accuracy.
We propose Oltron, an algorithm/hardware co-design solution for outlier-aware quantization of LLMs with inter-/intra-layer adaptation. Oltron employs a holistic quantization framework with three key innovations. First, we propose a novel quantization algorithm capable of determining the optimal composition ratio of outliers among different layers and various channel groups within a layer. Second, we propose a reconfigurable architecture that can adjust computation fabric based on inter- and intra-layer distributions. Third, we propose a tile-based dataflow optimizer to meticulously plan the complicated computation and memory access schedule for the mix-precision tensors. Oltron is demonstrated to surpass existing outlier-aware accelerator, OliVe, by 1.9× performance improvement and 1.6× energy efficiency improvement, with a superior model accuracy.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAbstract :
The need for a secured collaboration arises from the imperative to protect sensitive information and ensure the integrity of shared data. In order to face strong industrial constraints, industry players must ensure that communication between the different links in their value chain is fluid. In this paper, we present a secured collaboration platform on cloud to orchestrate, share and deliver data between different semiconductor OEMs and suppliers. It provides a collaborative space where engineers can work from a small device to a complex system. It is possible to exchange large multi format data, accelerate product development with co-innovation and enables collaboration between OEM and supply chain. It helps to cut excess costs and deliver products to the customer faster. It ensures full security and traceability to preserve and capitalize IP from different stakeholders. Shifting to this secured cloud platform significantly improves collaboration across the value chain by managing multitude of data and different version levels. We will showcase how this on cloud solution optimize the data exchange between OEMs and multiple suppliers.
The need for a secured collaboration arises from the imperative to protect sensitive information and ensure the integrity of shared data. In order to face strong industrial constraints, industry players must ensure that communication between the different links in their value chain is fluid. In this paper, we present a secured collaboration platform on cloud to orchestrate, share and deliver data between different semiconductor OEMs and suppliers. It provides a collaborative space where engineers can work from a small device to a complex system. It is possible to exchange large multi format data, accelerate product development with co-innovation and enables collaboration between OEM and supply chain. It helps to cut excess costs and deliver products to the customer faster. It ensures full security and traceability to preserve and capitalize IP from different stakeholders. Shifting to this secured cloud platform significantly improves collaboration across the value chain by managing multitude of data and different version levels. We will showcase how this on cloud solution optimize the data exchange between OEMs and multiple suppliers.
Exhibitor Forum
DescriptionGlobal semiconductor chips shortage disrupted the production in recent years affecting a wide range of industries. Despite the prevailing difficulties in keeping up with demand, and geo-politics that require moving from a global to sovereign supply chain. The semiconductor ecosystem is actively pursuing future solutions, including establishing new manufacturing sites and preparing the upcoming workforce. All these solutions are time taking and costly. In this scenario virtual twin, experience looks like a plus which can help companies bring down the cost and efforts. In this session, we will explain our cloud based semiconductor virtual twin experience to highlight how it can enable end-to-end digital continuity from chips to fab. This approach is applicable to the entire semiconductor ecosystem from design, to manufacturing, to material, to process, to equipment, to cleanroom/fab, which scales collaborative innovation across all the stages and all stakeholders. The following scenarios are covered by our solutions:
• from chips to systems
• project planning, requirement, system architecture and validation
• from lab to fab
• material, manufacturing process, equipment, process flow and operation
We will present how digital continuity and traceability utilize the data analytics and advanced cutting edge modeling techniques to enhance decision-making and minimize potential risks in semiconductor virtual twin experience universe. In addition, we will see how a model based innovation platform on IP secure cloud enables
• Virtualization of physical experimentation and tests
• Industry ecosystem collaboration, fostering private-public partnership between different IDM, OEM, Foundry, fabless, academia, research, assembly and test, to work together
• from chips to systems
• project planning, requirement, system architecture and validation
• from lab to fab
• material, manufacturing process, equipment, process flow and operation
We will present how digital continuity and traceability utilize the data analytics and advanced cutting edge modeling techniques to enhance decision-making and minimize potential risks in semiconductor virtual twin experience universe. In addition, we will see how a model based innovation platform on IP secure cloud enables
• Virtualization of physical experimentation and tests
• Industry ecosystem collaboration, fostering private-public partnership between different IDM, OEM, Foundry, fabless, academia, research, assembly and test, to work together
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionInter- and intra-chiplet interconnection networks play a vital role in the operation of many-core systems made of multiple chiplets. However, these networks are susceptible to faults caused by manufacturing defects and attacks resulting from the malicious insertion of hardware trojans and backdoors. Unlike conventional fault-tolerant or countermeasure methods, this paper focuses on optimizing network robustness to withstand both faults and attacks, while considering constraints of chiplet area and power budget. To achieve this, this paper first defines network robustness as a quantifiable measure based on various network parameters, after which an optimization problem is formulated to optimize the robustness of the network topology. To efficiently solve this problem, an efficient algorithm is proposed. Experimental results demonstrate that proposed method is capable of generating inter- and intra-chiplet interconnection networks that are significantly more robust than existing topology generation method. Specially, proposed method improves robustness over state-of-the-art methods by an average of 14.06% under random faults and by 9.37% under targeted attacks.
Research Manuscript
AI
Design
AI/ML System and Platform Design
DescriptionWe present a new XOR-based attention function for efficient hardware implementation of transformers. While standard attention relies on matrix multiplication, we propose replacing the computation of this attention function with bitwise XOR operations. We mathematically analyze the information-theoretic properties of multiplication-based attention, demonstrating that it preserves input entropy, and then show that XOR-based attention approximately preserves the entropy of its input. Across various simple tasks, including arithmetic, sorting, translation, and text generation, we show comparable performance to baseline methods using scaled GPT models. XOR-based attention shows substantial improvement in power, latency, and area compared to the multiplication-based attention function.
Research Manuscript
Design
AI/ML System and Platform Design
DescriptionTo overcome the burden on the memory size and bandwidth due to ever-increasing size of large language models (LLMs), aggressive weight quantization has been recently studied, while lacking research on quantizing activations. In this paper, we present a hardware-software co-design method that results in an energy-efficient LLM accelerator, named OPAL, for generation tasks. First of all, a novel activation quantization method that leverages the microscaling data format while preserving several outliers per sub-tensor block (e.g., four out of 128 elements) is proposed. Second, on top of preserving outliers, mixed precision is utilized that sets 5-bit for inputs to sensitive layers in the decoder block of an LLM, while keeping inputs to less sensitive layers to 3-bit. Finally, we present the OPAL hardware architecture that consists of FP units for handling outliers and vectorized INT multipliers for dominant non-outlier related operations. In addition, OPAL uses log2-based approximation on softmax operations that only requires shift and subtraction to maximize power efficiency. As a result, we are able to improve the energy efficiency by 1.6~2.2x, and reduce the area by 2.4~3.1x with negligible accuracy loss, i.e., <1 perplexity increase.
IP
Engineering Tracks
IP
DescriptionThe Compute Express Link (CXL) 3.1 specification introduces hardware support for cache coherence, facilitating efficient access to shared memory pools in data centers, crucial for addressing the growing memory demand from AI applications. However, challenges in Cost, Efficiency, and Sustainability impede the widespread deployment of CXL platforms at Hyperscale. In response, Meta and Google have unveiled a hardware-compressed CXL memory tier within the Open Compute Project (OCP) / Composable Memory System (CMS) framework, aiming to achieve sustainable and responsible data center operations across diverse compute platforms and memory technologies.
We present DenseMem, an OCP-compliant Memory Compression IP, the hardware-accelerated solution enhances effective capacity by 2-4x with sub 10ns latency at full bandwidth. Leveraging a novel cache line granularity compression algorithm, DenseMem is an area and power-efficient IP block compatible with the latest process nodes. Currently available for evaluation, DenseMem is scheduled for production deployment in mid-2024. It seamlessly integrates into CXL Type 3 device Systems-on-Chip (SoCs) between the CXL controller and memory controller logic blocks, supporting AXI4 and CHI specifications. The lightweight firmware of DenseMem facilitates communication via CXL.mem commands, exposing compressed memory regions for easy integration into existing Linux stacks, applications, and fabric management software.
We present DenseMem, an OCP-compliant Memory Compression IP, the hardware-accelerated solution enhances effective capacity by 2-4x with sub 10ns latency at full bandwidth. Leveraging a novel cache line granularity compression algorithm, DenseMem is an area and power-efficient IP block compatible with the latest process nodes. Currently available for evaluation, DenseMem is scheduled for production deployment in mid-2024. It seamlessly integrates into CXL Type 3 device Systems-on-Chip (SoCs) between the CXL controller and memory controller logic blocks, supporting AXI4 and CHI specifications. The lightweight firmware of DenseMem facilitates communication via CXL.mem commands, exposing compressed memory regions for easy integration into existing Linux stacks, applications, and fabric management software.
Embedded Systems and Software
AI
Embedded Systems
Engineering Tracks
DescriptionThe AUTOSAR standard is nowadays the predominant architecture for the development of automotive industrial software. In particular, AUTOSAR Classic is widely used among most of the world major car-makers.
Nevertheless, the side of software development workload for automotive projects has been growing exponentially in the last years so OEMs and suppliers are significantly struggling to find the proper developers for specific AUTOSAR-related implementations (e.g., application, BSW, integration, etc). On another hand, the study of AUTOSAR as part of the academy studying plan became extremely difficult due to the highly expensive dependencies to commercial AUTOSAR tools and lack of open source-based working environments for allowing university students to familiarize with the AUTOSAR world.
In this work we introduce the concept of an Open Source AUTOSAR classic platform as part of the AUTOSAR University Package. This classic platform describes an AUTOSAR classic development environment that will be developed using open-source tools and AUTOSAR modules and may be easily adopted by the universities with minimum financial investments. The Classic Platform Demonstrator (CPD) addresses the following problems: 1) Dependency on embedded hardware boards, 2) Dependency on proprietary AUTOSAR software stacks, 3) Complexity of the creation of a complete example, 4)Size of the standard (>20000 pages!), and not availability of a "stable" training material
Nevertheless, the side of software development workload for automotive projects has been growing exponentially in the last years so OEMs and suppliers are significantly struggling to find the proper developers for specific AUTOSAR-related implementations (e.g., application, BSW, integration, etc). On another hand, the study of AUTOSAR as part of the academy studying plan became extremely difficult due to the highly expensive dependencies to commercial AUTOSAR tools and lack of open source-based working environments for allowing university students to familiarize with the AUTOSAR world.
In this work we introduce the concept of an Open Source AUTOSAR classic platform as part of the AUTOSAR University Package. This classic platform describes an AUTOSAR classic development environment that will be developed using open-source tools and AUTOSAR modules and may be easily adopted by the universities with minimum financial investments. The Classic Platform Demonstrator (CPD) addresses the following problems: 1) Dependency on embedded hardware boards, 2) Dependency on proprietary AUTOSAR software stacks, 3) Complexity of the creation of a complete example, 4)Size of the standard (>20000 pages!), and not availability of a "stable" training material
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThe AUTOSAR standard is nowadays the predominant architecture for the development of automotive industrial software. In particular, AUTOSAR Classic is widely used among most of the world major car-makers.
Nevertheless, the side of software development workload for automotive projects has been growing exponentially in the last years so OEMs and suppliers are significantly struggling to find the proper developers for specific AUTOSAR-related implementations (e.g., application, BSW, integration, etc). On another hand, the study of AUTOSAR as part of the academy studying plan became extremely difficult due to the highly expensive dependencies to commercial AUTOSAR tools and lack of open source-based working environments for allowing university students to familiarize with the AUTOSAR world.
In this work we introduce the concept of an Open Source AUTOSAR classic platform as part of the AUTOSAR University Package. This classic platform describes an AUTOSAR classic development environment that will be developed using open-source tools and AUTOSAR modules and may be easily adopted by the universities with minimum financial investments. The Classic Platform Demonstrator (CPD) addresses the following problems: 1) Dependency on embedded hardware boards, 2) Dependency on proprietary AUTOSAR software stacks, 3) Complexity of the creation of a complete example, 4)Size of the standard (>20000 pages!), and not availability of a "stable" training material
Nevertheless, the side of software development workload for automotive projects has been growing exponentially in the last years so OEMs and suppliers are significantly struggling to find the proper developers for specific AUTOSAR-related implementations (e.g., application, BSW, integration, etc). On another hand, the study of AUTOSAR as part of the academy studying plan became extremely difficult due to the highly expensive dependencies to commercial AUTOSAR tools and lack of open source-based working environments for allowing university students to familiarize with the AUTOSAR world.
In this work we introduce the concept of an Open Source AUTOSAR classic platform as part of the AUTOSAR University Package. This classic platform describes an AUTOSAR classic development environment that will be developed using open-source tools and AUTOSAR modules and may be easily adopted by the universities with minimum financial investments. The Classic Platform Demonstrator (CPD) addresses the following problems: 1) Dependency on embedded hardware boards, 2) Dependency on proprietary AUTOSAR software stacks, 3) Complexity of the creation of a complete example, 4)Size of the standard (>20000 pages!), and not availability of a "stable" training material
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThis paper takes a human-in-the-loop human-in-the-plant (HIL-HIP) approach towards ensuring operational safety of autonomous systems. A three-way interaction is considered: a) through personalized inputs and biological feedback processes between HIP and HIL, b) through sensors and actuators between RWC and HIP, and c) through personalized configuration changes and data feedback between HIL and RWC. We extend control Lyapunov theory by generating barrier function (CLBF) under human action plans and integrate with neural architectures that can learn certificates. Synthesized HIL-HIP controller for automated insulin delivery in Type 1 Diabetes could reduce hypoglycemia by 3.8% as compared to standard model predictive control.
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionIn-SRAM computing promises energy efficiency, but circuit nonlinearities and PVT variations pose major challenges in designing robust accelerators. To address this, we introduce OPTIMA, a modeling framework that aids in analyzing bit-line discharge and power consumption in 6T-SRAM-based accelerators. It provides insights into limiting factors and enables fast design-space exploration of circuit configurations. Leveraging OPTIMA for in-SRAM multiplications exhibits ∼100× simulation speed-up while maintaining an average modeling error of 0.56mV. Exploration yields an optimized multiplier with 1.02pJ energy consumption per 4-bit operation and classification accuracies of 71.91% (top-1) and 90.72% (top-5) for ImageNet and 92.57% for CIFAR-10 datasets respectively when applied in quantized DNNs.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionEfficient quantum arithmetic circuits are commonly found in numerous quantum algorithms of practical significance. Till date, the logarithmic-depth quantum adders includes a constant coefficient k>=2 while achieving the Toffoli depth of klog(n)+O(1). By extensively studying alternative compositions of the carry-propagation structure, we show that an exact Toffoli-depth of log(n)+O(1) is achievable, when no uncomputation is done. This presents a reduction of Toffoli-depth by almost 50% compared to the best known quantum adder circuits presented till date. We demonstrate a further possible design by incorporating a different expansion of propagate and generate forms. Our designs, both with and without considering uncomputation are presented in detail. By conducting comprehensive theoretical and simulation-based studies, we firmly establish our claims of optimality. The results also mirror similar improvements, recently reported in classical adder circuit complexity.
Research Manuscript
EDA
Physical Design and Verification
DescriptionAs the VLSI technology continues to scale beyond 5nm, a strong demand on the continuing layout reduction of standard cells is required. However, the standard cells with conventional FinFET or nanosheet-FET structure are becoming much hard to meet this requirement due to the lateral P-FET and N-FET separation. It has been widely accepted that Complementary-FET (CFET) is a promising technology, which stacks P-FET on N-FET or vice versa, to achieve this objective. In comparison with synthesizing the conventional FET based standard cells, two prominent optimization tasks in CFET based multi-row cell synthesis that significantly affect the cell quality, in terms of area and routability, are (1) determining transistor folding shapes and (2) determining placement order of transistors with fully secured vertical i.e., z-directional routing space on the stacked FETs as well as buried power rail (BPR). In this work, we propose an optimal solution to the combined problem of tasks 1 and 2. Precisely, we develop a search tree-based area-optimal method of transistor folding and placement, in which we accelerate the cost computation of partial solutions by formulating it into dynamic programming while performing a strict feasibility checking of securing in-cell vertical routing space of partial solutions by formulating and solving it into an instance of network flow problem. Through experiment with benchmark circuits, it is shown that the CFET cells produced by our cell synthesizer are 14% smaller in size on average even with 31% shorter total metal length and 52% less use of metal2 for in-cell routing over the cells produced by the recent state-of-the-art CFET cell generator.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionProcessing convolution layers remains a huge bottleneck for private deep convolutional neural network (CNN) inference for large datasets.
To solve this issue, this paper presents a novel homomorphic convolution algorithm that provides speedup, communication cost, and storage saving.
We first note that padded convolution provides the advantage of model storage saving, but it does not support channel packing, thereby increasing the amount of computation and communication.
We address this limitation by proposing a novel plaintext multiplication algorithm using the Walsh-Hadamard matrix.
Furthermore, we propose the optimization techniques to significantly reduce the latency of the proposed convolution by selecting the optimal encryption parameters and applying lazy reduction.
It achieves 1.6-3.8X speedup and reduces the weight storage by 2000-8000X compared to the conventional convolution.
When the proposed convolution is employed for networks like VGG-16, ResNet-20, and MobileNetV1 on ImageNet, it reduces the end-to-end latency by 1.3-2.6X, the memory usage by 2.1-7.9X and communication cost by 1.7-2.0X compared to conventional method.
To solve this issue, this paper presents a novel homomorphic convolution algorithm that provides speedup, communication cost, and storage saving.
We first note that padded convolution provides the advantage of model storage saving, but it does not support channel packing, thereby increasing the amount of computation and communication.
We address this limitation by proposing a novel plaintext multiplication algorithm using the Walsh-Hadamard matrix.
Furthermore, we propose the optimization techniques to significantly reduce the latency of the proposed convolution by selecting the optimal encryption parameters and applying lazy reduction.
It achieves 1.6-3.8X speedup and reduces the weight storage by 2000-8000X compared to the conventional convolution.
When the proposed convolution is employed for networks like VGG-16, ResNet-20, and MobileNetV1 on ImageNet, it reduces the end-to-end latency by 1.3-2.6X, the memory usage by 2.1-7.9X and communication cost by 1.7-2.0X compared to conventional method.
Research Manuscript
AI
AI/ML Algorithms
DescriptionPrior work has addressed the problem of confidential inference in decision trees. Both traditional order-preserving cryptography and order-preserving NTRU cryptography have been used to ensure data and model privacy in decision trees. Furthermore, FPGA architectures and implementations have been proposed for implementing such confidential inference algorithms on limited resource, edge platforms such as low-cost FPGA boards. In this paper, we address the challenging problem of scalability of order-preserving confidential inference to random forests, which are ensembles of decision trees that are meant to improve their classification accuracy and reduce their overfitting. The paper develops a methodology and an FPGA implementation strategy for scaling up order-preserving cryptography to random forests. In particular, a framework is used to study the multifaceted tradeoffs that exist between the number of trees in the random forest, the strength of the encryption, the accuracy of the inferences, and the resources of the edge platform. Extensive experiments are conducted using the MNIST dataset and the Intel DE10 Standard FPGA board.
Research Manuscript
EDA
Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionThree-dimensional integrated circuit (3D IC) is an important manufacturing technology. In particular, the Monolithic 3D (M3D) technology stands out as a cutting-edge approach that provides higher integration density. However, M3D also introduces several challenges in terms of high density and computational complexity. In this paper, we propose a new approach for solving the inter-tier vias placement problem through optimal transport, which can be efficiently implemented in parallel with GPUs and consequently achieves significant speedup. Moreover, comparing with previous methods, our approach can also facilitate the processing of high integration density circuits to be more effective.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIntegrated circuit (IC) Power management is a growing challenge for both designers and manufacturers at advanced process nodes. We introduce an analysis-based solution during chipfinishing flow. This innovative solution provides automated DRC-clean layout modifications that reduce IR drop without negatively impacting performance and area.
Key metrics for a PnR flow focus on design performance, power, and area (PPA) goals. Using the solution, designers first analyze a chip for EMIR hotspots, then apply automated layout modifications to reduce resistance in these specific areas. These Correct-by-Construction modifications are based on a thorough understanding of available routing tracks and signoff DRC rules, significantly reducing costly design iterations between PnR tools and the final physical verification solution.
In this presentation, We demonstrate integration of the collaborative development of a Calibre DesignEnhancer tool into our design flow with Siemens to showcase the before and after EMIR results for an advanced node that shows 30% reduction in IR drop. This reduces the iterations required to correct the IR drop violations and eliminating iterations between PnR and physical verification, the DRC-clean results provided by the Calibre DesignEnhancer tool that significantly reduce the time pressure of final design closure while enhancing design quality of results for EMIR improvement.
Key metrics for a PnR flow focus on design performance, power, and area (PPA) goals. Using the solution, designers first analyze a chip for EMIR hotspots, then apply automated layout modifications to reduce resistance in these specific areas. These Correct-by-Construction modifications are based on a thorough understanding of available routing tracks and signoff DRC rules, significantly reducing costly design iterations between PnR tools and the final physical verification solution.
In this presentation, We demonstrate integration of the collaborative development of a Calibre DesignEnhancer tool into our design flow with Siemens to showcase the before and after EMIR results for an advanced node that shows 30% reduction in IR drop. This reduces the iterations required to correct the IR drop violations and eliminating iterations between PnR and physical verification, the DRC-clean results provided by the Calibre DesignEnhancer tool that significantly reduce the time pressure of final design closure while enhancing design quality of results for EMIR improvement.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionSeveral novel AI accelerators based on ASIC, FPGA, and resistive-memory devices have been recently demonstrated with promising results. Most of them target only the inference (testing) phase of deep learning. There have been very limited attempts to design a full-fledged AI accelerator capable of both training and inference in real-time. It is due to the highly compute and memory intensive nature of the training phase. In this paper, we propose P-ReTi, a novel analog photonics CNN accelerator which uses silicon microdisk-based convolution, photonic phase change memory-based memory, and dense-wavelength-division-multiplexing for energy-efficient and ultrafast deep learning. Compared to the state-of-the-art, P-ReTi improves the CNN throughput, energy-efficiency, and computational efficiency by up to 48×, 45×, and 12× respectively with trivial accuracy degradation.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionSecure Multi-Party Computation (MPC) is proposed to protect the data privacy from a group of parties, enabling collaborative computation of correct results for target functions. SPDZ, a set of mature MPC protocols widely used in machine learning and other scenarios, requires a significant number of Beaver triples for secure multiplications among parties. Given no Trusted Third Party (TTP) participated, the generation time constitutes over 92% of the total running time. This paper introduces MPC-PAT, a high-performance pipeline architecture designed for efficient Beaver triple generation. MPC-PAT accelerates random number generation, hash function, and modular multiplication in two finite fields. The evaluation results from its FPGA implementation demonstrate substantial speedups, ranging from 40x to 99x for basic operations and 2x to 136x for various convolutional networks compared to the existing SPDZ works.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionDetecting and mitigating parasitic leaks in layout designs is critical for ensuring the reliability and performance of integrated circuits (ICs). This presentation introduces a comprehensive approach to identify, analyze, and resolve parasitic leaks, which significantly impact circuit functionality. Our methodology, utilizing TCAD simulation to define parasitic FET conditions and employing Calibre PERC, enabled precise detection and characterization of leakage paths. The validation of this methodology involved establishing a stringent check rule, correlating it with defined parasitic FET conditions, and meticulously applying it within the layout design using Calibre PERC, thereby confirming the methodology's efficacy. Subsequently, this validated approach was applied to three full-chip layout designs, effectively uncovering and addressing over 9000 risky leak spots. Emphasizing the pivotal role of layout design in leak prevention, our integrated approach aims to identify issues early in the design phase. Through this presentation, we aim to offer valuable insights and practical solutions, empowering designers to proactively address parasitic leaks and enhance the reliability and performance of integrated circuits.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionDetecting and mitigating parasitic leaks in layout designs is critical for ensuring the reliability and performance of integrated circuits (ICs). This presentation introduces a comprehensive approach to identify, analyze, and resolve parasitic leaks, which significantly impact circuit functionality. Our methodology, utilizing TCAD simulation to define parasitic FET conditions and employing Calibre PERC, enabled precise detection and characterization of leakage paths. The validation of this methodology involved establishing a stringent check rule, correlating it with defined parasitic FET conditions, and meticulously applying it within the layout design using Calibre PERC, thereby confirming the methodology's efficacy. Subsequently, this validated approach was applied to three full-chip layout designs, effectively uncovering and addressing over 9000 risky leak spots. Emphasizing the pivotal role of layout design in leak prevention, our integrated approach aims to identify issues early in the design phase. Through this presentation, we aim to offer valuable insights and practical solutions, empowering designers to proactively address parasitic leaks and enhance the reliability and performance of integrated circuits.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionThe pruning-based model compression is regarded as an essential technique to deploy the recent large-size transformer models in practical services; however, accessing sparse transformer models cannot reach the ideal speed at all due to the frequent memory stalls for the irregular memory-accessing patterns. Based on the recent XOR-gate compression relaxing the amount of irregular accesses, this work presents a novel partially-structured transformer pruning method dedicated to the interface-friendly compression format. The stall-free memory access is firstly derived by limiting the number of patches per weight, introducing a new trade-off between model quality and effective memory bandwidth.
Then, the partially-structured pruning patterns are deployed to provide better accuracy-bandwidth trade-off by significantly reducing the number of correction patches. Adjusting the patch distribution per weight in an aggressive way, the number of limited patches can be even smaller than that of weight bits, further increasing the effective bandwidth for achieving the similar model accuracy. We demonstrate the proposed stall-free XOR-gate compression schemes at pruned DeiT/BERT models on ImageNet/SQuAD datasets, presenting the highest effective bandwidth for accessing sparse transformers compared to the existing stall-based solutions.
Then, the partially-structured pruning patterns are deployed to provide better accuracy-bandwidth trade-off by significantly reducing the number of correction patches. Adjusting the patch distribution per weight in an aggressive way, the number of limited patches can be even smaller than that of weight bits, further increasing the effective bandwidth for achieving the similar model accuracy. We demonstrate the proposed stall-free XOR-gate compression schemes at pruned DeiT/BERT models on ImageNet/SQuAD datasets, presenting the highest effective bandwidth for accessing sparse transformers compared to the existing stall-based solutions.
Research Manuscript
Embedded Systems
Time-Critical and Fault-Tolerant System Design
DescriptionPipelining on Edge Tensor Processing Units (TPUs) optimizes the deep neural network (DNN) inference by breaking it down into multiple stages processed concurrently on multiple accelerators. Such DNN inference tasks can be modeled as sporadic non-preemptive gangs with execution times that vary with their parallelism levels. This paper proposes a strict partitioning strategy for deploying DNN inferences in real-time systems. The strategy determines tasks' parallelism levels and assigns tasks to disjoint processor partitions. Configuring the tasks in the same partition with a uniform parallelism level avoids scheduling anomalies and enables schedulability verification using well-understood uniprocessor analyses. Evaluation using real-world Edge TPU benchmarks demonstrated that the proposed method achieves a higher schedulability ratio than state-of-the-art gang scheduling techniques.
Research Manuscript
EDA
Design Verification and Validation
DescriptionCoverage metrics have been widely adopted to quantify the completeness of hardware verification. Recently, coverage-guided fuzzing has emerged as a popular method for automatically creating test inputs toward higher verification coverage reach. However, we observe that its effectiveness on CPUs is hindered by limited sources of seed corpus and efficiency of mutations. To broaden the fuzzing horizons, this paper proposes the PathFuzz framework incorporating an efficient input format for fuzzing CPUs, the footprint memory, with seed corpus from real-world large-scale programs. Experiments demonstrate that using PathFuzz reaches over 95% verification coverage with four long-standing bugs newly identified in two well-known open-source CPU designs.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionRouting is a crucial step in designing a printed circuit board (PCB). However, evaluating and comparing automatic PCB routers are challenging due to the lack of common datasets. Consequently, many publications use small, undisclosed test sets to compare themselves against baselines. To address these issues, we curate PCBench, a dataset of PCB routing problems and solutions after stringent quality check on community-endorsed, open-source PCB designs. Our dataset is 100 times larger than those used in existing publications and can be further expanded by the community using our scripts supporting multiple input formats. To allow more computational professionals, especially those from the Machine Learning (ML) community, to advance PCB routing research, we propose a JSON-based syntax to represent PCB routing problems and their solutions using common math objects and data structures that anyone with basic computing training can understand. We hope this dataset can facilitate the PCB routing research, especially those using ML.
Research Manuscript
EDA
Physical Design and Verification
DescriptionWith the emergence of chiplet technology, the scale of IC packaging design has been steadily increasing, making the utilization of traditional design rule checking (DRC) methods more time-consuming. In this paper, we propose PDRC, a package-level design rule checker for non-manhattan geometry with GPU acceleration.
PDRC employs hierarchical interval lists within an iterative parallel sweepline framework to implement the geometric intersection algorithm, thereby finishing design rule checking tasks.
Experimental results have demonstrated 30 - 50 times speedup achieved by PDRC
compared with two CPU-based checkers.
PDRC employs hierarchical interval lists within an iterative parallel sweepline framework to implement the geometric intersection algorithm, thereby finishing design rule checking tasks.
Experimental results have demonstrated 30 - 50 times speedup achieved by PDRC
compared with two CPU-based checkers.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionPower estimation has become a critical metric for design evaluation and thus the focus now is on both Average and Peak Power. Using strategies like Clock-Gating, to reduce the average power may also reduce peak power. With focus on average power, the strategies which reduced peak power with no or minor impact on the average power are often not used. Neglecting high peak power can lead to increase in cost of packaging or even failure.
In this paper, we propose peak power optimization technique by re-scheduling data path operators across cycles. Cycle Accurate Peak Power at RTL was used to identify the peak power region using RTL-PA tools e.g. PowerPro. Waveform reconstruction using recon engine was used to generate active operator profile for the identified region. Based on the knowledge of active operators causing peak power, the RTL was hand modified and checked for correctness using formal verification tools. The Cycle Accurate Peak Power for the modified RTL was performed to validate the impact on Peak Power.
From the results it is clearly visible that same functionality of RTL was achieved with lower peak power. There was no noticeable impact on the average power of the design.
In this paper, we propose peak power optimization technique by re-scheduling data path operators across cycles. Cycle Accurate Peak Power at RTL was used to identify the peak power region using RTL-PA tools e.g. PowerPro. Waveform reconstruction using recon engine was used to generate active operator profile for the identified region. Based on the knowledge of active operators causing peak power, the RTL was hand modified and checked for correctness using formal verification tools. The Cycle Accurate Peak Power for the modified RTL was performed to validate the impact on Peak Power.
From the results it is clearly visible that same functionality of RTL was achieved with lower peak power. There was no noticeable impact on the average power of the design.
Research Manuscript
EDA
Analog CAD, Simulation, Verification and Test
DescriptionAnalog routing is crucial for performance optimization in analog circuit design, but conventionally takes significant development time and requires design expertise. Recent research has attempted to use machine learning (ML) to generate guidance to preserve circuit performance after analog routing. These methods face challenges such as expensive data acquisition and biased guidance. In this paper, we introduce AnalogFold, a new paradigm of analog routing leveraging ML-enabled performance-oriented routing guidance. Our approach learns performance-driven routing guidance and uses it to help automatic routers for performance-driven routing optimization. We propose to use a 3DGNN that incorporates cost-aware distance to make accurate predictions on post-layout performance. A pool-assisted potential relaxation process derives the effective routing guidance. The experimental results on multiple benchmarks under the TSMC 40nm technology node demonstrate the superiority of the proposed framework compared to the cutting-edge works.
Research Manuscript
Design
SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionHuffman decoding is crucial in data compression, and the self-synchronization-based parallel decoding algorithm enables subsequence-level parallelism. This paper introduces PHD, the first accelerator designed for self-synchronization-based parallel Huffman decoding on the Field-Programmable Gate Array (FPGA). Designing PHD poses challenges, including managing fine-grained parallelism, addressing limited on-chip memory, and handling inter-codeword dependency. PHD incorporates bit-level, subsequence-level, and tile-level parallelism, utilizes hybrid memory to store the codebook efficiently, and introduces the ONCE MORE optimization to reduce decoding loop iterations. Experimental results demonstrate that PHD outperforms the state-of-the-art GPU-based baseline regarding latency (9.4X to 12.8X reduction) and energy consumption (12.4X to 18.2X reduction).
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionOptimizing chip Power, Performance, Area, Schedule, and Cost (PPASC) is crucial to stay competitive in today's rapidly evolving technological landscape. Optimal design PPASC, require designers to explore large design space of functional, physical and process parameters that have complex relationships amongst themselves and optimize designs for conflicting goals iteratively. The quality of results is highly dependent on engineering expertise and limited by schedule and cost priorities. AI techniques can augment physical design and optimization effort with capabilities for multi-objective design exploration, replace traditional iterative feedback cycles with data driven insights and automate manual tasks with pattern recognition capabilities. The use of AI in backend design can enhance efficiency, quality, reliability and have higher chance of reaching PPASC minima as compared to conventional methods.
In this talk we will discuss some of the application of AI in physical design for clock parameter tuning, place and route recipe generation, last mile PPASC optimization, design robustness analysis and design rule fixing and share preliminary results from design testcases. Results indicate measurable benefit in terms of PPASC, design quality and reliability. It also streamlines design process, ensure execution predictability and free-ups engineering resource for higher value tasks paving way for innovative, reliable, and improved PPASC chips that can shape the future of technology.
In this talk we will discuss some of the application of AI in physical design for clock parameter tuning, place and route recipe generation, last mile PPASC optimization, design robustness analysis and design rule fixing and share preliminary results from design testcases. Results indicate measurable benefit in terms of PPASC, design quality and reliability. It also streamlines design process, ensure execution predictability and free-ups engineering resource for higher value tasks paving way for innovative, reliable, and improved PPASC chips that can shape the future of technology.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionQuantum circuit simulations are essential for the verification of quantum algorithms. However, as the number of qubits increases, the memory requirements for performing full state vector simulations grow exponentially, leading to significant latency and energy overhead. Processing-in-Memory (PIM) can efficiently handle quantum circuit simulations.
In this paper, we propose PIANIST. PIANIST leverages UPMEM to implement quantum circuit simulations. PIANIST features three optimization strategies to overcome the limitations of commercial PIM in quantum circuit simulation. PIANIST achieves an average speedup of 2.3x and 16.5x with 37.2% and 72.5% energy reduction over QuEST simulator on CPU in 16- and 32-qubit benchmarks.
In this paper, we propose PIANIST. PIANIST leverages UPMEM to implement quantum circuit simulations. PIANIST features three optimization strategies to overcome the limitations of commercial PIM in quantum circuit simulation. PIANIST achieves an average speedup of 2.3x and 16.5x with 37.2% and 72.5% energy reduction over QuEST simulator on CPU in 16- and 32-qubit benchmarks.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionSilicon photonic networks are revolutionizing computing systems by improving the energy efficiency, bandwidth, and latency of data movements. Optical modulators, such as microresonators and Mach-Zehnder Interferometers (MZIs), are the basic building blocks of silicon photonic networks. However, the time consumption brought about by the simulation stage in the current design of optical chips is too large, resulting in low overall design efficiency. In this paper, we propose the PINN(physics-informed neural network) based compact model for on-chip silicon optical devices, which improves the simulation efficiency by more than 10 times on average compared to existing modeling methods.
Research Manuscript
Embedded Systems
Embedded Memory and Storage Systems
DescriptionModern SSD firmware is continuously optimized for higher parallelism to match the growing frontend PCIe bandwidth with more backend flash channels. Although a multi-core microprocessor is typically adopted to concurrently process independent NVMe requests from multiple NVMe queues, the existing one-to-many thread-request mapping model with each thread serving one or more incoming I/O requests has poor scalability due to severe lock contention problem, especially in cache management.
In this paper, we first conduct preliminary experiments on an open-channel NVMe SSD to exhibit the lock contention problem in the one-to-many thread-request mapping model. When a thread locks a cache line and is waiting for a long-latency flash read to update this cache line, subsequent tasks on other threads that require the same cache line are all blocked to guarantee correctness. To mitigate this, we propose PipeSSD, a lock-free pipeline-based SSD firmware design with a many-to-one thread-request mapping model that assigns multiple threads to serve different stages of each I/O request in a pipelined way. It is worth noting that PipeSSD only performs cache updates in the last pipeline stage to eliminate dependency loops in the pipeline while maintaining a pilot for each cache line in the beginning pipeline stage to indicate the cache line status. With a multi-core architecture, different pipeline stages are processed on different cores communicated via several FIFO queues, which can ensure the processing sequence and data consistency without any cache line locks. We implement PipeSSD on real hardware and evaluate its performance on a multi-core NVMe SSD prototype. The evaluation results show that on an 8-core system, PipeSSD has a significant throughput improvement compared to the state-of-the-art multi-core SSD firmware.
In this paper, we first conduct preliminary experiments on an open-channel NVMe SSD to exhibit the lock contention problem in the one-to-many thread-request mapping model. When a thread locks a cache line and is waiting for a long-latency flash read to update this cache line, subsequent tasks on other threads that require the same cache line are all blocked to guarantee correctness. To mitigate this, we propose PipeSSD, a lock-free pipeline-based SSD firmware design with a many-to-one thread-request mapping model that assigns multiple threads to serve different stages of each I/O request in a pipelined way. It is worth noting that PipeSSD only performs cache updates in the last pipeline stage to eliminate dependency loops in the pipeline while maintaining a pilot for each cache line in the beginning pipeline stage to indicate the cache line status. With a multi-core architecture, different pipeline stages are processed on different cores communicated via several FIFO queues, which can ensure the processing sequence and data consistency without any cache line locks. We implement PipeSSD on real hardware and evaluate its performance on a multi-core NVMe SSD prototype. The evaluation results show that on an 8-core system, PipeSSD has a significant throughput improvement compared to the state-of-the-art multi-core SSD firmware.
Research Manuscript
AI
Design
AI/ML System and Platform Design
DescriptionThe sophisticated self-attention-based spatial correlation entails a high inference delay cost in vision transformers. To this end, we propose PIVOT-a hardware-algorithm co-optimization framework for input-difficulty-aware attention skipping for attention bottleneck optimization. The attention-skipping configurations are obtained via an iterative hardware-in-the loop co-search method. On the ZCU102 MPSoC FPGA, PIVOT achieves 2.7×(1.73×) lower EDP at 0.2%(0.4%) accuracy reduction compared to standard LVViT-S (DeiT-S) ViTs. Unlike prior works that require nuanced hardware support, PIVOT is compatible with traditional GPU and CPU platforms- 1.8× higher throughput at 0.4-1.3% higher accuracy compared to prior works.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThis paper addresses two primary challenges in end-to-end object detection within artificial intelligence of things (AIoT) systems: (1) the energy-intensive analog-to-digital converters (ADCs) required for the conversion of analog pixel arrays to digital matrices, and (2) the high data transfers between the sensing unit and computing unit. Our proposed solution involves the implementation of an in-sensor binary segmentation model on analog memristive crossbars to identify the important pixels. Additionally, we propose a data transfer scheme that adaptively selects between dense and sparse data transfer formats based on the sparsity ratio measured from the segmentation mask obtained by the segmentation model. Our results demonstrate that the proposed object detection system achieves significant energy savings along with a considerable 95% reduction in data transfer, all while maintaining high accuracy.
Research Manuscript
Design
SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionDue to the memory wall, memory system performance significantly impacts the user experience of mobile phones. The system cache (SC) locates on the memory side and is shared by all the central processing units (CPUs) and graph processing units (GPUs) within the mobile phone and is the last defense line before resorting to the time-consuming off-chip memory access. However, it is challenging to manage SC, due to the memory-side large working set and irregular accessing patterns. Although SC takes up a considerable on-chip area, the effectiveness of SC in terms of hit rate is rather low. It is observed that neither using the state-of-the-art cache replacement policies nor enlarging cache size can significantly benefit SC. The prefetchers designed for higher-level caches cannot be used by SC, because the required program counter (PC) is not available on the memory-side and/or the aggressive prefetch traffic violates the stringent power constraints of mobile phones. In this study, we propose Planaria, which includes two sub-prefetchers (SLP and TLP) and a coordinator (POC) to simultaneously achieve high accuracy and coverage of prefetching. The two sub-prefetchers exploit the intra- and inter-page regularities via self and transfer learning, respectively. The coordinator POC explicitly decouples the learning and issuing phases of the sub-prefetchers. The sub-prefetchers are directed by the full pattern, but are enabled in an irreversible order. The working fashion of "parallel training and serial issuing'' effectively increases useful prefetches and reduces useless prefetches. Experimental results show that, Planaria has improved the overall system performance in terms of instructions per cycle (IPC) by 28.9%, 21.9% and 15.3% on average over no prefetcher and BOP and SPP, respectively. Moreover, Planaria only incurs 0.5% power consumption overhead, while BOP and SPP increase the power consumption by 13.5% and 9.7%, respectively.
Research Manuscript
Security
Hardware Security: Attack and Defense
DescriptionThis work proposes a new countermeasure principle to defend against Dynamic Voltage Frequency Scaling (DVFS) based fault attacks on modern Intel systems. First, we establish that the fundamental cause of DVFS fault attacks is the ability to independently control the frequency and voltage of a processor. Using this observation, we construct a partition of frequency-voltage tuples into unsafe-safe states based on whether a tuple causes timing violations according to switching circuit theoretic principles. Our countermeasure completely prevents DVFS faults on three Intel generation CPUs: Sky Lake, Kaby Lake R, and Comet Lake. Further, it can also be deployed both as microcode or as model-specific registers at the hardware level, unlike previous countermeasures. Finally, we evaluate a minuscule overhead 0.28% of our countermeasure on SPEC2017.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThe Coresight architecture is an integral part of any processor based design. ARM CoreSight architecture is a solution for debug and trace of complex SoCs. It provides a set of standard interfaces and programmer model views enabling partners to define CoreSight components and integrate them within the CoreSight architecture. SOC DV efforts will be increased to setup the verification stimulus whenever the ARM architecture changes. Automated Verification testbench is one of the best solutions to ease the DV efforts for Coresight related test sequences. This paper talks about generic parameterized automated SOC DV environment, which helps us in accelerating the DV bring-up and reduce the time of verification cycle. The plug-n-play testbench supports various ARM architecture base Coresight system. There is very minimal manual intervention and only a few user inputs are required to implement the verification testbench for targeted SOC. The testbench can be plugged in for verifying the debug data path in SoC design and to close the verification early than the project deadline.
Research Manuscript
EDA
RTL/Logic Level and High-level Synthesis
DescriptionMemory partitioning is a widely used technique to reduce access conflicts on multi-bank memory in high-level synthesis. Previous memory partitioning methods mainly focus on a given access pattern extracted from stencil applications. Restricted by the pattern shape, these methods are prone to sub-optimal bank numbers or large overhead on address generation. In this work, we propose a pattern-morphing-based memory partitioning method, PMP, that only requires reduced hyperplane families to achieve the minimal bank number. To reduce the side effect of extra data padding, an integer linear programming problem is formulated for pattern morphing. Compared to the previous hyperplane-based memory partitioning, the experimental results show that our approach could achieve the optimal partition factor while saving 22% in LUTs, 21% in FlipFlops, 10% in DSPs, and 40% in memory overhead, on average.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionTSVs creates repeated placement and route blockages in design.
Special care needed to deal with floorplan and power plan challenges to handle TSV.
With increase number of blockages and TSV islands, pre-place cell addition and power plan run time increases.
Module splits near TSV, makes difficult for timing and m-bit merging.
Route Detoured on TSV area causes timing degradation.
Bigger IP need special planning, to ensure adequate power supply in TOP die.
With increased PnR runtime , handling pnr in 3D is difficult part.
Special care needed to deal with floorplan and power plan challenges to handle TSV.
With increase number of blockages and TSV islands, pre-place cell addition and power plan run time increases.
Module splits near TSV, makes difficult for timing and m-bit merging.
Route Detoured on TSV area causes timing degradation.
Bigger IP need special planning, to ensure adequate power supply in TOP die.
With increased PnR runtime , handling pnr in 3D is difficult part.
Research Manuscript
EDA
RTL/Logic Level and High-level Synthesis
DescriptionGenerating high-quality sub-circuit for local substitution is an effective optimization technique in logic synthesis. There have been abundant works on generating area and delay optimal sub-circuits, greatly enhancing the logic optimization quality. However, power- oriented sub-circuit generation is rarely discussed, while optimizing power consumption in this sub-15 nm era is of paramount interest. We propose PONO, an SMT-based near optimal sub-circuit generation flow for power optimization. PONO enables power-oriented circuit library building and fills the gap in generating circuits near the Pareto frontier in PPA (Power, Performance, and Area). It manifests superiority in power reduction over traditional one in rewrite, a key logic optimization algorithm. We test PONO on EFPL benchmarks, and it shows 8.7% less power consumption with comparable performance and area after placement and routing.
Tutorial
Security
DescriptionPost-Quantum Cryptography (PQC) encompasses cryptographic algorithms, typically public-key algorithms, designed to be secure against quantum and classical computers. Motivated by the threat posed by quantum computing to the security of most public-key algorithms currently in use, the National Institute of Standards and Technologies (NIST) started in December 2016 the PQC Standardization Process, a public competition for selection of public-key cryptosystems designed to resist attacks by a quantum computer. After three rounds of competition, in July 2022, NIST announced the first four proposals to be standardized, which include one key-establishment mechanism (i.e., CRYSTALS-Kyber) and three digital signatures (i.e., CRYSTALS-Dilithium, Falcon and SPHINCS+). CRYSTALS-Kyber and CRYSTALS-Dilithium are the primary algorithms recommended for most use cases, while Falcon and SPHINCS+ are proposed for use cases that require small signatures and non-lattice-based signatures, respectively. Shortly after NIST's announcement, in September 2022, the National Security Agency (NSA) published the Commercial National Security Algorithm Suite (CNSA) 2.0 advisory on protection of National Security Systems (NSS), which includes the approved PQC algorithms and the transition timeline. In August 2023, NIST requested public comments on the drafts of the standards derived from CRYSTALS-Kyber, CRYSTALS-Dilithium, and SPHINCS+.
This tutorial aims to introduce the audience to the implementation attacks published in the literature against the primary PQC algorithms to be standardized by the National Institute of Standards and Technologies (NIST) and approved by the National Security Agency (NSA) for national security systems (i.e., Kyber and Dilithium) as well as countermeasures against these implementation attacks. Other PQC standardization efforts will be mentioned. The goal is to prepare the hardware security community with the information required to do research in this field, play an active role in the remaining steps of the standardization process, and support secure deployment of PQC.
This tutorial aims to introduce the audience to the implementation attacks published in the literature against the primary PQC algorithms to be standardized by the National Institute of Standards and Technologies (NIST) and approved by the National Security Agency (NSA) for national security systems (i.e., Kyber and Dilithium) as well as countermeasures against these implementation attacks. Other PQC standardization efforts will be mentioned. The goal is to prepare the hardware security community with the information required to do research in this field, play an active role in the remaining steps of the standardization process, and support secure deployment of PQC.
Research Panel
EDA
DescriptionDeloitte predicts that the semiconductor industry will face a significant gap in workforce, with over 1 million additional jobs needed by 2030. Workforce development is critical, especially chip design talent (including logic and circuits, design verification, testing and CAD, and embedded software), an area that is typically overlooked but is a keystone in the semiconductor supply chain and is essential for technological advancement. The growing demand for design engineers is fueled by industry's emphasis on custom silicon and the realization that any advancement in software is enabled by hardware. This panel will address workforce development success, barriers, critical research areas, and government funding opportunities for cross sector collaborations to ensure a resilient chip design workforce with a focus on the following:
- Fundamental Research: Unprecedented demand and development of semiconductors demonstrates an opportunity to drive system-level performance improvements in computing. What are some strategies to cultivate research capabilities to address the next grand challenges in chip design and innovative ways to nurture PhD talent?
- Student Attraction & Retention: How do we enhance awareness about chip design to attract diverse talent? What strategies and resources can be implemented to build early excitement around a career in chip design and address the leaky talent pipeline through retention/retraining initiatives?
- Training & Education: What are the best ways to strategically incorporate IC design in early education? What programs or methods can be developed to enable our educators?
- Building Infrastructure: How can we overcome barriers like costly cloud services and fabrication resources to make educational infrastructure more accessible? What are some strategies to promote collaboration in education while centralizing, scaling and subsidizing infrastructure?
- Fundamental Research: Unprecedented demand and development of semiconductors demonstrates an opportunity to drive system-level performance improvements in computing. What are some strategies to cultivate research capabilities to address the next grand challenges in chip design and innovative ways to nurture PhD talent?
- Student Attraction & Retention: How do we enhance awareness about chip design to attract diverse talent? What strategies and resources can be implemented to build early excitement around a career in chip design and address the leaky talent pipeline through retention/retraining initiatives?
- Training & Education: What are the best ways to strategically incorporate IC design in early education? What programs or methods can be developed to enable our educators?
- Building Infrastructure: How can we overcome barriers like costly cloud services and fabrication resources to make educational infrastructure more accessible? What are some strategies to promote collaboration in education while centralizing, scaling and subsidizing infrastructure?
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionPowerdash is an innovative push-button framework designed to serve as a holistic solution for System-on-Chip (SOC) power analyses. This framework offers a range of capabilities that enable SOC designers to efficiently analyze and manage power-related data.
Powerdash excels in delivering ultra-low latency parsing tools that can effectively interpret Joules and Voltus power reports. It automates the process of collecting this data and seamlessly integrates it into a centralized Oracle database. Furthermore, it provides an intuitive visualization feature, enabling users to present power data in the form of informative charts via the Spotfire Dashboard.
Currently, Powerdash supports essential functionalities, including Library PPA Analysis and Usecase power tracking across multiple compile cycles and Process-Voltage-Temperature (PVT) scenarios. This paper discusses the framework's features, benefits, and its contribution to streamlining SOC power analysis processes, making it a valuable asset for SOC designers and engineers.
Powerdash excels in delivering ultra-low latency parsing tools that can effectively interpret Joules and Voltus power reports. It automates the process of collecting this data and seamlessly integrates it into a centralized Oracle database. Furthermore, it provides an intuitive visualization feature, enabling users to present power data in the form of informative charts via the Spotfire Dashboard.
Currently, Powerdash supports essential functionalities, including Library PPA Analysis and Usecase power tracking across multiple compile cycles and Process-Voltage-Temperature (PVT) scenarios. This paper discusses the framework's features, benefits, and its contribution to streamlining SOC power analysis processes, making it a valuable asset for SOC designers and engineers.
Research Manuscript
AI
AI/ML Application and Infrastructure
DescriptionTo address the power management challenges in deep neural networks (DNNs), dynamic voltage and frequency scaling (DVFS) technology is garnering attention for its ability to enhance energy efficiency without modifying the structure of DNNs. However, current DVFS methods, which depend on historical information such as processor utilization and task computational load, face issues like frequency ping-pong, response lag, and poor generalizability. Therefore, this paper introduces PowerLens, an adaptive DVFS framework. Initially, we develop a power-sensitive feature extraction method for DNNs and identify critical power blocks through clustering based on power behavior similarity, thereby achieving adaptive DVFS instrumentation point settings. Then, the framework adaptively presets the target frequency for each power block through a decision model. Finally, through a refined training and deployment process, we ensure the framework's effective adaptability across different platforms. Experimental results confirm the effectiveness of the framework in energy efficiency optimization.
Research Manuscript
EDA
Timing and Power Analysis and Optimization
DescriptionEfficient power grid analysis is critical in modern VLSI design. It is computationally challenging because it requires solving large linear equations with millions of unknowns. Iterative solvers are more scalable, but their performance relies on preconditioners. Existing preconditioning approaches suffer from either high construction cost or slow convergence rate, both resulting in unsatisfactory total solution time. In this work, we propose an efficient power grid simulator based on fast randomized Cholesky factorization, named PowerRChol. We first propose a randomized Cholesky factorization algorithm with provable linear-time complexity. Then we propose a randomized factorization oriented matrix reordering approach. Experimental results on large-scale power grids demonstrate the superior efficiency of PowerRChol over existing iterative solvers, showing 1.51X, 1.93X and 3.64X speedups on average over the original RChol [3], feGRASS [11] and AMG [14] based PCG solvers, respectively. For instance, a power grid matrix with 60 million nodes and 260 million nonzeros can be solved (at a 1E-6 accuracy level) in 148 seconds on a single CPU core.
Research Manuscript
EDA
Timing and Power Analysis and Optimization
DescriptionAccurate and efficient power analysis at early VLSI design stages is critical for effective power optimization. It is a promising yet challenging task, especially during placement stage with the clock tree and final signal routing unavailable. Additionally, optimization-induced circuit transformations like circuit restructuring and gate sizing can invalidate fine-grained power supervision. Addressing these, we introduce the first generalizable circuit-transformation-aware power prediction model at placement stage. Compared to the cutting-edge commercial IC engine Innovus, we have significantly reduced the cross-stage power analysis error between placement and detailed routing.
Research Manuscript
EDA
Physical Design and Verification
DescriptionToday's place-and-route (P&R) flows are increasingly challenged by complexity and scale of modern designs. Often, heuristics must trade off between turnaround time and quality of PPA outcomes. This paper presents a clustered placement methodology that improves both turnaround time and final-routed solution quality. Our PPA-aware clustering considers timing, power and logical hierarchy during netlist clustering, effectively reducing problem size and accelerating global placement runtime while improving post-route PPA metrics. Additionally, our machine learning (ML)-accelerated virtualized P&R (V-P&R) methodology predicts the best cluster shapes (i.e., aspect ratios and utilizations) to use in P&R of the clustered netlist. With the open-source OpenROAD tool, our methods achieve up to 47% (average: 36%) global placement runtime improvement with similar half-perimeter wirelength (HPWL) and 90% (29%) improvement in post-route total negative slack (TNS). With the commercial Cadence Innovus tool, our methods achieve up to 1.68% (0.00%) improvement in power and 94% (44%) improvement in TNS.
Research Manuscript
Security
Hardware Security: Primitives, Architecture, Design & Test
DescriptionGraph Neural Networks (GNNs) are increasingly used in fields like social media and bioinformatics, promoting the prosperity of cloud-based GNN inference services. Nevertheless, data privacy becomes a critical issue when handling sensitive information. Fully Homomorphic Encryption (FHE) enables computations on encrypted data, while privacy-preserving GNN inference generally necessitates ensuring graph structure data confidentiality and maintaining computation precision, both of which are computationally expensive in FHE. Existing schemes of GNNs inference with FHE are deterred by either computational overhead, accuracy degradation, or incomplete data protection. This paper presents PPGNN to address these challenges all at once. We first propose a novel privacy-preserving GNN inference algorithm utilizing a high-accuracy arithmetic-and-logic FHE approach, meanwhile only need much smaller parameters, substantially reducing computational complexity and facilitating parallel processing. Correspondingly, a dedicated hardware architecture has been designed to implement these innovations, with featured specialized units for arithmetic and logic FHE operations in a pipelined manner. Collectively, PPGNN achieves 2.7× and 1.5× speedup over state-of-the-art Arithmetic FHE and Logic FHE accelerators while ensuring high accuracy, simultaneously with about 18× energy reduction on average.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionOptical side-channel analysis poses a significant threat to the security of integrated circuits (ICs) by enabling the disclosure of secret data, such as encryption keys. In our work, we present a multiphysics simulation framework of optical side-channel analysis from the layout database of a fabricated testchip. By leveraging accurate device models and electro-photonic physics, our framework models the photon emission behavior in ICs and enables the statistical correlation of emitted photon patterns with secret keys. In our proposed solution, we begin by analyzing the device's layout under test and simulating the channel current of NMOS devices under various stimuli. By generating photon images based on pre-characterized models, we overlay individual photon images on the connected polysilicon ground metal. Through lossless image processing, we extracted photon intensity patterns from collected photon emission heatmaps and then performed correlation-based photon emission analysis (CPEA) to disclose the security key byte by byte. Our framework enables IC designers to assess the risks associated with optical side-channel attacks and develop efficient countermeasures at the pre-silicon stage.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIn response to the imperative for quantum-era data security, NIST standardized CRYSTALS-Kyber as a key-establishment algorithm. Despite mathematical robustness, potential side-channel vulnerabilities in CRYSTALS-Kyber risk data exposure. Pre-silicon evaluation of its resistance to side-channel attacks is crucial for real-world security. Current post-silicon validation practices may necessitate costly modifications. Our work presents the first pre-silicon power side-channel analysis of CRYSTALS-Kyber at the RTL level, quantifying leakage and identifying vulnerable modules. The NTT (Number Theoretic Transform) module mirrors the overall design's leakage pattern, suggesting modules in NTT multiplication contribute to the highest leakage, posing a potential threat to the secret key. These findings guide proactive measures in early design, fortifying defenses against RTL-level side-channel attacks.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionToday, the semiconductor design industry is centered around the use of EDA tools. These tools provide the necessary information and automation for a design engineer to do their work effectively. The automation of design processes is especially significant and has been key to the success of the industry. However, process automation comes at the cost of large compute resource requirements. These requirements will only increase as the industry continues to automate more processes. Therefore, the way a semiconductor design company manages their compute resources is and will continue to be essential to their success.
This presentation describes the details of two systems developed for predicting EDA tool resource usage. The first relying on a more conventionally engineered "recently used" algorithm and the second centered around a machine learning framework. Covered topics will include comparisons of algorithm complexity and accuracy in key compute resources such as memory usage.
This presentation describes the details of two systems developed for predicting EDA tool resource usage. The first relying on a more conventionally engineered "recently used" algorithm and the second centered around a machine learning framework. Covered topics will include comparisons of algorithm complexity and accuracy in key compute resources such as memory usage.
Research Manuscript
EDA
Design Verification and Validation
DescriptionThe IC3 algorithm, also known as PDR, has made a significant impact in the field of safety model checking in recent years due to its high efficiency, scalability, and completeness. The most crucial component of IC3 is inductive generalization, which involves dropping variables one by one and is often the most time-consuming step. In this paper, we propose a novel approach to predict a possible minimal lemma before dropping variables by utilizing the counterexample to propagation (CTP). By leveraging this approach, we can avoid dropping variables if predict successfully. The comprehensive evaluation demonstrates a commendable success rate in lemma prediction and a significant performance improvement achieved by our proposed method.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionModern disaggregated data centers have grown beyond CPU nodes to provide their customers with domain-specific accelerators (DSAs) such as GPUs, NPUs, and FPGAs. Existing CPU-based TEEs such as Intel SGX or AMD SEV does not provide sufficient protection. DSA-TEE such as Nvidia CC only addresses tightly coupled CPU-DSA systems with a propriety solution. On the other hand, existing academic proposals are tailored toward specific CPU-TEE platforms.
To bridge this lack of generality, in this paper, we investigate the feasibility of \textit{enclaved} execution across multi-tenant heterogeneous nodes, extending beyond TEE-enabled CPUs. Wide-scale TEE support for accelerators seems a straightforward solution but is far from being a reality.
In this paper, we investigate the fundamental design principles for enabling hardware-backed isolated and attestable instances, a.k.a., enclaves that provide isolation of code and data from attacker-controlled host software stack (OS/VMM). We prototype custom TEE hardware support for two kinds of accelerators: NPU and SSD with low overhead, that show the feasibility of adding TEE support to existing accelerators. Moreover, we evaluated our prototype with real-world AI and storage workload and observed 1-16% overhead.
To bridge this lack of generality, in this paper, we investigate the feasibility of \textit{enclaved} execution across multi-tenant heterogeneous nodes, extending beyond TEE-enabled CPUs. Wide-scale TEE support for accelerators seems a straightforward solution but is far from being a reality.
In this paper, we investigate the fundamental design principles for enabling hardware-backed isolated and attestable instances, a.k.a., enclaves that provide isolation of code and data from attacker-controlled host software stack (OS/VMM). We prototype custom TEE hardware support for two kinds of accelerators: NPU and SSD with low overhead, that show the feasibility of adding TEE support to existing accelerators. Moreover, we evaluated our prototype with real-world AI and storage workload and observed 1-16% overhead.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe budget is constrained by the process variations at of 5nm technology node and beyond. Greater attention will be required self-aligned via process. A probability model of via-metal open circuit defects is proposed to quantify the main uncertainty factors accurately. CD variabilities of lithography processes and displacements induced by overlay along critical direction are considered. Our model outperforms the Monte Carlo method by achieving an average deviation below 0.1%, while being at least two orders of magnitude faster in calculation speed. Our probability model can lead to a more robust design, enhancing the overall pattern quality with short turn-around-time.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIO Ring is a key factor for any SoC Design. It consists of some specified cells which need to adhere to all the Integration Guidelines mentioned in the library databooks. Building an IO Ring is a bit challenging task for SoC designers. The designer needs to build the Ring manually with all required cells and rules.
Designer needs to validate the IO ring for all the Integration guidelines.
The proposed Programmable IO Ring Builder and Checker [PRBC] tool establishes the automated IO RING, incorporating necessary cells and adhering to integration guidelines with inbuilt validation.
The detailed workflow of the tool is described which includes Input details, cell placement strategy, and the validated output sections.
The tool gives a unique advantage to SOC designers as it not only reduces the cycle time of IO ring development but also complies with all the IO integration rules.
Designer needs to validate the IO ring for all the Integration guidelines.
The proposed Programmable IO Ring Builder and Checker [PRBC] tool establishes the automated IO RING, incorporating necessary cells and adhering to integration guidelines with inbuilt validation.
The detailed workflow of the tool is described which includes Input details, cell placement strategy, and the validated output sections.
The tool gives a unique advantage to SOC designers as it not only reduces the cycle time of IO ring development but also complies with all the IO integration rules.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionThis Presentation explores the challenges of IO cell design and its accurate characterization in the rapidly evolving field of chip design, where the demand for reducing chip size and increasing design functionalities has become more complex and fast-paced. To overcome these challenges an improvised design approach that customizes the last-stage flip-flop of the controller and implements it in the periphery IO architecture, resulting in decreased clock-to-data turnaround time. To overcome the characterization challenges for customized design, the paper proposes a pruning netlist technique that improves run time and accurate constraint values extraction in characterization. The results demonstrate the effectiveness of the proposed approach in overcoming the conventional characterization approach in terms of run time and accuracy.
Research Manuscript
Design
SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionCoarse-Grained Reconfigurable Array (CGRA) is a parallel architecture providing high energy efficiency and spatial-temporal reconfigurability. Beyond loop scheduling for throughput optimization, program transformation is also crucial in CGRA mapping to optimize overall performance and efficiency. However, existing studies on program transformation optimization face challenges in exploring the transformation space systematically and evaluating candidates efficiently, leading to sub-optimal results. To tackle these challenges, this paper introduces PT-Map, an efficient program transformation optimization framework for CGRA mapping. PT-Map defines a comprehensive transformation space and employs a CGRA-specialized top-down exploration approach. It also incorporates a bottom-up evaluation scheme using architectural parameters and a graph neural network-based predictive model. Experiments demonstrate that PT-Map achieves up to 2.95x/1.80x speedups and 59.0%/23.2% energy-delay-product (EDP) reductions over the state-of-the-art approaches MapZero and PBP, respectively.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIn the context of remote sensing satellites, power consumption has always posed a significant challenge for storage systems and in-orbit computing. This research pushes computing-in-memory (CIM) towards computational storage to accelerate in-orbit remote sensing satellite image processing.The first improvement involves utilizing CIM to address the energy efficiency problem associated with neural network computing in computational storage. The second enhancement involves the introduction of Zoned Namespace (ZNS) Solid State Drives (SSDs) to further optimize storage bandwidth. Lastly, a semantic retrieval function is designed at the file system layer to enhance the retrieval capability of the in-orbit satellite storage system. Experimental results demonstrate that the proposed CIM-ZCSD system achieves a three-fold increase in writing bandwidth and nearly 20 times improvement in computational energy efficiency compared to traditional systems.
Research Manuscript
EDA
Analog CAD, Simulation, Verification and Test
DescriptionWith the CMOS technology advancing and the complexity of circuits growing, the demand for analog/mixed-signal design automation tools is increasing quickly. Although some tools have been developed to tackle this challenge, the performance degradation caused by process, voltage, and temperature (PVT) variations has been less considered. This paper presents PVTSizing, an optimization framework for PVT-robust analog circuit synthesis. PVTSizing adopts trust region Bayesian optimization (TuRBO) for high-quality initial datasets and reference points. Multi-task reinforcement learning (RL) is utilized for PVT optimization. Both TuRBO and RL are batch-friendly, allowing parallel sampling of design solutions. Meanwhile, critic-assisted pruning and zoom target metrics are proposed to improve sample efficiency and reduce runtime. In addition, this framework naturally supports sizing over random mismatch. On 4 real-world circuits with TSMC 28/180nm process, PVTSizing achieves 1.9x-8.8x sample efficiency and 1.6x-9.8x time efficiency improvements compared to prior sizing tools from both industry and academia.
Research Manuscript
AI
Security
AI/ML Security/Privacy
DescriptionThis paper introduces EmMark, a novel watermarking framework for protecting intellectual property (IP) of embedded large language models deployed on resource-constrained edge devices. To address the IP theft risks posed by malicious end-users, EmMark enables proprietors to authenticate ownership by querying the watermarked model weights and matching the inserted signatures. EmMark's novelty lies in its strategic watermark weight parameters selection, ensuring robustness and maintaining model quality.
Extensive proof-of-concept evaluations of models from OPT and LLaMA-2 families demonstrate EmMark's fidelity, achieving 100% success in watermark extraction with model performance preservation. EmMark also showcased its resilience against watermark removal and forging attacks.
Extensive proof-of-concept evaluations of models from OPT and LLaMA-2 families demonstrate EmMark's fidelity, achieving 100% success in watermark extraction with model performance preservation. EmMark also showcased its resilience against watermark removal and forging attacks.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe advantages of quantum pulses over quantum gates have attracted increasing attention from researchers.
However, while there are established workflows and processes to evaluate the performance of quantum gates, there has been limited research on profiling parameterized pulses.
To address this gap, our study proposes a set of design spaces for parameterized pulses, evaluating these pulses based on metrics such as expressivity, entanglement capability, and effective parameter dimension.
Using these design spaces, we demonstrate the advantages of parameterized pulses over gate circuits in terms of duration and performance at the same time, thus enabling high-performance quantum computing.
However, while there are established workflows and processes to evaluate the performance of quantum gates, there has been limited research on profiling parameterized pulses.
To address this gap, our study proposes a set of design spaces for parameterized pulses, evaluating these pulses based on metrics such as expressivity, entanglement capability, and effective parameter dimension.
Using these design spaces, we demonstrate the advantages of parameterized pulses over gate circuits in terms of duration and performance at the same time, thus enabling high-performance quantum computing.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThe quality of the Process Design Kit (PDK) is crucial for the success of any System-on-Chip (SOC) for any organization. Design Rule check is one of the mandatory checks in the sign-off process of a SOC or an IP. The QAcell methodology involves exhaustively creating small layouts representing the violating and legal configurations to verify the alignment of the Design Rule Check (DRC) deck with the Design Rule Manual (DRM) separately for each rule, including Device Rules. To increase the productivity , SKILL automation can automate the design of layout test cases (QAcells) for device rules. The automation allows quick and easy customization of complex devices and layouts with varying CDF parameters. This automation reduces the validation engineer's time by 3x for validating each rule separately. The combination of the QAcell methodology and SKILL automation provides an efficient approach to verifying the quality of the DRC deck for Device Rules. The result is a faster and precise validation methodology.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionFuture energy-efficient computing systems require new memory designs to overcome the challenges of transistor scaling. This paper presents a design space exploration methodology for rapid analysis of heterogeneous monolithic 3D integration for on-chip dynamic random-access memory. We develop a model for memory analysis validated with tape-out measurements and profile software applications with different memory access patterns. Using a system comprising silicon, carbon-nanotube and indium-gallium-zinc-oxide field effect transistors, we show that such designs can achieve 2.8x and 291x improvements in energy-delay-product in addition to 50% and 33% reductions in bit-cell area compared to silicon-based 6T-SRAM and 3T-eDRAM for embedded applications.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe linearity of Sigma Delta Modulators (SDMs) is evaluated performing a Fourier Transform of its output bitstream. The presence of quantization noise complicates this task: many samples and long simulations are needed to lower this noise floor and evaluate the signal harmonics. A method is presented here to estimate the quantization noise, remove it from the SDM bitstream, and reduce the samples needed to evaluate SDM linearity. Applying the proposed technique to a Simulink model of a fourth-order SDM, the linearity estimation accuracy is kept unchanged using 5000x less bitstream samples, thus enabling a strong reduction of the SDM verification time or an improved statistical coverage of PVT variability.
Research Panel
Design
Quantum Computing
DescriptionQuantum computers are a reality! In the recent years, the technology received a huge momentum fueled by numerous players (including established companies, an impressive number of start-ups, and plenty of research initiatives) who are working on the realization of corresponding machines, design flows, and applications. At the same time, however, there are still several questions: End-users and domain experts wonder for what applications quantum computing will be interesting (and when)? Designers and tool developers wonder what (physical) challenges and bottlenecks have to be addressed. And physicists wonder how they can address all these expectations while, at the same time, they are still trying to get decoherence times and errors under control.
Hence, it is time for a discussion about where we are with quantum computing? To this end, this panel brings renowned panelists from industry and academia together to discuss the current status and future promise of this technology, from different perspectives. More precisely, we are going to cover:
* How should we assess the recent accomplishments in the different technologies (superconducting, ion-traps, neutral atoms, etc.)? Which technology is most promising? Are those just another step in a still long series of further steps needed or do they constitute the eventual breakthrough?
* What bottlenecks still have to be overcome: Can we re-use the established design flow for classical circuits and systems for quantum computing? How much quantum physics expertise is needed to work in that field? Do we have metrics/benchmarks that can guide us through the corresponding developments?
* What are the practically relevant ecosystems? Will quantum computing replace conventional systems in entire fields or "only" extend the conventional computational capacities? Will there be "stand-alone" quantum computing applications or only quantum-classical co-design solutions?
* What are the timelines towards practically relevant quantum computing ecosystems.
In addition, the panelists will also be available to address dedicated questions from the design automation community. This and more will be covered in the panel.
Hence, it is time for a discussion about where we are with quantum computing? To this end, this panel brings renowned panelists from industry and academia together to discuss the current status and future promise of this technology, from different perspectives. More precisely, we are going to cover:
* How should we assess the recent accomplishments in the different technologies (superconducting, ion-traps, neutral atoms, etc.)? Which technology is most promising? Are those just another step in a still long series of further steps needed or do they constitute the eventual breakthrough?
* What bottlenecks still have to be overcome: Can we re-use the established design flow for classical circuits and systems for quantum computing? How much quantum physics expertise is needed to work in that field? Do we have metrics/benchmarks that can guide us through the corresponding developments?
* What are the practically relevant ecosystems? Will quantum computing replace conventional systems in entire fields or "only" extend the conventional computational capacities? Will there be "stand-alone" quantum computing applications or only quantum-classical co-design solutions?
* What are the timelines towards practically relevant quantum computing ecosystems.
In addition, the panelists will also be available to address dedicated questions from the design automation community. This and more will be covered in the panel.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionQuantum computing has the potential to solve problems that are intractable for classical systems, yet the high error rates in contemporary quantum devices often exceed tolerable limits for useful algorithm execution. Quantum Error Correction (QEC) mitigates this by employing redundancy, distributing quantum information across multiple data qubits and utilizing syndrome qubits to monitor their states for errors. The syndromes are subsequently interpreted by a decoding algorithm to identify and correct errors in the data qubits. This task is complex due to the multiplicity of error sources affecting both data and syndrome qubits as well as syndrome extraction operations. Additionally, identical syndromes can emanate from different error sources, necessitating a decoding algorithm that evaluates syndromes collectively. Although machine learning (ML) decoders such as multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs) have been proposed, they often focus on local syndrome regions and require retraining when adjusting for different code distances. To address these issues, we introduce a transformer-based QEC decoder which employs self-attention to achieve a global receptive field across all input syndromes. It incorporates a mixed loss training approach, combining both local physical error and global parity label losses. Moreover, the transformer architecture's inherent adaptability to variable-length inputs allows for efficient transfer learning, enabling the decoder to adapt to varying code distances without retraining.
Evaluation on six code distances and ten different error configurations demonstrates that our model consistently outperforms non-ML decoders, such as Union Find (UF) and Minimum Weight Perfect Matching (MWPM), and other ML decoders, thereby achieving best logical error rates. Moreover, the transfer learning can save over 10x cost.
Evaluation on six code distances and ten different error configurations demonstrates that our model consistently outperforms non-ML decoders, such as Union Find (UF) and Minimum Weight Perfect Matching (MWPM), and other ML decoders, thereby achieving best logical error rates. Moreover, the transfer learning can save over 10x cost.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionQuantum computing has rapidly grown to have actual devices with hundreds of qubits, showing its promise to achieve the quantum advantage over classical computing; however, the unpredictable and unstable noise in quantum devices set barriers to practically unleashing the power of quantum computing. Without understanding the impact of noise on the application, one can hardly reproduce the results or reuse the design. Although noisy quantum simulation can provide insights into the performance changes under noise, it faces the scalability issue, which cannot work for large circuits. To address this pressing problem, in this work, we propose the very first data-driven workflow to predict the bounds of performance. It applies the decomposition method to accurately decompose a trace of historical performance under noise to generate a training dataset, which can isolate different noise sources. On top of this, we develop a novel encoder to simultaneously embed circuit and noise information, which will be processed by an LSTM. The trained model can predict performance bounds for a given noise. Experimental results show that our method can efficiently produce practical bounds for various circuits with different scales.
Research Manuscript
Design
Quantum Computing
DescriptionThe rapid advancement of quantum computing has generated considerable anticipation for its transformative potential. However, harnessing its full potential relies on identifying "killer applications". In this regard, QuGeo emerges as a groundbreaking quantum learning framework, poised to become a key application in geoscience, particularly for Full-Waveform Inversion (FWI). This framework integrates variational quantum circuits with geoscience, representing a novel fusion of quantum computing and geophysical analysis. This synergy unlocks quantum computing's potential within geoscience. It addresses the critical need for physics-guided data scaling, ensuring high-performance geoscientific analyses aligned with core physical principles. Furthermore, QuGeo's introduction of a quantum circuit custom-designed for FWI highlights the critical importance of application-specific circuit design for quantum computing. In the OpenFWI's FlatVelA dataset experiments, the variational quantum circuit from QuGeo, with only 576 parameters, achieved significant improvement in performance. It reached a Structural Similarity Image Metric (SSIM) score of 0.905 between the ground truth and the output velocity map. This is a notable enhancement from the baseline design's SSIM score of 0.800, which was achieved without the incorporation of physics knowledge.
Research Manuscript
Design
AI/ML System and Platform Design
DescriptionWhile exhibiting superior performance in many tasks, vision transformers (ViTs) face challenges in quantization. Some existing low-bit-width quantization techniques cannot effectively cover the whole inference process of ViTs, leading to an additional memory overhead (22.3%-172.6%) compared with the corresponding fully quantized models. To address this issue, we propose quadruplet uniform quantization (QUQ) to deal with data of various distributions in ViT. QUQ divides the entire data range into at most four subranges that are uniformly quantized with different scale factors, respectively. To determine the partition scheme and quantization parameters, an efficient relaxation algorithm is proposed accordingly. Moreover, dedicated encoding and decoding strategies are devised to facilitate the design of an efficient accelerator. Experimental results show that QUQ surpasses state-of-the-art quantization techniques; it is the first viable scheme that can fully quantize ViTs to 6-bit with acceptable accuracy. Compared with the conventional uniform quantization, QUQ results in not only a higher accuracy but also an accelerator with lower area and power.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionPointer chasing becomes the performance bottleneck for in-memory indexes due to the memory wall. Prior works adopt a fixed granularity to partition the key space and maintain static heights of skiplist nodes among processing-in-memory (PIM) modules to accelerate skiplist operations, neglecting the changes in skewness and hotness. We present RADAR, an innovative PIM-friendly skiplist that dynamically partitions the key space to adapt to different skewness. An offline learning-based model is employed to catch hotness changes to adjust the heights of skiplist nodes. In multiple datasets, RADAR achieves up to 198.2x performance improvement and consumes 47.4% less memory than state-of-the-art designs.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionOur objective is to exhaustively verify static and dynamic connections, from the IPs deep inside an FPGA "core," all the way t the outer perimeter of an AI-centric computational circuit boards. The circuits usually have FPGAs from different vendors, many configurations, pinouts, implementation constraints which lead to high risk of connection bugs.
The simulation approach is not working for our multiple variants of large-scale, complex FPGA-based, AI-centric cloud hardware designs to meet our tight schedule, since it requires weeks of manual testbench development, weeks of run and turnaround time, and it is not exhaustive, and bugs can escape.
Formal connectivity verification establishes a framework for both experts and non-experts alike that ensures simplicity, reusability, and scalability, from block-to-system level, static and dynamic connectivity verification. It provides a comprehensive exposition of the design hierarchies required by backend physical tools, provides visibility into hidden cone of logic uncovering "blind spots" that escape detection in simulation-based techniques, and exhaustively proves connections when no stimulus can violate them.
With formal verification, we successfully uncovered RTL bugs in minutes - a task that weeks of simulation-based regressions had failed to accomplish. We got huge boost in productivity - 95% savings in engineer's time!
The simulation approach is not working for our multiple variants of large-scale, complex FPGA-based, AI-centric cloud hardware designs to meet our tight schedule, since it requires weeks of manual testbench development, weeks of run and turnaround time, and it is not exhaustive, and bugs can escape.
Formal connectivity verification establishes a framework for both experts and non-experts alike that ensures simplicity, reusability, and scalability, from block-to-system level, static and dynamic connectivity verification. It provides a comprehensive exposition of the design hierarchies required by backend physical tools, provides visibility into hidden cone of logic uncovering "blind spots" that escape detection in simulation-based techniques, and exhaustively proves connections when no stimulus can violate them.
With formal verification, we successfully uncovered RTL bugs in minutes - a task that weeks of simulation-based regressions had failed to accomplish. We got huge boost in productivity - 95% savings in engineer's time!
Research Manuscript
Design
Quantum Computing
DescriptionReversible computing has gained increasing attention as a prospective solution for energy dissipation, particularly in quantum computing. As the first practical reversible logic gate using adiabatic superconducting devices, reversible quantum-flux-parametron (RQFP) has been experimentally demonstrated in logical and physical reversibility. However, due to its unique logic function and structure, RQFP logic circuit design poses enormous challenges. Furthermore, circuit scale severely limits the existing exact logic synthesis method for RQFP logic. Therefore, this paper proposes an automatic Cartesian genetic programming-based synthesis framework to generate RQFP logic circuits. Experimental results on reversible logic benchmarks demonstrate RCGP's effectiveness.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn order to achieve the targeted PPA goals in our Single Die/2.5D/3DIC designs, accurate early analysis is crucial for IR and timing optimization from beginning stage of the project.
In the beginning of the project cycle:
• RDL with BUMPs is not available, Full flat design hierarchy is not yet available.
• Bump currents cannot be checked very early in design cycle and cannot be optimized.
• IA/IB weakness cannot be caught as power sources will be created on IB/IA pins for block level runs.
• RDL DEF + block runs will be taking more computational resources.
• For 3DIC designs, TSV model and Back metal resistances will be missing in early EMIR analysis.
Our approach is by utilizing Redhawk-SC EMIR Tool Design ECOs, we can draw RDL and BUMPs in our design. This modified design with Virtual RDL and BUMPs is now used to perform IR and EM analysis. For 3DIC designs
• Accurate block level results by accounting TSV, Back Metal resistances in block level runs.
• Multiple Design of Experiments can be performed.
• Accurate Top-Die results are available by accounting Bottom die parasitics for block level runs which helps in Top-Die design planning.
Modelling these additional challenges accurately is important for accurate early analysis.
In the beginning of the project cycle:
• RDL with BUMPs is not available, Full flat design hierarchy is not yet available.
• Bump currents cannot be checked very early in design cycle and cannot be optimized.
• IA/IB weakness cannot be caught as power sources will be created on IB/IA pins for block level runs.
• RDL DEF + block runs will be taking more computational resources.
• For 3DIC designs, TSV model and Back metal resistances will be missing in early EMIR analysis.
Our approach is by utilizing Redhawk-SC EMIR Tool Design ECOs, we can draw RDL and BUMPs in our design. This modified design with Virtual RDL and BUMPs is now used to perform IR and EM analysis. For 3DIC designs
• Accurate block level results by accounting TSV, Back Metal resistances in block level runs.
• Multiple Design of Experiments can be performed.
• Accurate Top-Die results are available by accounting Bottom die parasitics for block level runs which helps in Top-Die design planning.
Modelling these additional challenges accurately is important for accurate early analysis.
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionSolving sparse linear systems is crucial in scientific computing. Sparse Conjugate Gradient (CG) is one of the most popular iterative solvers with high efficiency and low storage requirements. However, the performance of sparse CG solvers implemented on storage-compute separated architectures is greatly limited by the irregular memory access and the large amount of data transmission.
In this paper, we propose a processing-in-memory (PIM) architecture, ReCG, based on the resistive random-access memory (ReRAM) to accelerate sparse CG solvers. The design of ReCG faces three major challenges: (1) how to make complex CG more suitable for acceleration with ReRAM-based architecture, (2) how to map sparse and irregular operations to regular crossbars that are more suitable for dense operations, and (3) how to coordinate the dataflow among hardware units to minimize the impact of the poor write endurance of ReRAMs on CG acceleration. To address these challenges, we (1) classify the sparse CG kernels by exploring the commonality of operations and design a flexible and dedicated architecture, (2) efficiently implement the sparse and irregular operations by utilizing both content-addressable memory (CAM) and multiply-and-accumulate (MAC) crossbars, and (3) develop a novel scheduling strategy for the dataflow. The experimental results show that ReCG improves the performance by up to three, one and one orders of magnitude compared with PETSc on CPU and GPU and CALLIPEPLA on FPGA, respectively, and the energy consumption is reduced by up to two, two and one orders of magnitude.
In this paper, we propose a processing-in-memory (PIM) architecture, ReCG, based on the resistive random-access memory (ReRAM) to accelerate sparse CG solvers. The design of ReCG faces three major challenges: (1) how to make complex CG more suitable for acceleration with ReRAM-based architecture, (2) how to map sparse and irregular operations to regular crossbars that are more suitable for dense operations, and (3) how to coordinate the dataflow among hardware units to minimize the impact of the poor write endurance of ReRAMs on CG acceleration. To address these challenges, we (1) classify the sparse CG kernels by exploring the commonality of operations and design a flexible and dedicated architecture, (2) efficiently implement the sparse and irregular operations by utilizing both content-addressable memory (CAM) and multiply-and-accumulate (MAC) crossbars, and (3) develop a novel scheduling strategy for the dataflow. The experimental results show that ReCG improves the performance by up to three, one and one orders of magnitude compared with PETSc on CPU and GPU and CALLIPEPLA on FPGA, respectively, and the energy consumption is reduced by up to two, two and one orders of magnitude.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionModern SoCs, with billions of transistors, pose challenges for traditional power integrity signoff due to increased node counts and process scaling. The existing methods are time-consuming, require substantial resources, and often result in systematic inaccuracies. To address this, a bottom-up hierarchical signoff methodology is proposed, which allows block-level signoff and reduces the overall turnaround time without compromising accuracy. This approach leverages the RedHawk-SC tool by Ansys to create child block models that are instantiated at the next hierarchical level. The hierarchical modeling flow has shown a performance improvement of ~45-50% during block level runs, with an accuracy within a 5% range compared to flat runs. The methodology is under continuous improvement to enhance accuracy and efficiency. The next step involves implementing a Hierarchical SignalEM methodology using the same reduced model solution.
Research Manuscript
EDA
Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionIn modern advanced packaging, redistribution layers (RDLs) are often used for signal transmission among chips, and vias are used for communication among different layers. Most existing RDL routers perform via planning before routing. However, since vias can be placed at arbitrary locations under the irregular via structure, via planning limits the solution space and reduces layout flexibility. This paper proposes a new flow with a novel routing graph model for 90- and 135-degree routing, which allows dynamic via insertion during routing. The proposed algorithm enlarges the solution space by providing more choices during path-finding, achieving higher routing quality. The experimental results based on commonly used benchmark suites show that our router achieves over 10\% better wirelength with over 29X speedup over the state-of-the-art work and even achieves 0.4\% better wirelength with 55X speedup over the state-of-the-art any-angle router.
Research Manuscript
Embedded Systems
Embedded Memory and Storage Systems
DescriptionLong DRAM access latency has a significant impact on modern system performance. However, the improvement of access latency is limited as the DRAM vendors reserve considerable timing margins against seldom worst-case conditions. To mitigate such pessimistic timing margins, we propose a temperature- and process-variation-aware timing detection and adaption DRAM (TPDA-DRAM) architecture. It equips in-situ cross-coupled detectors to monitor the voltage difference between bitline pairs, enabling estimation of timing margins caused by process and temperature variations. Moreover, TPDA-DRAM incorporates two collaborative timing adaption schemes: 1) a process-variation-aware timing adaption scheme (PVA) that selectively accelerates the access to rare weak cells and 2) a temperature-variation-aware timing adaption scheme (TVA) that precisely adjust timing parameters by adopting temperature information. Compared to prior art, the proposed detector reduces detection deviation by 54.8% and area overhead by 88.1%. The system-level evaluation in an eight-core system shows that TPDA-DRAM improves the average performance and energy efficiency by 20.5% and 15.0%, respectively.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAs DRAM devices continue to shrink, defects that are out of tolerance have become more prevalent. One such defect is interlayer misalignment, which occurs when two layers are not aligned correctly. Interlayer misalignment caused by the shifted patterns due to heat and stress is called as BLE (Bulk Layout Effect). It can lead to poor device yield.
In this paper, we propose two correction methods to reduce interlayer misalignment caused by BLE. The first method involves correcting the mask where BLE occurs in the opposite direction of the BLE. The second method involves correcting the other mask affected by BLE in the direction of the BLE. One of these methods should be chosen to ensure that it does not interfere with layout connections.
We evaluated methods using two different items. The interlayer misalignment in the chip decreased by 89% and 58% in item #1 and item #2, respectively. In addition, in-chip uniformity of alignment improved by 25% in item #1 and 27.4% in item #2.
It is proved that reducing interlayer misalignment caused by BLE can help improve in-chip uniformity of alignment. And it is expected to contribute to improving device yield by widening the window of manufacturing process tolerance.
In this paper, we propose two correction methods to reduce interlayer misalignment caused by BLE. The first method involves correcting the mask where BLE occurs in the opposite direction of the BLE. The second method involves correcting the other mask affected by BLE in the direction of the BLE. One of these methods should be chosen to ensure that it does not interfere with layout connections.
We evaluated methods using two different items. The interlayer misalignment in the chip decreased by 89% and 58% in item #1 and item #2, respectively. In addition, in-chip uniformity of alignment improved by 25% in item #1 and 27.4% in item #2.
It is proved that reducing interlayer misalignment caused by BLE can help improve in-chip uniformity of alignment. And it is expected to contribute to improving device yield by widening the window of manufacturing process tolerance.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionAutomotive radar sensors are mounted at the edges of the vehicle, where ambient temperatures of up to 85C need to be supported and providing good cooling solutions adds heavily to the sensor cost. With growing number of radar sensors per vehicle and the entry to radar sensors in the low/mid segment cars, there is a huge pressure on the radar sensor cost from the OEMs. By reducing the VDD supply by 4.5% , the leakage active current of the chip reduces by ~10%. This reduces the complexity and size of the heat sinking solution on the sensor and hence reduces the heat sink cost at the sensor. Therefore, in automotive radar sensors, one of the major power objectives is to reduce the Vmin of the device to reduce the overall power consumption. For industrial radar sensors, focus is more on the leakage current reduction as the device would be operating on battery and low deep sleep current would be required for minimized energy consumption and prolonged battery life.
The power consumption of a chip is directly proportional to its switching. The typical behavior of ATPG and MBIST engine is to target as many faults as possible with as little patterns as possible. This increases the switching activity of the test patterns and hence the power requirement for ATPG tends to be significantly higher than functional operation. Testing at reduced Vmin is required to distinguish between functionally correct devices and devices that are defective due to Vmin issues and production abnormalities. However, Vmin reduction is an iterative process between Product Engineering (PE) team and DFT team, taking more than one month to close. This paper proposes comprehensive power aware DFT methodology for reducing the overall Vmin of Scan and MBIST patterns to enable digital supply reduction without compromising on yield or test data volume (TDV) and the proposed methodology aims to reduce the turnaround time to generate power efficient patterns.
The power consumption of a chip is directly proportional to its switching. The typical behavior of ATPG and MBIST engine is to target as many faults as possible with as little patterns as possible. This increases the switching activity of the test patterns and hence the power requirement for ATPG tends to be significantly higher than functional operation. Testing at reduced Vmin is required to distinguish between functionally correct devices and devices that are defective due to Vmin issues and production abnormalities. However, Vmin reduction is an iterative process between Product Engineering (PE) team and DFT team, taking more than one month to close. This paper proposes comprehensive power aware DFT methodology for reducing the overall Vmin of Scan and MBIST patterns to enable digital supply reduction without compromising on yield or test data volume (TDV) and the proposed methodology aims to reduce the turnaround time to generate power efficient patterns.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionLogic synthesis is a design phase that relies on heuristic algorithms to optimize graph representations of digital circuits. The choice of a graph representation depends on target technology, design properties, and desired optimization metrics. While logic synthesis heuristics follow general principles, their realization often entails high engineering costs due to tight dependencies on specific circuit representations. This paper proposes a representation-independent algorithm for area-oriented optimization to transcend representation-specific tweaks and enhance adaptability across diverse technologies and designs. The experimental results show that our method can achieve additional average improvements of up to 9.74% compared to state-of-the-art representation-dependent engines.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionSRAM-based compute-in-memory (SRAM-CIM) is a promising architecture for efficient and accurate AI computing. However, the low memory density of SRAM makes it impractical to store all weights of large neural networks, leading to on-chip and off-chip weight loading overheads. Previous attempts to improve SRAM-CIM's memory density involved integrating multiple resistive RAM (ReRAM) cells into an SRAM cell as local weight storage. However, the current-based sensing scheme used in these approaches does not guarantee accurate data loading of weights, since it relies on the limited gain of the SRAM cell to latch data from ReRAM. The correctness of weight loading deteriorates as the number of embedded ReRAM cells increases, impeding the achievement of high-density SRAM-CIM. To address these issues, we propose ReS-CIM, a ReRAM-cached SRAM-CIM architecture employing a differential sensing scheme that provides highly scalable local ReRAM storage and robust weight loading. By amplifying the ReRAM resistance difference before SRAM latches the data, the proposed sensing scheme guarantees accurate weight loading across varying ReRAM capacities, on/off ratios, and device variations. Additionally, the voltage-based differential sensing mechanism eliminates static current flow, achieving ultra-low energy consumption and short latency. To fully leverage ReS-CIM's exceptional bandwidth data loading and energy efficiency, we introduce a CIM acceleration data flow. System-level simulations show that ReS-CIM achieves 91.7% energy savings and 97.7% latency on AlexNet when compared to the state-of-the-art all-weights-on-chip AI accelerator architectures.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe modern SoCs are equipped with complex reset architectures to meet the low-power and high-performance requirements. Multiple reset domains in a design can cause reset domain crossing (RDC) issues when data from one asynchronous source reset domain propagates to either a different asynchronous, synchronous, or no-reset destination domain. The data generated by the RDC verification tools is very large, consisting millions of RDC paths. The analysis of this data is a very time consuming and challenging task for the design and verification engineers, that often involves many iterations. In this paper we will highlight how we can automate the RDC results analysis using data processing and data analytics techniques to provide faster RDC verification closure.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionSeed promotion can be an issue during Hierarchical LVS leading to LVS incorrect results. The debugging methodology for finding the root cause behind it involves a lot of trials. One of them being the way layers are derived using node-preserving layer operations. In this paper we show the right methodology to debug seed promotion cases.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionDynamic Random Access Memory (DRAM) failures cause a significant number of server crashes in large-scale cloud centers, resulting in service interruptions and substantial economic losses. In this paper, we reframe the problem of DRAM failure prediction as a deep image classification (DIC) task. We propose a method that utilizes DIC algorithms to establish the relationship between Correctable Errors (CEs) and Uncorrectable Errors (UEs) with a post-enhancement stage. First, we encode the spatial positions of CEs into distinct blocks distributed across designated channels. Each block contains a value that represents CE counts. Then, we design an extensible post-enhancement stage to enhance those patterns that cannot be captured in the first stage. In our experiments conducted on a dataset from a real-world production cloud center, our approach demonstrates a significant improvement and achieves state-of-the-art performance. We release all source code as open source.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionNear-term quantum computers have the challenge of the short coherence time, which significantly limits the depth of verifiable quantum programs; thus, it is essential to implement a depth-efficient quantum algorithm.
We note that the linear-depth CNOT block observed in such as Bernstein-Vazirani (BV) algorithm and error detection codes are a major bottleneck in the quantum circuit execution process, which stems from an unparallelizable structure in which the target of CNOTs is concentrated in a single qubit.
In this task, we propose \textit{Retract} (contRollEd gaTe RearrAngement for reduCing depTh), which redesigns these CNOT structures into tree structures that expand logarithmically as the number of qubits increases.
Our experiments confirm the benefits of circuit depth reduction and fidelity improvement of \textit{Retract} over the conventional linear-depth CNOT implementation by conducting the tensor network-based simulation and IBM quantum machine evaluations.
We note that the linear-depth CNOT block observed in such as Bernstein-Vazirani (BV) algorithm and error detection codes are a major bottleneck in the quantum circuit execution process, which stems from an unparallelizable structure in which the target of CNOTs is concentrated in a single qubit.
In this task, we propose \textit{Retract} (contRollEd gaTe RearrAngement for reduCing depTh), which redesigns these CNOT structures into tree structures that expand logarithmically as the number of qubits increases.
Our experiments confirm the benefits of circuit depth reduction and fidelity improvement of \textit{Retract} over the conventional linear-depth CNOT implementation by conducting the tensor network-based simulation and IBM quantum machine evaluations.
Research Manuscript
EDA
RTL/Logic Level and High-level Synthesis
DescriptionThe key to pipeline throughput optimization is to resolve data hazards caused by read-after-write (RAW) dependencies, which are traditionally tackled by forwarding and speculation to avoid pipeline stalls. However, existing approaches are conducted based on high-level dataflow analysis, with potential loss of optimization opportunities for lack of analysis of the netlist structures.
We propose an efficient method to resolve RAW dependencies with low-level netlist analysis by gate-level forwarding and speculation. With a greedy search method to detect and resolve short-delay gate-level signal paths for forwarding and an approximate circuit synthesis method with formal verification for gate-level speculation, the method efficiently utilizes the gate-level information to further improve pipeline throughput. We conduct experiments on the widely-used ISCAS/EPFL benchmark circuits and a large-scale RISC-V CPU. Experimental results show that our approach can increase the pipeline throughput. More importantly, our approach can find better designs than human experts.
We propose an efficient method to resolve RAW dependencies with low-level netlist analysis by gate-level forwarding and speculation. With a greedy search method to detect and resolve short-delay gate-level signal paths for forwarding and an approximate circuit synthesis method with formal verification for gate-level speculation, the method efficiently utilizes the gate-level information to further improve pipeline throughput. We conduct experiments on the widely-used ISCAS/EPFL benchmark circuits and a large-scale RISC-V CPU. Experimental results show that our approach can increase the pipeline throughput. More importantly, our approach can find better designs than human experts.
Research Manuscript
EDA
Design Verification and Validation
DescriptionWe introduce RexBDDs, binary decision diagrams (BDDs) that exploit reduction opportunities well beyond those of reduced ordered BDDs, zero-suppressed BDDs, and recent proposals integrating multiple reduction rules. RexBDDs also leverage (output) complement flags and (input) swap flags to potentially decrease the number of nodes by a factor of four. We define a reduced form of RexBDDs that ensures canonicity, and use a set of benchmarks to demonstrate their superior storage and runtime requirements compared to previous alternatives.
Research Manuscript
Security
Hardware Security: Primitives, Architecture, Design & Test
DescriptionArithmetic operations on multi-precision integers (MPI) are a performance-critical component of many public-key cryptosystems, including not only classical RSA and ECC, but also post-quantum isogeny-based schemes. In this paper, we analyze and compare two different MPI representations, namely full-radix versus reduced-radix, for efficient modular arithmetic implementations on 64-bit RISC-V (i.e., RV64GC). We then explore how the execution time can be further improved by designing Instruction Set Extensions (ISEs). The ISE we propose can accelerate a CSIDH-512 class group action by a factor of 1.71 compared to an ISA-only implementation on a 64-bit Rocket core. The hardware overhead introduced by our ISE is approximately 10%.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionWafer diagnostics plays the key role in enabling yield pull-in by providing critical data for yield improvement at foundry. The key deliverables are defect candidate data for failure analysis(FA) and wafer sort analysis. Foundry encounters issues with missing wafer data, delayed data, wrong/partial data for volume diagnostics. These are due to various factors outside control of the diagnostics team and mini eco-system is required to enable the processing of wafers provided by foundry and revert back with defect data. Management needs to prioritize the wafers processing on a daily basis while managing costs. Efficient tools are available and are being built by design houses. However, they rely upon multiple teams internal and external for their performance. FMEA is the framework that was utilized to provide a decision support system for risk management. This paper will discuss the implementation aspects and its benefits. A framework for implementation is also provided for ease of use and replication into other design flows.
Research Manuscript
AI
AI/ML Algorithms
DescriptionExisting quantization approaches incur significant accuracy loss when compressing hybrid transformers with low bit-width. This paper presents RL-PTQ, a novel post-training quantization (PTQ) framework utilizing reinforcement learning (RL). Our focus is on determining the most effective bit-width and observer for quantization configurations tailored for mixed-precision by grouping layers and addressing the challenges of quantization of hybrid transformers. We achieved the highest quantized accuracy for MobileViTs compared to the previous PTQ methods. Furthermore, our quantized model on PIM architecture exhibited an energy efficiency enhancement of 10.1× and 22.6× compared to the baseline model, on the state-of-the-art PIM accelerator and GPU, respectively.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionQuantum state preparation, a crucial subroutine in quantum computing, involves generating a target quantum state from initialized qubits. Arbitrary state preparations can be broadly categorized into arithmetic decomposition (AD) and variational quantum state preparation (VQSP). AD employs a predefined procedure to decompose the target state into a series of gates, whereas VQSP iteratively tunes ansatz parameters to approximate target state. VQSP is particularly apt for Noisy- Intermediate Scale Quantum (NISQ) machines due to its shorter circuits. We present RobustState, a novel VQSP methodology that combines high robustness with high training efficiency. The core idea involves utilizing measurement outcomes from real machines to perform back-propagation through classical simulators, thus incorporating real quantum noise into gradient calculations. RobustState serves as a versatile, plug-and-play technique applicable for training parameters from scratch or fine-tuning existing parameters to enhance fidelity on target machines. It is adaptable to various ansatzes at both gate and pulse levels and can even benefit other variational algorithms, such as variational unitary synthesis. Comprehensive evaluation of RobustState on state preparation tasks for 4 distinct quantum algorithms using 10 real quantum machines demonstrates a coherent error reduction of up to 7.1 × and state fidelity improvement of up to 96% and 81% for 4-Q and 5-Q states, respectively. On average, RobustState improves fidelity by 50% and 72% for 4-Q and 5-Q states compared to baseline approaches.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThe turn-around-time for analog IP layout design significantly exceeds that for digital, despite their low quantity. Although automated migration solutions for analog IP have been proposed recently, their practical application is still challenging because even when the schematic is reused, layout reusability is often hindered by changes in design rule, IP boundary, etc. In this work, we propose row-based placement and legalization methodology, focusing on the mixed signal power delivery IP because 1) almost half of the analog circuit is occupied by power delivery IP quantitatively, 2) it is sensitive to parasitic RC rather than analog constraints such as matching. Precisely, it follows the steps, 1) schematic analysis & component generation, 2) component matching with dynamic floorplan adjustment, 3) global placement, 4) legalization for analog components, 5) vertical legalization with row power assignment, and 6) horizontal legalization with well bias alignment. Experimental results demonstrate that the proposed work can generate reasonable initial placement solutions within 1 hour, with up to 150 components. The generated layouts have 4.05% area overhead, 14.67% HPWL increase in average compared with the manual ones, which could be enhanced by the further optimization, either manually or with additional algorithms.
Research Manuscript
AI
AI/ML Algorithms
DescriptionAs the application scope of DNNs executed on microcontroller units (MCUs) extends to time-critical systems, it becomes important to ensure timing guarantees for increasing demand of DNN inferences. To this end, this paper proposes RT-MDM, the first real-time scheduling framework for multiple DNN tasks executed on an MCU using external memory. Identifying execution-order dependencies among segmented DNN models and memory requirements for parallel execution subject to the dependencies, we propose (i) a segment-group-based memory management policy that achieves isolated memory usage within a segment group and sharded memory usage across different segment groups, and (ii) an intra-task scheduler specialized for the proposed policy. Implementing RT-MDM on an actual system and optimizing its parameters for DNN segmentation and segment-group mapping, we demonstrate the effectiveness of RT-MDM in accommodating more DNN tasks while providing their timing guarantees.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionTemporal Graph Neural Network (TGNN) has attracted much research attention because it can capture dynamic nature of complex networks. However, existing software/hardware solutions suffer from and redundant computation overhead and excessive off-chip communications for TGNN due to they
need to recompute identical messages and unnecessarily updates the vertex memory of unaffected vertices. This paper proposes a redundancy-free accelerator, RTGA, for highperformance
TGNN inference. Specifically, RTGA proposes a redundancy-aware execution approach with temporal tree into novel accelerator design to effectively eliminate unnecessary data processing for fewer redundant computations and off-chip communications, and also designs a temporal-aware data caching method to improve data locality for TGNN. We have implemented and evaluated RTGA on a Xilinx Alveo
U280 FPGA card. Compared with the state-of-the-art software solutions (i.e., TGN and TGL) and hardware solutions (i.e., BlockGNN and FlowGNN), RTGA improves the performance of TGNN inference by an average of 473.2x, 87.4x, 8.2x, and 6.9x and saves energy by 542.8x, 102.2x, 9.4x, and
8.3x, respectively.
need to recompute identical messages and unnecessarily updates the vertex memory of unaffected vertices. This paper proposes a redundancy-free accelerator, RTGA, for highperformance
TGNN inference. Specifically, RTGA proposes a redundancy-aware execution approach with temporal tree into novel accelerator design to effectively eliminate unnecessary data processing for fewer redundant computations and off-chip communications, and also designs a temporal-aware data caching method to improve data locality for TGNN. We have implemented and evaluated RTGA on a Xilinx Alveo
U280 FPGA card. Compared with the state-of-the-art software solutions (i.e., TGN and TGL) and hardware solutions (i.e., BlockGNN and FlowGNN), RTGA improves the performance of TGNN inference by an average of 473.2x, 87.4x, 8.2x, and 6.9x and saves energy by 542.8x, 102.2x, 9.4x, and
8.3x, respectively.
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionRRAM-based compute-in-memory (CIM) suffers from programming variation issues, specifically device-to-device variation (DDV) and cycle-to-cycle variation (CCV), which can have a detrimental impact on inference accuracy. To address these variation issues, we propose RWriC, a dynamic Writing scheme for Variation Compensation for RRAM-based CIM. RWriC sequentially programs the weights, implemented by multiple RRAM cells, starting from the high significance cell (HSC) and moving towards the low significance cell (LSC). This approach leverages the knowledge of current cumulative errors and the programming targets (PTs) of other RRAM cells to dynamically adjust the PT of the RRAM currently under programming. By shifting the PT of HSC, RWriC enables the LSC to compensate for the programming errors of the HSC. Moreover, when the variation is substantial, RWriC allows the magnitude of LSC to be scaled up, providing an even wider compensation range. Through the combined application of the shifting and scaling techniques, experimental results show that the inference accuracy for ResNet50 on the CIFAR-10 dataset only drops by 0.9% under 18% device variation. In comparison to the conventional writing scheme, our RWriC approach achieves a 5-11x improvement in variation robustness for ResNet50 and Yolov8 across different tasks.
Research Manuscript
Security
Hardware Security: Primitives, Architecture, Design & Test
DescriptionThe reliability of physical unclonable function (PUF) has become the biggest challenge for key generation. Existing reliability improvement technologies incur high hardware overhead or testing costs. This paper proposes S2RAM-PUF, a novel, highly reliable and energy-efficient subthreshold SRAM PUF fabricated in 65nm process, with zero bit error rate (BER) across all voltage/temperature corners from 0.5V to 0.8V and from -40℃ to 120℃. The 20480 bits generated by the fabricated 5 S2RAM PUF chips pass the NIST 800-22 randomness test and exhibit almost ideal uniqueness with a mean inter-die hamming distance of 0.5007. The total energy per bit is as low as 3.12fJ at 0.5V supply voltage. Both stabilization BER and energy outperform the two state-of-the-art SRAM-type PUFs reported in JSSC 2020 and 2021.
Research Manuscript
Design
Design of Cyber-physical Systems and IoT
DescriptionController synthesis for nonlinear systems is an important research issue. Deep Neural Network (DNN) control policies obtained through reinforcement learning (RL), though exhibiting good performance in simulations, cannot be applied to safety-critical systems for lack of formal guarantee. To address this, this paper considers fully utilizing the advantages of RL for complex control tasks to obtain a well-performing DNN controller. Then, using PAC (Probably Approximately Correct) techniques, a polynomial surrogate controller with probabilistically controllable approximation error is obtained. Finally, the safety of the control system under the designed polynomial controller is verified using barrier certificate generation. Experiments demonstrate the effectiveness of our method in generating controllers with safety guarantees for systems with high dimensions and degrees.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn the fast-evolving landscape of modern computing, where the exponential growth of data is ubiquitous, ensuring the confidentiality and integrity of information stands as a paramount challenge. This paper undertakes a comprehensive exploration of data security, employing the lens of formal security verification to address this critical concern. The study meticulously delves into the integration of cutting-edge tools and methodologies, strategically designed to fortify data security in the face of emerging threats.
At its core, this paper centers around the intricate examination of data paths, driven by the imperative to eliminate potential vulnerabilities that could lead to leakage or corruption. Emphasizing the application of formal methods, the research identifies and mitigates threats to data security, with particular attention to the detection of side channels within processor architectures. This holistic approach extends beyond theoretical foundations, encompassing practical applications and an in-depth exploration of formal modeling techniques, symbolic execution, and advanced analysis methods, including static and dynamic analysis. The paper ensures a dynamic exploration that seamlessly combines theoretical insights with practical applications, promising a comprehensive understanding of formal security verification.
At its core, this paper centers around the intricate examination of data paths, driven by the imperative to eliminate potential vulnerabilities that could lead to leakage or corruption. Emphasizing the application of formal methods, the research identifies and mitigates threats to data security, with particular attention to the detection of side channels within processor architectures. This holistic approach extends beyond theoretical foundations, encompassing practical applications and an in-depth exploration of formal modeling techniques, symbolic execution, and advanced analysis methods, including static and dynamic analysis. The paper ensures a dynamic exploration that seamlessly combines theoretical insights with practical applications, promising a comprehensive understanding of formal security verification.
IP
Engineering Tracks
IP
DescriptionSamsung recognizes the critical importance of maintaining consistency and reliability across multiple library IP design kits with diverse flavors and technology nodes. From the SoC designer's perspective, a uniform foundation IP is paramount for a seamless design process, while the IP integrator must ensure the reliability of incoming IP to prevent integration issues that could lead to delays and increased costs.
Samsung's commitment to customers' requirements means that all IPs, regardless of their source, meet stringent reliability and consistency standards. The complexity of IP design at smaller technology nodes necessitates a systematic QA system, to detect and rectify issues early in the design flow, contributing to a more efficient and reliable development process.
In this paper, we will discuss how Samsung has deployed a comprehensive IP QA flow, integrating Siemens' Solido Crosscheck for automated and extensive validation at both fundamental and advanced QA levels. We cover the scalability and efficiency of the automated sign-off flow that not only reduces time and engineering efforts but also results in better silicon quality and shorter production schedules, benefiting both production and integration teams.
Samsung's commitment to customers' requirements means that all IPs, regardless of their source, meet stringent reliability and consistency standards. The complexity of IP design at smaller technology nodes necessitates a systematic QA system, to detect and rectify issues early in the design flow, contributing to a more efficient and reliable development process.
In this paper, we will discuss how Samsung has deployed a comprehensive IP QA flow, integrating Siemens' Solido Crosscheck for automated and extensive validation at both fundamental and advanced QA levels. We cover the scalability and efficiency of the automated sign-off flow that not only reduces time and engineering efforts but also results in better silicon quality and shorter production schedules, benefiting both production and integration teams.
Research Manuscript
Embedded Systems
Embedded Software
DescriptionStencil codes are performance-critical in many compute-intensive applications, but suffer from significant address calculation and irregular memory access overheads. This work presents SARIS, a general and highly flexible methodology for stencil acceleration using register-mapped indirect streams. We demonstrate SARIS for various stencil codes on an eight-core RISC-V compute cluster with indirect stream registers, achieving significant speedups of 2.72x, near-ideal FPU utilizations of 81%, and energy efficiency improvements of 1.58x over an RV32G baseline on average. Scaling out to a 256-core manycore system, we estimate an average FPU utilization of 64%, an average speedup of 2.14x, and up to 15% higher fractions of peak compute than a leading GPU code generator.
Research Manuscript
Design
Design of Cyber-physical Systems and IoT
DescriptionApproximate Computing is a design paradigm that trades off computational accuracy for gains in non-functional aspects such as reduced area, increased computation speed, or power reduction. The latter is of special interest in the field of Internet of Things. In this paper we present SAS, a framework for symmetry-based approximate logic synthesis. Given a Boolean multi-output function, SAS approximates it by (partially) replacing its output functions by symmetric functions with minimal Hamming distance. The framework is capable of restricting the introduced error with respect to a parameterized error metric that covers many real-word use-cases.
Experimental results on common benchmark sets as well as large bit width arithmetic Boolean functions confirm the effectiveness of the proposed framework. SAS is capable of synthesizing Boolean functions with size reductions of up to approximately 45% while, at the same time, respecting the specified threshold on the error metric. The framework is publicly available as open-source software on GitHub.
Experimental results on common benchmark sets as well as large bit width arithmetic Boolean functions confirm the effectiveness of the proposed framework. SAS is capable of synthesizing Boolean functions with size reductions of up to approximately 45% while, at the same time, respecting the specified threshold on the error metric. The framework is publicly available as open-source software on GitHub.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionTransformer neural networks demonstrate high performance on various machine learning tasks including natural language processing (NLP) and computer vision (CV). Compared to Convolutional Neural Networks (CNNs), Transformers rely more heavily on non-linear layers like softmax, which leads to greater latency and energy usage because of limited data reuse and the obvious pipeline bottleneck. Previous research on approximating softmax has not addressed its high memory access cost and has overlooked, from a high-level perspective, how the presence of softmax impacts the pipeline of attention dataflow, which is more significant than softmax's own computation. We present POEM, a hardware/software co-design approach for softmax computation and subsequent layer fusion. POEM hides delay of denominator accumulation of softmax by postponing normalization stage and avoiding maximum value search. It furthermore ensures computing pipeline of high parallelism free from congestion by memory-intensive operations. Expensive exponential function is replaced by linear approximation for large value, which both saves LUT and incurs high energy efficiency. We show that attention dataflow with POEM achieves up to 1.84x speedup compared to prior state-of-the-art ASIC design with minimal extra energy overhead, while maintaining high model accuracy.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionTraining big graph neural networks (GNNs) in distributed systems is quite time-consuming mainly because of the ubiquitous aggregate operations that involve a large amount of cross-partition communication for collecting embeddings/gradients during the forward and backward propagations. To reduce the volume of the communication, some recent approaches focused on decaying each of connections via sampling, quantifying, or delaying until satisfactory trade-off are obtained between volume and accuracy. However, when applied to popular GNNs, those approaches are found to be bounded by a common volume/accuracy Pareto frontier which shows that the decaying for individual connection cannot further accelerate the aggregate of training. In this work, SC-GNN, a semantic compression of the cross-partition communication, is proposed to concentrate a group of connections as a high-level semantics and transmit to a target partition. Since carrying the overall intent of a group, the semantics can keep transferring the interactions, i.e., embeddings/gradients, between a pair of remote partitions until GNN models converge. In addition, a connection-pattern based differential optimization is proposed to further prune those weak connections, while guaranteeing the training accuracy. The results show that, for multi-field datasets, the compression rate of SC-GNN is 40.8 times higher than SOTA methods and the epoch time is reduced to 31.77% on average.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionOn leading node designs, we see power supply integrity becoming more important and meeting the dynamic compression requirement becomes more difficult. We can use various techniques to fix local hotspots, but these techniques can be time consuming and iterative. Analyzing the power supply effect on timing analysis can be computationally expensive. To maintain schedule, we may need to leave some violations unfixed and model additional timing uncertainty on instances which violate. We used various techniques to model this and found only minor impacts. By allowing a small number of violations to be waived, we were able to improve the schedule.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionMulti-task Deep Neural Network (DNN) inference on Energy Harvesting (EH) devices has received limited attention, particularly in scenarios where energy availability fluctuates significantly. This poses substantial implementation challenges, especially in designing DNN architecture that support both varying energy and multi-tasking on resource-limited systems.Therefore, this work presents a comprehensive software/hardware co-design framework that introduces a Unified DNN model.
This Unified model has the flexibility to scale the network depending on the varying environment while performing multiple correlated tasks at the same time. To achieve such model, we propose inter-task and intra-task shared-weight design approach, where the inter-task design unifies the commonality of the extracted features, and the intra-task design unifies the shared information between multi-level sparsity.
We further present an on-device efficient implementation scheme where the Unified model is compressed with the promise of consistent inference accuracy. A runtime weight recollection process is also presented that guarantees dynamic DNN scalability.
Experimental results show that our multi-task Unified DNN, with proposed hardware implementation architecture can enable up to 1.8x inference latency reduction, 48% energy efficiency, and 3.5x memory conservation.
This Unified model has the flexibility to scale the network depending on the varying environment while performing multiple correlated tasks at the same time. To achieve such model, we propose inter-task and intra-task shared-weight design approach, where the inter-task design unifies the commonality of the extracted features, and the intra-task design unifies the shared information between multi-level sparsity.
We further present an on-device efficient implementation scheme where the Unified model is compressed with the promise of consistent inference accuracy. A runtime weight recollection process is also presented that guarantees dynamic DNN scalability.
Experimental results show that our multi-task Unified DNN, with proposed hardware implementation architecture can enable up to 1.8x inference latency reduction, 48% energy efficiency, and 3.5x memory conservation.
Research Manuscript
AI
AI/ML Application and Infrastructure
DescriptionAlphaFold2 has been hailed as a breakthrough in protein folding. It can rapidly predict protein structures with lab-grade accuracy. However, its training procedure is prohibitively time-consuming, and gets diminishing benefits from scaling to more compute resources. In this work, we conducted a comprehensive analysis on the AlphaFold training procedure, identified that inefficient communications and overhead-dominated computations were the key factors that prevented the AlphaFold training from effective scaling. We introduced ScaleFold, a systematic training method that incorporated optimizations specifically for these factors. ScaleFold successfully scaled the AlphaFold training to 2080 NVIDIA H100 GPUs with high resource utilization. In the MLPerf HPC v3.0 benchmark, ScaleFold finished the OpenFold benchmark in 7.51 minutes, shown over 6X speedup than the baseline. For training the AlphaFold model from scratch, ScaleFold completed the pretraining in 10 hours, a significant improvement over the seven days required by the original AlphaFold pretraining baseline.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIn this paper, we propose Scaler-FFT, a scalable amd mix-precison FPGA-based FFT architecture via general matrix multiplication (GEMM). Specifically, Scaler-FFT is configurable and supports FFT calculations under different points and word lengths.
In addition, we customize a novel data management strategy. It allows us to read multiple sets of data from the RAM group at the same time in different stages of FFT, making it possible to accomplish FFT via GEMM for general architecture.
In order to maintain high accuracy, we also introduce a data shift strategy to prevent data overflow and increase the signal-to-quantization noise ratio.
In addition, we customize a novel data management strategy. It allows us to read multiple sets of data from the RAM group at the same time in different stages of FFT, making it possible to accomplish FFT via GEMM for general architecture.
In order to maintain high accuracy, we also introduce a data shift strategy to prevent data overflow and increase the signal-to-quantization noise ratio.
Research Manuscript
Security
Hardware Security: Attack and Defense
DescriptionAs a prevalent privacy-preserving technology, Trusted Execution Environment has become widely adopted in numerous commercial processors. Nonetheless, they remain susceptible to various controlled-channel attacks. Untrusted operating systems can deduce enclave secrets by manipulating page tables or observing allocation- or swap-based page faults. In this paper, we propose SecPaging, a novel secure enclave paging mechanism based on hardware-enforced and microcode-supported protection to prevent these attacks. First, enclave PTEs are protected through hardware isolation, preventing privileged attackers from malicious tampering or observations. Second, Eager-Allocation mechanism is employed to prevent allocation-based controlled-channel attacks. Besides, Record-Reload mechanism is proposed to prevent swap-based controlled-channel attacks.
Exhibitor Forum
DescriptionModern SoCs require the integration of IP and tools from multiple vendors. Chip designers often must work with their IP vendors, tool vendors, and design service providers. This collaboration often requires enterprises to onboard third parties into their network to jointly work on a solution, requiring security exceptions. By leveraging the cloud as a secure independent collaboration platform, customers no longer need to make compromises with security to onboard third parties. The Microsoft Azure Modeling and Simulation Workbench makes it easy for customers to bring up a secure design environment and invite third parties to collaborate while keeping them isolated to the workbench.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThis paper introduces SeGen, a tool for automatically generating sequencing element designs. Motivated by the fact that the operation of all digital circuits can be represented by a Boolean function, SeGen autonomously generates all possible Boolean functions for positive-edge-triggered sequencing elements, including existing master-slave edge-triggered flip-flop (FF) and pulsed latch. A total of 47 resulting topologies encompass the entire spectrum of FFs, offering choices for specific applications based on SPICE simulation results. Furthermore, SeGen facilitates the implementation of other behavioral sequencing elements, such as dual-edge-triggered FF, by adjusting its setting.
Research Manuscript
Design
In-memory and Near-memory Computing Circuits
DescriptionMRAM is one of the most promising candidates for CIM. This paper proposes a series-parallel hybrid SOT-MRAM-CIM macro to solve the shortcomings of existing MRAM-CIM structures, like high energy cost and low operating frequency in traditional parallel or serial architecture. Additionally, we incorporate a multi-method modulation scheme, allowing for configurable precision (2/4/6/8-bit). We experimentally verified the performance of SOT-MRAM devices at 180-nm process node and design the macro at 28-nm node based on the test parameters of fabricated SOT devices. The simulation shows this macro can achieve energy efficiency of 23.7~29.6-Tops/W and computing frequency of 164.5-MHz/Bit at 8-bit precision.
Hands-On Training Session
DescriptionModern semiconductor development requires developers to collaborate with several third-parties like EDA vendors, IP vendors, and third-party service providers. Developers tackling advanced-node designs often need extensive infrastructure resources that are out of reach for small and medium companies. Cloud-based design environments give customers scale needed to tackle these challenges.
This session walks customers through how they can easily stand up a turnkey, secure, and scalable engineering environment with the Azure Modeling and Simulation Workbench with secure access needing little to no IT support.
This session walks customers through how they can easily stand up a turnkey, secure, and scalable engineering environment with the Azure Modeling and Simulation Workbench with secure access needing little to no IT support.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe variational quantum eigensolver (VQE) is a promising candidate that brings practical benefits from quantum computing. However, the required bandwidth in/out of a cryostat is a limiting factor to scale cryogenic quantum computers.
We propose a tailored counter-based module with single flux quantum circuits in 4-K stage which precomputes a part of VQE calculation and reduces the amount of inter-temperature communication.
The evaluation shows that our system reduces the required bandwidth by 97%, and with this drastic reduction, total power consumption is reduced by 93% in the case where 277 VQE programs are executed in parallel on a 10000-qubit machine.
We propose a tailored counter-based module with single flux quantum circuits in 4-K stage which precomputes a part of VQE calculation and reduces the amount of inter-temperature communication.
The evaluation shows that our system reduces the required bandwidth by 97%, and with this drastic reduction, total power consumption is reduced by 93% in the case where 277 VQE programs are executed in parallel on a 10000-qubit machine.
Research Manuscript
AI
AI/ML Application and Infrastructure
DescriptionSGM-PINN is a graph-based importance sampling framework to improve the training efficacy of Physics-Informed Neural Networks (PINNs) on parameterized problems. By applying a graph decomposition scheme to an undirected Probabilistic Graphical Model (PGM) built from the training dataset, our method generates node clusters encoding conditional dependence between training samples. Biasing sampling towards more important clusters allows smaller mini-batches and training datasets, improving training speed and accuracy. We additionally fuse an efficient robustness metric with residual losses to determine regions requiring additional sampling. Experiments demonstrate the advantages of the proposed framework, achieving 3X faster convergence compared to prior state-of-the-art sampling methods.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionProcessing-in-Memory (PIM) enhances memory with computational capabilities, potentially solving energy and latency issues tied to data transfer between memory and processors. However, managing concurrent computation and data movement in PIM is challenging. This paper introduces Shared-PIM, an architecture for in-DRAM PIM that strategically allocates rows in memory banks, bolstered by memory peripherals, for concurrent processing and data flow. Shared-PIM enables simultaneous computation and data transfer within a memory bank. When compared to LISA, a state-of-the-art architecture that facilitates data transfers for in-DRAM PIM, Shared-PIM reduces copy latency and energy by 5x and 1.2x respectively. Furthermore, when integrated to a state-of-the-art (SOTA) in-DRAM PIM architecture (pLUTo), Shared-PIM achieves 1.4x faster addition and multiplication, and thereby improves the performance of CNN, FFT, and BFS tasks by 1.3x, 1.27x and 1.7x respectively, with an area overhead of just 7.16%.
Research Manuscript
AI
AI/ML Application and Infrastructure
DescriptionFar Memory System(FMS) allows applications to access memory on remote machines(called memory nodes). However, existing FMSs can`t deal with large loads and have low efficiency in utilizing remote memory, which leads to the inability to share memory nodes among multiple processes, limiting the scalability of FMS.
In this paper, we propose Sharry, an efficient Sharing FMS. Sharry manages memory objects from multiple processes within a unified address space, avoiding the overhead of space switching. Sharry also optimizes the utilization of remote memory with fine-grained memory management. Additionally, Sharry offloads memory allocation to dedicated CPU core in order to handle larger loads in the sharing scenario.
Compared to state-of-the-art FMS, Sharry improves memory utilisation by 45%, causing only 9% performance degradation when multiple processes sharing single memory node.
In this paper, we propose Sharry, an efficient Sharing FMS. Sharry manages memory objects from multiple processes within a unified address space, avoiding the overhead of space switching. Sharry also optimizes the utilization of remote memory with fine-grained memory management. Additionally, Sharry offloads memory allocation to dedicated CPU core in order to handle larger loads in the sharing scenario.
Compared to state-of-the-art FMS, Sharry improves memory utilisation by 45%, causing only 9% performance degradation when multiple processes sharing single memory node.
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionBulk bitwise operations are commonplace in application domains such as databases, web search, cryptography, and image processing.
The ever-growing volume of data and processing demands of these domains often result in high energy consumption and latency in conventional systems, mainly due to extensive data movement.
Non-volatile memory (NVM) technologies, such as RRAM, PCM and STT-MRAM, facilitate conducting bulk-bitwise logic operations in-memory (CIM), eliminating the data movement.
However, mapping complex real-world applications to these CIM-capable NVMs is non-trivial and can lead to sub-optimal performance. To address this, we present SHERLOCK, a novel mapping and scheduling method tailored to exploit the unique characteristics of these systems. SHERLOCK collaboratively optimizes reliability and performance, a previously overlooked aspect that significantly affects both the correctness and throughput of these systems. Our method also leverages the granularity of CIM operations to reduce the number of write operations and, hence, energy consumption. Our evaluation on three representative applications from different domains shows that SHERLOCK outperforms the state-of-the-art in terms of performance and energy consumption.
The ever-growing volume of data and processing demands of these domains often result in high energy consumption and latency in conventional systems, mainly due to extensive data movement.
Non-volatile memory (NVM) technologies, such as RRAM, PCM and STT-MRAM, facilitate conducting bulk-bitwise logic operations in-memory (CIM), eliminating the data movement.
However, mapping complex real-world applications to these CIM-capable NVMs is non-trivial and can lead to sub-optimal performance. To address this, we present SHERLOCK, a novel mapping and scheduling method tailored to exploit the unique characteristics of these systems. SHERLOCK collaboratively optimizes reliability and performance, a previously overlooked aspect that significantly affects both the correctness and throughput of these systems. Our method also leverages the granularity of CIM operations to reduce the number of write operations and, hence, energy consumption. Our evaluation on three representative applications from different domains shows that SHERLOCK outperforms the state-of-the-art in terms of performance and energy consumption.
Front-End Design
AI
Design
Engineering Tracks
Front-End Design
DescriptionThe presentation focuses on bridging the gap between RTL designers and Implementation engineers. Removal of registers during the synthesis stage without providing any root cause analysis poses significant challenges for Implementation engineers late in the cycle. At the same time, currently there is no alternative for RTL designers to get early insights into register optimizations. Presentation highlights the challenges in verifying register optimization and proposes a shift-left methodology using Hyper-convergence to accurately detect and root cause synthesis optimized registers at early in the RTL stage itself. It also outlines future work to extend these methodologies to solve other implementation tool problems.
Front-End Design
Design
Engineering Tracks
Front-End Design
DescriptionRegister Transfer Logic (RTL) of Digital on Top (DoT) SoC designs limited to logical integration.
Conventionally the early simulation framework involving digital RTL does not have native power comprehension. The Power Intent (PI) is captured through CPF/UPF integration.
The Power and Ground (PG) IO cells are physical only cells (no logical connectivity), they are not inferred or instantiated in RTL.
This results in lack of
BIASFET connectivity at RTL stage of design à BIASFET is the ESD trigger generated in a PGIO that drives the primary protection devices in IO cells.
Low Power (LP) checks through PGIO paths.
It causes conventional DMS/AMS simulation setup to fail if it involves IO functionality.
In most mixed-signal embedded processing SoCs, even the power-up fails à BIASFET connectivity impacts external reset propagation.
The proposed solution is Compatible with standard/semi-custom implementation flows.
Complete coherence across design, implementation and verification flows.
Enables concurrent execution of LP verification and Physical Design (PD) cycles.
Enables early generation of Power Aware (PA) netlist, hence early verification of power intent with PG I/Os aware RTL.
Eliminates manual work-arounds in traditional LP mixed-signal/analog verification.
Verifying the chip level ESD integration across digital domains at early stage of the design using PA-RTL and avoid risk due to late finding of ESD triggering risk in the designs.
Debugging of ESD protection circuit integration issues at early stage.
As a result, overall design cycle time and RTL freeze quality improved.
Conventionally issues with ESD architecture including BIAS connectivity and can only be identified at post synthesis Gate Level (GL) stage à At least 3 months later than RTL stage.
Conventionally the early simulation framework involving digital RTL does not have native power comprehension. The Power Intent (PI) is captured through CPF/UPF integration.
The Power and Ground (PG) IO cells are physical only cells (no logical connectivity), they are not inferred or instantiated in RTL.
This results in lack of
BIASFET connectivity at RTL stage of design à BIASFET is the ESD trigger generated in a PGIO that drives the primary protection devices in IO cells.
Low Power (LP) checks through PGIO paths.
It causes conventional DMS/AMS simulation setup to fail if it involves IO functionality.
In most mixed-signal embedded processing SoCs, even the power-up fails à BIASFET connectivity impacts external reset propagation.
The proposed solution is Compatible with standard/semi-custom implementation flows.
Complete coherence across design, implementation and verification flows.
Enables concurrent execution of LP verification and Physical Design (PD) cycles.
Enables early generation of Power Aware (PA) netlist, hence early verification of power intent with PG I/Os aware RTL.
Eliminates manual work-arounds in traditional LP mixed-signal/analog verification.
Verifying the chip level ESD integration across digital domains at early stage of the design using PA-RTL and avoid risk due to late finding of ESD triggering risk in the designs.
Debugging of ESD protection circuit integration issues at early stage.
As a result, overall design cycle time and RTL freeze quality improved.
Conventionally issues with ESD architecture including BIAS connectivity and can only be identified at post synthesis Gate Level (GL) stage à At least 3 months later than RTL stage.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionSeveral silicon failures were identified in NXP SOCs due to incorrect voltage level shifting caused either by incorrect Liberty/UPF views generated for the IPs or issues in design practices. Incorrectness in Liberty/UPF views propagates through the design and shields the detection of discrepancies that might arise due to invalid voltage level shifts at various stages of the design cycle.
A two-stage PERC based validation mechanism, based on Calibre-PERCs static voltage tracing mechanism, was added to NXPs PERC solution in the form of following checks:
1) IP/Block level - Check the sanctity of the UPF/.lib file w.r.t. associated Power & Ground domains for IO (signal) pins. This check helps in identifying issues at IP level.
2) SOC level - Checks all the nets with Power Level Shifts and verifies presence of a valid Level Shifter on the identified interface net. This is an umbrella check covering all the scenarios of invalid Power Level shifts across the design irrespective of hierarchy.
Efficacy of the checks was established both at IP & SOC level (#1 & #2) respectively making it a part of NXPs standard IP & SOC verification process. Since the checks' validation process can start as early as at IP level, the check contributed to NXPs Shift-Left initiative.
A two-stage PERC based validation mechanism, based on Calibre-PERCs static voltage tracing mechanism, was added to NXPs PERC solution in the form of following checks:
1) IP/Block level - Check the sanctity of the UPF/.lib file w.r.t. associated Power & Ground domains for IO (signal) pins. This check helps in identifying issues at IP level.
2) SOC level - Checks all the nets with Power Level Shifts and verifies presence of a valid Level Shifter on the identified interface net. This is an umbrella check covering all the scenarios of invalid Power Level shifts across the design irrespective of hierarchy.
Efficacy of the checks was established both at IP & SOC level (#1 & #2) respectively making it a part of NXPs standard IP & SOC verification process. Since the checks' validation process can start as early as at IP level, the check contributed to NXPs Shift-Left initiative.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
Description· With shrinking process node, dynamic voltage drop becomes an increasingly challenging issue, due to the increased silicon frequency and reduced voltage headroom.
· Advance tech nodes also have more resistive power grid (PG). Combination of PG complexity and design size increase is making designs more difficult for IR closure.
· Turn-around time for the construction flow also has kept increasing, which adds more pressure on the IR closure. With all of these above, it becomes essential to have an efficient shift-left methodology to improve the productivity of IR fixing.
· Traditionally IR fixing has been done manually in post-route stage due to lack of an integrated in-construction automation. In this work, a methodology is proposed to utilize various power integrity solutions in Redhawk-Fusion EDA tool to perform in-design IR analysis and automated fixing during construction.
With the proposed solution, satisfactory IR violations reduction was observed (67% dynamic/100% static in our test cases). Significant turn-around time and eco iteration loops reduction for EM/IR convergence is achieved.
· Advance tech nodes also have more resistive power grid (PG). Combination of PG complexity and design size increase is making designs more difficult for IR closure.
· Turn-around time for the construction flow also has kept increasing, which adds more pressure on the IR closure. With all of these above, it becomes essential to have an efficient shift-left methodology to improve the productivity of IR fixing.
· Traditionally IR fixing has been done manually in post-route stage due to lack of an integrated in-construction automation. In this work, a methodology is proposed to utilize various power integrity solutions in Redhawk-Fusion EDA tool to perform in-design IR analysis and automated fixing during construction.
With the proposed solution, satisfactory IR violations reduction was observed (67% dynamic/100% static in our test cases). Significant turn-around time and eco iteration loops reduction for EM/IR convergence is achieved.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionFlip chip mounting has been widely used in recent years. Flip chip mounting has advantages such as shorter signal wires, smaller footprint, and multiple chip(lets).
However, flip chip packaging makes Si substrate as an attack surface, and then Si substrate voltage becomes one of the side-channel information.
Therefore, we develop analysis flow of Si substrate voltage using Chip Power Model (CPM). CPM is made of power library of standard cells, logic transition of digital circuit, design data. In order to analyze an accurate Si substrate voltage, design data information that is required to create CPM includes Si substrate configuration, thickness, resistance, capacitance.CPM is created for each dataset with changing input vectors for side-channel leakage evaluation.
We confirm that side-channel attack is successful using waveforms from CPM.
Furthermore, we find the possibility of localized and chip thickness dependent noise propagation by analyzing of waveforms from CPM. As for locality, we also confirmed that the matching between measurement and simulation.
However, flip chip packaging makes Si substrate as an attack surface, and then Si substrate voltage becomes one of the side-channel information.
Therefore, we develop analysis flow of Si substrate voltage using Chip Power Model (CPM). CPM is made of power library of standard cells, logic transition of digital circuit, design data. In order to analyze an accurate Si substrate voltage, design data information that is required to create CPM includes Si substrate configuration, thickness, resistance, capacitance.CPM is created for each dataset with changing input vectors for side-channel leakage evaluation.
We confirm that side-channel attack is successful using waveforms from CPM.
Furthermore, we find the possibility of localized and chip thickness dependent noise propagation by analyzing of waveforms from CPM. As for locality, we also confirmed that the matching between measurement and simulation.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionTiming closure is a critical but effort-taking task in VLSI designs. Early design stages have relatively ample room for changes that can fix timing problems in a proactive manner. However, accurate timing prediction is very challenging at early stages due to the absence of information determined by later stages in the design flow. At pre-routing stage, it is generally believed that the prediction of wire delay is more complicated than that of gate delay, since the former is highly dependent on the routing information and PVT conditions. Addressing that, in this work, prediction model is studied and the importance of multiple features are explored, with the purpose to boost the turn-around time of physical design and reduce the performance penalty caused by the worst-case scenario assumptions. Experimental results show that the proposed timing predictor has achieved a correlation over 0.98 with the Signal-Integrity (SI) sign-off timing results under multi-corner scenarios.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionSigmaDVD is a unique simulation method that provides complete power grid noise coverage for 100% of the design instances. This novel simulation technique generates tens of thousands of unique, physically and timing relevant switching scenarios for each instance independently, finding the statistically relevant worst case voltage drop or SigmaDVD value per instance. This analysis closes a known gap in power grid noise coverage available from other methods, such as vectorbased and vectorless simulations as well as BQM. The main considerations when comparing this new IR-Drop flow with other techniques are coverage (what % of hotspots from other IR-Drop methods does SigmaDVD cover?) and how to handle the increase in hotspots/violating instances caused by the massive increase in noise coverage. We quantified this new flow's coverage capabilities through heatmap comparisons. In this presentation, we will first discuss the theory of SigmaDVD, various trials we conducted to compare SigmaDVD with other IR-Drop methods, and applications of latest tool features, such as aggressor view, to prioritize, root-cause, and avoid power noise hotspots.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionMetal-oxide-metal-capacitance(MOMCAP) is widely used in integrated circuits because of its high unit capacitance, low parasitic, and good RF characteristics. As MOMCAP is increasingly sensitive to manufacturability in advanced technology, the foundry typically provides design rules with large margins to ensure yield. Besides, the reliability of non-standard MOMCAPs cannot be effectively guaranteed in differentiated competition, which leading to a deterioration in the reliability.
To solve these problems, we propose a breakdown simulation and mearsurement methodology. By applying a fixed potential to the MOMCAP plate and analyse the relationship between electric field intensity and breakdown field intensity, so as to achieve the prediction of the breakdown risk. On the basis of simulation, the results are further validated by real measurement. Here, decap is used to make samples and nanoprobe is used to test the BV curve. The test results show that the breakdown voltage is almost consistent with the simulation, which verifies the validity of the simulation and nanoprobe result. The method provides accurate guidance for the prediction of subsequent projects of the same type. Besides, considering the outstanding performance of TCAD, we have also performed reliability simulations such as BV and TDDB on MOSFET devices to promote the capability of DTCO.
To solve these problems, we propose a breakdown simulation and mearsurement methodology. By applying a fixed potential to the MOMCAP plate and analyse the relationship between electric field intensity and breakdown field intensity, so as to achieve the prediction of the breakdown risk. On the basis of simulation, the results are further validated by real measurement. Here, decap is used to make samples and nanoprobe is used to test the BV curve. The test results show that the breakdown voltage is almost consistent with the simulation, which verifies the validity of the simulation and nanoprobe result. The method provides accurate guidance for the prediction of subsequent projects of the same type. Besides, considering the outstanding performance of TCAD, we have also performed reliability simulations such as BV and TDDB on MOSFET devices to promote the capability of DTCO.
Research Manuscript
EDA
RTL/Logic Level and High-level Synthesis
DescriptionBinary adders are a critical building block in integrated circuit (IC) design. In addition to the widely used 32/64/128-bit adders, large (1024/2048 bits) adders are important in applications such as cryptography. However, most current adder design methods target regular bitwidths, and cannot efficiently generate large adders with good performance. In practice, adders are often integrated into circuits such as a multiplier-accumulator (MAC), resulting in complex non-uniform input arrival times. To address these challenges, we propose a new algorithm for efficiently generating high-quality adders for non-uniform input arrival times. It is based on a novel divide-and-conquer-friendly problem formulation, and can effectively generate and maintain the most useful adder structures through dynamic programming.
Experimental results show that it outperforms the current state-of-the-art methods in both quality and runtime. The adders generated by our algorithm have 2.8%, 8.3%, and 10.3% reductions in delay, area, and power, respectively, compared to those generated by a commercial synthesis tool.
Experimental results show that it outperforms the current state-of-the-art methods in both quality and runtime. The adders generated by our algorithm have 2.8%, 8.3%, and 10.3% reductions in delay, area, and power, respectively, compared to those generated by a commercial synthesis tool.
Research Manuscript
EDA
Physical Design and Verification
DescriptionPlacement is one of the most essential problems of VLSI physical de-
sign. Recently, the electrostatics-based placement has made a great
success and inspired many placement algorithms. However, the
recent direction of improvement is missing two important problems
for mixed-size placement – 1) how to initialize placement and 2) how
to handle macros in the analytical placement. In this paper, we
propose new mixed-size placer, SkyPlace which is enhanced by
novel placement initialization using semidefinite programming re-
laxation and density weighting technique. Our experimental results
show that SkyPlace clearly outperforms the leading-edge placer on the MMS benchmarks.
sign. Recently, the electrostatics-based placement has made a great
success and inspired many placement algorithms. However, the
recent direction of improvement is missing two important problems
for mixed-size placement – 1) how to initialize placement and 2) how
to handle macros in the analytical placement. In this paper, we
propose new mixed-size placer, SkyPlace which is enhanced by
novel placement initialization using semidefinite programming re-
laxation and density weighting technique. Our experimental results
show that SkyPlace clearly outperforms the leading-edge placer on the MMS benchmarks.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn this innovative approach to IP validation, we address growing complexity of designs by integrating advanced Fault Simulation (FS) and Artificial Intelligence/Machine Learning (AI/ML) techniques. Traditionally, validating (I/O) in complex IP designs required extensive test vectors, leading prolonged simulation times. Recognizing these challenges, we propose paradigm shift, leveraging AI/ML-generated models to streamline processes. The implementation flow begins with fault list generation using Python Scripts through IP design simulation and creating test vectors for identified signals. These vectors, including GM and FM stimuli and simulation runs, and then processed through an AI/ML tool, Colab, resulting in the development of a robust model using 20% test vectors. The model validated against the 80% test vectors for max accuracy between Predicted & Actual. Application of model to the next run of IP design simulation on 20% Test Vectors ensures prediction of the 80% results. Notably, fault simulation with the reduced test vector set markedly decreases simulation time. Our methodology brings forth efficiency gains in I/F validation, offering continuous and reliable processes adaptable to design iterations. Through this integration of AI/ML and FS, we present comprehensive solution that not only optimizes testing efficiency but also ensures robustness in validating modern, intricate IP designs.
Research Manuscript
EDA
Test, Validation and Silicon Lifecycle Management
DescriptionAutomatic test pattern generation (ATPG) is a critical technology in integrated circuit testing. It searches for effective test patterns to detect all possible faults in the circuit as entirely as possible, thereby ensuring chip yield and improving chip quality. However, the process of searching for test patterns is NP-complete. At the same time, the large amount of backtracking generated during the search for test patterns can directly affect the performance of ATPG. In this paper, a learning-based ATPG framework SmartATPG is proposed to search for high-quality test patterns, reduce the number of backtracking during the search process, and thereby improve the performance of ATPG. SmartATPG utilizes convolutional network (GCN) to fully extract circuit feature information and efficiently explore the ATPG search space through reinforcement learning (RL). Experimental results show that the proposed SmartATPG can perform better than traditional heuristic strategies and deep learning heuristic strategies on most benchmark circuits.
Research Manuscript
Design
SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionAs the de facto high-throughput accelerators targeting at a wide spectrum of applications, graphics processing units (GPUs) keep adding computing and memory resources to meet the increasing demands. However, while designed for massive parallelism, GPUs are frequently suffering from low thread occupancy and limited data throughput, which are typically attributed to constrained on-chip resources, such as shared memory and register file. To alleviate the pressure, last-level cache (LLC) is being substantially enlarged to support continuously growing computation and to shrink the off-chip data traffic. Nevertheless, the frequent low usage of LLC leaves the space waste, impeding LLC from fully unleashing potentials. Towards the issue, we propose to manage partial LLC in a software way instead to expand precious shared memory, named as SMILE, helping to alleviate the low occupancy. SMILE splits the monolithic LLC into normal data cache and new software region, with the latter being to extend the limited SMEM. For adapting to diverse application characteristics, SMILE enables multiple splitting grades and meanwhile determines the appropriate partition through online profiling among streaming multiprocessors. Experimental results show that SMILE achieves average performance improvements of 14.7% and 8.4% respectively, compared to the default baseline and prior state-of-the-art.
Research Manuscript
AI
AI/ML Algorithms
DescriptionMany real-world applications of the Internet of Things (IoT) employ machine learning (ML) algorithms to analyze time series information collected by interconnected sensors. However, distribution shift, a fundamental challenge in data-driven ML, arises when a model is deployed on a data distribution different from the training data and can substantially degrade model performance. Additionally, increasingly sophisticated deep neural networks (DNNs) are proposed to capture intricate spatial and temporal dependencies in multi-sensor time series data, often exceeding the capabilities of today's edge devices. In this paper, we propose SMORE, a novel resource-efficient domain adaptation (DA) algorithm for multi-sensor time series classification, leveraging the ultra-efficient operations of hyperdimensional computing. SMORE dynamically customizes test-time models with explicit consideration of the domain context of each sample to provide accurate predictions when confronted with domain shifts. Our evaluation on a variety of multi-sensor time series classification tasks shows that SMORE achieves on average 1.98% higher accuracy than state-of-the-art (SOTA) DNN-based DA algorithms with 18.81× faster training and 4.85× faster inference.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionContemporary methods for hardware security verification struggle with adaptability, scalability, and availability due to the increasing complexity of the modern System-on-Chips (SoCs). In this light, we introduce SoCureLLM, a Large Language Model (LLM)-based framework that excels in identifying security vulnerabilities within SoC designs and creating a comprehensive security policy database. This scalable framework processes varied, large-scale designs, overcoming token limitation and memorization issues of existing LLMs. In evaluations, SoCureLLM detected 76.47% of security bugs across three vulnerable SoCs, outperforming the state-of-the-art security verification methods. Furthermore, assessing three additional large-scale SoC designs against various threat models led to the formulation of 84 novel security policies, enriching the security policy database.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWe introduce a practically efficient algorithm for approximating maximum flows in large undirected graphs based on the recent high-performance spectral algorithms. Our approach exploits a resistor-network optimization framework that can be further accelerated by leveraging nearly linear-time graph Laplacian solvers. By iteratively sizing up highly-congested edges and sizing down the edges with high effective s-t resistance sensitivities, approximate maximum flows can be obtained in a highly-efficient manner. The proposed method also has been extended for solving multi-commodity flow problems. We demonstrate the effectiveness and efficiency of the proposed approach by comparing it with the prior state-of-the-art methods on IBM VLSI benchmarks and other public-domain networks such as social networks.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionDesign teams find it increasingly challenging to debug antenna violations, especially at
advanced nodes, due to increasing complexity in antenna rules. Antenna rule checks may
contain multiple scenarios with different conditional constructs, which make it difficult
for engineers not only to distinguish which equation has been used for calculating the
failure, but also how to fix the issue. They typically rely on multiple runs or a trial-and-
error method to fix the antenna violations, both of which are inefficient, time-consuming
solutions.
We present an innovative antenna debugging flow that calculates the exact number of
diodes that should be added to fix antenna errors in a single run. Given the required diode
area that should be added to fix an antenna violation, as well as the option to categorize
violations by net, designers can now resolve antenna errors accurately and efficiently.
advanced nodes, due to increasing complexity in antenna rules. Antenna rule checks may
contain multiple scenarios with different conditional constructs, which make it difficult
for engineers not only to distinguish which equation has been used for calculating the
failure, but also how to fix the issue. They typically rely on multiple runs or a trial-and-
error method to fix the antenna violations, both of which are inefficient, time-consuming
solutions.
We present an innovative antenna debugging flow that calculates the exact number of
diodes that should be added to fix antenna errors in a single run. Given the required diode
area that should be added to fix an antenna violation, as well as the option to categorize
violations by net, designers can now resolve antenna errors accurately and efficiently.
Research Manuscript
Design
SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionSparse general matrix-matrix multiplication is widely used in data mining applications. Its irregular memory access patterns limit the performance of general-purpose processors, thus motivating many FPGA-based hardware innovations. Nevertheless, existing accelerators fail to efficiently support heterogeneous input matrix sparsity, which is universal in various real-world applications. With in-depth experimental analysis, we observe that their performance is bottlenecked by their fixed tiling mechanisms, which only alleviate the irregularity of one input matrix. Based on the observation, we propose SpaHet, a software/hardware co-design to accelerate heterogeneous-sparsity based sparse matrix multiplication. Our experimental results show that SpaHet outperforms state-of-the-art FPGA-based solutions by 2.74× in performance.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionIn this paper, we propose SpARC, a sparse attention transformer accelerator that enhances throughput and energy efficiency by reducing the computational complexity of the self-attention mechanism. Our approach exploits inherent row-level redundancies in transformer attention maps to reduce the overall self-attention computation. By employing row-wise clustering, attention scores are calculated only once per cluster to achieve approximate attention without seriously compromising accuracy. To leverage the high parallelism of the proposed clustering approximate attention, we develop a fully pipelined accelerator with a dedicated memory hierarchy.
Research Manuscript
Design
Design of Cyber-physical Systems and IoT
DescriptionCurrently most TinyML devices only focus on inference, as training requires much more hardware resources. In this paper, we introduce SPARK, an efficient hybrid acceleration architecture with run-time sparsity-aware scheduling for TinyML learning. Besides a stand-alone accelerator, an in-pipeline acceleration unit is integrated within the CPU pipeline to support simultaneous forward and backward propagation. To better utilize sparsity and improve hardware utilization, a sparsity-aware acceleration scheduler is implemented to schedule the workload between two acceleration units. A unified memory system is also constructed to support transposable data fetch, reducing memory access. We implement SPARK using TSMC 22nm technology and evaluate different TinyML tasks. Our work is the first architecture to utilize two acceleration units for on-device learning. Compared with the baseline accelerator, SPARK achieves 4.1x performance improvement in average with only 2.27% area overhead. SPARK also outperforms off-shelf edge devices in performance by 9.4x with 446.0x higher efficiency.
Research Manuscript
Security
Embedded and Cross-Layer Security
DescriptionRunahead execution is a continuously evolving microarchitectural technique for processor performance. This paper introduces the first transient execution attack on the runahead execution, called SPECRUN, which exploits the unresolved branch prediction during runahead execution. We show that SPECRUN eliminates the limitation on the number of transient instructions posed by the reorder buffer size, enhancing the exploitability and harmfulness of the attack. We concretely demonstrate a proof-of-concept attack that causes leaking secrets from a victim process, validate the merit of SPECRUN, and design a secure runahead execution scheme. This paper highlights the need to consider the security of potential optimization techniques before implementing them in a processor.
Research Manuscript
Design
SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionThe identification and quantification of proteins through mass spectrometry (MS) are foundational to proteomics, offering insights into biological systems and disease states. However, current clustering tools struggle to process large-scale datasets. We propose SpectraFlux, a multiple FPGA-based architecture for accelerated mass spectrum clustering that outperforms existing CPU, GPU, and FPGA designs. It employs heterogeneous clustering kernels for adaptive bucket size management and optimizes memory usage by distinguishing between on-chip and high-bandwidth memory (HBM) storage solutions. SpectraFlux is built upon the TAPA-CS framework, which automatically compiles and partitions a large dataflow design across multiple chips with RDMA-based inter-FGPA communication. Our solution shows 2.7× speed up on a quad-FPGA platform compared to a single FPGA. Additionally, we introduce a refined cost model for frame-based inter-FPGA communication to better accommodate the variable data rates inherent in proteomic data processing, which reduces the inter-FPGA data movement by up to 73%. Finally, SpectraFlux achieves speedups of up to 11× and 17× over SOTA FPGA and GPU accelerators, respectively。
Research Manuscript
Autonomous Systems
Autonomous Systems (Automotive, Robotics, Drones)
DescriptionProtocols in autonomous vehicles are essential for efficient in-vehicle network communication. To ensure their security, many research efforts have been paid to the fuzz testing of their implementations. However, those fuzzing optimizations often struggle to manage the protocols' complex state, resulting in low efficiency in branch covering and vulnerability detection.
This paper introduces SPFuzz, a stateful path based parallel fuzzing framework to improve the testing performance of protocols in autonomous vehicles. The basic idea is to accelerate fuzzing speed by dividing tasks to reduce conflicts and dispatching them on different fuzzing instances. SPFuzz first leverages protocol state and data models to generate stateful paths, then divides them into discrete tasks and dispatches them based on their complexity and diversity, ensuring a balanced workload distribution across all fuzzing instances. For evaluation, we implement SPFuzz on top of the state-of-the-art protocol fuzzer Peach and conduct experiments on four prominent vehicle protocols, including ZMTP, MQTT, DDS, and DoIP. The results show that, compared to the original parallel mode of Peach, SPFuzz achieves the same code coverage at a speed of 2.8X-473.2X, with 5.52% more branch coverage within 24 hours. SPFuzz uncovered six previously unknown vulnerabilities in those heavily tested protocol implementations, with four CVEs assigned in the national vulnerability database. Additionally, SPFuzz has been adapted to ECUs from several vendors, such as NISSAN, and triggered a total of four vulnerabilities that may cause system crashes.
This paper introduces SPFuzz, a stateful path based parallel fuzzing framework to improve the testing performance of protocols in autonomous vehicles. The basic idea is to accelerate fuzzing speed by dividing tasks to reduce conflicts and dispatching them on different fuzzing instances. SPFuzz first leverages protocol state and data models to generate stateful paths, then divides them into discrete tasks and dispatches them based on their complexity and diversity, ensuring a balanced workload distribution across all fuzzing instances. For evaluation, we implement SPFuzz on top of the state-of-the-art protocol fuzzer Peach and conduct experiments on four prominent vehicle protocols, including ZMTP, MQTT, DDS, and DoIP. The results show that, compared to the original parallel mode of Peach, SPFuzz achieves the same code coverage at a speed of 2.8X-473.2X, with 5.52% more branch coverage within 24 hours. SPFuzz uncovered six previously unknown vulnerabilities in those heavily tested protocol implementations, with four CVEs assigned in the national vulnerability database. Additionally, SPFuzz has been adapted to ECUs from several vendors, such as NISSAN, and triggered a total of four vulnerabilities that may cause system crashes.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThis work presents the first, fully standard-complaint and lightweight hardware implementation of the SLH-DSA, formerly known as SPHINCS+, post-quantum digital signature scheme. The design is parameterizable across different security levels and different performance targets. The area footprint on a Xilinx Artix 7 FPGA of the presented hardware implementation based on the configurable parameters lies in the range of 3K to 14K LUTs, becoming currently the smallest SPHINCS+ implementations in the literature. The present implementation of SLH-DSA is suitable for lightweight and area-constrained IoT devices, for example, as IoT and other devices migrate to post-quantum~cryptography.
Research Manuscript
Design
Quantum Computing
DescriptionThe current Noisy Intermediate-Scale Quantum (NISQ) era suffers from high quantum readout error that severely reduces the measurement fidelity. Matrix-based error mitigation has been demonstrated as a promising software-level technique, which performs matrix-vector multiplication to calibrate the probability distribution with noise. However, this approach shows poor scalability and limited fidelity improvement as the matrix size exponentially increases with the number of qubits. In this paper, we propose SpREM to exploit the inherent sparsity in the mitigation matrix. Inspired by the interaction
mechanism between qubits, we identify structured sparsity patterns using Hamming distance. With this insight, we propose the Hamming-Distance Sparse Row (HDSR) compression method and its format, which can achieve higher sparsity than threshold-based pruning meanwhile exhibiting great fidelity improvement. Finally, we propose the computational dataflow of the HDSR format and implement it on hardware. Experiments demonstrate that SpREM achieves 98.9% sparsity and a 27.3× reduction in fidelity loss on the real-world quantum device, compared to threshold pruning. It achieves an average 11.2× ∼ 36.4× speedup compared to Xilinx Vitis SPARSE library and NVIDIA A100 GPU implementations.
mechanism between qubits, we identify structured sparsity patterns using Hamming distance. With this insight, we propose the Hamming-Distance Sparse Row (HDSR) compression method and its format, which can achieve higher sparsity than threshold-based pruning meanwhile exhibiting great fidelity improvement. Finally, we propose the computational dataflow of the HDSR format and implement it on hardware. Experiments demonstrate that SpREM achieves 98.9% sparsity and a 27.3× reduction in fidelity loss on the real-world quantum device, compared to threshold pruning. It achieves an average 11.2× ∼ 36.4× speedup compared to Xilinx Vitis SPARSE library and NVIDIA A100 GPU implementations.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionToward scalable implementation of superconducting quantum computers, this paper proposes a cost-effective RF pulse generator architecture. Most existing works use arbitrary waveform generators (AWGs)
with a huge memory and complex analog circuits for optimizing gate fidelity with complex RF pulse waveforms.
The proposed architecture simplifies the complex RF pulse waveforms to cost-aware square pulses.
This eliminates AWGs with high circuit costs which are an obstacle for scalable implementation.
This paper also proposes a pulse tuning method to maximize gate fidelity. Dynamics simulation results of transmons demonstrate our approach can achieve comparable gate fidelity to ideal RF pulses.
with a huge memory and complex analog circuits for optimizing gate fidelity with complex RF pulse waveforms.
The proposed architecture simplifies the complex RF pulse waveforms to cost-aware square pulses.
This eliminates AWGs with high circuit costs which are an obstacle for scalable implementation.
This paper also proposes a pulse tuning method to maximize gate fidelity. Dynamics simulation results of transmons demonstrate our approach can achieve comparable gate fidelity to ideal RF pulses.
Workshop
Security
DescriptionThe diminishing returns of technology scaling on performance have paved the way for innovation in computer architecture, shifting towards heterogeneous, domain-specific architectures. Modern systems incorporate domain-specific accelerators and specialized system components (buses, network-on-chip, peripherals, sensors, etc..) to efficiently manage complex and computationally demanding workloads.
A widely adopted approach to reduce the System-on-Chip (SoC) design complexity involves a hierarchical strategy that differentiates the system design efforts for the components of the heterogeneous architecture. This encompasses: (i) expensive in-house RTL development for critical modules, (ii) leveraging the most recent high-level synthesis (HLS) tools, and/or (iii) outsourcing highly specialized third-party intellectual property (IP) modules to reduce costs and development time.
Despite its advantages, such diversified design methodology exacerbates the challenge of system integration. Moreover, recent studies have demonstrated how careless system integration can lead to dangerous conditions, impacting the security, safety, and performance of the system. This can result from a combination of factors, including development bugs, lack of specifications, superficial verifications of IP components' behavior at the system level, and a scarcity of mechanisms supporting safe and secure system execution.
Addressing these challenges requires innovative approaches in the design and verification process, especially when dealing with the stringent safety and security requirements of mission-critical systems. The research community can play a disruptive role in overcoming these challenges. The availability of the complete codebase of multiple mature open hardware architectures and reconfigurable platforms represents an unprecedented opportunity for the development, testing, and native integration of novel mechanisms, tools, and analysis supporting security, safety, and performance efficiency for the development of the next-generation of systems.
This workshop welcomes work-in-progress contributions and innovative directions aimed at addressing challenges and profit from the opportunities provided by open hardware designs and architectures for the development of next-generation heterogeneous SoCs. The topics for the workshop include, but are not restricted to:
Security verification for hardware designs and system architectures
Architectural aspects of secure system integration
Secure system integration of third-party hardware components
Automated firmware generation supporting secure system execution
Security aspects of reconfigurable designs
Time-predictable system execution in open-hardware designs
Performance analysis, timing analysis, and worst-case analysis supporting
time-predictable system execution and/or communications in open-hardware designs
Automated firmware generation supporting time-predictable execution
Fault tolerance and execution in harsh conditions leveraging open-hardware designs
System architectures and methodologies supporting energy efficient/performant system execution in open-hardware designs
Hardware/software co-design, co-integration and co-verification of open-source processors, accelerators, and components
Open architectures for reconfigurable platforms and open CAD tools
Tools and analysis for open FPGAs and reconfigurable platforms
A widely adopted approach to reduce the System-on-Chip (SoC) design complexity involves a hierarchical strategy that differentiates the system design efforts for the components of the heterogeneous architecture. This encompasses: (i) expensive in-house RTL development for critical modules, (ii) leveraging the most recent high-level synthesis (HLS) tools, and/or (iii) outsourcing highly specialized third-party intellectual property (IP) modules to reduce costs and development time.
Despite its advantages, such diversified design methodology exacerbates the challenge of system integration. Moreover, recent studies have demonstrated how careless system integration can lead to dangerous conditions, impacting the security, safety, and performance of the system. This can result from a combination of factors, including development bugs, lack of specifications, superficial verifications of IP components' behavior at the system level, and a scarcity of mechanisms supporting safe and secure system execution.
Addressing these challenges requires innovative approaches in the design and verification process, especially when dealing with the stringent safety and security requirements of mission-critical systems. The research community can play a disruptive role in overcoming these challenges. The availability of the complete codebase of multiple mature open hardware architectures and reconfigurable platforms represents an unprecedented opportunity for the development, testing, and native integration of novel mechanisms, tools, and analysis supporting security, safety, and performance efficiency for the development of the next-generation of systems.
This workshop welcomes work-in-progress contributions and innovative directions aimed at addressing challenges and profit from the opportunities provided by open hardware designs and architectures for the development of next-generation heterogeneous SoCs. The topics for the workshop include, but are not restricted to:
Security verification for hardware designs and system architectures
Architectural aspects of secure system integration
Secure system integration of third-party hardware components
Automated firmware generation supporting secure system execution
Security aspects of reconfigurable designs
Time-predictable system execution in open-hardware designs
Performance analysis, timing analysis, and worst-case analysis supporting
time-predictable system execution and/or communications in open-hardware designs
Automated firmware generation supporting time-predictable execution
Fault tolerance and execution in harsh conditions leveraging open-hardware designs
System architectures and methodologies supporting energy efficient/performant system execution in open-hardware designs
Hardware/software co-design, co-integration and co-verification of open-source processors, accelerators, and components
Open architectures for reconfigurable platforms and open CAD tools
Tools and analysis for open FPGAs and reconfigurable platforms
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAs designs grow larger and more complex, more advanced Design for Test (DFT) approaches continue to be developed to keep up with the capacity required. One of these approaches is "Streaming Scan Network" (SSN), which is aimed at distributing scan test data across the entire design through a bus structure and allows for easy scalability with independent scan channels in each block/core of the design. Another feature used as part of DFT is the "Early Margin Adjust" (EMA) capability of memory macros, which allows for adjusting timing margins of the memory by setting register values during test bring-up. Both functions require distribution to / through all hierarchical blocks in the design, which historically has been handled manually, with the Physical Design team determining bus traversal paths through blocks and feeding that back to the DFT team for implementation. This approach can be extremely time consuming due to the complexities of chip floorplans introduced by rectilinear shapes and hierarchical block reuse, so is often deferred, risking late-breaking issues. This presentation details a set of systems designed to automate generation of paths through a design, providing access to optimized bus distribution orders for DFT implementation, starting from the first floorplans.
Research Manuscript
EDA
Design Verification and Validation
DescriptionThe ever-expanding scale of integrated circuits has brought about a significant rise in the design risks associated with radiation-resistant integrated circuit chips. Traditional single-particle experimental methods, with their iterative design approach, are increasingly ill-suited for the challenges posed by large-scale integrated circuits. In response, this article introduces a novel sensitivity-aware single-particle radiation effects simulation framework tailored for System-on-Chip platforms. Based on SVM algorithm we have implemented fast finding and classification of sensitive circuit nodes. Additionally, the methodology automates soft error analysis across the entire software stack. The study includes practical experiments focusing on RISC-V architecture, encompassing core components, buses, and memory systems. It culminates in the establishment of databases for Single Event Upsets (SEU) and Single Event Transients (SET), showcasing the practical efficacy of the proposed methodology in addressing radiation-induced challenges at the scale of contemporary integrated circuits. Experimental results have shown up to 12.78× speed-up on the basis of achieving 94.58\% accuracy.
Research Manuscript
Design
Emerging Models of Computation
DescriptionEmerging resistive RAM (ReRAM) devices can in-situ execute vector-matrix-multiplication (VMM) for scientific computing. However, the peripheral separated S&Hs and ADCs for row buffering and sensing in conventional designs are the system bottleneck. We propose an ADC-less all-in-one subarray-VMM-sensing design that enables the precharge once, readout multiple-bits functionality. We propose a cascaded-feedback bitline sensing architecture and a buffering-and-sensing-collocated sense amplifier design with bitline and storage node fully decoupled for enabling conflict-free column accesses. We further propose cross-level interleaving for successive VMM accesses. Experimental results show that our design achieves 297% performance improvement and 85.8% energy reduction, compared with an aggressive baseline.
Research Manuscript
Design
In-memory and Near-memory Computing Circuits
DescriptionOne of the most critical limitations to scalable graph mining is memory capacity, as graphs of interest continue to grow while the rate of DRAM scaling diminishes. While high-performance NVMe storage is cheap and dense enough to better support larger graphs, the relative performance limitations of secondary storage force a cost-performance trade-off. We present STING, which uses an asynchronous callback function to provide a general interface to in-storage graphs while allowing transparent near-storage acceleration. Using triangle counting, we show with transparent filtering and sorting acceleration, STING can achieve improve state-of-the-art by 3x for cost and power efficiency.
Exhibitor Forum
DescriptionSemiconductor design complexity has increased exponentially in recent years. Teams have gone from building relatively simple designs across a handful of design centers to building platforms of staggering complexity across multiple, geographically dispersed, integrated design centers. Meanwhile, time to market pressures continue to intensify.
To face these challenges, organizations must increase efficiency across all aspects of the design lifecycle. One effective way to do this is through the Transformation Model for IP-Centric Design: a blueprint for improving IP reuse, end-to-end traceability, and collaboration at enterprise scale.
Currently, most teams employ a traditional, project-centric, "copy-and-modify" design methodology. While this approach worked well when projects and teams were smaller, today the inefficiencies of project-centric design are clear. Projects are siloed, so teams end up re-solving the same problems and manually tracking IP, project-by-project. This lack of centralized coordination makes it difficult to pool resources, meet compliance requirements, or scale design and development.
This is where an IP-centric design methodology comes into play.
IP-centric design creates a centralized system for design management across varied projects, cross-functional teams, and globally dispersed design centers. By transitioning to an IP-centric design methodology, organizations can achieve the goal of a streamlined, fully traceable, horizontally scaling, single source of truth for all design management needs across hardware, firmware, and software projects and platforms. The benefits include improved collaboration, accelerated design, more informed build vs. buy decisions, and a streamlining of efforts across design teams.
To face these challenges, organizations must increase efficiency across all aspects of the design lifecycle. One effective way to do this is through the Transformation Model for IP-Centric Design: a blueprint for improving IP reuse, end-to-end traceability, and collaboration at enterprise scale.
Currently, most teams employ a traditional, project-centric, "copy-and-modify" design methodology. While this approach worked well when projects and teams were smaller, today the inefficiencies of project-centric design are clear. Projects are siloed, so teams end up re-solving the same problems and manually tracking IP, project-by-project. This lack of centralized coordination makes it difficult to pool resources, meet compliance requirements, or scale design and development.
This is where an IP-centric design methodology comes into play.
IP-centric design creates a centralized system for design management across varied projects, cross-functional teams, and globally dispersed design centers. By transitioning to an IP-centric design methodology, organizations can achieve the goal of a streamlined, fully traceable, horizontally scaling, single source of truth for all design management needs across hardware, firmware, and software projects and platforms. The benefits include improved collaboration, accelerated design, more informed build vs. buy decisions, and a streamlining of efforts across design teams.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWe introduce SVDE, a serverless cloud framework for video processing. SVDE effectively addresses the neglected inefficiencies introduced by modern serverless video processing, caused by advanced video encoding and imbalanced network latency. SVDE leverages a decision tree regression model to optimize scheduling decisions, while holistically considering the chunk size variability, hardware heterogeneity, node queuing status, and network imbalances. Compared to existing solutions, SVDE demonstrates a significant performance improvement, achieving up to 1.87× speedup across ten real-world video workloads.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionEfficiently supporting long context length is crucial for Transformer models. The quadratic complexity of the self-attention computation plagues traditional Transformers. Sliding window-based static sparse attention mitigates the problem by limiting the attention scope of the input tokens, reducing the theoretical complexity from quadratic to linear. Although the sparsity induced by window attention is highly structured, it does not align perfectly with the microarchitecture of the conventional accelerators, leading to suboptimal implementation. In response, we propose a dataflow-aware FPGA-based accelerator design, SWAT, that efficiently leverages the sparsity to achieve scalable performance for long input. The proposed microarchitecture is based on a design that maximizes data reuse by using a combination of row-wise dataflow, kernel fusion optimization, and an input-stationary design considering the distributed memory and computation resources of FPGA. Consequently, it achieves up to 22x and 5.7x improvement in latency and energy efficiency compared to the baseline FPGA-based accelerator and 15x energy efficiency compared to GPU-based solution.
Research Manuscript
EDA
Design Verification and Validation
DescriptionSymbolic quick error detection (SQED) has greatly improved efficiency in formal chip verification. However, it has a limitation in detecting single-instruction bugs due to its reliance on the self-consistency property. To address this, we propose a new variant called symbolic quick error detection by semantically equivalent program execution (SEPE-SQED), which utilizes program synthesis techniques to find sequences with equivalent meanings to original instructions. SEPE-SQED effectively detects single-instruction bugs by differentiating their impact on the original instruction and its semantically equivalent program (instruction sequence). To manage the search space associated with program synthesis, we introduce the CEGIS based on the highest priority first algorithm. The experimental results show that our proposed CEGIS approach improves the speed of generating the desired set of equivalent programs by 50% in time compared to previous methods. Compared to SQED, SEPE-SQED offers a wider variety of instruction combinations and can provide a shorter trace for triggering bugs in certain scenarios.
Research Manuscript
Design
Quantum Computing
DescriptionThis paper proposes an efficient stabilizer circuit simulation algorithm that only traverses the circuit forward once.
We introduce phase symbolization into stabilizer generators, which allows possible Pauli faults in the circuit to be accumulated explicitly as symbolic expressions in the phases of stabilizer generators.
This way, the measurement outcomes are also symbolic expressions, and we can sample them by substituting the symbolic variables with concrete values, without traversing the circuit repeatedly.
We show how to integrate symbolic phases into the stabilizer tableau and maintain them efficiently using bit-vector encoding.
A new data layout of the stabilizer tableau in memory is proposed, which improves the performance of our algorithm (and other stabilizer simulation algorithms based on the stabilizer tableau).
We implement our algorithm and data layout in a Julia package named SymPhase.jl, and compare it with Stim, the state-of-the-art simulator, on several benchmarks.
We show that SymPhase.jl has superior performance in terms of sampling time, which is crucial for generating a large number of samples for further analysis.
We introduce phase symbolization into stabilizer generators, which allows possible Pauli faults in the circuit to be accumulated explicitly as symbolic expressions in the phases of stabilizer generators.
This way, the measurement outcomes are also symbolic expressions, and we can sample them by substituting the symbolic variables with concrete values, without traversing the circuit repeatedly.
We show how to integrate symbolic phases into the stabilizer tableau and maintain them efficiently using bit-vector encoding.
A new data layout of the stabilizer tableau in memory is proposed, which improves the performance of our algorithm (and other stabilizer simulation algorithms based on the stabilizer tableau).
We implement our algorithm and data layout in a Julia package named SymPhase.jl, and compare it with Stim, the state-of-the-art simulator, on several benchmarks.
We show that SymPhase.jl has superior performance in terms of sampling time, which is crucial for generating a large number of samples for further analysis.
Research Manuscript
Design
In-memory and Near-memory Computing Circuits
DescriptionProcessing in-memory has the potential to accelerate high-data-rate applications beyond the limits of modern hardware. Flow-based computing is a computing paradigm for executing Boolean logic within nanoscale memory arrays by leveraging the natural flow of electric current. Previous approaches of mapping Boolean logic onto flow-based computing circuits have been constrained by their reliance on binary decision diagrams (BDDs), which translates into high area overhead. In this paper, we introduce a novel framework called FACTOR for mapping logic functions into dense flow-based computing circuits. The proposed methodology introduces Boolean connectivity graphs (BCGs) as a more versatile representation, capable of producing smaller crossbar circuits. The framework constructs concise BCGs using factorization and expression trees. Next, the BCGs are modified to be amenable for mapping to crossbar hardware. We also propose a time multiplexing strategy for sharing hardware between different Boolean functions. Compared with the state-of-the-art approach, the experimental evaluation using 14 circuits demonstrates that FACTOR reduces area, speed, and energy with 80%, 2%, and 12%, respectively, compared with the state-of-the-art synthesis method for flow-based computing.
Research Manuscript
EDA
RTL/Logic Level and High-level Synthesis
DescriptionGate-level clocking, typical in traditional approaches to Single Flux Quantum (SFQ) technology, makes the effective synthesis of superconducting circuits a significant engineering hurdle. This paper addresses this challenge by employing the recently introduced xSFQ logic family. xSFQ leverages dual-rail alternating encoding to eliminate the clock dependency from the superconducting gate semantics. This obviates the need for ad hoc modifications to existing synthesis tools and avoids unnecessary circuit resource overheads, marking a significant advancement in superconducting circuit design automation. Our implementation results demonstrate an average reduction of over 80% in the Josephson junction count for circuits from the ISCAS85, EPFL, and ISCAS89 benchmark suites.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionIO integrity analysis early in the design cycle helps disintegrate the system level constraints from DIE level constraint. Integrity challenges are more predominant with 324-529 BALL packages, in automotive infotainment SOCs with close to 200-400 signals including GHz DDR, EMAC, eMMC, xSPI etc. Total power being supported ranging up to 10Watts.
This paper discusses System aware IO integrity analysis which enables faster engagement closure among design, test, and application/customer.
This paper discusses System aware IO integrity analysis which enables faster engagement closure among design, test, and application/customer.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionDFT engineers take efforts on high quality SDC delivery in limited schedule for timing analysis and ATPG. To meet the schedule, engineer often got quality loss on false path consistency due to limited schedule or human error causing coverage loss or time wasted on timing violation review. APR timing closure progress will also be impacted. Moreover, function constraint is often updated during timing closure progress. Function constraint cannot be directly used in AC scan and referencing timing report to prepare AC scan constraint often sacrifice test coverage. Preparing AC scan constraint often takes time and rely on DFT engineer's experience to ensure the constraint quality.
We provided a systematic flow to generate AC scan timing and ATPG constraint dealing with clock structure difference, unsupported description due to ATPG tool limitation, multiple test mode for ATPG, and add-on/redundant timing exception due to Scan structure. The flow helps map AC scan clocks to function clocks and generate AC scan timing and ATPG constraints efficiently.
We provided a systematic flow to generate AC scan timing and ATPG constraint dealing with clock structure difference, unsupported description due to ATPG tool limitation, multiple test mode for ATPG, and add-on/redundant timing exception due to Scan structure. The flow helps map AC scan clocks to function clocks and generate AC scan timing and ATPG constraints efficiently.
Front-End Design
Design
Engineering Tracks
Front-End Design
DescriptionDFT engineers take efforts on high quality SDC delivery in limited schedule for timing analysis and ATPG. To meet the schedule, engineer often got quality loss on false path consistency due to limited schedule or human error causing coverage loss or time wasted on timing violation review. APR timing closure progress will also be impacted. Moreover, function constraint is often updated during timing closure progress. Function constraint cannot be directly used in AC scan and referencing timing report to prepare AC scan constraint often sacrifice test coverage. Preparing AC scan constraint often takes time and rely on DFT engineer's experience to ensure the constraint quality.
We provided a systematic flow to generate AC scan timing and ATPG constraint dealing with clock structure difference, unsupported description due to ATPG tool limitation, multiple test mode for ATPG, and add-on/redundant timing exception due to Scan structure. The flow helps map AC scan clocks to function clocks and generate AC scan timing and ATPG constraints efficiently.
We provided a systematic flow to generate AC scan timing and ATPG constraint dealing with clock structure difference, unsupported description due to ATPG tool limitation, multiple test mode for ATPG, and add-on/redundant timing exception due to Scan structure. The flow helps map AC scan clocks to function clocks and generate AC scan timing and ATPG constraints efficiently.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThe vertical segments across IoT, data centers, AI, networking, autonomous vehicles, cryptocurrency infrastructure are creating data requirements explosion. New standards, emerging at lightning speeds, are battling the never-ending thirst for low power, high speed and throughput. With the increase in complexity, the verification effort is also increasing exponentially. Multiple DRAM memory vendors and wide varieties of memory are growing challenges as each vendor have their own unique timing parameter types/ values and configuration register values. Ensuring the correctness of timing parameter values and registers so that the DDR Controller, DDR PHY, and the DRAM device operate in sync is a huge and error-prone task. To make this effort error-free, we have developed an automated and scalable solution where the verification features of DFI and Memory are integrated and synced to reduce the verification efforts, fast time to market, and no silicon escape.
Keynote
Special Event
Design
DescriptionIn this keynote, Dr. Gary Patton will introduce the fundamental concepts driving the vision of a 'Systems Foundry', including a standards-based approach to assemble heterogenous dies. Dr. Patton will also cover the factors driving the inevitable need for disaggregation; factors like reticle limit, thermal constraints, cost, yield, etc., among others that are especially exacerbated in the need to satisfy the demands of HPC designs in the AI era. In addition, Dr. Patton will go over the transformative journey at Intel over the last 4-5 years that has helped orient the execution towards enabling the vision of a Systems Foundry. A journey that encompasses delivering to a full breadth of EDA offerings and development of advanced packaging capabilities, to name a few. The work is not done, however; the EDA & IP ecosystem has a vital role to play in this vision - to enable a seamless 3DIC design platform for advanced packaging implementation & modeling, AI-driven 3D exploration and System-Technology Co-Optimization while tackling challenges in the multi-physics domain. Intel has several collaborative projects with EDA to address these challenges, and Dr. Patton will end with a call to action to the ecosystem partners on continued partnership to realize this vision
Work-in-Progress Poster
TACPlace: Ultrafast Thermal-Aware Chiplet Placement under Multi-Power Mode Using Feasibility Seeking
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
Description2.5D integrated technology offers advanced design capabilities with enhanced functionality and higher performance. However, the increasing power density of modern chiplets poses significant challenges to inter-chiplet placement, considering multiple constraints introduced by routing, thermal management, and power distribution.
We propose TACPlace, a hyper-efficient thermal-aware placement framework that jointly explores these complex constraints within low whitespace. As the first feasibility-seeking 2.5D placer, TACPlace incorporates both the ultrafast and accurate thermal simulation based on Green's function and the routing optimization, for thermal-\&wirelength-aware perturbations. TACPlace also fills the void of supporting multi-power mode, \ie considering different working modes of the 2.5D system.
Experiments over real-world 2.5D systems demonstrate a speedup of placement from hours to seconds, an average 23.9\% decreased wirelength, and up to 3.4\textcelsius{} reduction in peak temperature in contrast with the state of the art. Furthermore, TACPlace tailored for multiple workloads achieves an averaged 6.2\textcelsius{} lower peak temperature than single-power mode.
We propose TACPlace, a hyper-efficient thermal-aware placement framework that jointly explores these complex constraints within low whitespace. As the first feasibility-seeking 2.5D placer, TACPlace incorporates both the ultrafast and accurate thermal simulation based on Green's function and the routing optimization, for thermal-\&wirelength-aware perturbations. TACPlace also fills the void of supporting multi-power mode, \ie considering different working modes of the 2.5D system.
Experiments over real-world 2.5D systems demonstrate a speedup of placement from hours to seconds, an average 23.9\% decreased wirelength, and up to 3.4\textcelsius{} reduction in peak temperature in contrast with the state of the art. Furthermore, TACPlace tailored for multiple workloads achieves an averaged 6.2\textcelsius{} lower peak temperature than single-power mode.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionDesign data management is an important concern from a chip design perspective due to the enormous amount of data generated through the design flow. This data, if not effectively managed can lead to disks running out of space at critical times in a project, causing run crashes and increased turnaround time. Most engineers do not proactively clean up data and is almost always done in a haphazard manner at a choke point where disks complain of space shortage. As projects progress, it becomes a time-consuming chore to recognize and delete older experiments, sometimes leading to accidental deletion of data that led to the final design. Data is expensive – from space, cost and power perspective, and all of the potentially unnecessary data adds upto business operating costs.
The proposal is a utility for automatic data archival tagging that would lead to Optimal Disk Space Management and Tapeout Data Preservation to minimize each chip's design development footprint with minimal effort from end-user. Analysis on a Reticle-sized Data-center Accelerator chip determined that only 20% or less of the current 1.8PB of data used is necessary for Tapeout Data Archival. With future enhancements, this number could go as less as 10-15%, which is significant.
The proposal is a utility for automatic data archival tagging that would lead to Optimal Disk Space Management and Tapeout Data Preservation to minimize each chip's design development footprint with minimal effort from end-user. Analysis on a Reticle-sized Data-center Accelerator chip determined that only 20% or less of the current 1.8PB of data used is necessary for Tapeout Data Archival. With future enhancements, this number could go as less as 10-15%, which is significant.
Research Manuscript
Embedded Systems
Embedded Memory and Storage Systems
DescriptionWith the development of chiplet technology, the architecture of Non-Uniform Memory Access (NUMA) has become increasingly intricate. The placement of memory page significantly influences application performance in NUMA systems. We found that memory access bottlenecks occur between high-level NUMA domains consisting of multiple chiplets. In this paper, we introduce a Traffic-Aware Page Mapping Method (TAPMM) designed for multi-level NUMA systems. TAPMM conceptualizes the multi-level NUMA system as a memory access tree, utilizing hardware performance events to be aware of system traffic and identify the optimal page mapping method for bandwidth efficiency. Our experiments demonstrate that TAPMM achieves a speedup of up to 2.12 times on a real commodity machine compared to existing optimization tools.
Research Manuscript
Security
Embedded and Cross-Layer Security
DescriptionHardware-based tracing, being efficient, can be a good alternative to the computationally-expensive software-based instrumentation in binary-only greybox fuzzing. However, it only records all branches within a specified address range, lacking the flexibility to re-filter them. This paper introduces TATOO, a hardware platform employing tagged architectures and hardware tracing to enhance binary-only fuzzing. TATOO stands out by enabling users to tag instructions at the instruction level, significantly reducing the volume of traced data and improving fuzzing efficiency. TATOO also supports recording the dataflow information for smart mutations. Implemented on a real hardware FPGA platform, TATOO demonstrates a mere 8.7% performance overhead.
Research Manuscript
AI
Security
AI/ML Security/Privacy
DescriptionTrusted Execution Environments (TEEs) have become a promising solution to secure DNN models on edge devices. However, existing solutions either provide inadequate protection or introduce large performance overhead. This paper presents TBNet, a TEE-based defense framework that protects DNN model from a neural architectural perspective. TBNet generates a novel Two-Branch substitution model, to exploit (1) the computational resources in untrusted Rich Execution Environment (REE) for latency reduction and (2) the physically-isolated TEE for model protection. Experimental results on a Raspberry Pi across diverse DNN model architectures and datasets demonstrate that TBNet achieves efficient model protection at a low cost.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIn system-on-chip (SoC) security, thwarting information leakage is crucial for protecting sensitive data. To tackle this, we introduce a Time and Distance (TDM) based security metric tailored for evaluating the risk of information leakage in hardware Intellectual Properties (IPs) in complex SoCs. The TDM metric quantitatively measures the assets' exposure time and proximity to output ports, pinpointing the vulnerable location and time of information leakage. Leveraging graph-based analysis, we trace data flow within SoCs, identifying the potential leak points. We validated our metric against five open-source hardware designs, confirming its effectiveness in quantitatively measuring susceptibility to information leakage and enhancing SoC design security.
IP
Engineering Tracks
IP
DescriptionThis session addresses the frontiers of technology scaling, examining the interplay between cost, performance, and power as designers navigate current limitations and future trends. Discussions will range from the evolution of process technology, such as FinFET to GAA, to the strategic use of DTCO and disaggregated designs in overcoming die size and cost-per-transistor challenges. Emphasizing the critical role of packaging in adopting chiplets, advances in interconnect and 3D-IC technologies will be explored. The collective insights aim to chart a course through the complexities of scaling in the more-than-Moore era, focusing on economic and technological viability.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWe generalize the problem of evaluating hardware Trojan detection methods as a decision problem, solvable by an interactive proof system, with a prover whose challenge is to find malicious signals in a design presented by a verifier. The verifier checks the validity of the proof presented by the prover, being a candidate set of malicious signals. The proof succeeds if and only if the prover finds all malicious signals in a design, and not more. Because a prover may not be trusted, benchmarks must be obfuscated in order to increase difficulty to manually analyze design files, and to mitigate brute-force guessing the correct set of signals. We propose an obfuscation technique by introducing random in a benchmark's signal names. Experimental analysis shows that obfuscation provides sufficient protection against manual source analysis, while brute-force guessing is mitigated by the interactive proof system by limiting the number of guesses for one specific, uniquely randomized, benchmark to 1.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIncreasing complexity in integrated circuits (ICs) node over node results significant growth in circuit verification time and effort. Today's tapeout sensitivities make it critical to begin checking and fixing connectivity issues in earlier design stages, since connectivity violations will affect downstream flows such as reliability verification (PERC/ESD-checks), electromigration/voltage drop (EMIR) layout optimization. However, running signoff verification in early stages typically produces thousands, if not millions, of layout errors, only some of which are actionable. Addressing all these errors is an unproductive drain on both time and resources, as many will simply disappear when full-chip design comes together at signoff, while finding and debugging relevant errors requires a significant number of iterations and many manual steps. Critical pain points in early design stage circuit verification include short isolation (SI), electrical rule checking (ERC), and soft connection checking (Softchk). The Calibre nmLVS Recon tool is specifically designed to improve early-stage layout vs. schematic (LVS) verification providing targeted functionalities to address these issues. Earlier focused circuit verification reduces overall IC design verification and debugging time while improving design quality and time to market, without compromising the signoff quality. Real-world results demonstrate the effectiveness and efficiency of the Calibre nmLVS Recon tool.
TechTalk
AI
Design
DescriptionTo be announced
IP
AI
Engineering Tracks
IP
DescriptionSession Structure
The Open Chiplet Economy and AI
A brief introduction of how the open chiplet economy can help with AI
Bapi Vinnakota (OCP), Cliff Grossner (OCP)
A Survey of AI-related IP for the Open Chiplet Economy
High performance D2D PHY IP and
Other soft IP relevant to AI
Elad Alon (Blue Cheetah) Letizia Giuliano (Alphawave)
AI-Related Chiplets in the Open Chiplet Economy
An overview of chiplets relevant to building AI and HPC systems
John Shalf (LBL) +
AI Packaging Workflow
Basic and advanced packaging workflows for AI systems
Lihong Cao (ASE) +
The Open Chiplet Economy and AI
A brief introduction of how the open chiplet economy can help with AI
Bapi Vinnakota (OCP), Cliff Grossner (OCP)
A Survey of AI-related IP for the Open Chiplet Economy
High performance D2D PHY IP and
Other soft IP relevant to AI
Elad Alon (Blue Cheetah) Letizia Giuliano (Alphawave)
AI-Related Chiplets in the Open Chiplet Economy
An overview of chiplets relevant to building AI and HPC systems
John Shalf (LBL) +
AI Packaging Workflow
Basic and advanced packaging workflows for AI systems
Lihong Cao (ASE) +
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionPlacement is a critical yet computationally complex task. Modern analytical placers suffer a long placement iteration time. Recent efforts to expedite this process incorporate deep learning.
However, learning-based placers require time- and data-consuming model training due to the complexity of circuit placement that involves large-scale cells and design-specific graph statistics.
This paper proposes GiFt, a parameter-free technique for accelerating placement, rooted in graph signal processing. It can be seamlessly integrated with modern analytical placers, yielding high-quality placement solutions with significantly reduced iteration time. Experimental results show that state-of-the-art placers equipped with GiFt can achieve over 50% reduction in total runtime.
However, learning-based placers require time- and data-consuming model training due to the complexity of circuit placement that involves large-scale cells and design-specific graph statistics.
This paper proposes GiFt, a parameter-free technique for accelerating placement, rooted in graph signal processing. It can be seamlessly integrated with modern analytical placers, yielding high-quality placement solutions with significantly reduced iteration time. Experimental results show that state-of-the-art placers equipped with GiFt can achieve over 50% reduction in total runtime.
Research Manuscript
EDA
Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionIn the thermal design of 3-D integrated circuits (ICs) and packages, numerical simulation is extensively employed to investigate the impact of model parameters on hotspot temperature. However, conventional simulation approaches usually require plenty of computational resource and thus lead to expensive time cost for thermal designs. In this paper, we present a novel technique to efficiently and accurately conduct thermal simulation of 3-D ICs and packages, potentially reducing thermal design timeline from weeks to minutes. The proposed thermal resistance network derivative (TREND) model facilitates to focus the solution domain on the crucial regions for thermal designs and accelerate simulation without sacrificing accuracy. Also, the TREND model protects the internal details of chips and packages, which is quite suitable for modular thermal designs. The flexibility, accuracy, and efficiency of the proposed method are demonstrated through several numerical examples. Compared with commercial software, a speed-up of 2695x is achieved in a typical thermal design case without the loss of accuracy.
Research Manuscript
Security
Hardware Security: Attack and Defense
DescriptionWhile interrupts play a critical role in modern OSes, they have been exploited as a wide range of side channel attacks to break system confidentiality, such as keystroke interrupts, graphic interrupts and network interrupts. In this paper, we propose ThermalScope, a new side channel that exploits thermal event interrupts, which is adaptable for both native and browser scenarios and incorporates two heat amplifying techniques. The exploited thermal event interrupts are activated only when the CPU package temperature reaches a fixed threshold that is determined by manufacturers. Our key observation is that workloads running on CPUs inevitably generates their distinct heat, which can be correlated with the thermal event interrupts. To demonstrate the viability of ThermalScope, we conduct a comprehensive evaluation on multiple Ubuntu OSes with different Intel-based CPUs. First, we show that the activation of thermal event interrupts correlates with the level of CPU temperature. We then apply ThermalScope to mount different side channel attacks, i.e., building covert channels with a transmission rate of 0.1 b/s, fingerprinting DNN model architectures with an accuracy of over 90% and breaking KASLR within 8.2 hours.
Research Manuscript
Design
In-memory and Near-memory Computing Circuits
DescriptionMemory disaggregation, facilitated by SmartNICs, has emerged as a cost-effective approach for sharing memory resources in data centers. However, current SoC-based SmartNICs face several challenges for supporting near-data processing (NDP) in DM systems effectively. To address these challenges, we propose TIGA, an efficient NDP framework for SmartNICs-based DM systems. We propose an adaptive resource allocator to fully utilize SoC cores, and a SmartNIC-CPU cooperative mechanism to schedule NDP tasks. We prototype TIGA with FPGAs and evaluate it with typical workloads. Experimental results show that TIGA significantly improves the efficiency of NDP tasks in DM systems.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionHold timing violations can be challenging to fix especially with additional limitations in a 5nm design that were not seen in larger nodes, such as power, crosstalk, and narrower setup-hold window (less setup margin). Current Place and Route (PnR) and timing eco tools struggle to address these difficult hold violations due to the use of limited timing views for acceptable runtime and the tendency to insert excess hold padding that may have issues with wiring and power. This presentation describes four methods that can reduce hold violations with a less impact on power and wiring resources. The methods are: Reducing the VT on the existing cells, reducing the drive strength on existing cells, placing delay cells further away that is less congested and with wiring resources, and manipulating the clock so there would be less hold violations due to a wide clock skew. These methods provide additional solutions besides the traditional padding method so that the timing closure is not deadlocked with power and wiring issues.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionWith increasing grid resistance, modelling the impact of voltage drop on instances has been a focus of research to improve timing robustness or yield. Furthermore, as technology node shrinks, RC constant is dominated by Resistance, with R increasing 10x from 28nm-7nm, and cell delays are non-linear with IR drop, accounting for higher delay variations at below nominal voltage domain.
Traditionally and still mostly as well, designers use derating techniques to account for the IR drop. The limitation of this approach is that it is a flat OCV factor and sometimes we penalize paths which are not as prone to voltage sensitivity causing excessive pessimism and on the other hand could mask potential real violations. These issues paved the way towards IR aware STA wherein voltage drop obtained from IR tools are back-annotated onto the timing engine. This method is robust but still depends on the coverage done using vectorless dynamic analysis. In order to perform analysis on timing critical and voltage sensitive path we need to have both the timer and the power-grid solver integrated so that we could perform more voltage sensitive focused critical timing path analysis. This methodology can serve as an augmentation to regular signoff and can potentially unravel timing voltage sensitive instances which are missed by traditional methods leading to more robustness and silicon success. In this paper we will be presenting a case study on evolution and adoption for accounting IR-Drop variations, results and current limitations.
• Timing-Power Integrated Flow for concurrent Timing and Voltage-Drop analysis: Paper will walk
through the STA-PI integration flow with different STA and IR analysis corners under a common
cockpit (SLOW corner STA and TYP corner IR) and evaluates the benefits of timing-aware IR and
IR aware timing analysis
• Detailed analysis on Timing/IR critical Block: Paper presents the quantitative analysis done on
timing sensitive paths with annotated EIV values for voltage drop on timing instances and its
comparison with the traditional Flat Voltage derates and IR-aware STA flow
• Resource and runtime evaluation for Large Designs: With analysis done on a 100M instance
block, paper will focus on the resource/runtime trade-offs with PPA and robustness gains
• Timing ECOs with EIV annotation and timing fixes: Timing ECOs with EIV annotated timing paths
with focus on fixing Voltage-Sensitive timing critical paths
• Potential missed violations: Paper will also focus on timing paths which were not violating with Flat
voltage derates but seen as potential violators with the discussed flow
Traditionally and still mostly as well, designers use derating techniques to account for the IR drop. The limitation of this approach is that it is a flat OCV factor and sometimes we penalize paths which are not as prone to voltage sensitivity causing excessive pessimism and on the other hand could mask potential real violations. These issues paved the way towards IR aware STA wherein voltage drop obtained from IR tools are back-annotated onto the timing engine. This method is robust but still depends on the coverage done using vectorless dynamic analysis. In order to perform analysis on timing critical and voltage sensitive path we need to have both the timer and the power-grid solver integrated so that we could perform more voltage sensitive focused critical timing path analysis. This methodology can serve as an augmentation to regular signoff and can potentially unravel timing voltage sensitive instances which are missed by traditional methods leading to more robustness and silicon success. In this paper we will be presenting a case study on evolution and adoption for accounting IR-Drop variations, results and current limitations.
• Timing-Power Integrated Flow for concurrent Timing and Voltage-Drop analysis: Paper will walk
through the STA-PI integration flow with different STA and IR analysis corners under a common
cockpit (SLOW corner STA and TYP corner IR) and evaluates the benefits of timing-aware IR and
IR aware timing analysis
• Detailed analysis on Timing/IR critical Block: Paper presents the quantitative analysis done on
timing sensitive paths with annotated EIV values for voltage drop on timing instances and its
comparison with the traditional Flat Voltage derates and IR-aware STA flow
• Resource and runtime evaluation for Large Designs: With analysis done on a 100M instance
block, paper will focus on the resource/runtime trade-offs with PPA and robustness gains
• Timing ECOs with EIV annotation and timing fixes: Timing ECOs with EIV annotated timing paths
with focus on fixing Voltage-Sensitive timing critical paths
• Potential missed violations: Paper will also focus on timing paths which were not violating with Flat
voltage derates but seen as potential violators with the discussed flow
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAs designs scale up in size and complexity timing takedown can become a tedious and overwhelming task. Gathering any insightful information from the millions of paths that are generated from a full timing run is, in many cases, not possible. The solution to this problem is generally scripts that read in timing reports, summarize, process, and draw some conclusions about the data.
These scripts are often a critical to any timing signoff. Therefore, these scripts should be written to be customizable, lightweight, and first and foremost fast. By leveraging industry standard software architecture practices, we were able to write an application that that can be easily expanded, handle reports from multiple tool vendors and technology nodes, run quickly, and is fully user customizable. By defining our own database structure we let anyone write their own interfaces to this database, enabling any engineer with access to the program to write custom python scripts using the same data, Using standardized software architectures also makes it easier to add or remove developers as needed to support future development. In addition, this methodology can serve as the first stepping stone to sanitize data for AI/ML use cases.
These scripts are often a critical to any timing signoff. Therefore, these scripts should be written to be customizable, lightweight, and first and foremost fast. By leveraging industry standard software architecture practices, we were able to write an application that that can be easily expanded, handle reports from multiple tool vendors and technology nodes, run quickly, and is fully user customizable. By defining our own database structure we let anyone write their own interfaces to this database, enabling any engineer with access to the program to write custom python scripts using the same data, Using standardized software architectures also makes it easier to add or remove developers as needed to support future development. In addition, this methodology can serve as the first stepping stone to sanitize data for AI/ML use cases.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionImage segmentation is one of the major computer vision tasks, which is applicable in a variety of domains such as autonomous navigation of an unmanned aerial vehicle. However, image segmentation cannot easily materialize on small embedded systems because image segmentation models generally have high peak memory usage due to their architectural characteristics. This work finds that image segmentation models unnecessarily require large memory space with an existing tiny machine learning framework. Therefore, the existing tiny machine learning framework cannot effectively manage the memory space for the image segmentation models.
This work proposes TinySeg, a new optimizing framework that enables memory-efficient image segmentation for small embedded systems. TinySeg analyzes the lifetimes of tensors in the target model and identifies long-living tensors. Then, TinySeg optimizes the memory usage of the target model with two methods: (i) tensor spilling into local or remote storage and (ii) fused fetching of spilled tensors. This work implements TinySeg on top of the existing tiny machine learning framework and demonstrates that TinySeg can reduce the peak memory usage of an image segmentation model by 39.3 percent for small embedded systems.
This work proposes TinySeg, a new optimizing framework that enables memory-efficient image segmentation for small embedded systems. TinySeg analyzes the lifetimes of tensors in the target model and identifies long-living tensors. Then, TinySeg optimizes the memory usage of the target model with two methods: (i) tensor spilling into local or remote storage and (ii) fused fetching of spilled tensors. This work implements TinySeg on top of the existing tiny machine learning framework and demonstrates that TinySeg can reduce the peak memory usage of an image segmentation model by 39.3 percent for small embedded systems.
Research Manuscript
Design
Quantum Computing
DescriptionTrapped-Ion (TI) technology offers potential breakthroughs for Noisy Intermediate Scale Quantum (NISQ) computing. TI qubits provide advantages like extended coherence times and high gate fidelity, making them appealing for large-scale quantum systems. Constructing such systems demands a distributed architecture connecting Quantum Charge Coupled Devices (QCCDs) via quantum matter-links and photonic switches. However, current distributed TI NISQ computers face hardware and system challenges. Entangling qubits across a photonic switch introduces significant latency, impacting performance, while existing compilers generate suboptimal mappings and schedules. In response, we introduce TITAN, a large-scale distributed TI NISQ computer. TITAN employs an innovative photonic interconnection design to reduce entanglement latency and an advanced partitioning and mapping algorithm to optimize quantum matter-link communications. Our evaluations show that TITAN significantly enhances quantum application performance and fidelity compared to existing systems.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionThe attention mechanism in text generation is memory-bounded due to its sequential characteristics. Therefore, off-chip memory accesses should be minimized for faster execution. Although previous methods addressed this by pruning unimportant tokens, they fall short in selectively removing tokens with near-zero attention probabilities in each instance. Our method estimates the probability before the softmax function, effectively removing low probability tokens and achieving an 12.1x pruning ratio without fine-tuning. Additionally, we present a hardware design supporting seamless on-demand off-chip access. Our approach shows 2.6x reduced memory accesses, leading to an average 2.3x speedup and a 2.4x energy efficiency.
Research Manuscript
EDA
Physical Design and Verification
DescriptionModern System-on-Chip (SoC) design is divided into hierarchical instances using the multiply-instantiated block (MIB) technique to simplify the design process. Top-level routing aims at providing routing prototyping between those instances. It requires consideration of replicated routing paths that can either be utilized for routing or remain as floating segments. Conventional path-searching based algorithm often fails to find a legal solution under such a scenario. To address this, we propose an effective and efficient top-level routing framework for MIBs by hashing the topology of each net and using a group maze routing scheme. Experimental results demonstrate promising performance compared to the winners of the MIB-aware top-level router contest 2022 organized by Synopsys.
Research Manuscript
EDA
Physical Design and Verification
DescriptionClock tree synthesis (CTS) constructs an efficient clock tree, meeting design constraints and minimizing resource usage. It serves as a bridge between placement and routing, facilitating concurrent optimization of multiple design objectives. To construct a clock tree with lower latency and load capacitance while maintaining a specified skew constraint, we introduce skew-latency-load tree (SLLT) which combines the merits of bound skew tree and Steiner shallow-light tree, along with an analysis and demonstration of the boundaries of these two tree types. We propose a method for constructing SLLT, which significantly reduces both the maximum latency and load capacitance compared to previous methods while ensuring skew control. Combining this routing topology generation method, we introduce a hierarchical CTS framework, and it is constructed by integrating partition schemes and buffering optimization techniques. We validate our solution at 28nm process technology, demonstrating superior performance compared to the solutions of OpenROAD and advanced commercial tool. Our approach outperforms in all metrics (max latency, skew, buffer number, clock capacitance), achieving a significant reduction in latency of 29.45% compared to OpenROAD and 6.75% compared to the commercial tool.
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionThe desire to empower resource-limited edge devices with computer vision (CV) must overcome the high energy consumption of collecting and processing vast sensory data. To address the challenge, this work proposes an energy-efficient non-von-Neumann in-pixel processing solution for neuromorphic vision sensors employing emerging (X) magnetic domain wall magnetic tunnel junction (MDWMTJ) for the first time, in conjunction with CMOS-based neuromorphic pixels. Our hybrid CMOS+X approach performs in-situ massively parallel asynchronous analog convolution, exhibiting low power consumption and high accuracy across various CV applications by leveraging the non-volatility and programmability of the MDWMTJ. Moreover, our developed device-circuit-algorithm co-design framework captures device constraints (low tunnel-magnetoresistance, low dynamic range) and circuit constraints (non-linearity, process variation, area consideration) based on monte-carlo simulations and device parameters utilizing GF22nm FD-SOI technology. Our experimental results suggest we can achieve an average of 45.3% reduction in backend-processor energy, maintaining similar front-end energy compared to the state-of-the-art and high accuracy of 79.17% and 95.99% on the DVS-CIFAR10 and IBM DVS128-Gesture datasets, respectively.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionVerification with application executables is common phase in virtual development kits (VDKs), RTL simulations and emulation. It involves both loading and dumping memory abstractions, usually modelled as 2D arrays, with the application hex images. This is usually acheived in 2 ways (i) frontdoor loading using design modules mimicking real silicon (ii) backdoor loading using external methods such as simulator API to initialize the design in an "image-loaded" state. The former is slow and inefficient since the design spends a lot of time in loading process along with additional design modules for support. The latter can be performed efficiently without additional design modules but requires a lot of platform-specific infrasturcture with memory-dependent details (for ex: ECC, endianness, controller size). In this presentation we argue that a succint representation of such details is possible for most memories. Such a representation is possible because of stereotypical operations on the memory abstractions. We show that tools processing such representations dramatically reduce the maintainable code size.
Research Manuscript
Embedded Systems
Time-Critical and Fault-Tolerant System Design
DescriptionTime-Sensitive Networking (TSN) technology has been increasingly deployed in mission- and safety-critical industrial applications to achieve high throughput and deterministic communications. To provide stringent timing guarantee, TSN requires that network devices follow a predefined communication schedule for real-time end-to-end packet processing, involving both TSN bridges and end stations. Extensive efforts have been devoted on the TSN bridge design in the literature. Achieving TSN compatibility on the end stations (especially on Commercial Off-The-Shelf (COTS) hardware), however is challenging due to the inefficiencies of general CPU and unpredictable bus contention. To fill this gap, this work presents a software-based open-source approach that i) enables nanosecond-level packet transmission accuracy based on DPDK, and ii) employs a novel multi-core scheduling algorithm to boost the throughput of real-time TSN traffic. Our proposed solution leverages existing COTS hardware and thus is more generic and cost-effective compared to existing hardware-centric solutions. We validate our design by developing a prototype end station and incorporating it within an eight-bridge TSN network testbed. Our extensive experiments demonstrate the efficiency and effectiveness of our design at both device and system levels.
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionBit-level sparsity in neural network models harbors immense untapped potential.
Eliminating redundant calculations of randomly distributed zero-bits significantly boosts computational efficiency.
Yet, traditional digital SRAM-PIM architecture, limited by rigid crossbar architecture, struggles to effectively exploit this unstructured sparsity.
To address this challenge, we propose Dyadic Block PIM (DB-PIM), a novel algorithm-architecture co-design framework.
It preserves the random distribution of non-zero bits to maintain accuracy while restricting the number of non-zero bits in each weight of the filter to improve regularity.
DB-PIM improves both performance and energy efficiency, achieving a remarkable speedup of up to 6.53x and energy savings of 77.50%.
Eliminating redundant calculations of randomly distributed zero-bits significantly boosts computational efficiency.
Yet, traditional digital SRAM-PIM architecture, limited by rigid crossbar architecture, struggles to effectively exploit this unstructured sparsity.
To address this challenge, we propose Dyadic Block PIM (DB-PIM), a novel algorithm-architecture co-design framework.
It preserves the random distribution of non-zero bits to maintain accuracy while restricting the number of non-zero bits in each weight of the filter to improve regularity.
DB-PIM improves both performance and energy efficiency, achieving a remarkable speedup of up to 6.53x and energy savings of 77.50%.
Research Manuscript
Embedded Systems
Embedded System Design Tools and Methodologies
DescriptionSystemC TLM-2.0 is currently the industry standard for simulating full Systems-on-a-Chip (SoCs). Although SystemC is designed to simulate the behavior of complex, parallel systems, the simulation itself is by default single-threaded. We present a technique to overcome this performance limitation by parallelizing the CPU model of a SystemC-TLM-2.0-based system-level simulator, a so-called Virtual Platform (VP). Our solution is fully compliant with the SystemC standard. To further increase the performance, we developed algorithms for asynchronous DMI pointer caching and we introduced a new tunable parameter called async_rate. This parameter controls the frequency used to annotate timing information to SystemC.
Evaluation results demonstrate a significant speedup compared to sequential execution, with a maximum of 7.8 x achieved for octacore VPs on fully parallelizable workloads. For the execution of the NPB suite on the SIM‑V VP, an average speedup of 6.2 x is achieved. This approach is a promising solution for accelerating VPs while adhering to the SystemC standard.
Evaluation results demonstrate a significant speedup compared to sequential execution, with a maximum of 7.8 x achieved for octacore VPs on fully parallelizable workloads. For the execution of the NPB suite on the SIM‑V VP, an average speedup of 6.2 x is achieved. This approach is a promising solution for accelerating VPs while adhering to the SystemC standard.
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionThe memory-intensive embedding layer in the recommendation model continues to be the performance bottleneck. While prior works have attempted to improve the embedding layer performance by exploiting the data locality to cache the frequently accessed embedding vectors and their partial sums. However, these solutions rely on the static cache, which is invalidated in the embedding training scenario of the embedding vectors being updated frequently. To this end, this paper proposes ReFree, a redundancy-free near-memory processing (NMP) solution for embedding training. Specifically, ReFree identifies the reusable data in real-time for both forward and backpropagation of the embedding layer training, and leverages a lightweight NMP architecture to enable redundancy-free near-memory acceleration of the entire embedding training process. Evaluation results on real-world datasets show that ReFree outperforms the state-of-the-art solutions by 10.9x and reduces 5.3x energy consumption on average.
Research Manuscript
EDA
Analog CAD, Simulation, Verification and Test
DescriptionSignal integrity becomes more critical to modern digital systems such as solid-state drives due to their high-speed operation. However, one of the challenges in signal integrity analysis is S-parameter modeling process for printed circuit boards (PCB). Due to increasing PCB design complexity, existing numerical methods take too long to solve governing equations for S-parameters. To overcome the issue, we present a novel deep learning framework, TraceFormer, to predict S-parameters of PCB traces. Our framework constructs a graph from PCB traces and tokenizes trace segments with geometric and topological information. A transformer encoder produces PCB representations from the tokens, followed by extraction networks which predict four different types of complex-valued S-parameters together. TraceFormer achieved above 0.99 R-squared score up to 15GHz for 4-port PCB designs, resulting in less than 3.1% and 4.2% errors in terms of the eye diagram's width and height, respectively.
Research Manuscript
AI
AI/ML Application and Infrastructure
DescriptionWith the evolution of network infrastructure, the pattern of network traffic becomes unprecedentedly complex. Conventional machine learning algorithms are struggling to cope with the high-dimensional data and real-time processing speeds required in such complex networks. Fortunately, hyperdimensional Computing (HDC), which is power-efficient and supports parallel processing, provides a potential solution to this challenge. In this paper, we present TrafficHD, a novel classification framework that leverages HDC to analyze network traffic in real-time. By transforming network traffic features into high-dimensional binary vectors, TrafficHD enables the rapid execution of recognition tasks within the constraints of real-time systems. Extensive evaluations on a wide range of network tasks show that TrafficHD achieves 30.57× and 98.32× faster than state-of-the-art (SOTA) machine learning and HDC algorithms while providing 3× higher robustness to network noise.
Research Manuscript
Design
Emerging Models of Computation
DescriptionThe advancement of neuromorphic devices (NDs) for processing deep neural networks has narrowed the accuracy gap with software-trained models. To accurately assess ND performance, reliable simulation frameworks for on-chip training are crucial. We critically evaluated existing frameworks, identifying key defects in the training process. Consequently, we introduce TraiNDSim, a novel framework that addresses these issues. In refining the training process, we propose an advanced conductance normalization strategy called layer-wise normalization, which limits the weight range by taking the initial weight distribution into account. Additionally, our framework integrates three conductance models, notably refining one of the conventional models to depend solely on nonlinearity. Moreover, it features a bi-directional weight representation method with a unique conductance compensation technique. Our comprehensive analysis using TraiNDSim demonstrates its effectiveness in accurately reflecting the impact of ND parameters on training, promising more precise device performance evaluations. Our framework is available at https://anonymous.4open.science/r/TraiNDSim-FC25.
Exhibitor Forum
DescriptionSince the inception of the semiconductor industry in the 1950s, there have been continuous advancements on multiple fronts. Semiconductor chips are becoming more and more powerful & complex.
Designing such complex chips is extremely difficult, costly and error prone. Design methodologies have also evolved to keep the pace with growing complexity. Historically this evolution came in the form of raising the abstraction of chip design: component level to gate level, and further to RTL.
Shift-left methodology allows the identification and resolution of flaws early in the design cycle. Fast simulation models allow early software development in parallel to hardware design. It is feasible to engage with the potential customers and validate the SoC architecture with the real customer workloads quite early in the cycle. These methodologies significantly reduce the Cost & Time to market. Overall it enhances the probability of success for a new SoC.
Two key trends in the industry today are, usage of chips for AI applications, and emergence of an open source processor architecture RISC-V ISA. Effective use of the ESL methodologies is necessary for the success of these trends.
The presentation will cover the virtual prototype of Core-V-MCU, an open source System on Chip from OpenHW group. This SoC has a RISC-V CPU core, an embedded FPGA, on chip SRAM, and a rich set of peripherals.
Designing such complex chips is extremely difficult, costly and error prone. Design methodologies have also evolved to keep the pace with growing complexity. Historically this evolution came in the form of raising the abstraction of chip design: component level to gate level, and further to RTL.
Shift-left methodology allows the identification and resolution of flaws early in the design cycle. Fast simulation models allow early software development in parallel to hardware design. It is feasible to engage with the potential customers and validate the SoC architecture with the real customer workloads quite early in the cycle. These methodologies significantly reduce the Cost & Time to market. Overall it enhances the probability of success for a new SoC.
Two key trends in the industry today are, usage of chips for AI applications, and emergence of an open source processor architecture RISC-V ISA. Effective use of the ESL methodologies is necessary for the success of these trends.
The presentation will cover the virtual prototype of Core-V-MCU, an open source System on Chip from OpenHW group. This SoC has a RISC-V CPU core, an embedded FPGA, on chip SRAM, and a rich set of peripherals.
Research Manuscript
Security
Embedded and Cross-Layer Security
DescriptionAnalyzing the security of closed-source drivers and libraries in embedded systems holds significant importance, given their fundamental role in the supply chain. Unlike x86, embedded platforms lack comprehensive binary manipulating tools, making it difficult for researchers and developers to effectively detect and patch security issues in such closed-source components. Existing works either depend on full-fledged operating system features or suffer from tedious corner cases, restricting their application to bare-metal firmware prevalent in embedded environments.
In this paper, we present PIFER (Practical Instrumenting Framework for Embedded fiRmware) that enables general and fine-grained static binary instrumentation for embedded bare-metal firmware. By abusing the built-in hardware exception-handling mechanism of the embedded processors, PIFER can perform instrumentation on arbitrary target addresses. Additionally, We propose an instruction translation-based scheme to guarantee the correct execution of the original firmware after patching. We evaluate PIFER against real-world, complex firmware, including Zephyr RTOS, CoreMark benchmark, and a close-sourced commercial product. The results indicate that PIFER correctly instrumented 98.9\% of the instructions. Further, a comprehensive performance evaluation was conducted, demonstrating the practicality and efficiency of our work.
In this paper, we present PIFER (Practical Instrumenting Framework for Embedded fiRmware) that enables general and fine-grained static binary instrumentation for embedded bare-metal firmware. By abusing the built-in hardware exception-handling mechanism of the embedded processors, PIFER can perform instrumentation on arbitrary target addresses. Additionally, We propose an instruction translation-based scheme to guarantee the correct execution of the original firmware after patching. We evaluate PIFER against real-world, complex firmware, including Zephyr RTOS, CoreMark benchmark, and a close-sourced commercial product. The results indicate that PIFER correctly instrumented 98.9\% of the instructions. Further, a comprehensive performance evaluation was conducted, demonstrating the practicality and efficiency of our work.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionFloating-point compute-in-memory (FP-CIM) is regarded as an attractive approach to enhance the energy efficiency for complex neural networks. Digital domain compute mechanism has been widely utilized in CIM designs owing to its high robust to PVT variations. However, the energy consumption of digital CIM is significantly influenced by the toggle rate of compute-tree. In this work, a toggle-rate immune floating-point digital CIM (TRIFP-DCIM) design is proposed with 34.03% compute energy reduction in average. Combined with the TRIFP-DCIM design, a toggle-rate gathering method is employed in the neural network training/inference process with almost no accuracy loss. Experiment results show that the TRIFP-DCIM can achieve 14.51-36.83 TFLOPS/W@BF16 in 28nm technology process.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionTrusted Execution Environments (TEEs)-based Federated Learning (FL) faces significant challenges: (1) the server is unable to tolerate malicious physical attacks. (2) the capability of FL clients to install TEEs is significantly limited. Our study concentrates on developing a server-focused FL framework that reduces the need for extensive client-side TEE attestations while strengthening the server against side-channel attacks. We introduce three novel solutions: (1) employing server-only TEE, which significantly cuts down on the client-side attestation requirements; (2) implementing a TEE-supported mutual attestation and Byzantine Fault Tolerance (BFT) protocol to boost server reliability; and (3) integrating Oblivious RAM to conceal memory access patterns (MAP), safeguarding against MAP attacks. These solutions aim to reinforce the privacy, integrity, and importantly reliability of FL systems, providing a balance between practicality and high security. The efficiency of our approach has been validated through extensive experiments. These confirm a well-balanced trade-off among increased latency, enhanced system security, and the reliability of the FL system.
Research Manuscript
Design
Emerging Models of Computation
DescriptionWith the exponential growth of digital data, DNA is emerging as an attractive medium for storage and computing. Thus, design methods for encoding, storing, and searching digital data within DNA storage are of utmost importance. This paper introduces image classification as a measurable task for evaluating the performance of DNA encoders in similar image searches. Furthermore, we propose a novel triplet network-based DNA encoder to improve the accuracy and efficiency. The evaluation using the CIFAR-100 dataset demonstrates that the proposed encoder outperforms existing encoders in retrieving similar images, with an accuracy of 0.77, which is equivalent to 94% of the practical upper limit, and 16 times faster training time.
Research Manuscript
Design
AI/ML System and Platform Design
DescriptionAttention-based models provide significant accuracy improvement to Natural Language Processing (NLP) and computer vision (CV) fields at the cost of heavy computational and memory demands. Previous works seek to alleviate the performance bottleneck by removing useless relations for each position. However, their attempts only focus on intra-sentence optimization and overlook the opportunity in the temporal domain. In this paper, we accelerate attention by leveraging the tempo-spatial similarity across successive sentences, given the observation that successive sentences tend to bear high similarity. This is rational owing to many semantic similar words (namely tokens) in the attention-based models. We first propose an online-offline prediction algorithm to identify similar tokens/heads. We then design a recovery algorithm so that we can skip the computation on similar tokens/heads in succeeding sentences and recover their results by copying other tokens/heads features in preceding sentences to reserve accuracy. From the hardware aspect, we propose a specialized architecture TSAcc that includes a prediction engine and recovery engine to translate the computational saving in the algorithm to real speedup. Experiments show that TSAcc can achieve $8.5\times$, $2.7\times$, $14.1\times$, and $64.9\times$ speedup compared to SpAtten, Sanger, 1080TI GPU, and Xeon CPU, with negligible accuracy loss.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionTSVs affect surrounding devices, causing device performance and reliability to vary within several micrometers of the TSV.
TSVs affect the saturation current (Idsat) of NMOS and PMOS devices, which can be interpreted as the variation of FET's Idsat with distance and divided into Soft KOZ and Hard KOZ.
In Hard KOZ, the placement of DVC and routing of metal are prohibited as the impact of DVC performance is large.
In Soft KOZ, the placement and routing of standard cells are allowed as the impact of TSV stress on the device can be predicted.
The timing impact of cells within the KOZ is reflected in the timing analysis.
When the proposed KOZ method is applied, the area of the KOZ decreases by 4.58% compared to the reference. The cell area decreases by 2.86%, resulting in a 7.44% decrease in the overall block area.
The speed of the top 300 critical paths at the block level is degraded by 0.16%.
TSVs affect the saturation current (Idsat) of NMOS and PMOS devices, which can be interpreted as the variation of FET's Idsat with distance and divided into Soft KOZ and Hard KOZ.
In Hard KOZ, the placement of DVC and routing of metal are prohibited as the impact of DVC performance is large.
In Soft KOZ, the placement and routing of standard cells are allowed as the impact of TSV stress on the device can be predicted.
The timing impact of cells within the KOZ is reflected in the timing analysis.
When the proposed KOZ method is applied, the area of the KOZ decreases by 4.58% compared to the reference. The cell area decreases by 2.86%, resulting in a 7.44% decrease in the overall block area.
The speed of the top 300 critical paths at the block level is degraded by 0.16%.
Back-End Design
Back-End Design
Design
Engineering Tracks
DescriptionThis paper presents a power distributed networks (PDNs) design for UCIe-A (UCIe advanced package), at organic interposer technology. To meet UCIe-A power integrity requirement for voltage fluctuation (Vpp) less than 30mV, interposer level decoupling capacitors is a critical design (e.g., distributed eDTC at CoWoS-S). However, at organic inteposer, it is still limited solution for localized and efficiently noises decoupling, especially, UCIe-A X64E bump map IP hardmarco is with small dimension as 1225μm x 388.8μm.
This work proposes a localized decoupling capacitor integrated solution, deploying C4-bump-side integrated passive components (IPDs) at UCIe-A X64E die-to-die gap, which provides efficiently local decoupling paths for each UCIe-A IP macro, as well as without penalty of PDNs parasitics degraded or occupied extra region for decoupling capacitors.
This work demonstrates the design of UCIe-A X64E testchip in tsmc 3nm technology (tape out on 2023/Nov.), as well as interconnects and PDNs at tsmc 65nm organic interposer (CoWoS-R, 8-RDL), where the PDNs co-simulated peak impedance (ZPDN) can be suppressed by 55% (from 21.59mΩ to 11.87mΩ, at 100MHz), as well as peak-to-peak voltage fluctuation (Vpp) can be suppressed by 78.7% (from 103.00mV to 21.98mV). With the good PDNs design, related power-aware SI co-simulated eyediagram can achieve 0.78UI at 32GT/s.
This work proposes a localized decoupling capacitor integrated solution, deploying C4-bump-side integrated passive components (IPDs) at UCIe-A X64E die-to-die gap, which provides efficiently local decoupling paths for each UCIe-A IP macro, as well as without penalty of PDNs parasitics degraded or occupied extra region for decoupling capacitors.
This work demonstrates the design of UCIe-A X64E testchip in tsmc 3nm technology (tape out on 2023/Nov.), as well as interconnects and PDNs at tsmc 65nm organic interposer (CoWoS-R, 8-RDL), where the PDNs co-simulated peak impedance (ZPDN) can be suppressed by 55% (from 21.59mΩ to 11.87mΩ, at 100MHz), as well as peak-to-peak voltage fluctuation (Vpp) can be suppressed by 78.7% (from 103.00mV to 21.98mV). With the good PDNs design, related power-aware SI co-simulated eyediagram can achieve 0.78UI at 32GT/s.
Exhibitor Forum
DescriptionProcess variation library models in the Liberty Variation Format (LVF) have become commonplace in timing signoff for standard cells, yet the embedded memories that comprise most of the chip area still employ library modeling methodologies from several technology generations ago. The main challenge is the greatly increased amount of simulation required to extract meaningful LVF data compared to nominal timing characterization.
It is known that the effect of random process variation is local to a small portion of the circuit. LVF characterization methods based on full-macro or critical-path simulation observe these effects at a global or near-global scale. As a result, they cannot fully utilize the simulation power and causes inefficiency. We believe a good methodology combining strategically partitioning the design and localized OCV models can greatly improve the efficiency of LVF characterization, without compromising accuracy.
This presentation unveils the key technologies behind Liberal-Mem, the memory characterization system from Empyrean, with a highly efficient solution to embedded memory characterization, especially to LVF extraction.
It is known that the effect of random process variation is local to a small portion of the circuit. LVF characterization methods based on full-macro or critical-path simulation observe these effects at a global or near-global scale. As a result, they cannot fully utilize the simulation power and causes inefficiency. We believe a good methodology combining strategically partitioning the design and localized OCV models can greatly improve the efficiency of LVF characterization, without compromising accuracy.
This presentation unveils the key technologies behind Liberal-Mem, the memory characterization system from Empyrean, with a highly efficient solution to embedded memory characterization, especially to LVF extraction.
Research Manuscript
Security
Hardware Security: Attack and Defense
DescriptionTraditionally, power side-channel analysis requires physical access to the target device, as well as specialized devices to measure the power consumption with enough precision.
Recently research has shown that on x86 platforms, on-chip power meter capabilities exposed to a software interface might be used for power side-channel attacks without physical access. In this paper, we show that such software-based power side-channel attack is also applicable on Apple silicon (e.g., M1/M2 platforms), exploiting the System Management Controller (SMC) and its power-related keys, which provides access to the on-chip power meters through a software interface to user space software.
We observed data-dependent power consumption reporting from such SMC keys and analyzed the correlations between the power consumption and the processed data. Our work also demonstrated how an unprivileged user mode application successfully recovers bytes from an AES encryption key from a cryptographic service supported by a kernel mode driver in MacOS. We have also studied the feasibility of performing frequency throttling side-channel attack on Apple silicon. Furthermore, we discuss the impact of software-based power side-channels in the industry, possible countermeasures, and the overall implications of software interfaces for modern on-chip power management systems.
Recently research has shown that on x86 platforms, on-chip power meter capabilities exposed to a software interface might be used for power side-channel attacks without physical access. In this paper, we show that such software-based power side-channel attack is also applicable on Apple silicon (e.g., M1/M2 platforms), exploiting the System Management Controller (SMC) and its power-related keys, which provides access to the on-chip power meters through a software interface to user space software.
We observed data-dependent power consumption reporting from such SMC keys and analyzed the correlations between the power consumption and the processed data. Our work also demonstrated how an unprivileged user mode application successfully recovers bytes from an AES encryption key from a cryptographic service supported by a kernel mode driver in MacOS. We have also studied the feasibility of performing frequency throttling side-channel attack on Apple silicon. Furthermore, we discuss the impact of software-based power side-channels in the industry, possible countermeasures, and the overall implications of software interfaces for modern on-chip power management systems.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe computing-in-memory (CIM) architecture has demonstrated high energy efficiency on memory-intensive and computation-intensive AI workloads. Despite the high energy efficiency of CIM circuits (i.e. macros), there remains a significant gap between the macro-level and processor-level energy efficiency in existing CIM chips. The key bottleneck is that, non-trivial surrounding modules are still necessary to implement an end-to-end CIM processor, including the SRAMs, register buffers, accumulators, etc. These surrounding modules show varied influences under different CIM configurations and workloads.
This work is motivated to explore the upper bound of energy efficiency in a CIM processor, and explore methods to approach the theoretical limit. The main contributions of this paper include: 1) Reveal the necessary modules of a CIM processor for different scales of applications; 2) Propose a quantitative analysis of the processor-level energy efficiency for different CIM architectures, as well as the gap between actual values and upper bounds; 3) Indicate the design principles to approach the theoretical upper bound of energy efficiency. Experiment results show an obviously varied processor/macro efficiency ratio (9.06%-44.3%) under varied design parameters and AI workloads. With the proposed flexible parallelism, the processor/macro efficiency ratio can be improved by up to 15.0%.
This work is motivated to explore the upper bound of energy efficiency in a CIM processor, and explore methods to approach the theoretical limit. The main contributions of this paper include: 1) Reveal the necessary modules of a CIM processor for different scales of applications; 2) Propose a quantitative analysis of the processor-level energy efficiency for different CIM architectures, as well as the gap between actual values and upper bounds; 3) Indicate the design principles to approach the theoretical upper bound of energy efficiency. Experiment results show an obviously varied processor/macro efficiency ratio (9.06%-44.3%) under varied design parameters and AI workloads. With the proposed flexible parallelism, the processor/macro efficiency ratio can be improved by up to 15.0%.
Research Manuscript
EDA
Physical Design and Verification
DescriptionAdiabatic quantum-flux-parametron (AQFP) logic, known for its remarkable energy efficiency, has emerged as a prominent superconductor-based logic family, surpassing traditional rapid single flux quantum (RSFQ) logic. In AQFP circuits, each cell operates on AC power, serving as both a power supply and clock signal to drive data flow across clock phases. However, signal attenuation with increasing wire length may result in more potential data errors. To address this, rows of buffers are inserted as repeaters to ensure data synchronization and avoid wirelength violations. However, these inserted buffer rows in AQFP placement significantly amplifies power consumption and circuit delay.
To resolve these challenges, this paper propose an innovative and analytical method for AQFP placement. The proposed method aims to achieve minimizing the need for additional buffers. The framework incorporates two key features: (1) entanglement entropy for topology initialization and (2) projection for placement and buffering. These features offer advantages such as avoiding intensive computations, including fix-order Lagrangian optimization in large-scale scenarios, while significantly reducing the required number of buffer rows. Experimental results validate the efficiency of the proposed framework, demonstrating an outstanding 29% and 40% reduction in the amount of required buffers and time compared to the state-of-the-art method.
To resolve these challenges, this paper propose an innovative and analytical method for AQFP placement. The proposed method aims to achieve minimizing the need for additional buffers. The framework incorporates two key features: (1) entanglement entropy for topology initialization and (2) projection for placement and buffering. These features offer advantages such as avoiding intensive computations, including fix-order Lagrangian optimization in large-scale scenarios, while significantly reducing the required number of buffer rows. Experimental results validate the efficiency of the proposed framework, demonstrating an outstanding 29% and 40% reduction in the amount of required buffers and time compared to the state-of-the-art method.
Research Manuscript
Design
Emerging Models of Computation
DescriptionSuperconductive rapid single-flux quantum (RSFQ) ICs dissipate 10-100 smaller power w.r.t. CMOS while operating at tens of GHz. The issue of path balancing in RSFQ systems however incurs significant area overhead, particularly severe due to limited layout density of RSFQ fabrication.
The SFQ T1-cell realize the full adder function with 60% less area compared to conventional implementation. This cell however imposes complex input timing constraints. With multiphase clocking, the T1-cell input timing can be efficiently satisfied. Here, we propose SFQ technology mapping methodology supporting T1-cells. The area of the arithmetic SFQ networks is reduced by up to 25%.
The SFQ T1-cell realize the full adder function with 60% less area compared to conventional implementation. This cell however imposes complex input timing constraints. With multiphase clocking, the T1-cell input timing can be efficiently satisfied. Here, we propose SFQ technology mapping methodology supporting T1-cells. The area of the arithmetic SFQ networks is reduced by up to 25%.
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionDeep Learning Recommendation Models (DLRMs) have gained popularity in recommendation systems due to their effectiveness in handling large-scale recommendation tasks. The embedding layers of DLRMs have become the performance bottleneck due to their intensive needs on memory capacity and memory bandwidth. In this paper, we propose UpDLRM, which utilizes real-world processing-in-memory (PIM) hardware, UPMEM DPU, to boost the memory bandwidth and reduce recommendation latency. The parallel nature of the DPU memory can provide high aggregated bandwidth for the large number of irregular memory accesses in embedding lookups, thus offering great potential to reduce the inference latency. To fully utilize the DPU memory bandwidth, we further studied the embedding table partitioning problem to achieve good workload-balance and efficient data caching. Evaluations using real-world datasets show that, UpDLRM achieves much lower inference time for DLRMs compared to both CPU-only and CPU-GPU hybrid counterparts.
Front-End Design
Design
Engineering Tracks
Front-End Design
DescriptionTime to market and bug free chip has put lot of pressure on the verification domain and resulted into multiple verification techniques that complement each other. UVM is reusable and robust verification environment. AMS verification tests cases can be integrally fused in UVM and design module description at random abstraction level that allows effective verification test setup. In this paper, we shall discuss how we have extended the existing IEEE1687 flow to support analog signals. Than we talk about the approach with which verification engineer can effectively reuse DFT test cases described in the IEEE 1687 Procedural Description Language (PDL). PDL is suited to describe the digital setup of an AMS test and is written at IP level. It does guarantee a path to any production test system therefore it is used to describe the AMS test cases as an input for DFT Verification. The approach has the potential to improve the quality of DFT test cases and shall improve the overall code coverage of the AMS design.
Research Manuscript
EDA
Analog CAD, Simulation, Verification and Test
DescriptionDespite recent advances in algorithms such as the use of reinforcement learning, analog circuit sizing optimization remains a challenging task that demands numerous circuit simulations, hence extensive CPU times. This paper presents the application of Model-Based Policy Optimization (MBPO) to boost the sample efficiency of reinforcement learning for analog circuit sizing. This method leverages an ensemble of probabilistic dynamic models to generate short rollouts branched from real data for a fast extensive exploration of the design space, thereby speeding up the learning process of the reinforcement learning agent and enhancing its convergence. Integrated in the Twin Delayed DDPG (TD3) algorithm, our new model-based TD3 (MBTD3) approach has been validated on analog circuits of different complexity, outperforming the existing model-free TD3 method by achieving power/area-optimal design solutions with up to 3x fewer simulations and half the run time. In addition, for larger analog circuits, we present a multi-agent version of MBTD3 in which multiple simultaneous agents use global probabilistic models for sizing different blocks within the circuit. Demonstrated for a complex data receiver circuit, it surpasses the model-free multi-agent TD3 method at 2x less simulations and half the run time. These novel methods highly boost the efficiency of automated analog circuit sizing.
Research Manuscript
AI
Design
AI/ML, Digital, and Analog Circuits
DescriptionHyperdimensional computing (HDC) is a bio-inspired machine learning paradigm utilizing hyperdimensional spaces for data representation. HDC significantly improves the ability to learn from sparse data and enhances noise robustness, and also enables parallel computation. Despite these advantages, HDC's reliance on high dimensionality and operational simplicity can lead to increased hardware costs and potential security vulnerabilities. This paper introduces a novel HDC encoding strategy using variation-based analog entropy (VAE), aiming to reduce memory footprint, lower power/energy consumption, and enhance security with physically-unclonable entropy generation. The VAE cell, with high entropy robustness 30.23-57.76 dB SNR and a small footprint 10 transistors, allows HDC to achieve a 14.3x reduction in vector dimensions, a 4.4x decrease in unit entropy cell area, and a 2% increase in accuracy compared to binary/multi-bit HDC. These benefits lead to a 1.3-4.4x area and a 327x leakage power reduction when compared to an SRAM baseline. We have designed custom low-power circuits that enable end-to-end analog entropy storage, distribution management, binding, permutation, and bundling. This analog implementation prevents data conversion during feature vector encoding, thereby significantly enhancing energy efficiency 48.5 nJ per query. Furthermore, with hardware-secured basis vectors, data security is significantly improved, as evidenced by the markedly degraded visual distinguish-ability of retrieved image data and maximum of 11dB lower PSNR.
Research Manuscript
AI
AI/ML Application and Infrastructure
DescriptionDetecting complex anomalies on massive amounts of data is a crucial task in Industry 4.0, best addressed by deep learning. However, available solutions are computationally demanding, requiring cloud architectures prone to latency and bandwidth issues. This work presents VARADE, a novel solution implementing a light autoregressive framework based on variational inference, which is best suited for real-time execution on the edge. The proposed approach was validated on a robotic arm, part of a pilot production line, and compared with several state-of-the-art algorithms, obtaining the best trade-off between anomaly detection accuracy, power consumption and inference frequency on two different edge platforms.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionSynopsys VC LP is a static low power verification checker which helps to verify consistency between UPF and design throughout the design flow.
Traditional use models of VC LP allow to make sure UPF is correct and complete at RTL stage and at netlist stage, the low power cells (multi-voltage or MV cells) such as Isolation and Level Shifter inserted are structurally and electrically correct.
New electrical issues introduced after MV cell insertion are caught only at netlist. There is an increasing demand to catch netlist level low power issues at the RTL stage itself and to reduce noise by predicting post-synthesis behavior.
Virtual instrumentation-based flow in VC LP shifts left the low power verification by virtually instrumenting the MV cells in the design based on the power intent leading to more accurate verification of the design at RTL stage.
VC LP has achieved ~99% accuracy in predictive checkers w.r.t netlist runs at customer design.
One customer has enabled this feature on 100+ sub-systems and SOC which matched netlist behavior at RTL stage especially for back-to-back ISO/LS cases. This flow is also gaining traction at various elite customers.
Traditional use models of VC LP allow to make sure UPF is correct and complete at RTL stage and at netlist stage, the low power cells (multi-voltage or MV cells) such as Isolation and Level Shifter inserted are structurally and electrically correct.
New electrical issues introduced after MV cell insertion are caught only at netlist. There is an increasing demand to catch netlist level low power issues at the RTL stage itself and to reduce noise by predicting post-synthesis behavior.
Virtual instrumentation-based flow in VC LP shifts left the low power verification by virtually instrumenting the MV cells in the design based on the power intent leading to more accurate verification of the design at RTL stage.
VC LP has achieved ~99% accuracy in predictive checkers w.r.t netlist runs at customer design.
One customer has enabled this feature on 100+ sub-systems and SOC which matched netlist behavior at RTL stage especially for back-to-back ISO/LS cases. This flow is also gaining traction at various elite customers.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionHyperdimensional Computing (HDC) represents an emerging paradigm within the domain of cognitive computing, drawing inspiration from the information processing mechanisms observed in the human brain.
Despite the research efforts devoted to improvement and extension of HDC algorithm and hardware, few studies address the challenges of handling complex datasets in image classification. In this study, we examine the performance of HDC on image data, highlighting its limitations in terms of accuracy, susceptibility to anomalies, and privacy concerns. Inspired by our observations, we aim to combine HDC with a more effective feature extraction, without compromising overall efficiency. To this end, we analyze the efficiency bottleneck of HDC, and accordingly, propose a novel vector-free encoding that shrinks the energy consumption and obviates the need for vector storage. We repurpose the released resources to augment the proposed encoding with a well crafted feature extractor. Experimental results indicate that our proposed design, VisionHD, gains a significant accuracy improvement (>22%) while its energy consumption remains 30% lower than the baseline HDC. To evaluate the privacy of VisionHD, we introduce a more effective and generic reversing technique, which reveals that VisionHD successfully obfuscates the information and improves privacy metrics by 16.9X. Furthermore, VisionHD consistently exhibits higher accuracy under different rates of perturbation.
Despite the research efforts devoted to improvement and extension of HDC algorithm and hardware, few studies address the challenges of handling complex datasets in image classification. In this study, we examine the performance of HDC on image data, highlighting its limitations in terms of accuracy, susceptibility to anomalies, and privacy concerns. Inspired by our observations, we aim to combine HDC with a more effective feature extraction, without compromising overall efficiency. To this end, we analyze the efficiency bottleneck of HDC, and accordingly, propose a novel vector-free encoding that shrinks the energy consumption and obviates the need for vector storage. We repurpose the released resources to augment the proposed encoding with a well crafted feature extractor. Experimental results indicate that our proposed design, VisionHD, gains a significant accuracy improvement (>22%) while its energy consumption remains 30% lower than the baseline HDC. To evaluate the privacy of VisionHD, we introduce a more effective and generic reversing technique, which reveals that VisionHD successfully obfuscates the information and improves privacy metrics by 16.9X. Furthermore, VisionHD consistently exhibits higher accuracy under different rates of perturbation.
Research Manuscript
AI
Design
AI/ML Architecture Design
DescriptionVision Transformers have demonstrated remarkable performance in various vision tasks. However, general-purpose processors, such as CPUs and GPUs, face challenges in efficiently handling the inference of Vision Transformers. To address the issue, prior works have focused on accelerating only attention due to its high computational cost in NLP Transformers. In contrast, Vision Transformers demonstrate a higher computational cost due to linear modules such as linear transformation, linear projection and Feed-Forward Network (FFN), compared to attention. In this paper, we present ViT-slice, an algorithm-architecture co-design that enhances end-to-end performance and energy efficiency by optimizing not only attention but also linear modules. At the algorithm level, we propose bit-slice compression that avoids storing the redundant most significant bits (MSBs). Additionally, we present bit-slice dot product with early skip to efficiently compute the dot product using bit-sliced data. To enable early skip during the dot product computation, we leverage a trainable threshold. On the hardware level, we introduce a specialized bit-slice dot product unit (BSDPU) to efficiently process the bit-slice dot product with early skip algorithm. Additionally, we present a bit-slice encoder and decoder for on-chip bit-slice compression. ViT-slice achieves 244×, 35.3×, 16.8×, 10.4×, 5.0× end-to-end speedup over Xeon CPU, EdgeGPU, TITAN Xp GPU, Sanger accelerator and ViTCoD accelerator, respectively.
Research Manuscript
Design
AI/ML System and Platform Design
DescriptionVision Transformers (ViTs) have emerged as a promising solution to enable efficient 3D Human Mesh Recovery (HMR) in augmented and virtual reality (AR/VR) applications. However, it remains a challenge to efficiently accelerate ViT-based HMR due to high computational complexity and memory access footprint. In this paper, we propose VITA, a hardware and algorithm co-design framework for ViT-based HMR with much-improved performance and energy efficiency. To be specific, on the algorithm side, we proposed a pooling attention model optimized with regular memory access and reduced computation complexity. On the hardware side, we proposed an accelerator architecture capable of adapting various data movement caused by various pooling operations.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThis paper introduces ViTSen, optimizing Vision Transformers (ViTs) for resource-constrained edge devices. It features an in-sensor image compression technique to effectively reduce data conversion and transmission power costs. Further, ViTSen incorporates a ReRAM crossbar array, enabling efficient near-sensor analog convolution. This integration and novel pixel reading and peripheral circuitry decrease the reliance on analog buffers and converters, significantly lowering power consumption. To make ViTSen compatible, several established ViT algorithms have undergone quantization and channel reduction. Circuit-to-application co-simulation results show that ViTSen maintains accuracy comparable to a full-precision baseline across various data precisions, achieving an efficiency of approximately ~3.1 TOp/s/W.
Research Manuscript
EDA
Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionIn three-dimensional integrated circuits, the interconnection design among chiplets on redistribution layers (RDLs) is crucial for achieving high-performance computing systems. To optimize the inter-chip connections, most of the previous works focused on automatic signal net routing and pin assignment. The power net routing, or the power plane generation, is still a manual and time-consuming task, especially when generating the power planes of more than ten power supplies on a limited number of RDLs. This paper proposes a novel Voronoi diagram-based multiple power plane generation methodology which simultaneously optimizes the power planes of all power nets by utilizing the white space of given RDLs, while considering the signal routing blockages, power integrity, and complex design rules. Experimental results show that the proposed approach can achieve not only optimal area utilization but also the best power integrity in terms of the total number of redundant vias.
Research Manuscript
Design
Design of Cyber-physical Systems and IoT
DescriptionThis paper presents a versatile vertical indexing processor (VVIP) based on a single-instruction multiple-data architecture for edge computing. In VVIP, the vertical source and destination indexing instructions are customized for area-efficient computations. The proposed indexing method reorders data within a processing module by using more registers and data-steering logic in the calculations. In particular, VVIP supports multibit-serial multiplication and sparse data operations by leveraging register files as lookup tables or accumulators. The VVIP, verified on a vector processor, has an area overhead of less than 2.8%. It exhibits an average computation rate that is 10.1 times faster than the 1-bit-serial multiplication in linear algebra benchmarks, and 1.2 times average performance improvement in unstructured sparse point-wise convolution tasks when compared to conventional control sequences.
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn this presentation, we are very pleased to introduce synergy between watsonx.ai and the Design Data Browser (DDB for short). DDB is powerful high-performance X-windows based application which serves as a central cockpit uniting our core timing and physical database, known as DD, as well as our web-based interactive timing triage environments TimingVisualizer and Timng Takedown Dashboard. IBM® watsonx.ai™ AI is part of the IBM watsonx™ AI and data platform that brings together new generative AI capabilities, powered by foundation models and traditional machine learning. With watsonx.ai, you can train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with ease and build AI applications in a fraction of the time with a fraction of the data. In the next few charts, we will illustrate how the synergy between watsonx.ai and DDB helps address key challenges in timing take-down.
IP
Engineering Tracks
IP
DescriptionThe latest generation of DDR5 synchronous dynamic random-access memory (DRAM) brings significant advancements over its predecessor, DDR4. These improvements are particularly beneficial in data-intensive applications such as cloud computing, big data analytics, and high-performance computing. DDR5 enhances performance and power efficiency, but it also introduces new technical challenges in design implementation and verification, especially from the Memory Controller (MC) perspective.
The design and verification of power management flows for DDR5 DRAM require addressing several major technical challenges. To ensure exhaustive validation of power management flows, a Formal Property Verification (FPV) based methodology is employed. This approach has yielded encouraging results, highlighting the successful optimization of DDR5 DRAM Power Management.
The verification of DDR5 Power Management using formal technology has led to a detailed formal verification framework that validates power management flows. The results have been promising in terms of bugs found and coverage achieved. This has led to improved accuracy and efficiency of the DDR5 DRAM Memory Controller.
The successful verification and validation of DDR5 DRAM's power management flows demonstrate the effectiveness of the FPV-based methodology. This advancement is crucial for the continued evolution of memory technologies in high-demand computing environments.
The design and verification of power management flows for DDR5 DRAM require addressing several major technical challenges. To ensure exhaustive validation of power management flows, a Formal Property Verification (FPV) based methodology is employed. This approach has yielded encouraging results, highlighting the successful optimization of DDR5 DRAM Power Management.
The verification of DDR5 Power Management using formal technology has led to a detailed formal verification framework that validates power management flows. The results have been promising in terms of bugs found and coverage achieved. This has led to improved accuracy and efficiency of the DDR5 DRAM Memory Controller.
The successful verification and validation of DDR5 DRAM's power management flows demonstrate the effectiveness of the FPV-based methodology. This advancement is crucial for the continued evolution of memory technologies in high-demand computing environments.
Front-End Design
AI
Design
Engineering Tracks
Front-End Design
DescriptionAre existing verification methodologies running out of steam? Are some newer technologies still finding acceptance and adoption? Are some technologies, such as AI, being over-hyped? Are there too many tools to choose from and are they only affordable to large development teams?
Industry-expert panelists will discuss their views on existing industry-standard verification methodologies, (UVM, PSS, Formal, VIP) and future trends towards improved verification (AI and Beyond) and offer their perspectives on the viability of these various tools and trends. Panelists will address potential future improvements to existing tools or new technologies to accelerate verification.
Do not expect the panelists to be in complete agreement on these topics!
Industry-expert panelists will discuss their views on existing industry-standard verification methodologies, (UVM, PSS, Formal, VIP) and future trends towards improved verification (AI and Beyond) and offer their perspectives on the viability of these various tools and trends. Panelists will address potential future improvements to existing tools or new technologies to accelerate verification.
Do not expect the panelists to be in complete agreement on these topics!
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWireless power transfer (WPT) techniques provide the possibility for rapid and efficient energy supply to sensors, and mobile charging devices further expand the application scenarios of WPT technology.
Therefore, related modeling and scheduling studies have always been a hot topic.
In this paper, we design a schedule scheme for efficient charging with mobile chargers in wireless-powered sensor networks (WPSN).
We consider beamforming technique and fluctuating channel, which makes scheduling problem more practical and complex for linear analysis unavailable.
We show the analyzability of area discretization and then prove the constant approximation ratio for our scheduling algorithm in offline scenario where the channel state information is known.
In online scenario where the channel state information dynamically changes, a bandit algorithm is further proposed to balance the exploration and exploitation both in stationary and non-stationary environment.
Simulation results validate that the performance of our algorithm in the offline scenario highly approximates the upper bound, and the algorithm in the online scenario can rapidly approach the upper bound while effectively tracking the changes in channel state information for adaptive adjustments.
Therefore, related modeling and scheduling studies have always been a hot topic.
In this paper, we design a schedule scheme for efficient charging with mobile chargers in wireless-powered sensor networks (WPSN).
We consider beamforming technique and fluctuating channel, which makes scheduling problem more practical and complex for linear analysis unavailable.
We show the analyzability of area discretization and then prove the constant approximation ratio for our scheduling algorithm in offline scenario where the channel state information is known.
In online scenario where the channel state information dynamically changes, a bandit algorithm is further proposed to balance the exploration and exploitation both in stationary and non-stationary environment.
Simulation results validate that the performance of our algorithm in the offline scenario highly approximates the upper bound, and the algorithm in the online scenario can rapidly approach the upper bound while effectively tracking the changes in channel state information for adaptive adjustments.
Research Manuscript
Security
Hardware Security: Attack and Defense
DescriptionThe vulnerabilities of transient execution have been exploited in
many side-channel attacks (SCA). We report Whisper, a novel
transient execution timing (TET) side channel, which is based on
the execution time difference of transient execution under different
conditions. We develop TET version of SCAs including Meltdown,
Zombieload, and Spectre-RSB that use Whisper as covert channel
to leak information. We further propose TET-KASLR to break the
kernel address space layout randomization (KASLR) mechanism
under the protection of KPTI and FLARE. These attacks are simple
to implement and can bypass the existing mitigation methods
because the TET side channel relies on execution time that can
be conveniently obtained by architectural level timing analysis. We
demonstrate the correctness and effectiveness of these attacks on
various x86-64 CPUs. The root cause of Whisper is analyzed with
our toolset built on performance monitor unit (PMU) and potential
defense against Whisper is also discussed.
many side-channel attacks (SCA). We report Whisper, a novel
transient execution timing (TET) side channel, which is based on
the execution time difference of transient execution under different
conditions. We develop TET version of SCAs including Meltdown,
Zombieload, and Spectre-RSB that use Whisper as covert channel
to leak information. We further propose TET-KASLR to break the
kernel address space layout randomization (KASLR) mechanism
under the protection of KPTI and FLARE. These attacks are simple
to implement and can bypass the existing mitigation methods
because the TET side channel relies on execution time that can
be conveniently obtained by architectural level timing analysis. We
demonstrate the correctness and effectiveness of these attacks on
various x86-64 CPUs. The root cause of Whisper is analyzed with
our toolset built on performance monitor unit (PMU) and potential
defense against Whisper is also discussed.
Front-End Design
Design
Engineering Tracks
Front-End Design
DescriptionIncreasing use of electronic components in safety-critical applications like healthcare, automobiles, etc. have made manufacturers aim for zero defects-per-billion deliveries. The onus is on the design and verification engineers to deliver such high quality products without compromising on time-to-market metrices. Dual Core Lock Step (DCLS) is a configuration that is used widely in Functional Safety (FuSa) applications to alert the user whenever a system deviates from its specified behaviour. Given many possible implementations of the DCLS configuration, verification of a DCLS implemenation becomes a challenging tasks. In this work, we present a generic DCLS verification package which uses formal and static checks to verify all aspects of a DCLS implementation. We demonstrate some of the bugs and detail out our prosposed checks which have been successfully applied on mulitple in-house designs.
Research Panel
AI
EDA
DescriptionThe panel will candidly discuss if and why EDA misses disruptive innovation and chooses incremental advancements.
The EDA industry claims to have the maximum number of PhDs. Still, the last disruptive innovation from EDA was Logic Synthesis, P&R, Logic Simulation, and Formal Methods, invented decades ago.
This panel will attempt to answer the elements required to create bold disruptions. Does the EDA industry need a collective goal, like in other sectors? Does the EDA business duopoly hinder startup innovation?
The discussion will cover the following topics:
1. Current State of EDA and AI
The panel will be discussing the current state of AI . They will cover topics such as LLMs, RAG, and RL, and the next steps for these technologies.
2. Efficiency versus Discovery and Innovation
The panel will discuss the tension between efficiency, discovery, and innovation in the EDA and Semi industry. They will discuss how to balance these competing goals and how to create an environment that encourages both.
3. Closed-Form Solution Mindset
The panel will discuss the challenge of getting beyond the "closed-form solution" mindset. They will explore how this mindset can limit the potential of the EDA industry and how to break free from it.
4. Impact of AI on Workforce
Will AI change the nature of work in the EDA and Semiconductor industries? If so, how can we prepare the workforce for these changes?
5. Fostering Disruptive Innovation
Finally, the panel will discuss how to foster disruptive innovation in the EDA and Semi industry. They will explore the necessary factors for disruptive innovation to occur and how to create an environment that encourages it.
The EDA industry claims to have the maximum number of PhDs. Still, the last disruptive innovation from EDA was Logic Synthesis, P&R, Logic Simulation, and Formal Methods, invented decades ago.
This panel will attempt to answer the elements required to create bold disruptions. Does the EDA industry need a collective goal, like in other sectors? Does the EDA business duopoly hinder startup innovation?
The discussion will cover the following topics:
1. Current State of EDA and AI
The panel will be discussing the current state of AI . They will cover topics such as LLMs, RAG, and RL, and the next steps for these technologies.
2. Efficiency versus Discovery and Innovation
The panel will discuss the tension between efficiency, discovery, and innovation in the EDA and Semi industry. They will discuss how to balance these competing goals and how to create an environment that encourages both.
3. Closed-Form Solution Mindset
The panel will discuss the challenge of getting beyond the "closed-form solution" mindset. They will explore how this mindset can limit the potential of the EDA industry and how to break free from it.
4. Impact of AI on Workforce
Will AI change the nature of work in the EDA and Semiconductor industries? If so, how can we prepare the workforce for these changes?
5. Fostering Disruptive Innovation
Finally, the panel will discuss how to foster disruptive innovation in the EDA and Semi industry. They will explore the necessary factors for disruptive innovation to occur and how to create an environment that encourages it.
IP
Engineering Tracks
IP
DescriptionIn most complex digital architectures multiple Masters may compete for access to shared hardware resources (e.g., buses, memories, CPUs).
Arbitration algorithms are used to determine which Master takes over control of them at any time.
Fixed priority and Round-Robin are the most widespread algorithms used to face this kind of challenge. While those are effective in some cases, they may not be optimal for systems with varying traffic patterns.
This paper proposes a novel digital IP which implements a new starvation-free multi-master arbitration algorithm based on feedback considering the "history" of previous requests and grants within a given timeframe. This allows an efficient access to the shared resource and outperforming conventional approaches.
The proposed algorithm not only ensures a starvation-free feature, thus confirming that no Master is unfairly blocked from gaining access to the shared resource indefinitely, but also can guarantee a maximum granting time for every Master involved.
The verification of the IP was conducted through formal analysis to guarantee its starvation-free feature, while UVM dynamic simulations showed a significant reduction (from 30 % up to 64 %, in relation to dissimilar scenarios) of the average waiting time of every involved Master, compared to RR algorithm.
Arbitration algorithms are used to determine which Master takes over control of them at any time.
Fixed priority and Round-Robin are the most widespread algorithms used to face this kind of challenge. While those are effective in some cases, they may not be optimal for systems with varying traffic patterns.
This paper proposes a novel digital IP which implements a new starvation-free multi-master arbitration algorithm based on feedback considering the "history" of previous requests and grants within a given timeframe. This allows an efficient access to the shared resource and outperforming conventional approaches.
The proposed algorithm not only ensures a starvation-free feature, thus confirming that no Master is unfairly blocked from gaining access to the shared resource indefinitely, but also can guarantee a maximum granting time for every Master involved.
The verification of the IP was conducted through formal analysis to guarantee its starvation-free feature, while UVM dynamic simulations showed a significant reduction (from 30 % up to 64 %, in relation to dissimilar scenarios) of the average waiting time of every involved Master, compared to RR algorithm.
Research Manuscript
Design
SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionThe convolution neural network (CNN) has been widely adopted in computer vision tasks.
In the FPGA-based CNN accelerator design, Winograd convolution can effectively improve computation performance and save hardware resources.
However, building efficient and highly compatible IP for arbitrary Winograd convolution on FPGA remains underexplored.
To address this issue, we propose a novel and efficient reformulation of Winograd convolution, named Structured Direct Winograd Convolution (SDW).
We further develop WinoGen, a Chisel-based highly configurable Winograd convolution IP generator.
Given arbitrary input/output tile size and kernel size, it can generate optimized high-performance IP automatically.
Meanwhile, our generated IP can be compatible with multiple kernel sizes and tile sizes.
Experimental results show that the IP generated by WinoGen achieves DSP efficiency up to 3.80 GOPS/DSP and energy efficiency up to 652.77 GOPS/W while showing 2.45 times and 3.10 times improvements when processing a same CNN model compared with state-of-the-arts.
In the FPGA-based CNN accelerator design, Winograd convolution can effectively improve computation performance and save hardware resources.
However, building efficient and highly compatible IP for arbitrary Winograd convolution on FPGA remains underexplored.
To address this issue, we propose a novel and efficient reformulation of Winograd convolution, named Structured Direct Winograd Convolution (SDW).
We further develop WinoGen, a Chisel-based highly configurable Winograd convolution IP generator.
Given arbitrary input/output tile size and kernel size, it can generate optimized high-performance IP automatically.
Meanwhile, our generated IP can be compatible with multiple kernel sizes and tile sizes.
Experimental results show that the IP generated by WinoGen achieves DSP efficiency up to 3.80 GOPS/DSP and energy efficiency up to 652.77 GOPS/W while showing 2.45 times and 3.10 times improvements when processing a same CNN model compared with state-of-the-arts.
Exhibitor Forum
DescriptionCustomization is now the way forward for increasing performance in electronic systems. By customizing the processor to the actual workload, you can gain massive improvements for power, performance, and area. Using the right tools, customization can be approached using a fast and easy iterative approach enabling rapid architecture exploration and automated RTL and SDK generation. But how can you keep control of the customizations made during the design process, and how can you ensure the design is easily verified once you have achieved the performance you need? The answer is in bounded customization. By adding custom instructions within set bounds, you can achieve a good balance of freedom and control. Because you will not need to re-verify the entire core, the verification process will be smooth. With bounded customization, there is no risk of dead silicon because the custom instructions cannot break the baseline core. By working with tools able to generate the customized RTL and SDK as well as a verification environment aiding the verification of the custom instructions, you gain the power to customize and the confidence to claim responsibility for the end result.
Workshop
DescriptionContemporary microelectronic design is facing tremendous challenges in memory bandwidth, processing speed and power consumption. Although recent advances in monolithic design (e.g. near-memory and in-memory computing) help relieve some issues, the scaling trend is still lagging behind the ever increasing demand of AI, HPC and other applications. In this context, technological innovations beyond a monolithic chip, such as 2.5D and 3D packaging at the macro and micro levels, are critical to enabling heterogeneous integration with various types of chiplets, and bringing significant performance and cost benefits for future systems. Such a paradigm shift further drives new innovations on chiplet IPs, heterogeneous architectures and system mapping.
This workshop is designed to be a forum that is highly interactive, timely and informative, on the related topics:
● Roadmap and technology perspectives of heterogeneous integration
● IP definition for chiplets
● Signaling interface cross chiplets
● Network topology for data movement
● Design solutions for power delivery
● Thermal management
● Testing in a heterogeneous system
● High-level synthesis for the chiplet system
● Architectural innovations
● Ecosystems of IPs and EDA tools
Proposed Format: The format of the workshop will consist of multiple invited presentations from industry, academia, and government funding agencies. We will also organize a panel for discussions.
Intended Audience: Industry and academic researchers, funding agencies, IP providers, EDA tool vendors, foundry
This workshop is designed to be a forum that is highly interactive, timely and informative, on the related topics:
● Roadmap and technology perspectives of heterogeneous integration
● IP definition for chiplets
● Signaling interface cross chiplets
● Network topology for data movement
● Design solutions for power delivery
● Thermal management
● Testing in a heterogeneous system
● High-level synthesis for the chiplet system
● Architectural innovations
● Ecosystems of IPs and EDA tools
Proposed Format: The format of the workshop will consist of multiple invited presentations from industry, academia, and government funding agencies. We will also organize a panel for discussions.
Intended Audience: Industry and academic researchers, funding agencies, IP providers, EDA tool vendors, foundry
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionMost prior research in real-time schedulability analysis assumes real-time scheduling policies like fixed priority or Earliest Deadline First(EDF). However, the Linux operating system, widely used in embedded systems, employs the Completely Fair Scheduler(CFS) by default. To ensure the safe execution of real-time applications, it is crucial to check the satisfaction of real-time constraints. To our knowledge, for the first time, we propose a novel analysis method to estimate the worst-case response time(WCRT) of tasks under the CFS, with formal proof of its reliability. To validate the proposed method, a CFS simulator is developed to simulate the scheduling behavior faithfully and efficiently. By comparison with the simulated results, we confirm that the proposed WCRT analysis method is efficient, and the percentage of false negatives resulting from overestimation is less than 7% on average in our experimental setup.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionPersistent memory has emerged as a groundbreaking solution for byte-addressable storage-class memory. While persistent memory is highly desirable due to its advantages in high capacity, performance, and cost-effectiveness, novel system architecture solutions are required for this memory to be adopted in data center applications that require high performance and reliability. Major challenges include issues with the persistent memory media, such as endurance, retention, and reliability, programming complexity to enable byte-addressable access and persistence, as well as hardware and software system integration within the data center platform. To address these challenges, we have developed a comprehensive system comprising a persistent memory controller and software layers that effectively mitigate persistent memory issues and are compatible with data center applications. In our FPGA implementation, we have successfully demonstrated the first CXL interface for byte-addressable persistent memory. Our approach has resulted in the creation of a resilient and high-performing persistent memory device that has been successfully tested in real-world scenarios, including in-memory databases and filesystem applications in our data center.
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionResistive Random Access Memory (ReRAM) has emerged as a promising platform for deep neural networks (DNNs) due to its support for parallel in-situ matrix-vector multiplication. However, hardware failures, such as stuck-at-fault defects, can result in significant prediction errors during model inference. While additional crossbars can be used to address these failures, they come with storage overhead and are not efficient in terms of space, energy, and cost. In this paper, we propose a fault protection mechanism that incurs zero space cost. Our approach includes: 1) differentiable structure pruning of rows and columns to reduce model redundancy, 2) weight duplication and voting for robust output, and 3) embedding duplicated most significant bits (MSBs) into the model weight. We evaluate our method on nine tasks of the GLUE benchmark with the BERT model, and experimental results prove its effectiveness.
Research Manuscript
AI
Design
AI/ML, Digital, and Analog Circuits
DescriptionNeural Volume Rendering (NVR), a novel paradigm for the long-standing problem of photo-realistic rendering of virtual worlds, has developed explosively in the past three years. The unique and substantial computational requirements of NVR pose challenge on deploying NVR to existing dedicated accelerator for neural networks. In this work, we propose ZeroTetris, a spacial feature similarity-
based sparse multilayer perceptron (MLP) hardware accelerator for NVR. By leveraging the unique similarity-based sparsity between adjacent sampling points in NVR models, ZeroTetris efficiently bypass the computation of zero activations, thereby enhancing energy efficiency. Evaluation results affirm the effectiveness of the proposed design, showcasing ZeroTetris's superior performance in both area and power efficiency compared to other dedicated sparse matrix multiplication or MLP accelerator designs.
based sparse multilayer perceptron (MLP) hardware accelerator for NVR. By leveraging the unique similarity-based sparsity between adjacent sampling points in NVR models, ZeroTetris efficiently bypass the computation of zero activations, thereby enhancing energy efficiency. Evaluation results affirm the effectiveness of the proposed design, showcasing ZeroTetris's superior performance in both area and power efficiency compared to other dedicated sparse matrix multiplication or MLP accelerator designs.
Research Manuscript
AI
AI/ML Algorithms
DescriptionOptical neural networks (ONNs) have attracted great attention due to their low energy consumption and high-speed processing. The usual neural network training scheme leads to poor performance for ONNs because of their special parameterization and fabrication variations. This paper contributes to extend zeroth-order (ZO) optimization, which can be used to train such ONNs, in two ways. The first is to propose linear combination natural gradient, which mitigates the optimization difficulty caused by the special parameterization of an ONN. The second is to generate a guided direction vector by calibration for better guessing than random vectors generated in ZO optimization. Experimental results show that the two extensions significantly outperformed the existing ZO optimization and related methods with little computational overhead.
Research Manuscript
Embedded Systems
Embedded Memory and Storage Systems
DescriptionCompared with conventional SRAM, Spin-Transfer Torque Random Access Memory(STT-RAM) is expected to play a crucial role in future memory technologies with the increasing demands for higher storage density and lower power consumption for modern embedded systems. Moreover, Multi-Level Cell (MLC) STT-RAM outperforms Single-Level Cell (SLC) STT-RAM since it can store multiple bits per cell. However, MLC STT-RAM suffers from the occurrence of two-step state transitions (TTs) due to additional flipping of soft domains. Existing approaches mitigate this problem by reducing TTs with data coding. However, none of them can eliminate all the TTs. In this work, we propose a two-step transition avoidance scheme, referred to as zeroTT, for MLC STT-RAM. We show why the existing (2,3)-based coding methods cannot avoid TTs. Then, we refine the problem of expansion coding and present how to find zeroTT coding methods. Lastly, we propose an optimal (3,4)-based coding method considering the issues of space overhead and coding complexity. The experimental results demonstrate that zeroTT can completely avoid TTs, leading to a more efficient MLC STT-RAM in terms of latency, energy consumption, and lifetime.
Sessions
Research Manuscript
Embedded Systems
Embedded Memory and Storage Systems
Research Manuscript
Design
SoC, Heterogeneous, and Reconfigurable Architectures
Research Manuscript
EDA
Test, Validation and Silicon Lifecycle Management
Research Manuscript
EDA
RTL/Logic Level and High-level Synthesis
Research Manuscript
AI
AI/ML Application and Infrastructure
Research Manuscript
EDA
Analog CAD, Simulation, Verification and Test
Research Manuscript
Security
Hardware Security: Primitives, Architecture, Design & Test
Research Manuscript
Design
Emerging Models of Computation
Research Manuscript
Autonomous Systems
Autonomous Systems (Automotive, Robotics, Drones)
Research Manuscript
AI
Design
AI/ML Architecture Design
Research Manuscript
AI
Design
AI/ML Architecture Design
Additional Meeting
DAC Early Career Workshop
Research Manuscript
AI
Design
AI/ML System and Platform Design
Research Manuscript
Embedded Systems
Embedded System Design Tools and Methodologies
Research Manuscript
AI
Design
AI/ML Architecture Design
Back-End Design
Back-End Design
Design
Engineering Tracks
Special Session (Research)
Design
Back-End Design
Back-End Design
Design
Engineering Tracks
Back-End Design
Back-End Design
Design
Engineering Tracks
Research Manuscript
Design
Design of Cyber-physical Systems and IoT
Research Manuscript
Embedded Systems
Time-Critical and Fault-Tolerant System Design
Research Manuscript
AI
Security
AI/ML Security/Privacy
Additional Meeting
HACK at DAC
Additional Meeting
HACK at DAC
Research Manuscript
Security
Hardware Security: Primitives, Architecture, Design & Test
Research Manuscript
Design
SoC, Heterogeneous, and Reconfigurable Architectures
Research Manuscript
EDA
RTL/Logic Level and High-level Synthesis
Additional Meeting
IEEE CEDA Distinguished Lecture Luncheon
Research Manuscript
AI
Design
AI/ML Architecture Design
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
Research Manuscript
Embedded Systems
Embedded Software
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
Research Manuscript
Design
Emerging Models of Computation
Research Manuscript
EDA
Physical Design and Verification
Research Manuscript
AI
Design
AI/ML System and Platform Design
Additional Meeting
PhD Forum & University Demo
Research Manuscript
EDA
Timing and Power Analysis and Optimization
Research Manuscript
Design
Design for Manufacturability and Reliability
Research Manuscript
EDA
Timing and Power Analysis and Optimization
Special Session (Research)
AI
Design
Research Manuscript
EDA
Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
Research Manuscript
Design
In-memory and Near-memory Computing Circuits
Research Manuscript
EDA
Physical Design and Verification
Front-End Design
Design
Engineering Tracks
Front-End Design
Research Manuscript
Design
In-memory and Near-memory Computing Circuits
Research Manuscript
AI
AI/ML Algorithms
Back-End Design
Back-End Design
Design
Engineering Tracks
Additional Meeting
TODAES Editorial Board Meeting
Research Manuscript
Design
AI/ML System and Platform Design
Engineering Track Poster
Back-End Design
Embedded Systems
Front-End Design
IP
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
Research Manuscript
Design
Design for Manufacturability and Reliability
Research Manuscript
EDA
Design Verification and Validation
Work-in-Progress Poster
AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
Research Manuscript
AI
Design
AI/ML, Digital, and Analog Circuits
Research Manuscript
Design
In-memory and Near-memory Computing Architectures, Applications and Systems
Additional Meeting
Young Fellows Closing Ceremony
Additional Meeting
Young Fellows Kick-Off and All-Day Activities
Additional Meeting
Young Fellows Posters
Try a different query.