Search Program
Organizations
Contributors
Presentations
IP


Engineering Tracks
IP
DescriptionCurrent vehicle systems need to process the amount of data from a wide variety of sensors, such as radars and cameras, at high speed and in real time, therefore Ethernet Switch embedded in vehicle system should communicate at a high throughput of 50Gbps. In addition, the next generation vehicle system which autonomous driving technology becomes increasingly sophisticated, high speed communication of 100 Gbps will be required. Furthermore, in-vehicle ECUs are required to consume even less power in order to prevent heat generation in the in-vehicle environment and optimize battery efficiency.
Conventionally, Ethernet Switch have used HASH method for search processing in the switch processing block, which has low power consumption but limited throughput. TCAM method is essential in order to achieve high throughput of 100 Gbps, but it has a problem of high power consumption. Additionally, architectural optimization in the searching processing block is also required to achieve high throughput with low power consumption.
We have realized an Ethernet Switch that can achieve high throughput with low power consumption by adopting a pipeline search method and a phase shift search method on TCAM base. This Ethernet Switch fulfills the requirements of next generation autonomous driving car.
Conventionally, Ethernet Switch have used HASH method for search processing in the switch processing block, which has low power consumption but limited throughput. TCAM method is essential in order to achieve high throughput of 100 Gbps, but it has a problem of high power consumption. Additionally, architectural optimization in the searching processing block is also required to achieve high throughput with low power consumption.
We have realized an Ethernet Switch that can achieve high throughput with low power consumption by adopting a pipeline search method and a phase shift search method on TCAM base. This Ethernet Switch fulfills the requirements of next generation autonomous driving car.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionWith the growing demand for the heterogenous chip interconnect there is a dire need of a unified EDA design environment that effectively handles the complex logical interconnects, physical layout design, EE, mechanical & thermal simulations.
Intel's embedded multi-die interconnect bridge (EMIB) is an approach to in-package high-density interconnect of heterogeneous chips. With the increasing demand from Intel internal and IFS customer base there is a bigger challenges for tools to handle highly complex design with 10s of complex chiplets and their connectivity management, low latency high bump count layout and reliable interconnects that can be seamlessly simulate with EDA tools.
Intel's collaboration with Cadence on automating the 2.5D design is a significant step towards making EMib technology a more widely adopted and efficient solution for high-performance chip design. This will significantly benefit other companies and researchers working in this field.
Intel's embedded multi-die interconnect bridge (EMIB) is an approach to in-package high-density interconnect of heterogeneous chips. With the increasing demand from Intel internal and IFS customer base there is a bigger challenges for tools to handle highly complex design with 10s of complex chiplets and their connectivity management, low latency high bump count layout and reliable interconnects that can be seamlessly simulate with EDA tools.
Intel's collaboration with Cadence on automating the 2.5D design is a significant step towards making EMib technology a more widely adopted and efficient solution for high-performance chip design. This will significantly benefit other companies and researchers working in this field.
Research Manuscript


EDA
Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
DescriptionEnvironmental sustainability is a critical concern for Integrated Circuits (ICs) throughout their entire life cycle, particularly in manufacturing and use. Meanwhile, ICs using 3D/2.5D integration technologies have emerged as promising solutions to meet the growing demands for computational power. However, there is a distinct lack of carbon modeling tools for 3D/2.5D ICs. Addressing this, we propose 3D-Carbon, an analytical carbon modeling tool designed to quantify the carbon emissions of 3D/2.5D ICs throughout their life cycle. 3D-Carbon factors in both potential savings and overheads from advanced integration technologies, considering practical deployment constraints like bandwidth. We validate 3D-Carbon's accuracy against established baselines and illustrate its utility through case studies in autonomous vehicles. We believe that 3D-Carbon lays the initial foundation for future innovations in developing environmentally sustainable 3D/2.5D ICs.
Research Panel


Design
DescriptionAt the end of 2D scaling of Moore's law, 3D integrated circuits that take advantages of advanced packaging and heterogeneous integration offers many prospects of extending the chip density scaling and the system performance improvements for the next decade. Much of 3DIC design activity in the industry today is done via different teams within the same chipmaker company. 3DICs hold the potential to not only make the chip architecture heterogeneous, and chiplet sourcing to be highly diversified. Moreover, 3DICs themselves have a few avenues to be realized towards commercial success, ranging from true disaggregated chiplets to sequential stacked processing. This presses us to answer a few key questions:
1. Technology:
a. How will heat dissipation be managed, are new cooling techniques are being pursued to mitigate the thermal challenge?
b. How to design the power delivery network from the board to the substrate to the multi-tier of 3D stack with minimal voltage drop and high-power conversion efficiency? How to design the backside power delivery in leading edge node CMOS with 3D stacking?
c. How to ensure signal integrity, yield and reliability between multiple tiers of 3D stacking, and what testing and standardization efforts are needed to embrace the heterogeneous dies from different designers and foundries?
2. EDA flows and interoperability
a. Will the ecosystem extend the same standards-based interoperability of design tools, flows and methodologies to 3DIC, as enjoyed by monolithic system designers today?
b. How can EDA industry help system designers in planning, managing and tracking their complex 3DIC projects in implementation, analysis, and signoffs?
3. Roadmap:
a. Is the roadmap to sequential monolithic stacked 3DIC an inevitability? What factors lead the industry to it?
b. What are the boundaries between monolithic 3D integration (with sequential processing at BEOL) and heterogenous 3D integration (with die stacking or bonding)?
Are we as an industry able to apply lessons from the past struggles with monolithic chip design and interoperability to this emerging challenge? This panel will discuss the need, scope of solution and potential candidate efforts already in motion.
1. Technology:
a. How will heat dissipation be managed, are new cooling techniques are being pursued to mitigate the thermal challenge?
b. How to design the power delivery network from the board to the substrate to the multi-tier of 3D stack with minimal voltage drop and high-power conversion efficiency? How to design the backside power delivery in leading edge node CMOS with 3D stacking?
c. How to ensure signal integrity, yield and reliability between multiple tiers of 3D stacking, and what testing and standardization efforts are needed to embrace the heterogeneous dies from different designers and foundries?
2. EDA flows and interoperability
a. Will the ecosystem extend the same standards-based interoperability of design tools, flows and methodologies to 3DIC, as enjoyed by monolithic system designers today?
b. How can EDA industry help system designers in planning, managing and tracking their complex 3DIC projects in implementation, analysis, and signoffs?
3. Roadmap:
a. Is the roadmap to sequential monolithic stacked 3DIC an inevitability? What factors lead the industry to it?
b. What are the boundaries between monolithic 3D integration (with sequential processing at BEOL) and heterogenous 3D integration (with die stacking or bonding)?
Are we as an industry able to apply lessons from the past struggles with monolithic chip design and interoperability to this emerging challenge? This panel will discuss the need, scope of solution and potential candidate efforts already in motion.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
Description3DIC design can reduce the length of interconnections and secure gains in power and performance by using multiple dies stacked vertically.
However, the design complexity increases, and more resources are required to modify the design compared to a single die design.
In the early stages of design, we need to be able to quickly and easily prototype design.
Early thermal analysis is an important key to determining the design floorplan, and a high correlation is required after the design is complete.
When we performed thermal analysis on the prototype design and the two designs after the actual P&R was completed, we confirmed that the thermal map showed similar heat maps and hot spots.
When we performed thermal analysis according to the three power scenario steps, the largest error rate between the prototype and the real was 8.34%, which was found near the chip boundary at 5.8s.
We confirmed that the temperature difference was less than 10% and the hot spot trend was very similar.
However, the design complexity increases, and more resources are required to modify the design compared to a single die design.
In the early stages of design, we need to be able to quickly and easily prototype design.
Early thermal analysis is an important key to determining the design floorplan, and a high correlation is required after the design is complete.
When we performed thermal analysis on the prototype design and the two designs after the actual P&R was completed, we confirmed that the thermal map showed similar heat maps and hot spots.
When we performed thermal analysis according to the three power scenario steps, the largest error rate between the prototype and the real was 8.34%, which was found near the chip boundary at 5.8s.
We confirmed that the temperature difference was less than 10% and the hot spot trend was very similar.
Research Manuscript
4-Transistor Ternary Content Addressable Memory Cell Design using Stacked Hybrid IGZO/Si Transistors


Design
Emerging Models of Computation
DescriptionIn this paper, we propose a 4T-based paired orthogonally stacked transistors for random access memory (POST-RAM) cell structure and also suggest ternary content addressable memory (TCAM) applications. POST-RAM cells feature vertically stacked read and write transistors, maximizing area efficiency by utilizing only two transistors' space. %POST-RAM cells have read and write transistors stacked vertically, maximizing area efficiency by using the area of only two transistors.
POST-RAM employs InGaZnO (IGZO) channels for write transistors and single crystal silicon channels for read transistors, which results in both extremely long memory retention and fast reading performance. A comprehensive 3D-TCAD simulation is conducted to validate the procedural design of the proposed device structure. Furthermore, we introduced a self-clamped searching scheme (SC2S) designed to enhance the efficiency of TCAM operations. The results conclusively demonstrate that operating a TCAM based on the proposed POST-RAM architecture can lead to a 20$\%$ improvement in energy-delay product (EDP). Notably, the delay performance can be enhanced by up to 40$\%$ when compared to a 16T SRAM-based TCAM. Additionally, the proposed scheme enables a more than sixfold reduction in cell area, demonstrating an efficient use of space.
POST-RAM employs InGaZnO (IGZO) channels for write transistors and single crystal silicon channels for read transistors, which results in both extremely long memory retention and fast reading performance. A comprehensive 3D-TCAD simulation is conducted to validate the procedural design of the proposed device structure. Furthermore, we introduced a self-clamped searching scheme (SC2S) designed to enhance the efficiency of TCAM operations. The results conclusively demonstrate that operating a TCAM based on the proposed POST-RAM architecture can lead to a 20$\%$ improvement in energy-delay product (EDP). Notably, the delay performance can be enhanced by up to 40$\%$ when compared to a 16T SRAM-based TCAM. Additionally, the proposed scheme enables a more than sixfold reduction in cell area, demonstrating an efficient use of space.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionFor today's high speed AMS design, as the processes shrinking and design complexity increasing, the layout parasitics have become more and more important and even more dominant than devices, which impact a lot on design's performance.On the other hand, as the parasitics magnitude increase, it's more and more hard to debug complex parasitics issue through traditional method like post-sim, with which designer need to spend more post-sim and sign-off runtime, more experience-based manually debug and iteration to identify the real bottleneck can usually make the design schedule out of control.
To improve the design efficiency, a "shift-left" parasitic analysis flow for AMS layout parasitics become necessary and important, to help design identify the parasitics caused design problem more early, quickly, and easily.
Before go to sign-off stage, we first use ParagonX perform quickly parasitics analysis of R, C, RC delay, net matching, etc in early design stage, and debug result by element, by layer, by layout locations, to identify and optimize the real layout bottleneck, reducing the layout iterations ranging from weeks to hours. Through the flow improvement, we makes parasitics debugging and layout optimization easy and efficient, significantly improve design efficiency.
To improve the design efficiency, a "shift-left" parasitic analysis flow for AMS layout parasitics become necessary and important, to help design identify the parasitics caused design problem more early, quickly, and easily.
Before go to sign-off stage, we first use ParagonX perform quickly parasitics analysis of R, C, RC delay, net matching, etc in early design stage, and debug result by element, by layer, by layout locations, to identify and optimize the real layout bottleneck, reducing the layout iterations ranging from weeks to hours. Through the flow improvement, we makes parasitics debugging and layout optimization easy and efficient, significantly improve design efficiency.
Research Manuscript


Embedded Systems
Time-Critical and Fault-Tolerant System Design
DescriptionParallel real-time systems often rely on the shared cache for dependent data transmissions across cores. Conventional shared cache and their management techniques suffer from intensive contention and are markedly inflexible, leading to significant transmission latency of shared data. In this paper, we provide a Virtual Indexed Physical Tagged, Selectively-Inclusive Non-Exclusive L1.5 Cache, offering way-level control and versatile sharing capabilities. Focusing on a common-seen parallel task model, the Directed Acyclic Graph (DAG), we construct a novel scheduling method that exploits the L1.5 Cache to reduce data transmission latency, achieving improved timing performance. As a systematical solution, we build a real system, from the SoC and ISA to the drivers and the programming model. Experiments show that the proposed solution significantly improves the real-time performance of DAG tasks with negligible hardware overhead.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionTraditionally, the budgeting of STA and IR drop limits was done separately, with each converging to their respective limits without much interaction. Recently, there have been attempts to incorporate IR drop into STA analysis for a more informed timing signoff. However, the reverse - incorporating timing critical path into IR signoff - has not been as thoroughly investigated.
This work proposed a methodology for IR drop signoff with awareness of timing critical paths. It utilizes the latest features from the Redhawk-SC EDA tool to incorporate timing analysis results into IR voltage drop signoff. This IR voltage drop data can subsequently be incorporated into an incremental timing analysis to pinpoint potential waivers for IR violations. Evaluation data from real design blocks in advanced nodes demonstrate that it can improve design coverage and enhance silicon robustness and system performance.
This work proposed a methodology for IR drop signoff with awareness of timing critical paths. It utilizes the latest features from the Redhawk-SC EDA tool to incorporate timing analysis results into IR voltage drop signoff. This IR voltage drop data can subsequently be incorporated into an incremental timing analysis to pinpoint potential waivers for IR violations. Evaluation data from real design blocks in advanced nodes demonstrate that it can improve design coverage and enhance silicon robustness and system performance.
Research Manuscript


Embedded Systems
Embedded Memory and Storage Systems
Descriptionk-Clique counting problem plays an important role in graph mining which has seen a growing number of applications. However, current k-Clique counting accelerators cannot meet the performance requirement mainly because they struggle with high data transfer issue incurred by the intensive set intersection operations and the inability of load balancing. In this paper, we propose to solve this problem with a hybrid framework of content addressable memory (CAM) and in-memory processing (PIM). Specifically, we first utilize CAM for binary induced subgraph generation in order to reduce the search space, then we use PIM to implement in-place parallel k-Clique counting through iterative Boolean logic "AND"- like operation. To take full advantage of this combined CAM and PIM framework, we develop dynamic task scheduling strategies that can achieve near optimal load balancing among the PIM arrays. Experimental results demonstrate that, compared with state-of-the-art CPU and GPU platforms, our approach achieves speedups of 167.5× and 28.8×, respectively. Meanwhile, the energy efficiency is improved by 788.3× over the GPU baseline.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWith interconnect spacing shrinking in advanced technology nodes, the precision of existing timing predictions worsens as crosstalk-induced delay is hard to quantify. During the routing process, the crosstalk effect is usually modeled by predicting coupling capacitance with congestion information. However, the timing estimation is overly pessimistic since the crosstalk-induced delay depends not only on the coupling capacitance but also on the signal arrival time. In this work, a crosstalk-aware timing estimation method is presented using a two-step machine learning approach. Interconnects that are physically adjacent and overlap in signal timing windows are filtered first. Secondly, crosstalk delay is predicted by integrating physical topology features and timing features without the post-routing result and the parasitic extraction flow. Experimental results demonstrate that the match rate of identified crosstalk-critical nets is over 99\% compared to a commercial tool. The delay prediction results are more accurate than other state-of-the-art methods.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionVarious custom cells are used in DRAM and NAND Flash memories to optimize power, performance, and area. Liberty model characterization of the custom cells becomes a time-consuming manual task when an automation tool is unable to extract the timing arc and Spice input decks, called configuration for characterization in this paper, from them. The conventional approach is to enhance the tool's capabilities so that it can accommodate custom cells which were not previously taken into consideration. However, as the majority of cell types are remained unchanged across various projects, the configurations can be reused once manually crafted and verified. This study presented a data-driven approach that automates the Liberty model characterization process by mapping a cell to its corresponding configuration with a neural network. We employed graph neural networks (GNNs) to establish relationships between cell topologies and the configurations. We implemented supervised classifiers based on widely used GNNs such as GCN, GraphSAGE, GAT, and GIN, and compare the classification accuracies and the numbers of parameters. With GNNs, our method reached over 94% accuracy, while the traditional rule-based methods using naming convention or ad-hoc connectivity rule scored below 75%.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionNearly a decade ago, in July 2015, we released the 1st edition of our book "Formal Verification: An Essential Toolkit for Modern VLSI Design". This book was well-received in the industry, being essentially the first practical modern guidebook on the topic of Formal Verification (FV) aimed at active engineers designing and validating RTL models, rather than theoretical researchers. However, we are part of a rapidly evolving field, and our notion of best practices for FV has undergone many changes in the years since the initial release. We have also gained a variety of different experiences— while all three authors had worked together at Intel when beginning the first edition, since then one author moved to academia, and another moved from Intel to EDA vendor Cadence. It is the gradual accumulation of these changes and varied new learnings that eventually motivated us to put out a heavily revised 2nd edition, released in June 2023. Since not every FV practitioner has purchased our 2nd edition, or has kept completely up-to-date with FV methodology at other companies in the industry, we think it will be useful to summarize some of the major areas in which FV practice has changed and improved in the years leading up to our 2nd edition. This information will help current designers, validators, and FV specialists to improve their practices and enable them to better incorporate the industry's latest learnings.
Research Manuscript


AI
Design
AI/ML Architecture Design
DescriptionDeep Learning, particularly Deep Neural Networks (DNNs), has emerged as a powerful tool for addressing intricate real-world challenges. Nonetheless, the deployment of DNNs presents its own set of obstacles, chiefly stemming from substantial hardware demands. In response to this challenge, Domain-Specific Accelerators (DSAs) have gained prominence as a means of executing DNNs, especially within cloud service providers offering DNN execution as a service. For service providers, managing multi-tenancy and ensuring high quality service delivery, particularly in meeting stringent execution time constraints, assumes paramount importance, all while endeavoring to maintain cost-effectiveness. In this context, the utilization of heterogeneous multi-accelerator systems becomes increasingly relevant. This paper presents RELMAS, a low-overhead deep reinforcement learning algorithm designed for the real-time scheduling of DNNs in multi-tenant environments, taking into account the dataflow heterogeneity of accelerators and memory bandwidths contentions. By doing so, service providers can employ the most efficient scheduling policy for user requests, optimizing Service-Level-Agreement (SLA) satisfaction rates and enhancing hardware utilization. The application of RELMAS to a heterogeneous multi-accelerator system composed of various instances of Simba and Eyeriss sub-accelerators resulted in up to a 173% improvement in SLA satisfaction rate compared to state-of-the-art scheduling techniques across different workload scenarios, with less than a 1.5% energy overhead.
Front-End Design


Design
Engineering Tracks
Front-End Design
DescriptionIn this talk, we present Cross Testbench (XTB), a distributed co-simulation environment that enables co-simulation across two simulation approaches, event-driven and cycle-based. Event-driven and cycle-based simulation are two commonly utilized verification approaches in the industry. The former takes into account delays and timings, is versatile, and works well with asynchronous systems, which makes it ideal for achieving highly accurate simulations. However, simulation speed depends on the model size and activity, making it slower for large designs. Whereas, cycle simulation is faster, scales better, and supports hardware acceleration, but does not include timing information, which makes it more suitable for large designs such as server microprocessors. Each approach has distinct benefits, and leveraging both ensures reliable and precise verification while maintaining rapid execution and extensive test coverage. We leveraged XTB to achieve chip-level verification allowing for interplay between parts of the design which were required to be simulated with event simulation (such as vendor delivered Verification IPs for physical parts), and the rest of it which utilized cycle simulation to achieve high throughput. We highlight the successful usage of XTB to verify IBM's memory buffer chip which integrates external IPs such as DDR5 and PCIe. In addition, we outline XTBs capability to save and restart in a distributed co-simulation to significantly improve performance in a production environment.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe state-of-the-art method for oracle synthesis in quantum computing is based on logic networks, where each node corresponds to an output or an intermediate state requiring uncomputation cleanup. The order in which we compute and uncompute these nodes, sometimes referred to as the reversible pebble game, is a key factor influencing the number of qubits and the circuit length in the final result. In this paper, we introduce a novel pebbling strategy based on divide-and-conquer that aims at reducing the number of qubits while maintaining a reasonable circuit length. Our results show that our algorithm beats previous heuristic method in both number of qubits and circuit length, having the potential in tackling large-scale oracle synthesis problems.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionTransformers-based language models have demonstrated tremendous accuracy in multiple natural language processing (NLP) tasks. Transformers use self-attention, in which matrix multiplication is the dominant computation. Moreover, their large size, makes the data movement a latency and energy efficiency bottleneck in conventional Von-Neumann systems. The processing-in-memory architectures with compute elements in the memory have been proposed to address the bottleneck. This paper presents PACT-3D, a PIM architecture with novel computing units interfaced with DRAM banks performing the required computations and achieving a 1.7X reduction in latency and 18.7X improvement in energy efficiency against the state-of-the-art PIM architecture.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIn recent years, In-RRAM Computing (IRC) is a promising technique for deep neural network (DNN) applications. Combined with proper pruning technique, the cost and energy of DNN computation can be further reduced. However, IRC often suffers from various non-ideal effects in RRAM arrays such as the sneak path and IR-drop, which greatly affect the computation accuracy. Therefore, accurate error injection is required in the verification at early design stage. Conventional random disturbance and equation-based approaches did not consider the data allocation issue, which may incur larger errors for the sparse matrix generated by data pruning techniques. In this paper, a fast and accurate IR-drop model is proposed to reflect the data-dependent effects, which is able to offer accurate error injection in the DNN training phase with sparse matrix. As shown in the experimental results, the proposed model shows a good match to the HSPICE results even if the data allocation becomes non-uniform. With the proposed simple model, the accuracy degradation of real NN applications can be well observed, even for large RRAM arrays.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIn analog or mixed-signal in-memory computing (IMC) applications, the focus is typically on the bit cell, particularly during the inference period. However, for transmitting multiplication-and-accumulation (MAC) results to subsequent layers, IMC macros must convert analog signals into digital domain using analog-to-digital converters (ADCs), often the most power and area-intensive components in IMC systems. Addressing this, we present an efficient training/inferencing algorithm tailored for specific IMC applications, introducing an ADC-less IMC macro design suitable for practical memory systems. This novel architecture eliminates the need for power-intensive ADCs, opting for reconfigurable conventional memory structures with sense amplifiers, like DRAM or SRAM arrays. This study introduces an algorithm that integrates sense amplifiers into both the training and inference processes without extra hardware additives.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionCircuit knitting emerges as a promising technique to overcome the limitation of the few physical qubits in near-term quantum hardware by cutting large quantum circuits into smaller subcircuits. Recent research in this area has been primarily oriented towards reducing subcircuit sampling overhead. Unfortunately, these works neglect hardware information during circuit cutting, thus posing significant challenges to the follow on stages. In fact, direct compilation and execution of these partitioned subcircuits yields low-fidelity results, highlighting the need for a more holistic optimization strategy.
In this work, we propose a hardware-aware framework aiming to advance the practicability of circuit knitting. Drawing a contrast with prior methodologies, the presented framework innovatively designs a cutting scheme that concurrently optimizes the number of gate cuttings and SWAP insertions during circuit cutting. In particular, we leverage the graph similarity between qubits interaction and chip layout as a heuristic guide to reduces potential SWAPs in the subsequent step of qubit routing. Building upon this, the circuit knitting framework we developed can reduce total subcircuits depth by up to 64% (48% on average) compared to the state-of-the-art approach, and enhance the relative fidelity up to 2.7x.
In this work, we propose a hardware-aware framework aiming to advance the practicability of circuit knitting. Drawing a contrast with prior methodologies, the presented framework innovatively designs a cutting scheme that concurrently optimizes the number of gate cuttings and SWAP insertions during circuit cutting. In particular, we leverage the graph similarity between qubits interaction and chip layout as a heuristic guide to reduces potential SWAPs in the subsequent step of qubit routing. Building upon this, the circuit knitting framework we developed can reduce total subcircuits depth by up to 64% (48% on average) compared to the state-of-the-art approach, and enhance the relative fidelity up to 2.7x.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAs the technology node shrinks, routing in memory devices is becoming a challenging problem. Advanced commercial routing solutions have been introduced for dealing with more complex design rules and less routing resources, however, routing results are still far from satisfactory. Complex routing patterns from those routing solutions are not meeting customer's specific expectations, rather making it more difficult for engineers to manually modify it. In this paper we explore the possibility whether a simpler approach, a heuristic-based routing methodology can be a better option for improving routability. Our routing methodology simplifies entire routing process into two stages: global routing and local routing, and heuristic-based algorithm is applied in each stage. With our routing methodology, we could achieve higher routing success rate by on average 43%, with less routing resource usage by on average 13% and less drc errors by on average 68%.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWireless baseband processing (WBP) is a key element of wireless communications, with a series of signal processing modules to improve data throughput and counter channel fading. Conventional hardware solutions, such as digital signal processors (DSPs) and more recently, graphic processing units (GPUs), provide various degrees of parallelism, yet they both fail to take into account the cyclical and consecutive character of WBP. Furthermore, the large amount of data in WBPs cannot be processed quickly in symmetric multiprocessors (SMPs) due to the unpredictability of memory latency. To address this issue, we propose a hierarchical dataflow-driven architecture to accelerate WBP. A \textit{pack-and-ship} approach is presented under a non-uniform memory access (NUMA) architecture to allow the subordinate tiles to operate in a bundled access and execute manner. We also propose a multi-level dataflow model and the related scheduling scheme to manage and allocate the heterogeneous hardware resources. Experiment results demonstrate that our prototype achieves $2\times$ and $2.3\times$ speedup in terms of normalized throughput and single-tile clock cycles compared with GPU and DSP counterparts in several critical WBP benchmarks. Additionally, a link-level throughput of $288$ Mbps can be achieved with a $45$-core configuration.
Research Manuscript


EDA
Design Methodologies for System-on-Chip and 3D/2.5D System-in-Package
Description3D ICs promise increased logic density and reduced routing congestion over conventional monolithic 2D ICs.
High level synthesis (HLS) tools promise reduced design complexity by approaching the design from a higher abstraction level and allow for more optimization flexibility.
We propose improving timing closure of 3D ICs by co-designing the architecture and physical design by integrating HLS and 3D IC macro placement into the same holistic loop.
On average our method is able to reduce estimated total negative slack (TNS) by 62% and 92% when compared to a traditional binding and placement technique for 2D and 3D ICs respectively.
High level synthesis (HLS) tools promise reduced design complexity by approaching the design from a higher abstraction level and allow for more optimization flexibility.
We propose improving timing closure of 3D ICs by co-designing the architecture and physical design by integrating HLS and 3D IC macro placement into the same holistic loop.
On average our method is able to reduce estimated total negative slack (TNS) by 62% and 92% when compared to a traditional binding and placement technique for 2D and 3D ICs respectively.
Research Manuscript


Design
Quantum Computing
DescriptionIsing model-based computers have recently emerged as high-performance solvers for combinatorial optimization problems (COPs). For Ising model, a simulated bifurcation (SB) algorithm searches for the solution by solving pairs of differential equations. The SB machine benefits from massive parallelism but suffers from high energy. Dynamic stochastic computing implements accumulation-based operations efficiently. This article proposes a high-performance stochastic SB machine (SSBM) for solving COPs with efficient hardware. To this end, we develop a stochastic SB (sSB) algorithm such that the multiply-and-accumulate (MAC) operation is converted to multiplexing and addition while the numerical integration is implemented by using signed stochastic integrators (SSIs). Specifically, the sSB stochastically ternarizes position values used for the MAC operation. A stochastic computing SB cell (SC-SBC) is constructed by using two SSIs for area efficiency. Additionally, a binary-stochastic computing SB cell (BSC-SBC) uses one binary integrator and one SSI to achieve a reduced delay. Based on sSB, an SSBM is then built by using the SC-SBC or BSC-SBC as the basic building block. The designs and syntheses of two SSBMs with 2000 fully connected spins require at least 1.13 times smaller area than the state-of-the-art designs.
Research Manuscript


Security
Hardware Security: Primitives, Architecture, Design & Test
DescriptionFully Homomorphic Encryption (FHE) enables unlimited computation depth, allowing for privacy-enhanced neural network inference tasks directly on the ciphertext. However, existing FHE architectures suffer from the memory access bottleneck due to the significant data consumption. This work proposes a High-throughput FHE engine for private inference (PI) based on 3D stacked memory (H3). H3 adopts software-hardware co-design that dynamically adjusts the polynomial decomposition during the PI process to minimize the computation and storage overhead at a fine granularity. With 3D hybrid bonding, H3 integrates a logic die with a multi-layer embedded DRAM, routing data efficiently to the processing unit array through an efficient broadcast mechanism. H3 consumes 192mm$^2$ of the area when implemented using a 28nm logic process. H3 achieves a throughput of 1.36 million LeNet-5 or 920 ResNet-20 PI per minute, surpassing existing 7nm accelerators by 52%. This demonstrates that 3D memory is a promising technology to promote the performance of FHE.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThis paper presents a high-throughput, energy-efficient, and constant-time in-SRAM Advanced Encryption Standard (AES) engine. The proposed in-memory AES ensures high-throughput operation exploiting the column-wise single instruction multiple data (SIMD) processing of compact round functions for both electronic-codebook (ECB) and counter (CTR) modes of operation. Moreover, we proposed a processor-assisted key loading strategy and a prudent memory management scheme to minimize the memory footprint needed for AES to improve the peak operating frequency and energy efficiency of the underlying compute SRAM hardware. The bit-serial processing further guarantees the constant-time execution of AES, providing strong resistance to side-channel timing attacks. Experimental results show that our proposed AES ECB design achieves 2.4×(149×) throughput, 2.4×(270×) throughput per area, 2.3×(7.7×) per block energy improvement as compared to the state-of-the-art non-constant-time (constant-time) designs, respectively. The resulted AES Counter (CTR) mode design achieves 1.9× per block energy improvement as compared to the state-of-the-art reconfigurable IMC AES CTR designs.
Research Manuscript


AI
AI/ML Application and Infrastructure
DescriptionAs deep learning empowers various fields, many new operators have been proposed to improve the accuracy of deep learning models. Researchers often use imperative programming diagrams (PyTorch) to express these new operators, leaving the fusion optimization of these operators to deep learning compilers. Unfortunately, the inherent side effects introduced by imperative tensor programs, especially tensor-level mutations, often make optimization extremely difficult. We present a holistic functionalization approach (TensorSSA) to optimizing imperative tensor programs beyond control flow boundaries. We achieve a 1.79X (1.34X on average) speedup in representative deep learning tasks than state-of-the-art works.
Research Manuscript


Autonomous Systems
Autonomous Systems (Automotive, Robotics, Drones)
DescriptionIn this paper, we introduce DLAPID, a novel decoupled parallel hardware-software co-design architecture for real-time video dehazing. From a software point of view, DLAPID isolates the atmospheric light operation from the initial transmission estimation to take full advantage of the hardware accelerators' parallelization features. For the hardware implementation, we deploy DLAPID both on FPGA and GPU platforms and validate its effectiveness. Using both real-world driving scenario testing sets and ground-truth datasets, we quantitatively and qualitatively assess the proposed method against several SOTA (state-of-the-art) video dehazing models. The outcomes of our experiments demonstrate that our approach achieves better dehazing performance with lower power consumption and has real-time processing capabilities, thereby preventing potential accidents in autonomous vehicles.
Back-End Design


Back-End Design
Design
Engineering Tracks
DescriptionBackside Power Delivery Network (BSPDN) is a Design Technology Co-Optimization (DTCO) method aimed at sustaining Moore's Law. It achieves this by relocating the Power Delivery Network (PDN) inside silicon, transitioning from the front-side to the back-side, thereby freeing up routing resources for improved signal routing. The improvement in IR drop compared to the traditional Frontside Power Delivery Network (FSPDN) is also noteworthy. Traditional IR drop analysis takes months from PDK released, P&R, IR analysis. In this paper, we propose a methodology to estimate the IR drop enhancement of BSPDN at early stage. We initially model the Power Delivery Network (PDN) using a simplified resistance and current model. Based on this simplified model, we derive a formula to calculate the IR drop. This formula is applicable to model IR drop for both BSPDN and FSPDN. Utilizing this method allows us to estimate IR drop before the actual Place and Route (P&R) tasks are completed, thereby speeding up the Design Technology Co-Optimization (DTCO) iteration. To demonstrate the correlation, we implement a real design and analyze IR drop using Electronic Design Automation (EDA) tools. The results indicate that this methodology is effective in estimating IR drop before the design is implemented, thereby benefiting DTCO.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn this paper we would like to propose an easy module bind based automation for the AXI protocol violation check and extraction of the performance from any AXI-3 based bus. The automation infrastructure proposed, reduces the manual effort, time and human error in extracting the performance indices. It also flags any AXI protocol violations in the design. The major capabilities of the infrastructure include reporting of any AXI protocol violations, per transaction latency, byte transferred, average latency, peak latency, total accumulated latency, average outstanding transactions, number of address requests, number of data requests and net bandwidth. The infrastructure also generates independent RTL hierarchical performance summary log with the previously mentioned parameters which enables user to get the performance info without any wave. The infrastructure was tested on various AXI-3 masters with different address, data and ID width which resulted in reduction in the design verification time, and a higher confidence on the quality of the design. Producing a performance and protocol check report is effortless using this infrastructure with very minimal input. The infrastructure, being parameterized and bind-based, exhibit significant reusability, whether at the SOC or IP level.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionLiquid State Machine (LSM), a spiking neural network model, has shown superiority in various applications due to its inherent spatiotemporal information processing property and low training complexity. Traditional hyperparameters optimization methodologies for LSM usually focus on the mono-criteria of accuracy while ignoring the trade-off among accuracy, parameter size, and hardware overhead (e.g., power consumption) when deployed on neuromorphic processors, which hinders LSM's better applications in resource-restricted scenarios (e.g., embedded systems). Thus, co-considering the performance of LSM algorithms and hardware constraints is critical for real-world applications, which still requires further exploration. This work treats the optimization of LSM as a Multi-objective Optimization Problem (MOP) and proposes a general hardware-aware multi-objective optimization framework. In light of the vast design space and time-consuming function evaluations of the spiking neural network, a decomposition-based Multi-objective Optimization Algorithm (MOA) aiming at computationally expensive problems, MOTPE/D, is proposed in this framework. Experiments are conducted on two typical case studies, i.e., N-MNIST classification and DVS-128 classification. It has been experimentally supported that the proposed framework outperforms peer solutions in terms of different performance indicators. This work is open sourced for reproducibility and further study, which can be accessed through: https://anonymous.4open.science/r/MOTPE-D.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionGraph Neural Networks (GNNs) demand extensive fine-grained memory access, which leads to the inefficient use of bandwidth resources. This issue is more serious when dealing with large-scale graph training tasks. Near-data processing emerges as a promising solution for data-intensive computation tasks; however, existing GNN acceleration architectures do not integrate the near-data processing approach. To address this gap, we conduct a comprehensive analysis of GNN operation characteristics, taking into consideration the requirements for accelerating aggregation and combination processes. In this paper, we introduce a near-data processing architecture tailored for GNN acceleration, named NDPGNN. NDPGNN offers different operational modes, catering to the acceleration needs of various GNN frameworks, while ensuring system configurability and scalability. In comparison to previous approaches, NDPGNN brings 5.68x improvement in system performance while reducing 8.49× energy consumption overhead.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn the fast-paced semiconductor world, rapid time-to-market is crucial. Traditional SoC development, waiting for fully developed IPs, hinders speed and competitiveness. This presentation introduces the concept of preliminary IP CAD views, generated as soon as IP specifications are defined. This allows SoC developers to start design (flow setup and cleanup) and provide feedback earlier, significantly reducing overall cycle time. We propose an optimized approach for generating these preliminary views, achieving up to 40% faster runtime and minimizing delays caused by human intervention. This streamlined technique allows for faster iterations and feedback, increasing development speed and competitiveness in the competitive industry.
DAC Pavilion Panel


Design
DAC Pavilion
DescriptionRISC-V and a growing open-source ecosystem have moved from hype to reality. Consequently, the semiconductor industry is at an inflection point as architectural paradigms require early power and performance metrics, creating demand for new design, verification and validation technologies and methodologies.
Engineers now have the ability to design a specific rather than generic open-source instruction set easily customizable to an application in a vertical market. It's an era where RISC-V starts are not just starts but used in volume production.
The status quo has been upended and with it a challenge with a new open-source ecosystem versus the trust of a traditional, well-established and rich ecosystem. The new open-source instruction set and software don't have legacy, experience and domain knowledge sharing, particularly software validation's usage and experience.
It could also become an exciting era for design verification as it becomes the chief enabler for the new ecosystem and architecture, especially hardware-assisted verification that can serve as a risk mitigation tool.
A panel of design and verification users and experts, all of whom have studied the open-source ecosystem and its requirements and deficiencies, will be part of the DAC Pavilion Panel. DAC attendees are invited to listen in as they discuss where emphasis should be placed for the next-generation design verification flow. Audience participation will be encouraged.
Engineers now have the ability to design a specific rather than generic open-source instruction set easily customizable to an application in a vertical market. It's an era where RISC-V starts are not just starts but used in volume production.
The status quo has been upended and with it a challenge with a new open-source ecosystem versus the trust of a traditional, well-established and rich ecosystem. The new open-source instruction set and software don't have legacy, experience and domain knowledge sharing, particularly software validation's usage and experience.
It could also become an exciting era for design verification as it becomes the chief enabler for the new ecosystem and architecture, especially hardware-assisted verification that can serve as a risk mitigation tool.
A panel of design and verification users and experts, all of whom have studied the open-source ecosystem and its requirements and deficiencies, will be part of the DAC Pavilion Panel. DAC attendees are invited to listen in as they discuss where emphasis should be placed for the next-generation design verification flow. Audience participation will be encouraged.
Back-End Design


Back-End Design
Design
Engineering Tracks
Description5G Downlink Datapath designs contain repeated structures, the same design instantiated multiple times, which make it very difficult to identify and place the macros in a way that is optimal for routability and performance. Traditional macro placement however, has been a very manual and iterative endeavor for these and all types of complex designs where the number of macros has grown dramatically, the sizes vary widely, and the interconnectivity between them is more intricate.
In this paper, we set out to test if a P&R AI-driven macro placement capability could mimic the QoR (floorplan quality and design metrics), achieved by the expert engineers on this design but in a fraction of the time, and lessening the burden of manually placing the macros and running full-flow iterations required on our traditional flow.
In addition, we further investigated the benefits of the feature's Bayesian optimization flow on design exploration for the same block. Analyzing if the generation of various floorplans that each alone could meet the required metrics, could yield and optimal solution, unique to the needs of the design. By providing comparison results at post placement, the designers could then choose which option to push through the full P&R flow, reducing the total number of iterations and the overall turnaround time.
In this paper, we set out to test if a P&R AI-driven macro placement capability could mimic the QoR (floorplan quality and design metrics), achieved by the expert engineers on this design but in a fraction of the time, and lessening the burden of manually placing the macros and running full-flow iterations required on our traditional flow.
In addition, we further investigated the benefits of the feature's Bayesian optimization flow on design exploration for the same block. Analyzing if the generation of various floorplans that each alone could meet the required metrics, could yield and optimal solution, unique to the needs of the design. By providing comparison results at post placement, the designers could then choose which option to push through the full P&R flow, reducing the total number of iterations and the overall turnaround time.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionRF circuit analysis such as periodic AC (PAC) and noise (PNoise) simulation has been very computationally demanding, especially when the number of frequency points is large. In this paper, we propose a new iterative method with Krylov subspace recycling technology for large-scale PAC and PNoise analysis, which can re-use the Krylov subspace generated during the solutions of previous frequencies to accelerate the convergence of iterative solution at subsequent frequencies. In particular, we derive the recycling method based on the GMRES formulation which is more efficient and robust than the previous recycling method based on the GCR formulation. In addition, we also study the effects of frequency sweeping order to reduce the total number of iterations in the subspace recycling process. Numerical results show that the proposed method can achieve a speed-up of 4.7X-21.2X compared to non-recycling GMRES and up to 24.5% improvement compared to the traditional recycling GCR.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
Descriptionarm has always been exploring the cloud advantages and is quite motivated to be fully Cloud enabled. On cloud, spot instances have always been the cost effective solution but not many EDA tools can leverage this advantage. Spot instances offer a cost-effective solution by taking advantage of unused cloud resources. arm has already adopted spot instances for small/short workloads like APL characterizations.The runtime of Redhawk-SC EMIR runs is high, leading to a higher susceptibility for failure due to extended durations and more resource requirement. Goal is for RHSC to harnesses this capability for large work loads, to provide a Viable Option to Optimize the user's Cloud Expenses.
The new DataLake feature offers a more cost-effective solution by dividing workers into two categories. Execution workers are launched on spot instances and are responsible solely for the execution of jobs and with micro-resiliency, the eviction of spot instances is handled gracefully. On the other hand, DataLake workers are launched on reserved instances to ensure reliability since they are file servers. By dividing workers into these two categories and leveraging the capabilities of reserved and spot instances, this approach enables a highly scalable and cost-efficient system with robust micro-resiliency.
DataLake runs on aarch64 machines were completed successfully inspite of spot instances eviction. There is reduction in cost seen on DataLake runs as compared to reserved with minimal impact on run time and no change in QoR.
The new DataLake feature offers a more cost-effective solution by dividing workers into two categories. Execution workers are launched on spot instances and are responsible solely for the execution of jobs and with micro-resiliency, the eviction of spot instances is handled gracefully. On the other hand, DataLake workers are launched on reserved instances to ensure reliability since they are file servers. By dividing workers into these two categories and leveraging the capabilities of reserved and spot instances, this approach enables a highly scalable and cost-efficient system with robust micro-resiliency.
DataLake runs on aarch64 machines were completed successfully inspite of spot instances eviction. There is reduction in cost seen on DataLake runs as compared to reserved with minimal impact on run time and no change in QoR.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionOne of the critical requirements for any embedded application is FuSA (Functional Safety) because it is essential that all the embedded devices function correctly and safely under any faulty or failure scenarios. When it comes to automotives, as per ISO 26262 standard, any failure, be it systematic or random, needs to be addressed during the development itself.
This paper focuses on two safety strategies being widely used in automotives (TMR : Triple Module Redundancy & DCLS : Dual Core Lock Step) and how by using the new USF (Unified Safety Format), these safety mechanisms can be implemented in automotive designs with minimal user effort & reduced run time.
Earlier both the strategies were achieved using custom coded scripts and user was required to manually create bounds for the Safety Main and Shadow modules. Also, the TMR solution with a single voter cell was not supported.
With USF format support, the TMR conversion can be achieved using a single voter cell and an effective physical separation can be achieved for the Main and Shadow modules.
This paper will also highlight the run-time gain with the new USF based approach
This paper focuses on two safety strategies being widely used in automotives (TMR : Triple Module Redundancy & DCLS : Dual Core Lock Step) and how by using the new USF (Unified Safety Format), these safety mechanisms can be implemented in automotive designs with minimal user effort & reduced run time.
Earlier both the strategies were achieved using custom coded scripts and user was required to manually create bounds for the Safety Main and Shadow modules. Also, the TMR solution with a single voter cell was not supported.
With USF format support, the TMR conversion can be achieved using a single voter cell and an effective physical separation can be achieved for the Main and Shadow modules.
This paper will also highlight the run-time gain with the new USF based approach
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionPrior to product market launch, it is critical to have a cost-effective post-silicon validation program. Currently, post-silicon validation requires tremendous resources to constantly stress test post silicon by running a list of internal and external tools across a cluster of systems. This effort involves a high number of stress-test cases and consumes thousands of stress hours. However, the question remains, are the parts really being stressed by running those stress tests? How thorough is stress coverage across the silicon? Moreover, does the probability of identifying a bug increase with higher stress? What about the case for lower stress? The answers to these questions can teach us how to create and improve an effective validation stress test plan. This paper describes a novel approach to extracting the stress map from a stress tool, applying a stress map to correlate with a stress-induced failure (bug), and assessing stress coverage across the entire validation test plan. It also discusses how the current validation stress test plan can be improved using lessons from previous stress-induced failure studies.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThe design verification (DV) phase in chip production lifecycle is a crucial and time-consuming process.
For any new IP/SoC, creating a DV environment from scratch is time consuming, repetitive and cumbersome task. This usually requires 2-3 weeks of effort from DV Engineer. Furthermore, if third party VIPs are required, it takes more time to figure out the configurations needed for the same and sometimes takes unnecessary debug sessions.
This paper presents a newly developed automation flow which takes inputs from the user on required BFMs (in house or third-party) and generates a testbench with all the UVM components in it. These include, the TB top, UVM-Env, UVM agents, scoreboard etc. The BFMs are picked from a common location which can be made accessible to all the DV work areas. The BFMs are instantiated here with known config sets and user need only give minimal information to the automation flow.
As of today, this flow has been implemented for IPDV. It can be extended to be used on SOC DV in future.
For any new IP/SoC, creating a DV environment from scratch is time consuming, repetitive and cumbersome task. This usually requires 2-3 weeks of effort from DV Engineer. Furthermore, if third party VIPs are required, it takes more time to figure out the configurations needed for the same and sometimes takes unnecessary debug sessions.
This paper presents a newly developed automation flow which takes inputs from the user on required BFMs (in house or third-party) and generates a testbench with all the UVM components in it. These include, the TB top, UVM-Env, UVM agents, scoreboard etc. The BFMs are picked from a common location which can be made accessible to all the DV work areas. The BFMs are instantiated here with known config sets and user need only give minimal information to the automation flow.
As of today, this flow has been implemented for IPDV. It can be extended to be used on SOC DV in future.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionFormal Verification is widely applied at IP level. FPV and its Apps are largely used (Linting, Register check, Coverage). IPs are often signed-off only with Formal. Our aim is to use FPV at SoC level.
Our top-level verification tasks:
1. IP integration:
Check that all the IPs are correctly connected on the bus and accessible by the masters.
2. IP operation:
Check that all the IPs are functionally working in SoC.
3. System behavior:
Check that the main application is working.
The tests are usually developed in C code and executed by a CPU in a UVM test bench.
The paper is focused on step 1; the idea is to use the Formal Property Verification to prove the IP integration. Internally developed Python utility generates specific SVA assertions by a simple SoC description excel file. It produces read-write properties that check the accessibility of the peripheral registers and memory spaces from the CPU bus master.
This approach verifies the SoC integration early in the flow, with no UVM; the bugs commonly discovered are:
- Wrong memory map
- Wrong data bus connection
- IP clock and/or reset stuck-at
- Wrong peripheral's reset value
Our top-level verification tasks:
1. IP integration:
Check that all the IPs are correctly connected on the bus and accessible by the masters.
2. IP operation:
Check that all the IPs are functionally working in SoC.
3. System behavior:
Check that the main application is working.
The tests are usually developed in C code and executed by a CPU in a UVM test bench.
The paper is focused on step 1; the idea is to use the Formal Property Verification to prove the IP integration. Internally developed Python utility generates specific SVA assertions by a simple SoC description excel file. It produces read-write properties that check the accessibility of the peripheral registers and memory spaces from the CPU bus master.
This approach verifies the SoC integration early in the flow, with no UVM; the bugs commonly discovered are:
- Wrong memory map
- Wrong data bus connection
- IP clock and/or reset stuck-at
- Wrong peripheral's reset value
Front-End Design


Design
Engineering Tracks
Front-End Design
DescriptionCurrently, formal verification techniques succumb when it comes to the verification of system-level behavior. Only a handful of properties are converged by state-of-the-art SMT solvers. Moreover, the current state-of-the-art frameworks do not address the formal verification aspects as the scale of the design increases beyond the component level, they are: Consistency (if the design is over-constrained), completeness( The set of properties considered are exhaustive) and correctness(if the properties describe the correct behavior). In our solution to system-level verification, we address all of these concerns in our proposed approach. We show in our experiments all the properties either converge with a better bound or show higher bounds as compared to the legacy techniques. This leads us to say with confidence that our solution works well for subsystem-level design verification. While we submit this work, experiments are still ongoing to check the viability of this solution if the designs are further scaled up.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionAs semiconductor manufacturing technology has been rapidly advanced, conventional approaches cannot classify new wafer defect patterns without training. To overcome this, our study proposes an image matching-based search algorithm to analyse wafer defect patterns. The proposed algorithm finds the correlation of wafer defect patterns that determines the feature-based similarity between Wafer Bin Maps (WBMs). Besides, we propose a new metric called Match of Defects (MoD) score to perform robust searching by considering the size and location of defect patterns. Experimental results show that our method is effective on industrial WBM datasets called WM811K and MixedWM38.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionLLE (Local Layout Effect) refers to the mutual influence of adjacent layout elements in semiconductor design. In the process of measuring the characteristics of standard cells, LLE context assumptions are stored in design kit together to be utilized for the block level analysis.To minimize LLE impact on design, conventional library characterization relies on assumption of fixed overlay patterns that takes into account worst or best context based on multiple experiments. But actual context and characterized context can be different, and those situations make uncertainty skew on clock path causing pessimism and optimism on design. This proposed characterization and modeling method resolves the gap between actual context and design kit due to assumption of fixed overlay patterns during characterization. It removes redundant pessimism and optimism in cell delay modeling, then achieved PPA improvement and higher sign-off frequency in design.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionTo achieve the highest power savings, it is desired to make modification as early as possible in the design cycle requiring RTL Power Optimization flows. One of the major challenges with RTL Power Optimization is lack of eco-system to validate the impact on Power for the changes. To capture the power saving for modifications, new waveform must be generated, requiring a re-simulation. In most cases the simulation setup is available for SoC, thus doing the re-simulation for modified RTL becomes resource and time-consuming process.
Back-End Design


Back-End Design
Design
Engineering Tracks
DescriptionAnalog/Mixed Signal IP/Product have large number of custom bus routings which have been routed manually to meet many requirements (various width/space/layer for matching, IR drop, EM, Noise, etc.). This causes TAT increase continuously by the complexity of advanced node DRC, product size increase, design change, lack of automated solutions. In this paper, we analyzed challenges for custom bus automation and proposed a new custom bus routing solution which enables the fast generation of large number of various bus routings with quality by copying the user defined reference wire information and applying segmented combination of pre-defined bus options. The proposed solution is developed under collaboration of SLSI and Cadence and achieved 63% of TAT reduction at pilot test.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionAnnealing processors have attracted attention as domain-specific computers to solve combinatorial optimization problems (COPs) efficiently. Furthermore, their performance can be enhanced by the merge method that enables updating multi-variables simultaneously. However, directly implementing the merge method on an annealing processor requires large-scale computational and memory resources.
In this paper, we propose a parallel-trial double-update annealing (PDA) algorithm that integrates the merge method into the annealing computation flow. Also, we can realize its processor by a minor extension to the existing near-memory architecture. Simulation results for several COPs demonstrate that PDA can find higher quality solutions than the conventional annealing algorithm.
In this paper, we propose a parallel-trial double-update annealing (PDA) algorithm that integrates the merge method into the annealing computation flow. Also, we can realize its processor by a minor extension to the existing near-memory architecture. Simulation results for several COPs demonstrate that PDA can find higher quality solutions than the conventional annealing algorithm.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWe propose PRADA, a practical DRAM-based analog PIM architecture. Unlike existing proposals, PRADA does not introduce any change to the cell area to implement NOT operation. PRADA proposes two states in the bitline sense amplifier to implement NOT operation without additional circuitry. We also introduce sequential row activation to enhance the throughput performance and not to modify the row decoder. Compared to state-of-the-art analog PIM architectures, PRADA demonstrates 2.67-4.79x higher throughput for 8-bit integer multiply. For vector-ADD, PRADA achieves 3.09-3.13x speedups over the baseline, which compares favorably to the other architectures with 1.04-2.07x speedups, while maintaining superior compatibility and reliability.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionHewlett Packard Labs has been researching high-speed low-power dense wavelength division multiplexing (DWDM) Silicon Photonics (SiPh) system for post-exascale high-performance computing system. We propose a process/temperature/voltage (PVT) variation analysis for SiPh designs leveraging electronic-photonic co-design engine. Especially, as electronics' corner extremes distort signal integrity of the SiPh link in the voltage and time domain. Thus, we exploit the novel adjustable tuning techniques on electronic transceiver to improve system performance.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionSolving the Boolean Matching (BMP) is one of the fundamental tasks in EDA: indeed it allows to match components from a technology library for functional equivalence against portions of a digital design. The equivalence under negation-permutation-negation of two Boolean functions requires the exploration of a super-exponential number of possible negations and permutations of input and output bits. Current solutions address the BMP via approximate methods, which still have a more than exponential worst-case time complexity.
In this work, we propose a quantum solver for the BMP achieving an exponential speedup in the exploration of the input negations, and devise a quantum sorting network to perform custom input permutations at runtime. We provide a fully detailed quantum circuit implementing our proposal, showing its costs in terms of the number of qubits and quantum gates.
We experimentally validated our solution both with a quantum circuit simulator and a physical quantum computer, a Rigetti ASPEN-M-2, employing the ISCAS benchmark suite, a de-facto standard for classical EDA.
In this work, we propose a quantum solver for the BMP achieving an exponential speedup in the exploration of the input negations, and devise a quantum sorting network to perform custom input permutations at runtime. We provide a fully detailed quantum circuit implementing our proposal, showing its costs in terms of the number of qubits and quantum gates.
We experimentally validated our solution both with a quantum circuit simulator and a physical quantum computer, a Rigetti ASPEN-M-2, employing the ISCAS benchmark suite, a de-facto standard for classical EDA.
Research Manuscript


Embedded Systems
Time-Critical and Fault-Tolerant System Design
DescriptionMultimodal transformer excels in various applications, but faces great challenges such as high memory consumption and limited data reuse that hinder real-time performance. To address these issues, we propose a processing-in-memory (PIM)-GPU collaboration oriented compiler that optimizes the acceleration of multimodal transformers. The PIM-GPU synergy adapts well to multimodal transformers and improves execution time through dynamic programming algorithms. In addition, we introduce a tailored PIM allocation algorithm for variable-length inputs to further increase efficiency. Experimental results show an average end-to-end speedup of 15x.
Research Manuscript


Embedded Systems
Embedded Memory and Storage Systems
DescriptionVirtual reality (VR) wearable devices can achieve immersive entertainment by fusing multi-modal tasks from various senses. However, constrained by the short battery life and limited hardware resources of VR devices, it is difficult to run multiple tasks simultaneously with different modals. Based on the above issues, we propose an energy-efficient accelerator that supports Multi-modal Tasks for VR devices, namely MTVR. We present a multi-task computing solution based on the flexible multi-task computing core design and efficient computing unit allocation strategy, which simultaneously achieves efficient work of multi-modal tasks. We have designed an early exit detector to skip invalid calculations, which saves energy greatly. In addition, a fine-grained tiny value skip method at multiplier and adder levels is proposed to save energy
further. We provide a hybrid RRAM and SRAM memory access scheme, reducing the external memory access (EMA). Through experimental evaluation, the multi-task computing core achieves an average computational utilization of 95%. When the invalid input ratio is 90%, energy saving brought by the early exit detector can reach 88%. The tiny value skip method further achieved 13% energy saving. A hybrid memory access scheme obtains a 98.9% EMA reduction. We deployed the MTVR accelerator in FPGA and self-designed RRAM, achieving energy efficiency of 3.6 TOPS/W, higher than other single-task accelerators.
further. We provide a hybrid RRAM and SRAM memory access scheme, reducing the external memory access (EMA). Through experimental evaluation, the multi-task computing core achieves an average computational utilization of 95%. When the invalid input ratio is 90%, energy saving brought by the early exit detector can reach 88%. The tiny value skip method further achieved 13% energy saving. A hybrid memory access scheme obtains a 98.9% EMA reduction. We deployed the MTVR accelerator in FPGA and self-designed RRAM, achieving energy efficiency of 3.6 TOPS/W, higher than other single-task accelerators.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThe semiconductor industry faces a significantly higher portion of third-party IP, and the number of Status and Control Registers (CSRs) can now grow to 5M+. Hardware/software interfaces (HSIs) are critical, and users write and maintain homegrown scripts and solutions and spend significant manual efforts to manually generate accurate designs using many different forms of definitions like IP-XACT, SystemRDL, and spreadsheets.
We will introduce a unified single-source approach to CSR development that automates the generation of all outputs for hardware and software interface implementation, eliminates time-consuming and error-prone manual scripting and editing of design data, provides a scalable infrastructure that promotes a rapid, highly iterative design environment and scales to the most complex designs.
The CSRSpec domain-specific language specifies all aspects of the HSI and generates RTL, firmware headers, verification class instances, documentation outputs, register behavior, and address map hierarchy description. It provides a broad set of configurations and behaviors with over 200 unique properties and 6,000 register behavior combinations. The resulting methodology is repeatable, scalable, and supports legacy data reuse while supporting industry standards. Our examples show a significant reduction of manually maintained CSR specifications, reduced source code copy-paste errors and coherency problems, and eliminated file coherency issues.
We will introduce a unified single-source approach to CSR development that automates the generation of all outputs for hardware and software interface implementation, eliminates time-consuming and error-prone manual scripting and editing of design data, provides a scalable infrastructure that promotes a rapid, highly iterative design environment and scales to the most complex designs.
The CSRSpec domain-specific language specifies all aspects of the HSI and generates RTL, firmware headers, verification class instances, documentation outputs, register behavior, and address map hierarchy description. It provides a broad set of configurations and behaviors with over 200 unique properties and 6,000 register behavior combinations. The resulting methodology is repeatable, scalable, and supports legacy data reuse while supporting industry standards. Our examples show a significant reduction of manually maintained CSR specifications, reduced source code copy-paste errors and coherency problems, and eliminated file coherency issues.
Research Manuscript


AI
Design
AI/ML Architecture Design
DescriptionVolume imaging (3D model with inner structure) is widely applied to various areas, such as medical diagnosis and archaeology. Especially during the COVID-19 pandemic, there is a great demand for lung CT. However, it is quite time-consuming to generate a 3D model by reconstructing the internal structure of an object. To make things worse, due to the poor data locality of the reconstruction algorithm, researchers are pessimistic about accelerating it with ASIC. Besides the locality issue, we find that the complex synchronization is also a major obstacle for 3D reconstruction. To overcome the problems, we propose a holistic solution using software-hardware co-design. We first provide a unified programming model to cover various 3D reconstruction tasks. Then, we redesign the dataflow of the reconstruction algorithm to improve data locality. In addition, we remove unnecessary synchronizations by carefully analyzing the data dependency. After that, we propose a novel near-memory acceleration architecture, called Waffle, for further improvement. Experiment results show that Waffle in a package can achieve 3.51× ∼ 3.96× speedup over a cluster of 10 GPUs with 9.35× ∼ 10.97× energy efficiency.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAs the proportion of memories increasing in design, MMB (Multi-Memory Bus) interface is widely used in HPC core for memory test, which is a predefined bus in Function RTL, providing an access to multiple memory arrays and no need for memory wrappers. As a result of MMB interface application, the test area, timing impact and routing congestion can be reduced. However, there are some challenges when using MMB interface. The memories inside MMB interface only support serially test which means test time, test cost and the chip time-to-market will increase.
In this paper, we propose some solution for above challenges.
The memory subgroups of one MMB interface will be tested parallelly, and the outputs of every two adjacent subgroups will make a comparison in-situ. In order to ensure the assuracy of the compare results, the output data of one subgroup will also feed into processor for comparation.
The repair logic is also shared between the parallel test subgroups. A common repair solution will be applied for the test groups.
In this paper, we propose some solution for above challenges.
The memory subgroups of one MMB interface will be tested parallelly, and the outputs of every two adjacent subgroups will make a comparison in-situ. In order to ensure the assuracy of the compare results, the output data of one subgroup will also feed into processor for comparation.
The repair logic is also shared between the parallel test subgroups. A common repair solution will be applied for the test groups.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionDomain-specific systems, consisting of custom hardware accelerators, improve the performance of a specific set of applications compared to general-purpose processing systems. These hardware accelerators are generated using high-level synthesis (HLS) tools. The HLS tools often ignore the challenges of implementing a complex system of parallel accelerators, particularly regarding the way accelerators access memory. Our work proposes a buffering system design that improves accelerators' memory accesses by intelligently employing burst transactions to prefetch useful data from external memory to on-chip local buffers. Our design is dynamic, parametric, and transparent to the accelerators generated by HLS tools. We derive the parameters using appropriate compiler-based analysis passes and memory channel latency constraints. The proposed buffering system design results in, on average, 8.8x performance improvements while lowering memory channel utilization on average by 53.2% for a set of PolyBench kernels.
Back-End Design


Back-End Design
Design
Engineering Tracks
DescriptionAs wafer cost continues to increase at a rapid pace, there is a growing demand to convert more of our 2D SoCs into 3D System-In-Package designs. Furthermore, as individual IPs get larger and more complex, we see a need to disaggregate these designs along arbitrary boundaries, or "cutlines", rather than along standard fabric interfaces as has been done in the past. This results in large numbers of high-speed ad hoc interfaces on the die boundaries and creates a need for cross-die optimization techniques. Silicon architects and floorplanners need robust and intuitive methods to rapidly create and assess different configurations in the early planning phase of the design, so that they can deliver the best mix of Performance, Power, Area and Cost for the product. This paper presents these construction and analysis techniques on 2 different designs – a low-power Crypto core that explores several cutlines, and a high-speed compute module that explores different bump pitch and floorplan options. We present exhaustive studies and KPIs that can support cutline decisions, including 2D/3D PPA comparison, 3D IR/Thermal plots, 2D vs 3D QoR (ex. buffer/inverter count & routing length), D2D Bump-to-Flop distance monitoring, D2D timing paths analysis, and 2D vs 3D metal layer usage.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThis paper presents the results of tests evaluating the quality of 4G and 5G signals in Brazil, aimed mainly at cities that build wind farms in mountainous regions – a typical scenario along the Brazilian. The results obtained between November 2021 and October 2023 show an increase in signal coverage of 835%. However, this does not represent the quality or end of the oscillation; on the contrary, we found and listed four serious problems: first, failure in closed environments such as hospitals, loss of signal on roads and highways, slowness in settings with a large circulation of people and vehicles, which consequently affects applications that use two-factor authentication and banking and credit card applications.
Analyst Presentation


DAC Pavilion
DescriptionWe will examine the financial performance and key business metrics of the EDA industry through 2023, as well as the material technical and market trends and requirements that have influenced EDA business performance and strategies. Among the trends, we will again examine the progression of semiconductor R&D spending and how the market value of the publicly held EDA companies has evolved. Lastly, we will provide our updated financial projections for the EDA industry for 2024 through 2026.
Research Manuscript


AI
AI/ML Application and Infrastructure
DescriptionOne of the primary challenges impeding the progress of Neural Architecture Search (NAS) is its extensive reliance on exorbitant computational resources. NAS benchmarks aim to simulate runs of NAS experiments at zero cost, remediating the need for extensive compute. However, existing NAS benchmarks use synthetic datasets and model proxies that make simplified assumptions about the characteristics of these datasets and models, leading to unrealistic evaluations. We present a technique that allows searching for training proxies that reduce the cost of benchmark construction by significant margins, making it possible to construct realistic NAS benchmarks for large-scale datasets. Using this technique, we construct an open-source bi-objective NAS benchmark for the ImageNet2012 dataset combined with the on-device performance of accelerators, including GPUs, TPUs, and FPGAs. Through extensive experimentation with various NAS optimizers and hardware platforms, we show that the benchmark is accurate and allows searching for state-of-the-art hardware-aware models at zero cost.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThe Keysight ADS RF board design automation tool has significantly improved the efficiency of Bill of Material simulation and brings down the work needed for an engineer to validate a typical RF board down from 14 days to 1.5 day. The approach and tooling are used by a major smart phone developer. Furthermore, the tooling has been built so it can be leveraged to benefit more RF Module & RFIC customers.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThe amount of data generated in 2025 is estimated to be 181 zettabytes (181,000,000,000,000,000,000,000 bytes). To accommodate this, the size of data centers keeps expanding, putting different servers of the same data center several miles away from each other. Optical fibers are a necessity between servers and leveraging Silicon Photonics comes into play. With only about 15 years of learning ("All-silicon active and passive guided-wave components for λ = 1.3 and 1.6 µm": https://ieeexplore.ieee.org/document/1073057), Silicon Photonics doesn't have as much legacy information as CMOS2 (~ 75 years: https://en.wikipedia.org/wiki/History_of_the_transistor). We can't afford to wait another 50 years, so how do we accelerate this learning pace?
To face this challenge, we will discuss strategies such as: anticipating design constraints based on FMEA analysis in order to accelerate design timeline, design compaction to support higher packaging density, minimizing wafer scraps and improvement of wafer yield.
This presentation will discuss our research approach, the hurdles we encountered and how we handled them as well as the current limits and our future steps.
FMEA: Failure Mode and Effect Analysis
To face this challenge, we will discuss strategies such as: anticipating design constraints based on FMEA analysis in order to accelerate design timeline, design compaction to support higher packaging density, minimizing wafer scraps and improvement of wafer yield.
This presentation will discuss our research approach, the hurdles we encountered and how we handled them as well as the current limits and our future steps.
FMEA: Failure Mode and Effect Analysis
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionWith the semiconductor industry's push to newer process nodes and shorter time to market, analog and custom IC layout creation is turning out to be the bottleneck as it has historically been a highly manual process. Since Analog IPs often stay the same across nodes, the ability to automatically recreate the designs can reduce costly iterations and help designs converge faster.
When the design methodology requirements vary across process nodes, layout porting based on mapping of objects and scaling of sizes and coordinates fails miserably in producing high-quality layout that is design rule correct. Our innovative approach of auto-inferring design intents from source layout and driving automated layout creation in target node solves the layout migration challenge with upwards of 2X boost in productivity.
The schematics on the target node are generated by mapping devices and parameters from the source schematic and optimizing them for the target node using customizable machine learning (ML)-based engines. Schematic-driven layout generates node and design-specific grids to ensure DRC-correct placement and routing, while the migration functionality seeds the target layout with relative placement information from the source layout including device groups, captured as scalable templates, that take updated parameters and instance counts into account. Incremental placer legalizes the placement followed by guard ring and fill cell generation that are specific to target process node. In the last step, routing topology information from the source layout is used to generate routing in the target layout to help meet electrical and parasitic requirements through a combination of automation and migration. The final LVS and DRC-clean layout on the target node is generated in a significantly shorter time compared to manual creation, boosted by the use of existing layout footprint and patterns.
When the design methodology requirements vary across process nodes, layout porting based on mapping of objects and scaling of sizes and coordinates fails miserably in producing high-quality layout that is design rule correct. Our innovative approach of auto-inferring design intents from source layout and driving automated layout creation in target node solves the layout migration challenge with upwards of 2X boost in productivity.
The schematics on the target node are generated by mapping devices and parameters from the source schematic and optimizing them for the target node using customizable machine learning (ML)-based engines. Schematic-driven layout generates node and design-specific grids to ensure DRC-correct placement and routing, while the migration functionality seeds the target layout with relative placement information from the source layout including device groups, captured as scalable templates, that take updated parameters and instance counts into account. Incremental placer legalizes the placement followed by guard ring and fill cell generation that are specific to target process node. In the last step, routing topology information from the source layout is used to generate routing in the target layout to help meet electrical and parasitic requirements through a combination of automation and migration. The final LVS and DRC-clean layout on the target node is generated in a significantly shorter time compared to manual creation, boosted by the use of existing layout footprint and patterns.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionBalancing accuracy and hardware efficiency remains a challenge with traditional pruning methods. N:M sparsity is a recent approach offering a compromise, allowing up to N non-zero weights in a group of M consecutive weights.
However, N:M pruning enforces a uniform sparsity level of $\frac{N}{M}$ across all layers, which does not align well sparse nature of deep neural networks (DNNs). To achieve a more flexible sparsity pattern and a higher overall sparsity level, we present~\textit{JointNF}, a novel joint N:M and structured pruning algorithm to enable fine-grained structured pruning with adaptive sparsity levels across the DNN layers. Moreover, we show for the first time that N:M pruning can also be applied over the input activation for further performance enhancement.
However, N:M pruning enforces a uniform sparsity level of $\frac{N}{M}$ across all layers, which does not align well sparse nature of deep neural networks (DNNs). To achieve a more flexible sparsity pattern and a higher overall sparsity level, we present~\textit{JointNF}, a novel joint N:M and structured pruning algorithm to enable fine-grained structured pruning with adaptive sparsity levels across the DNN layers. Moreover, we show for the first time that N:M pruning can also be applied over the input activation for further performance enhancement.
Research Manuscript


AI
AI/ML Application and Infrastructure
DescriptionDesign-Technology Co-Optimization (DTCO) can be significantly accelerated by employing Neural Compact Models (NCMs). However, the effective deployment of NCMs requires a substantial amount of training data for accurate device modeling. This paper introduces an Active Learning (AL) framework designed to enhance the efficiency of both device modeling and process optimization, particularly addressing the challenges of time-intensive Technology Computer-Aided Design (TCAD) simulations. The framework employs a ranking algorithm that assesses metrics such as the expected variance from the neural tangent kernel (NTK), TCAD simulation time, and the complexity of I-V curves. This strategy considerably reduces the number of required simulations while maintaining high accuracy. Demonstrating the effectiveness of our AL framework, we achieved a 28.5\% improvement in MSE within a 30-minute time budget for device modeling, and an 86.7\% reduction in the data points required for process optimization of a 51-stage ring oscillator (RO). These results offer a streamlined, adaptable solution for rapid device modeling and process optimization in various DTCO applications.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionAs a small effort towards general purpose CIM paradigm, in this paper, we propose a heterogeneous workloads centric compute-in-memory (HWCCIM) architecture. Particularly, we present a design to compile essential algorithmic operations into an address table for in-memory computing circuits. Leveraging a reconfigurable address generation unit to guide data movement within different in-memory computing-based operator arrays, it is able to complete calculations and producing corresponding results. We further illustrate the construction of HWCCIM architecture in a behavioral-level circuit model. We also evaluate the proposed architecture using two classical algorithms, the Fast Fourier Transform (FFT) and the Multilayer Perceptron (MLP) algorithms. Compared to conventional approaches, HWCCIM achieves a maximum latency acceleration of 1.5x and an average latency acceleration of 1.3x.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionTime-to-market is a crucial factor in today's competitive chip design landscape. Accurate timing and power analysis are essential for successful tapeout, demanding fast and precise Liberty characterization data (.libs). Traditional methods, heavily reliant on SPICE simulations, are often time-consuming and resource intensive. This presentation investigates the application of AI to revolutionize library characterization in two different chip design scenarios.
Scenario 1 leverages ML to analyze existing PVT data and build accurate models for timing, power, and noise across various Liberty formats (NLDM, CCS, CCSN and LVF). This dramatically reduces characterization time for new PVT additions, offering up to a 100x runtime savings. Importantly, the generated .libs maintain high accuracy, with deviations from Spice simulations within 5% for timing & 10% for leakage power and internal power energy.
Scenario 2 optimizes the characterization flow by identifying a critical subset of .libs from existing libraries and generating the remaining .libs within a target accuracy range. This significantly reduces the need for recharacterization, saving over 50% of time and resources during Spice model updates or minor design changes.
Scenario 1 leverages ML to analyze existing PVT data and build accurate models for timing, power, and noise across various Liberty formats (NLDM, CCS, CCSN and LVF). This dramatically reduces characterization time for new PVT additions, offering up to a 100x runtime savings. Importantly, the generated .libs maintain high accuracy, with deviations from Spice simulations within 5% for timing & 10% for leakage power and internal power energy.
Scenario 2 optimizes the characterization flow by identifying a critical subset of .libs from existing libraries and generating the remaining .libs within a target accuracy range. This significantly reduces the need for recharacterization, saving over 50% of time and resources during Spice model updates or minor design changes.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionSparse LU factorization is the indispensable building block of the circuit simulation, and dominates the simulation time, especially when dealing with large-scale circuits. RF circuits has been increasingly emphasized with the evolution of ubiquitous wireless communication (i.e., 5G and WiFi). The RF simulation matrices show a distinctive pattern of structured dense blocks, and this pattern has been inadvertently overlooked by prior works, leading to underutilization of computational resources. In this paper, by exploiting the block structure, we propose a novel blocked format for L and U factors and re-design the large-scale sparse LU factorization accordingly, which leverages the data locality inherent in RF matrices. The data format transformation is streamlined, strategically eliminating the redundant data movement and costly indirect memory access. Moreover, the vector operations is converted into matrix operations, enabling efficient data reuse and enhancing data-level parallelism. The experiment results show that our method achieves superior performance to state-of-the-art implementation.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionRange join-based variant annotation is an essential stage in genomic big data analysis, often requiring complex conditional joins with databases spanning Terabytes in size. However, its performance on multi-threaded CPUs/GPUs have been bottlenecked by both the memory-access bandwidth and instruction/data dependencies. Furthermore, massive data-accesses involved in range joins for variant annotations drastically affect energy efficiency, and pose serious challenges to commercial adoption of fast-evolving genomic big data analysis. In this work, we present an efficient hardware-software co-design for range join-based variant annotations on clusters of HBM-enabled FPGAs. Our highly-scalable in-memory processing system achieves up-to 1.98x/6.51x/38.1x speedup/energy improvements/memory access reductions compared to state-of-the-art CPU solution, while being highly extensible to other big data applications of range join.
Research Manuscript


Design
In-memory and Near-memory Computing Circuits
DescriptionRegular path queries (RPQs) in graph databases are bottlenecked by the memory wall. Emerging processing-in-memory (PIM) technologies offer a promising solution to dispatch and execute path matching tasks in parallel within PIM modules. We present Moctopus, a PIM-based data management system for graph databases that supports efficient batch RPQs and graph updates. Moctopus employs a PIM-friendly dynamic graph partitioning algorithm, which tackles graph skewness and preserves graph locality with low overhead for RPQ processing. Moctopus enables efficient graph updates by amortizing the host CPU's update overhead to PIM modules. Evaluation of Moctopus demonstrates superiority over the state-of-the-art traditional graph database.
IP


Engineering Tracks
IP
DescriptionWith the rapidly rising number of computing and peripheral building blocks in modern System-on-chip (SoC) development quickly being in the 100s, the interconnect between these blocks can become the long pole for timing analysis and significantly contribute to power consumption. Networks-on-Chips (NoCs) have emerged as the critical solution for on-chip communication and have seen a rapid rise in protocol complexity for coherent and non-coherent designs, and flows for automated RTL generation of configurable NoC IP from high-level topology descriptions have emerged.
With the transport delay increasingly dominated by RC wiring delay, changes in the NoC topology caused by difficulties in timing closure during the Place and Route (P&R) phase can add significant project delays.
This presentation will outline a flow and methodology that uses earlier, abstracted technology information to efficiently guide the development of NoCs using .lef/.def based import of floorplan information to inform NoC-topology development and export constraint and placement information as guidance to standard digital implementation flows to avoid late surprises in timing closure.
With the transport delay increasingly dominated by RC wiring delay, changes in the NoC topology caused by difficulties in timing closure during the Place and Route (P&R) phase can add significant project delays.
This presentation will outline a flow and methodology that uses earlier, abstracted technology information to efficiently guide the development of NoCs using .lef/.def based import of floorplan information to inform NoC-topology development and export constraint and placement information as guidance to standard digital implementation flows to avoid late surprises in timing closure.
Research Manuscript


Embedded Systems
Embedded System Design Tools and Methodologies
DescriptionSimulink has been widely used in embedded software development, which supports simulation to validate the correctness of the constructed models. However, as the scale and complexity of models in industrial applications grows, it is time-consuming for the simulation engine of Simulink to achieve high coverage and detect potential errors, especially accumulative errors.
In this paper, we propose AccMoS, an accelerating model simulation method for Simulink models via code generation. AccMoS generates simulation functionality code for Simulink models through simulation oriented instrumentation, including runtime actor information collection, coverage collection, and calculation diagnosis. The final simulation code is constructed by composing all the instrumentation code with actor code generated from a predefined template library and integrating test data import. After compiling and executing the code, AccMoS generates simulation results that include coverage and diagnostic information. We implemented AccMoS and evaluated it on several benchmark Simulink models. Compared to Simulink's simulation engine, AccMoS shows a 215.3× improvement in simulation efficiency, significantly reduces the time required for detecting errors. AccMoS also achieved greater coverage within equivalent time.
In this paper, we propose AccMoS, an accelerating model simulation method for Simulink models via code generation. AccMoS generates simulation functionality code for Simulink models through simulation oriented instrumentation, including runtime actor information collection, coverage collection, and calculation diagnosis. The final simulation code is constructed by composing all the instrumentation code with actor code generated from a predefined template library and integrating test data import. After compiling and executing the code, AccMoS generates simulation results that include coverage and diagnostic information. We implemented AccMoS and evaluated it on several benchmark Simulink models. Compared to Simulink's simulation engine, AccMoS shows a 215.3× improvement in simulation efficiency, significantly reduces the time required for detecting errors. AccMoS also achieved greater coverage within equivalent time.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAutomotive-grade multiprocessor System-on-Chips (SoCs) operating in advanced FinFET nodes demand unparalleled reliability and quality. Ensuring power integrity signoff for these SoCs is crucial, necessitating extensive coverage of local switching noise for EMIR analyses. Conventional vectorless EMIR and Gate-VCD based methods are increasingly inadequate in identifying critical noise conditions affecting timing. This study introduces a novel aggressor-based EMIR analysis using SigmaDVD, delivering exceptional local noise coverage for robust power integrity sign-off. Comparative analyses of conventional vectorless EMIR and Gate-VCD EMIR against SigmaDVD on two automotive SoCs reveal significantly heightened local noise coverage with SigmaDVD. This innovative approach provides a foundation for confident power integrity signoff on automotive SoCs, addressing the stringent requirements of extreme reliability in advanced FinFET nodes.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionLarge neural networks, especially transformer-based models, present two critical challenges that exacerbate the memory wall issue in AI accelerator designs. First, the increased dynamic range of the weights requires higher precision quantization formats, leading to higher memory capacity requirements. Second, the exponential growth in model parameters incurs more data movement, leading to increased latency and power consumption. In this study, we propose two novel approaches to address these problems. First, based on Posit, we introduce a new format called adaptive Posit (AdaP), which dynamically extends the dynamic range of its representation at run time with minimal hardware overhead. AdaP, utilizing two exponent encoding schemes, accommodates the data distribution with lower quantization error compared to regular Posit. Second, we propose to use compute-in-memory (CIM) architecture to implement AdaP multiply-and-accumulate (MAC) computation to reduce weight data movement. Traditional CIM proposed for floating-point-alike MAC computation uses a comparator tree (CT) to compute the maximum exponent, enabling the CIM to focus on integer MAC. However, the CT-based design has poor scalability as the number of inputs increases. To address this, we propose a speculative input alignment design that significantly reduces the delay, area, and power consumption for the max exponent computation. Software evaluations show that 8-bit AdaP incurs a negligible 0.25% F1 score reduction on the XLM language identification benchmark compared to the full-precision baseline. Hardware synthesis and simulation results further illustrate that our approach achieves 55% energy efficiency and 2.4x area efficiency improvement compared to the state-of-the-art posit processing element.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIn thermal analysis of a chiplet system, conventional numerical methods or machine learning-based surrogate models face tremendous challenges in computation cost and accuracy, especially in the presence of process and material variations. We propose Graph Neural Networks (GNNs) as a mathematical framework for efficient and robust thermal analysis with composite materials. By modeling each region and their thermal interactions as a graph, we continually adapt the GNN model under thermal interface variations. We validate our approach with numerical solutions and real thermal images from a crossbar unit, and demonstrate its speedup and accuracy in a 2.5D chiplet system.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe pervasive integration of deep neural networks (DNNs) within smart devices has significantly increased computational workloads, consequently intensifying pressure on real-time performance and device power consumption. Offloading segments of DNNs to the edge has emerged as an effective strategy for reducing latency and device power usage. Nonetheless, determining the workload to offload presents a complex challenge, particularly in the face of fluctuating device workloads and varying wireless signal strengths. This paper introduces a streamlined approach aimed at swiftly and accurately forecasting the computing latency of a DNN. Building upon this, an adaptive neurosurgeon framework is proposed to dynamically select the optimal partition point of a DNN during runtime, effectively minimizing computing latency. Through experimental validation, our proposed adaptive neurosurgeon demonstrates superior performance in reducing computing latency amidst changing DNN workloads across devices and varying wireless communication capabilities, outperforming existing state-of-the-art approaches, such as the autodidactic neurosurgeon.
Research Manuscript


AI
Design
AI/ML System and Platform Design
DescriptionAlthough Federated Learning (FL) is promising to enable collaborative learning among Artificial Intelligence of Things (AIoT) devices, it suffers from the problem of low classification performance due
to various heterogeneity factors (e.g., computing capacity, memory size) of devices and uncertain operating environments. To address these issues, this paper introduces an effective FL approach named AdaptiveFL based on a novel fine-grained width-wise model pruning strategy, which can generate various heterogeneous local models for heterogeneous AIoT devices. By using our proposed reinforcement learning-based device selection mechanism, AdaptiveFL can adaptively dispatch suitable heterogeneous models to corresponding AIoT devices on the fly based on their available resources for
local training. Experimental results show that, compared to state-of-the-art methods, AdaptiveFL can achieve up to 16.83% inference improvements for both IID and non-IID scenarios.
to various heterogeneity factors (e.g., computing capacity, memory size) of devices and uncertain operating environments. To address these issues, this paper introduces an effective FL approach named AdaptiveFL based on a novel fine-grained width-wise model pruning strategy, which can generate various heterogeneous local models for heterogeneous AIoT devices. By using our proposed reinforcement learning-based device selection mechanism, AdaptiveFL can adaptively dispatch suitable heterogeneous models to corresponding AIoT devices on the fly based on their available resources for
local training. Experimental results show that, compared to state-of-the-art methods, AdaptiveFL can achieve up to 16.83% inference improvements for both IID and non-IID scenarios.
Research Manuscript


AI
Design
AI/ML, Digital, and Analog Circuits
DescriptionEmerging proposals, such as AdderNet, exploit efficient arithmetic alternatives to the Multiply-ACcumulate (MAC) operations in convolutional neural networks (CNNs). AdderNet adopts an ℓ1-norm based feature extraction kernel, which shows nearly identical model accuracy as compared to the CNN counterparts and can achieve considerable hardware savings due to simpler Sum-of-Absolute-Difference (SAD) operations. Nevertheless, existing AdderNet-based accelerator designs still face critical implementation challenges, such as inefficient model quantization, excessive feature memory overheads, and sub-optimal resource utilization. This paper presents AdderNet 2.0, an optimal AdderNet based accelerator architecture with a novel Activation-Oriented Quantization (AOQ) strategy, a Fused Bias Removal (FBR) scheme for on-chip feature memory bitwidth reduction, and an improved PE design to improve resource utilization. The proposed AdderNet 2.0 accelerator designs were implemented on Xilinx Kria KV-260 FPGA. Experimental results show that INT6 accelerator design achieves up to 3.78× DSP density improvement, and 24% LUT, 40% FF, and 2.1× BRAM savings compared to the baseline CNN design.
Research Manuscript


Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionThe compute-in-memory (CIM) paradigm holds great promise to efficiently accelerate machine learning workloads. Among memory devices, static random-access memory (SRAM) stands out as a practical choice due to its exceptional reliability in the digital domain and balanced performance. Recently, there has been a growing interest in accelerating floating-point (FP) deep neural networks (DNNs) with SRAM CIM due to their critical importance in DNN training and high-accurate inference. This paper proposes an efficient SRAM CIM macro for FP DNNs. To achieve the design, we identify a lightweight approach that decomposes conventional FP mantissa multiplication into two parts: mantissa sub-addition (sub-ADD) and mantissa sub-multiplication (sub-MUL). Our study shows that while mantissa sub-MUL is compute-intensive, it only contributes to the minority of FP products, whereas mantissa sub-ADD, although compute-light, accounts for the majority of FP products. Recognizing "Addition is Most You Need", we develop a hybrid-domain SRAM CIM macro to accurately handle mantissa sub-ADD in the digital domain while improving the energy efficiency of mantissa sub-MUL using analog computing. Experiments with the MLPerf benchmark demonstrate its remarkable improvement in energy efficiency by 8.7×∼ 9.3× (7.3×∼8.2×) in inference (training) compared to a fully digital FP baseline without any accuracy loss, showcasing its great potential for FP DNN acceleration.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionDNN accelerators, significantly advanced by model compression and specialized dataflow techniques, have marked considerable progress. However, the frequent access of high-precision partial sums (PSUMs) leads to excessive memory demands in architectures utilizing weight/input stationary dataflows. Traditional compression strategies have typically overlooked PSUM quantization, a gap recently explored in compute-in-memory research. Moreover, these approaches are mainly toward reducing the Analog-to-Digital Converter (ADC) overhead, neglecting the critical issue of intensive memory access. This study introduces a novel Additive Partial Sum Quantization (APSQ) method, seamlessly integrating PSUM accumulation into the quantization framework. We further propose a grouping strategy that combines APSQ with PSQ enhanced by a floating-point regularization technique to boost accuracy. The experiments indicate that APSQ can efficiently compress PSUMs to INT-8, incurring a negligible degradation of accuracy for the Segformer-B0 and EfficientViT-B0 on the challenging Cityscapes dataset. This leads to a notable reduction in energy costs by 30~45%.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionAI algorithms are increasingly diverse, from dense to sparse, and from regular to irregular. To efficiently manage such diversity in hardware, we propose a programmable heterogeneous accelerator that dynamically balances the computation requirements across different design levels. It comprises two types of processing elements (PEs) customized for dense (e.g., DNNs) and sparse (e.g., graphs) workloads, respectively. These PEs are integrated into a programmable architecture, enabling support for various memory access and computation patterns. Based on 16nm design data, the new accelerator achieves a 11x improvement in latency compared to state-of-the-art homogeneous accelerators.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionLast decades have seen a lot of research on Analog Design Automation. The most recent approaches are based on Reinforcement Learning (RL), instead of heuristic optimizers, such as ant colony, particle swarm or differential evolution algorithm. This paper describes a new learning strategy enhancing the most recent Proximal Policy Optimization (PPO) RL approach, applied to analog design. This solution is compared to more classical heuristic methods mentioned above. This study is done using an electrical-simulator-based environment under equivalent calculation conditions. The paper highlights convergence properties and demonstrates the RL ability to avoid local minimum traps.
Research Manuscript


EDA
Timing and Power Analysis and Optimization
DescriptionMultiple Input Switching (MIS) effects commonly induce undesired glitch pulses at the output of CMOS gates, potentially leading to circuit malfunction and significant power consumption. Thus, accurate and efficient glitch modeling is crucial for the design of high-performance, low-power, and reliable ICs. In this work, we present a new gate-level approach for modeling glitch effects under MIS. Unlike previous studies, we leverage efficient Machine Learning (ML) techniques to accurately estimate the glitch shape characteristics, propagation delay, and power consumption. To this end, we evaluate various ML engines and explore different Artificial Neural Network (ANN) architectures. Moreover, we introduce a seamless workflow to integrate our ANNs into existing standard cell libraries, striking an optimal balance between model size and accuracy in gate-level glitch modeling. Experimental evaluation on gates implemented in 7 nm FinFET technology demonstrates that the proposed models achieve an average error of 2.19% against SPICE simulation while maintaining a minimal memory footprint.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThis presentation is about how to consider LLE impact in timing signoff flow. In advanced node, LLE impact is increased than before, so it has become an essential item to be considered.
Because this LLE impact could not be considered in timing signoff flow in the existing method, we introduce the advanced timing signoff methodology that fully considers LLE impact.
LLE impact can be calculated with vth and u0 parameters. So with sensitivity of these parameters characterized library, depending on which cell is placed next to it, the amount of change in these parameters are measured and reflected in the delay.
Additionally, the verification method and design gain that can be obtained from this methodology are also described.
Because this LLE impact could not be considered in timing signoff flow in the existing method, we introduce the advanced timing signoff methodology that fully considers LLE impact.
LLE impact can be calculated with vth and u0 parameters. So with sensitivity of these parameters characterized library, depending on which cell is placed next to it, the amount of change in these parameters are measured and reflected in the delay.
Additionally, the verification method and design gain that can be obtained from this methodology are also described.
Research Manuscript


EDA
Design Verification and Validation
DescriptionGiven the increasing complexity of integrated circuits, the utilization of machine learning in simulation-based hardware design verification (DV) has become crucial to ensure comprehensive coverage of hard-to-hit states. Our paper proposes a deep deterministic policy gradient (DDPG) algorithm combined with prioritized experience replay (PER) to determine the stimulus settings that result in the highest average FIFO depth in a modified exclusive shared invalid (MESI) cache controller architecture. This architecture includes four FIFOs, each corresponding to a distinct CPU.
Through extensive experimentation, DDPG coupled with PER (DDPG-PER) proves to be more effective than DDPG with uniform experience replay in enhancing average FIFO depth and coverage within the DV process. Furthermore, our proposed DDPG-PER framework significantly increases the occurrence of higher FIFO depths, thereby addressing the challenges associated with reaching hard-to-hit states in DV. The proposed DDPG-PER and DDPG algorithms also demonstrate a larger average FIFO depth over four CPUs, requiring considerably less execution time than Bayesian Optimization (BO).
Through extensive experimentation, DDPG coupled with PER (DDPG-PER) proves to be more effective than DDPG with uniform experience replay in enhancing average FIFO depth and coverage within the DV process. Furthermore, our proposed DDPG-PER framework significantly increases the occurrence of higher FIFO depths, thereby addressing the challenges associated with reaching hard-to-hit states in DV. The proposed DDPG-PER and DDPG algorithms also demonstrate a larger average FIFO depth over four CPUs, requiring considerably less execution time than Bayesian Optimization (BO).
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionA chip design consists of interconnected blocks providing advanced functionality. While these blocks are thoroughly verified, the integrity of connections between these pre-verified components lacks clear ownership and efficient verification processes. With growing design complexity, the number of such connections can reach millions and lead to unexpected problems, which may appear late in the design flow. Therefore, a robust methodology for early checking of connection integrity at the RTL and netlist level is crucial. Current formal, simulation, and script-based approaches for connectivity checking face challenges, such as a lack of key functionality, scalability limitations, debugging difficulties, and inefficient usability.
In contrast, this paper introduces a novel static approach to defining a comprehensive set of rules at both the block and top-level, addressing issues such as the elimination of improper connectivity that may lead to block abutment issues during physical design, clock domain identification for specific instance ports, detection of driven pins within a module and ensuring glitch-free input pins for specified instances. We successfully verified connectivity and glitches on an active SoC design with this methodology in a matter of days, as opposed to weeks of work with alternative methods.
In contrast, this paper introduces a novel static approach to defining a comprehensive set of rules at both the block and top-level, addressing issues such as the elimination of improper connectivity that may lead to block abutment issues during physical design, clock domain identification for specific instance ports, detection of driven pins within a module and ensuring glitch-free input pins for specified instances. We successfully verified connectivity and glitches on an active SoC design with this methodology in a matter of days, as opposed to weeks of work with alternative methods.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAs we move towards lower-technology nodes, the challenges in design implementation intensify & enhancing design methodologies and algorithms becomes crucial. By encouraging the integration of different stages, we can significantly improve the implementation process. We present an innovative methodology for implementing a source-synchronous design by integrating an extra stage into our conventional APR flow, strategically situated between the floorplan and placement stages during the design implementation. Our presented solution utilizes a source-synchronous design topology, [SSD Flow], comprising two distinct stages. Initially, we traverse the critical signal nets, followed by the execution of tailored clock routing that adheres to specified rules and constraints. This articulated approach systematically navigates timing intricacies while proactively mitigating crosstalk and noise issues, ultimately optimizing the design. The main objective is to devise a methodology to simplify the implementation process and achieve an enhanced Quality of Results (QoR). Our proposed methodology has significantly streamlined the design implementation process, yielding substantial improvements. Remarkably, our approach has showcased a substantial improvement in Turnaround Time (TAT), featuring a commendable reduction of 2 weeks. From the implementation perspective, our methodology has delivered noteworthy and promising outcomes, including a 48% decrease in latency, a 59.20% reduction in data path delay, a 39.6% enhancement in dynamic power, a 50% reduction in data path depth, and a 55.5% decrease in clock path depth.
DAC Pavilion Panel


Security
DAC Pavilion
DescriptionSemiconductor security is increasingly crucial due to the increasing number of chip vulnerabilities and initiatives regulating cybersecurity assurance for electronic products and systems. Various industry and regulatory bodies have implemented standards and regulations to address cybersecurity concerns across both software and hardware, such as the ISO/SAE 21434 cybersecurity standard for automotive, and the recently released European Union (EU) Cyber Resilience Act. This panel of industry experts will delve into the current state of cybersecurity assurance for semiconductor chips, and how the emerging security standards and growing threat landscape will continue to accelerate the need for more rigorous cybersecurity measures across all sectors.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn the current dynamically changing landscape of computing, growth of artificial intelligence (AI) applications have caused an exponential increase in energy consumption, re-emphasizing the need for managing power footprint in chip design. To manage this escalating energy footprint and enabling true system level low power design, modeling standards play a key role to facilitate inter-operability and re-use. The IEEE 2416 system level power modeling standard, introduced in 2019, offers a unified framework spanning system-level to detailed design, facilitating comprehensive low power design for entire systems. This standard also enables efficiency through contributor-based Process, Voltage, and Temperature (PVT) independent power modeling.
The IEEE 2416 standard is currently undergoing several extensions slated for release in 2024. Noteworthy among these extensions is the comprehensive modeling of multiple voltage blocks and precise representations of analog and mixed-signal blocks. We present these upcoming extensions for the first time, highlighting their potential value through a complete system example with processor cores, accelerators, analog and mixed-signal IP.
This presentation offers insights into the practical implementation of forthcoming extensions with examples. We believe that sharing these advancements, coupled with real-world examples, will help the audience gain valuable early details in using the standard for designing low power systems.
The IEEE 2416 standard is currently undergoing several extensions slated for release in 2024. Noteworthy among these extensions is the comprehensive modeling of multiple voltage blocks and precise representations of analog and mixed-signal blocks. We present these upcoming extensions for the first time, highlighting their potential value through a complete system example with processor cores, accelerators, analog and mixed-signal IP.
This presentation offers insights into the practical implementation of forthcoming extensions with examples. We believe that sharing these advancements, coupled with real-world examples, will help the audience gain valuable early details in using the standard for designing low power systems.
IP


Engineering Tracks
IP
DescriptionContinuous Time Delta Sigma Modulators (CTDSMs) are critical part of various RF receiver chains. These ADCs should be able to accommodate wider signal bandwidths with high dynamic range. This requires higher sampling rate leading to increased power consumption. Thereby, making successful power and signal integrity sign-off a challenging task.
In EMIR analysis, a circuit is simulated together with the parasitic resistor and capacitor network which models the IR drop and Electromigration (EM) effects for both power and signal nets. Advanced node designs have more complex EM rules and with exponential increase in parasitics (RCs) for such kind of designs, the EM simulation becomes more costly.
To address these challenges, we have used Virtuoso-ADE and SpectreX-EMIR solution which handles high-capacity designs and provides exceptional performance. With this flow, a new two-stage iterated method of Spectre-X is used for EMIR analysis to achieve golden accuracy with high performance gain.
In this paper, by using this new two stage iterated method of Spectre-X EMIR, we have achieved close to golden accuracy of direct method (single stage), accelerating EMIR signoff analysis closure by 2.5X performance gain. Seamless integration of Voltus-Fi solution with easy visualization and postprocessing features of ADE, provides productivity gain of 30%.
In EMIR analysis, a circuit is simulated together with the parasitic resistor and capacitor network which models the IR drop and Electromigration (EM) effects for both power and signal nets. Advanced node designs have more complex EM rules and with exponential increase in parasitics (RCs) for such kind of designs, the EM simulation becomes more costly.
To address these challenges, we have used Virtuoso-ADE and SpectreX-EMIR solution which handles high-capacity designs and provides exceptional performance. With this flow, a new two-stage iterated method of Spectre-X is used for EMIR analysis to achieve golden accuracy with high performance gain.
In this paper, by using this new two stage iterated method of Spectre-X EMIR, we have achieved close to golden accuracy of direct method (single stage), accelerating EMIR signoff analysis closure by 2.5X performance gain. Seamless integration of Voltus-Fi solution with easy visualization and postprocessing features of ADE, provides productivity gain of 30%.
Research Manuscript


AI
Security
AI/ML Security/Privacy
DescriptionThe paper introduces AdvHunter, a novel strategy to detect adversarial examples (AEs) in Deep Neural Networks (DNNs). AdvHunter operates effectively in practical black-box scenarios, where only hard-label query access is available, a situation often encountered with proprietary DNNs. This differentiates it from existing defenses, which usually rely on white-box access or need to be integrated during the training phase - requirements often not feasible with proprietary DNNs. AdvHunter functions by monitoring data flow dynamics within the computational environment during the inference phase of DNNs. It utilizes Hardware Performance Counters to monitor microarchitectural activities and employs principles of Gaussian Mixture Models to detect AEs. Extensive evaluation across various datasets, DNN architectures, and adversarial perturbations demonstrate the effectiveness of AdvHunter.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionProcessing-in-Memory (PIM) enables efficient computation of heavy workloads. Motivated by its capabilities, we investigate its potential ability in accelerating Fully Homomorphic Encryption (FHE), a domain known for its colossal computational demands. We present affinity-based optimizations confronting challenges in optimizing the extensive data processing of FHE within PIM's unique architectural constraints, focusing on the balance between parallelism and data affinity. Our novel scheduling methodology minimizes remote data access while reducing penalties by loss in parallelism. We evaluate our solution on an existing PIM-HBM system, achieving 4.55x-216.56x speedup when computing real-world workloads over the TFHE, compared to previous works.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionHeterogeneous systems-on-chips (SoCs) for real-time applications integrate CPUs and/or GPUs with accelerators to meet application deadlines under strict power/area constraints. The large design space of these systems necessitates efficient SoC-level design space exploration (DSE). Existing static approaches struggle to find SoCs that satisfy all constraints, rendering them unsuitable for real-time applications. We propose the use of dynamic scheduling techniques to significantly reduce the design space and navigate it efficiently. Our proposal outperforms existing methodologies with 5.3-12.8x faster DSE times for autonomous vehicle and augmented/virtual reality domains, yielding designs with 1.2-3x better throughput (iso-area) and up to 2.4x lower area (iso-throughput).
Keynote
Special Event


AI
Design
DescriptionArtificial intelligence is changing the world around us, but most of the focus has been on large models running on immense compute servers. There is a critical need for AI in edge applications to decrease latency and power consumption. Fulfilling this need requires new approaches to meet the constraints of future industrial, automotive, and consumer platforms at the intelligent edge.
Front-End Design


AI
Design
Engineering Tracks
Front-End Design
DescriptionGenerative AI is everywhere, but it's still making its first steps in Chip Design.
In this session, we'll invite representative from the design community to review the challenges and present working solutions on using AI for front-end chip design, with an emphasis in sharing "how-to" ideas.
In this session, we'll invite representative from the design community to review the challenges and present working solutions on using AI for front-end chip design, with an emphasis in sharing "how-to" ideas.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThis presentation discusses how an AI-assisted design optimization methodology provides a verified optimal solution for two circuits: metal-option switches and charge pumps. By exploring the entire design space, up to 260,000 design combinations in this case, it results in a faster design cycle, improved capacity, and reduced CPU time.
Micron uses metal-layer switches in its circuits for adjustment to changes. These switches need tuning multiple times during a product design cycle; they might also require adjustment post-tapeout if the process varies, causing poor circuit performance. On the other hand, charge pumps are widely used in memory design for converting a supply voltage to a higher or lower value.
The traditional tuning methods of the above-mentioned circuits involve an iterative manual process to explore as many of the design combinations as possible. This process is time-consuming and may lead to sub-optimal solutions.
This presentation covers the motivation behind the work, the methodology used, and the results obtained by the design team. We also discuss the algorithm behind the AI-powered solution that helped achieve these results.
Micron uses metal-layer switches in its circuits for adjustment to changes. These switches need tuning multiple times during a product design cycle; they might also require adjustment post-tapeout if the process varies, causing poor circuit performance. On the other hand, charge pumps are widely used in memory design for converting a supply voltage to a higher or lower value.
The traditional tuning methods of the above-mentioned circuits involve an iterative manual process to explore as many of the design combinations as possible. This process is time-consuming and may lead to sub-optimal solutions.
This presentation covers the motivation behind the work, the methodology used, and the results obtained by the design team. We also discuss the algorithm behind the AI-powered solution that helped achieve these results.
IP


Engineering Tracks
IP
DescriptionThe paper addresses the challenge of validating Process, Voltage, and Temperature (PVT) corners in semiconductor design, highlighting the increasing complexity of design technology and the impact of process variables and device interference. Recognizing the limitations of traditional Brute-Force methods and the impracticality of validating all PVT corners due to runtime constraints, the paper proposes an AI-based approach. The authors introduce a statistical verification method that combines a scaling method with an Artificial Intelligence (AI)-based Brute-Force accurate method.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionWith the rise of generative AI applications, there is a growing demand for high-bandwidth memory in AI/GPU chips, and interposer designs like UCIE for D2D and SOC to HBM interconnects are increasingly popular for chiplets interconnection. Interposer designs face unique challenges like small trace width, high interconnect density, and the absence of a solid plane. These challenges make traditional SI flow time consuming and lack of silicon-based material consideration. An efficient and accuracy pre-layout analysis flow is very urgently needed.
This paper proposes an efficient interposer high-speed design simulation and optimization flow. This flow is driven by optiSLang, allowing for the configuration of design parameters and objectives. By leveraging various AI/ML algorithms, the solution space is explored to identify the optimal design. This flow operates as a closed-loop automatic iterative optimization process.
In summary, this paper presents an automated interposer pre-layout design simulation and optimization flow. The proposed flow enhances accuracy, speed, and realism compared to traditional manual approaches, and the validation results demonstrate its effectiveness and applicability.
This paper proposes an efficient interposer high-speed design simulation and optimization flow. This flow is driven by optiSLang, allowing for the configuration of design parameters and objectives. By leveraging various AI/ML algorithms, the solution space is explored to identify the optimal design. This flow operates as a closed-loop automatic iterative optimization process.
In summary, this paper presents an automated interposer pre-layout design simulation and optimization flow. The proposed flow enhances accuracy, speed, and realism compared to traditional manual approaches, and the validation results demonstrate its effectiveness and applicability.
Front-End Design


AI
Design
Engineering Tracks
Front-End Design
DescriptionThis paper addresses the critical challenge in chip design scalability, where standard cells are replicated in the millions, resulting in designs with tens of billions of transistors. Traditional methods of constraining Process, Voltage, and Temperature (PVT) corners based on past experiences and conducting Monte Carlo simulations on worst-case scenarios prove unreliable. Incorrectly predicting worst-case PVT can lead to schedule delays and design robustness issues. The brute-force Monte Carlo methods for high sigma verification are both costly and impractical.
To overcome these challenges, we present an AI-powered automated methodology for detecting and verifying worst-case yield. Our single-pass PVT + variation high-sigma solution, exemplified by the Solido PVTMC Verifier, achieves the fastest runtime, while the brute-force accurate high-sigma solution, demonstrated by Solido High-Sigma Verifier, ensures the highest accuracy.
The results on latch-based D flip-flop circuits showcase the effectiveness of our approach. Solido High-Sigma Verifier verified bimodality failure occurrences with 4,000 simulations, delivering a staggering 2,500,000X faster runtime than brute-force methods. Furthermore, the yield for this cell at the target PVT was verified to 6.322 sigma, accompanied by a remarkable 30X runtime speedup compared to the previous methodology. This signifies not only improved performance but also better accuracy and coverage rates.
To overcome these challenges, we present an AI-powered automated methodology for detecting and verifying worst-case yield. Our single-pass PVT + variation high-sigma solution, exemplified by the Solido PVTMC Verifier, achieves the fastest runtime, while the brute-force accurate high-sigma solution, demonstrated by Solido High-Sigma Verifier, ensures the highest accuracy.
The results on latch-based D flip-flop circuits showcase the effectiveness of our approach. Solido High-Sigma Verifier verified bimodality failure occurrences with 4,000 simulations, delivering a staggering 2,500,000X faster runtime than brute-force methods. Furthermore, the yield for this cell at the target PVT was verified to 6.322 sigma, accompanied by a remarkable 30X runtime speedup compared to the previous methodology. This signifies not only improved performance but also better accuracy and coverage rates.
Work-in-Progress Poster
AiDAC: A Low-Cost In-Memory Computing Architecture with All-Analog Multibit Compute and Interconnect


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionAnalog in-memory computing (AiMC) is an emerging technology that shows fantastic performance superiority for neural network acceleration. However, as the computational bit-width and scale increase, high-precision data conversion and long-distance data routing will result in unacceptable energy and latency overheads in the AiMC system. In this work, we focus on the potential of in-charge computing and in-time interconnection and show an innovative AiMC architecture, named AiDAC, with three key contributions: (1) AiDAC enhances multibit computing efficiency and reduces data conversion times by grouping capacitors technology; (2) AiDAC first adopts row drivers and column time accumulators to achieve large-scale AiMC arrays integration while minimizing the energy cost of data movements. (3) AiDAC is the first work to support large-scale all-analog multibit vector-matrix multiplication (VMM) operations. The evaluation shows that AiDAC maintains high-precision calculation (less than 0.79% total computing error) while also possessing excellent performance features, such as high parallelism (up to 26.2TOPS), low latency (<20ns/VMM), and high energy efficiency (123.8TOPS/W), for 8bits VMM with 1024 input channels.
Research Manuscript


Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionThe emergence of Diffusion models has gained significant attention in the field of Artificial Intelligence Generated Content. While Diffusion demonstrates impressive image generation capability, it faces hardware deployment challenges due to its unique model architecture and computation requirement. In this paper, we present a hardware accelerator design, i.e. AIG-CIM, which incorporates tri-gear heterogeneous digital compute-in-memory to address the flexible data reuse demands in Diffusion models. Our framework offers a collaborative design methodology for large generative models from the computational circuit-level to the multi-chip-module system-level. We implemented and evaluated the AIG-CIM accelerator using TSMC 22nm technology. For several Diffusion inferences, scalable AIG-CIM chiplets achieve 21.3× latency reduction, up to 231.2× throughput improvement and three orders of magnitude energy efficiency improvement compared to the NVIDIA RTX 3090 GPU.
Research Manuscript


Security
Hardware Security: Primitives, Architecture, Design & Test
DescriptionThe use of cross-scheme fully homomorphic encryption (FHE) in privacy-preserving applications challenges hardware accelerator design. Existing accelerator architectures fail to efficiently handle hybrid FHE schemes due to the mismatch between computational demands and hardware resources. We propose a novel architecture using a hardware-friendly, versatile low-level operator, i.e., Meta-OP. Our slot-based data management efficiently handles memory access patterns of the meta-op for diverse operations. Alchemist accelerates both arithmetic and logic FHE with high hardware utilization rates. Compared to existing ASIC accelerators, Alchemist outperforms with a 29.4× performance per area improvement for arithmetic FHE and a 7.0× overall speedup for logic FHE.
Research Manuscript


AI
AI/ML Algorithms
DescriptionTraditional Deep Neural Network (DNN) quantization methods using integer, fixed-point, or floating-point data types struggle to capture diverse DNN parameter distributions at low precision, and often require large silicon overhead and intensive quantization-aware training. In this study, we introduce Logarithmic Posits (LP), an adaptive, hardware-friendly data type inspired by posits that dynamically adapts to DNN weight/activation distributions by parameterizing LP bit fields. We also develop a novel genetic-algorithm based framework, LP Quantization (LPQ), to find optimal layer-wise LP parameters while reducing representational divergence between quantized and full-precision models through a novel global-local contrastive objective. Additionally, we design a unified mixed-precision LP accelerator (LPA) architecture comprising of processing elements (PEs) incorporating LP in the computational datapath. Our algorithm-hardware co-design demonstrates on average <1% drop in top-1 accuracy across various CNN and ViT models. It also achieves ~2x improvements in performance per unit area and 2.2x gains in energy efficiency compared to state-of-the-art quantization accelerators using different data types.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionTraining Graph Neural Networks(GNNs) on a large monolithic graph presents unique challenges as the graph cannot fit within a single machine and it cannot be decomposed into smaller disconnected components. Distributed sampling-based training distributes the graph across multiple machines and trains the GNN on small parts of the graph that are randomly sampled every training iteration. We show that in a distributed environment, the sampling overhead is a significant component of the training time for large-scale graphs. We propose FastSample which is composed of two synergistic techniques that greatly reduce the distributed sampling time: 1)~a new graph partitioning method that eliminates most of the communication rounds in distributed sampling , 2)~a novel highly optimized sampling kernel that reduces memory movement during sampling. We test FastSample on large-scale graph benchmarks and show that FastSample speeds up distributed sampling-based GNN training by up to 2x with no loss in accuracy.
Research Manuscript


Embedded Systems
Embedded Software
DescriptionRegular Expression (RE) matching enables the identification of patterns in datastream of heterogeneous fields ranging from proteomics to computer security. These scenarios require massive data analysis that, combined with the high data dependency of the REs, leads to long computational times and high energy consumption. Currently, RE engines rely on either (1) flexibility in run-time RE changes and broad operators support impairing performance or (2) fixed high-performing accelerators implementing few simple RE operators. To overcome these limitations, we propose ALVEARE: a hardware-software approach combining a Domain-Specific Language (DSL) with an embedded Domain-Specific Architecture. We exploit REs as a DSL by translating them into flexible executables through our RISC-based Instruction Set Architecture that expresses from simple to advanced primitives. Then, we design a speculation-based microarchitecture to execute real benchmarks efficiently.
ALVEARE provides RE-domain flexibility and broad operators' support and achieves up to 34x speedup and 57x energy efficiency improvements against the state-of-the-art RE2 and Bluefield DPU 2 with its RE accelerator.
ALVEARE provides RE-domain flexibility and broad operators' support and achieves up to 34x speedup and 57x energy efficiency improvements against the state-of-the-art RE2 and Bluefield DPU 2 with its RE accelerator.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionResearchers and industries are increasingly drawn to quantum computing solutions, attracted by their potential computational advantages over classical systems. However, validating new quantum algorithms faces challenges due to limited qubit availability and noise in current quantum devices. Software simulators offer a solution but are time-consuming. Hardware emulators are emerging as an attractive alternative.
This article introduces AMARETTO (quAntuM ARchitecture EmulaTion TechnOlogy), an architecture designed for quantum computing emulation on low-tier Field Programmable Gate Arrays (FPGAs) supporting Clifford+T and rotational gate sets. AMARETTO accelerates and simplifies the functional verification of quantum algorithms using a Reduced-Instruction-Set-Computer (RISC)-like structure and efficient handling of sparse quantum gates. A dedicated compiler translates OpenQASM 2.0 into RISC-like instructions. Our results, validated against the Qiskit state vector simulator, demonstrate successful emulation of 16 qubits on a Xilinx Kria KV260 System on Module (SoM). This approach rivals other works in the literature, offering similar emulated qubit capacity on a smaller, more accessible FPGA.
This article introduces AMARETTO (quAntuM ARchitecture EmulaTion TechnOlogy), an architecture designed for quantum computing emulation on low-tier Field Programmable Gate Arrays (FPGAs) supporting Clifford+T and rotational gate sets. AMARETTO accelerates and simplifies the functional verification of quantum algorithms using a Reduced-Instruction-Set-Computer (RISC)-like structure and efficient handling of sparse quantum gates. A dedicated compiler translates OpenQASM 2.0 into RISC-like instructions. Our results, validated against the Qiskit state vector simulator, demonstrate successful emulation of 16 qubits on a Xilinx Kria KV260 System on Module (SoM). This approach rivals other works in the literature, offering similar emulated qubit capacity on a smaller, more accessible FPGA.
IP


Engineering Tracks
IP
DescriptionA new time-skew mismatch correction IP with lowest known convergence time has been developed for a TI-ADC (Time-Interleaved Analog to Digital Converter) for a communications receiver system. The proposed design greatly relieves the communication link budget by significantly reducing time-skew estimation and correction by at least two orders of magnitude. The proposed non-iterative calibration technique is purely deterministic, uses contemporary signal processing blocks and is not based on any correlational or statistical approaches. Numerical simulation results demonstrate a significant improvement in the TI-ADC performance with the proposed calibration method. In next generation 5G/6G, Radar & Space communication domains, the low latency of the proposed TI-ADC will enable applications where response time needed is fast. As the correction converges at a very fast rate, time-skew changes due to rapid temperature changes will be tracked and compensated.
IP


Engineering Tracks
IP
DescriptionAn on-chip all-digital transient filter IP is proposed as a replacement of an off-chip, external to chip RC circuit for glitch filtering. This is mandatory for EMC compliance and to filter-out transient artifacts due to impedance mismatches. The proposed filter area is very low and can be accommodated in existing design of serial link PHY receivers. It has low insertion latency, does not vary the signal transition width and preserves the signal width/duty cycle of the received signal. The high figure of merit all-digital filter completely replaces the conventional RC low pass filter. The prior analog RC filter not only adds inertia to the system, but also occupies physical board space. Muti-channel systems will benefit a lot and become less cumbersome. The proposed design is all-digital and thus highly technology independent, enabling very short development times. It has been deployed successfully in the MIPI I3C controller and tested with glitchy transitions in both SCL & SDA signals.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionQuantum computers based on superconducting qubits require classical radio-frequency (RF) electronic circuits to control and read out the quantum states. As the complexity of quantum computers scales up, the classical circuitry part becomes increasingly important, calling for high-quality models for its design and optimization. In this paper, we derive an analytical model to quantify the impact of circuit non-idealities on the readout fidelity for superconducting quantum computing hardware. Such a model considers a comprehensive set of non-idealities commonly found in the readout chain, such as frequency, amplitude and phase inaccuracies, impedance mismatch, quantum noise, and amplifier noise, and predicts the joint effects of these non-idealities on the final fidelity. The model's accuracy and effectiveness are verified by numerical quantum-classical co-simulation. The availability of such a model can facilitate the design and optimization of practical quantum computers.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionWhile outsourcing hardware designs using FPGA (Field Programmable Gate Array) enables cost optimization in manufacturing, hardware-Trojan insertion becomes a potential threat to industrial fields. In this paper, we propose a system that applies IFT (Information Flow Tracking) to detect hardware Trojans inserted into a DUT (Design Under Test) written in HDL (Hardware Description Language). Unlike existing IFT techniques for DUTs, our implementation tracks the information flow of multiple variables in simulation. This allows flexible assertion policies used for testing. By checking if a DUT violates any given policies, our system detects a Trojan with extracting a statement in HDL and its condition for execution related to the Trojan. These are useful to understand the location of the Trojan and its trigger condition.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThe integration of multiple dies and substrates into a unified 3D-IC package presents a compelling solution to the limitations posed by scaling and challenges in SOC migration, making it a focal point in semiconductor advancement. Despite its prominence, diverse fabrication methodologies, teams, and formats introduce complexities to seamless integration. This approach underscores the critical need for innovative approaches to ensure cohesive connectivity. Additionally, it emphasizes the imperative role of automation in generating 3D-IC rule decks for swift and precise qualification. Efficient qualification solutions demand automated systems capable of synthesizing rule decks while adhering to design specifications and manufacturing methods. This approach accelerates system netlist generation, layout assembly, and LVS (Layout vs. Schematic) rule deck creation, expediting physical verification to mitigate challenges, and promote seamless integration across diverse substrates in semiconductor design and manufacturing.
Back-End Design


Back-End Design
Design
Engineering Tracks
DescriptionFull-Chip STA is a mandatory step for Design closure cycle. With extensive market requirements for high computational workloads, design sizes are growing along with the ask for performance and area. With Chips getting fabricated on shrinking technology nodes, this further tightens the impact window on design accuracy and pessimism. To cater for the mentioned needs, designs are highly modularized at architecture level, but when it comes to STA, performing a Full-Chip Flat STA is the only option for computing exact design performance.
Flat STA on big designs comes with a high cost of runtime and memory requirements, which makes flat STA performed at final design closure stage as the optimal situation. Hence, there are methodologies to perform faster STA, for e.g., distributed chip timing analysis and Hierarchical Timing analysis, which can save on runtime & memory requirements, but could show an impact on accuracy.
For flexibility in STA methodologies multiple Hierarchical STA Flows, including ETMs, Boundary-Models (for Bottom-Up analysis) and Timing Contexts (for Top-Down analysis) are supported by EDA vendors.
SmartScope Flow discussed in the paper, provides a method to bridge the gap between the flow with timing-based models and Flat STA, with a vision to provide accuracy as that of Flat STA and runtime/memory requirements as that of timing-model based flows.
This paper will showcase:
i) Quantitative analysis of Full Hierarchical Flows, and
ii) A detailed correlation in terms of runtime, memory, and accuracy comparison among different
Hierarchical STA Flows, with Flat STA as the anchor point for comparison.
1. Hierarchical STA with ETMs: Use of extracted timing models for blocks and netlist/spef for
Toplevel for STA. Best in runtime/memory but could show hit on top-block interface.
2. Bottom-Up analysis with Boundary-Models: A hybrid of etm and full verilog, this flow uses a
trimmed down netlist model for sub-blocks which offers faster TAT along with analyzing the
interface timing inaccuracies, if any.
a. Comprehensive QOR comparison (Memory consumption, Runtime, Performance (Accuracy))
across Hierarchical and Flat methodologies
b. Extended Interface model netlist reduction techniques with similar accuracy
c. Debug techniques to handle Clock Mapping issues
3. Top-Down analysis with Timing context Flat FC-STA: Creating context timing in a Full-Chip Flat
STA for blocks and performing Block STA with actual toplevel latencies as constraints. Support
for both SIM and MIM blocks is provided.
4. SmartScope Flows: Closing the loop b/w Bottom-Up and Top-Down approaches by creating a
hand-shake b/w the two flows.
This paper provides data points on a ~150M instance design, in terms of timing correlation and runtime/memory benefits in comparison to flat STA.
Flat STA on big designs comes with a high cost of runtime and memory requirements, which makes flat STA performed at final design closure stage as the optimal situation. Hence, there are methodologies to perform faster STA, for e.g., distributed chip timing analysis and Hierarchical Timing analysis, which can save on runtime & memory requirements, but could show an impact on accuracy.
For flexibility in STA methodologies multiple Hierarchical STA Flows, including ETMs, Boundary-Models (for Bottom-Up analysis) and Timing Contexts (for Top-Down analysis) are supported by EDA vendors.
SmartScope Flow discussed in the paper, provides a method to bridge the gap between the flow with timing-based models and Flat STA, with a vision to provide accuracy as that of Flat STA and runtime/memory requirements as that of timing-model based flows.
This paper will showcase:
i) Quantitative analysis of Full Hierarchical Flows, and
ii) A detailed correlation in terms of runtime, memory, and accuracy comparison among different
Hierarchical STA Flows, with Flat STA as the anchor point for comparison.
1. Hierarchical STA with ETMs: Use of extracted timing models for blocks and netlist/spef for
Toplevel for STA. Best in runtime/memory but could show hit on top-block interface.
2. Bottom-Up analysis with Boundary-Models: A hybrid of etm and full verilog, this flow uses a
trimmed down netlist model for sub-blocks which offers faster TAT along with analyzing the
interface timing inaccuracies, if any.
a. Comprehensive QOR comparison (Memory consumption, Runtime, Performance (Accuracy))
across Hierarchical and Flat methodologies
b. Extended Interface model netlist reduction techniques with similar accuracy
c. Debug techniques to handle Clock Mapping issues
3. Top-Down analysis with Timing context Flat FC-STA: Creating context timing in a Full-Chip Flat
STA for blocks and performing Block STA with actual toplevel latencies as constraints. Support
for both SIM and MIM blocks is provided.
4. SmartScope Flows: Closing the loop b/w Bottom-Up and Top-Down approaches by creating a
hand-shake b/w the two flows.
This paper provides data points on a ~150M instance design, in terms of timing correlation and runtime/memory benefits in comparison to flat STA.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThis study introduces an innovative approach for closing large SOC designs efficiently through an effective hierarchical EM flow. The methodology leverages hierarchical analysis framework, integrating both top-level and block-level EM considerations to address the complexities of large-scale SoC designs. This approach uniquely combines the granularity of block-level analysis with the holistic perspective of top-level integration, enabling precise identification and mitigation of EM issues without compromising the accuracy.
Key elements of this methodology include advanced EM modeling at various hierarchical levels, strategic partitioning of the SoC into manageable blocks, and the use of boundary models to accurately assess EM effects at interconnects.
The results demonstrate a significant reduction in the time required to close large SoC designs and memory footprint. This methodology not only enhances the reliability and performance of the SoC but also offers a scalable solution applicable to a wide range of complex integrated circuit designs. The hierarchical top scope signal EM flow represents a substantial advancement in SoC design methodologies, setting a new benchmark for efficiently addressing electromigration challenges in large-complex SoC designs.
Key elements of this methodology include advanced EM modeling at various hierarchical levels, strategic partitioning of the SoC into manageable blocks, and the use of boundary models to accurately assess EM effects at interconnects.
The results demonstrate a significant reduction in the time required to close large SoC designs and memory footprint. This methodology not only enhances the reliability and performance of the SoC but also offers a scalable solution applicable to a wide range of complex integrated circuit designs. The hierarchical top scope signal EM flow represents a substantial advancement in SoC design methodologies, setting a new benchmark for efficiently addressing electromigration challenges in large-complex SoC designs.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionWith the continuous advancement of advanced chip packaging technology, the excellent performance of package core power plays a pivotal role in the operation of the entire chip. Especially for high-performance 2.5D and 3D large-scale ICs, the efficient simulation of core power poses significant challenges.
Many indicators of IP power noise are targeted at M0 within the die. Backend engineers can utilize a package subckt model to simulate dynamic and static IR drops to verify if the power noise at M0 meets the requirements of the indicators. But package engineers typically only simulate power noise at the bumps limited by tools and methods.
This paper introduces a fast method for evaluating chip power noise using iCPM. The iCPM is generated by RedHawk-SC with several probe points on M0. Subsequently, package engineers can construct a circuit using iCPM + package model + PCB model. Simulating power noise at M0 via spice simulation only takes a few minutes. This method significantly improves simulation efficiency.
Many indicators of IP power noise are targeted at M0 within the die. Backend engineers can utilize a package subckt model to simulate dynamic and static IR drops to verify if the power noise at M0 meets the requirements of the indicators. But package engineers typically only simulate power noise at the bumps limited by tools and methods.
This paper introduces a fast method for evaluating chip power noise using iCPM. The iCPM is generated by RedHawk-SC with several probe points on M0. Subsequently, package engineers can construct a circuit using iCPM + package model + PCB model. Simulating power noise at M0 via spice simulation only takes a few minutes. This method significantly improves simulation efficiency.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionTiming closure is a critical but effort-taking task in VLSI designs. In this paper, we focus on the timing-driven placement by considering two important factors, namely accurate sign-off timing predictor and accordingly placement optimization method. For accurate timing analysis, an innovative timing prediction model that can be transformed into a differentiable function is proposed, serving as a replacement for the conventional Elmore delay. While maintaining model accuracy, the overall model complexity is thereby reduced. To evaluate the effectiveness of our timing model, we seamlessly integrate it into the open-source placer DreamPlace. In addition, a pin-to-pin weighting approach based on differentiable timing model is given for timing optimization. Experimental results show that our differentiable timing prediction model can significantly reduce the max and mean timing errors compared to the Elmore delay, and exhibits equivalent accuracy to the non-differentiable timing prediction model. The timing performance after placement optimization is better than the result using Elmore delay, i.,e., smaller TNS and WNS with wirelength decreases by 15% on average.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionA key distinguishing feature of single flux quantum (SFQ) circuits is that each logic gate is clocked. This feature forces the introduction of path-balancing flip-flops to ensure proper synchronization of inputs at each gate. This paper proposes a polynomial time complexity approximation algorithm for clocking assignments that minimizes the insertion of path balancing
buffers for multi-threaded multi-phase clocking of SFQ circuits. Existing SFQ multi-phase clocking solutions have been shown to effectively reduce the number of required buffers inserted while maintaining high throughput, however, the associated clock assignment algorithms
have exponential complexity and can have prohibitively long runtimes for large circuits, limiting the scalability of this approach. Our proposed algorithm is based on a linear program (LP) that
leads to solutions that are experimentally on average within 5% of the optimum and helps accelerate convergence towards the optimal integer linear program (ILP) based solution. The improved LP and ILP runtimes permit multi-phase clocking schemes to scale to larger SFQ circuits than previous state of the art clocking assignment methods. We further extend the existing algorithm to support fanout sharing of the added buffers, saving, on average, an additional 10% of the inserted DFFs. Compared to traditional full path balancing (FPB) methods across 10 benchmarks, our enhanced LP saves 79.9%, 87.8%, and 91.2% of the inserted buffers for 2, 3, and 4 clock phases respectively. Finally, we extend this approach to the generation of circuits that completely mitigate potential hold-time violations at the cost of either adding on average less than 10% more buffers (for designs with 3 or more clock phases) or, more generally, adding a clock phase and thereby reducing throughput.
buffers for multi-threaded multi-phase clocking of SFQ circuits. Existing SFQ multi-phase clocking solutions have been shown to effectively reduce the number of required buffers inserted while maintaining high throughput, however, the associated clock assignment algorithms
have exponential complexity and can have prohibitively long runtimes for large circuits, limiting the scalability of this approach. Our proposed algorithm is based on a linear program (LP) that
leads to solutions that are experimentally on average within 5% of the optimum and helps accelerate convergence towards the optimal integer linear program (ILP) based solution. The improved LP and ILP runtimes permit multi-phase clocking schemes to scale to larger SFQ circuits than previous state of the art clocking assignment methods. We further extend the existing algorithm to support fanout sharing of the added buffers, saving, on average, an additional 10% of the inserted DFFs. Compared to traditional full path balancing (FPB) methods across 10 benchmarks, our enhanced LP saves 79.9%, 87.8%, and 91.2% of the inserted buffers for 2, 3, and 4 clock phases respectively. Finally, we extend this approach to the generation of circuits that completely mitigate potential hold-time violations at the cost of either adding on average less than 10% more buffers (for designs with 3 or more clock phases) or, more generally, adding a clock phase and thereby reducing throughput.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionMulti-die designs, 2.5DIC and 3DIC, have been rising in popularity in last decade as they offer tremendously increased levels of integration, smaller footprint, performance gains, and more. While they are attractive for many applications, it also creates more stringent design bottlenecks in the areas of thermal management and power delivery. For 3DICs, in addition to the complex SoC/PCB interactions seen in their 2D counterparts, we must account for electrical and thermal coupling between dies as well.
For these advanced package design, such as 2.5D/3DIC, chiplets, power, thermal, electromagnetics and mechanical – and their highly coupled interactions – are the primary limiters of entitled performance, yield and cost. As we know, when temperature increases, it increases the device leakage power consumption, and requires more cooling costs. Also, temperature increase can have tremendous negative impact on the overall design performance, such as interconnect resistance hike, device performance degradation, and the thermal induced noise can change the light wave phased in optical designs.
Higher thermal effects also cause reliability issues, like electromigration failure, aging issue, and stress related failures. So thermal management becomes very important to avoid thermal runaway and reliability issues. However, full 3DIC system thermal analysis with detail CTM takes too much time at sign-off stage, and once thermal issues arise, there is no space left to adjust on the SoC die. Therefore, in most cases, upgrading cooling equipment is almost the only option, and the cost is too high! We seek a shifting left method to manage chip thermal in the early stages. Early thermal management can more efficiently avoid thermal run away, reduce thermal management costs, and give designers more confidence during design sign-off analysis.
Thermal aware floorplan & power plan with preliminary collateral in RedHawk-SC-Electrothermal at early stage can analyze and predict power-thermal reliability issues, identify thermal issues early enables fixes/changes that can have a profound effect on reducing failures with a minimum of design effort. Through early-stage thermal-stress analysis, we can avoid the warpage and solder joint reliability issues caused by thermal expansion.
keywords : 3DIC, thermal-aware floorplan, power-plan, early-stage thermal management
For these advanced package design, such as 2.5D/3DIC, chiplets, power, thermal, electromagnetics and mechanical – and their highly coupled interactions – are the primary limiters of entitled performance, yield and cost. As we know, when temperature increases, it increases the device leakage power consumption, and requires more cooling costs. Also, temperature increase can have tremendous negative impact on the overall design performance, such as interconnect resistance hike, device performance degradation, and the thermal induced noise can change the light wave phased in optical designs.
Higher thermal effects also cause reliability issues, like electromigration failure, aging issue, and stress related failures. So thermal management becomes very important to avoid thermal runaway and reliability issues. However, full 3DIC system thermal analysis with detail CTM takes too much time at sign-off stage, and once thermal issues arise, there is no space left to adjust on the SoC die. Therefore, in most cases, upgrading cooling equipment is almost the only option, and the cost is too high! We seek a shifting left method to manage chip thermal in the early stages. Early thermal management can more efficiently avoid thermal run away, reduce thermal management costs, and give designers more confidence during design sign-off analysis.
Thermal aware floorplan & power plan with preliminary collateral in RedHawk-SC-Electrothermal at early stage can analyze and predict power-thermal reliability issues, identify thermal issues early enables fixes/changes that can have a profound effect on reducing failures with a minimum of design effort. Through early-stage thermal-stress analysis, we can avoid the warpage and solder joint reliability issues caused by thermal expansion.
keywords : 3DIC, thermal-aware floorplan, power-plan, early-stage thermal management
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionExterior design plays an important role in automotive design industries and it usually takes laborious work by designers. Image editing, as a fundamental image manipulation task, has been revolutionized by denoising diffusion models thanks to its great productivity and creativity. However, the application of denoising diffusion models for image editing on automotive design is still limited due to the ambiguous editing instructions and uncontrollable output, leading to undesirable results with bad quality. Moreover, the training and inference require a lot of resources. In this work, we propose a novel image editing framework for automotive design to precisely comprehend human instructions and produce high-fidelity exterior renderings. Meanwhile, it needs only 6.5 GPU hours and 16GB VRAM to train and 8GB VRAM to inference, making it more accessible.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAt Renesas, we develop compact and low-power SRAMs for our products. For our SRAM library development, we produce and verify all 10,000+ memory instances generated by our Memory Compiler.
All SRAM IPs must be validated across a wide range of process, voltage, and temperature (PVT) conditions, as well as multiple views and formats for consistency and correctness, including logical, physical, timing, SPICE, and other views. This requires significant time and effort.
To enhance IP QA process in terms of efficiency and coverage, Renesas has built an SRAM IP QA methodology in collaboration with Siemens' Solido Crosscheck. This methodology includes several custom checks from Renesas, in addition to standard SRAM and IP checks. It covers all relevant front-end and back-end design views for IP production and integration workflows, and enables Renesas to fully validate IPs in significantly less time than before.
In this paper, we will discuss Renesas' efficient SRAM IP QA methodology. Within this methodology, we will also highlight key QA checks for SRAM validation, the importance of such rules, and provide insight into QA efficiency and coverage of the flow.
All SRAM IPs must be validated across a wide range of process, voltage, and temperature (PVT) conditions, as well as multiple views and formats for consistency and correctness, including logical, physical, timing, SPICE, and other views. This requires significant time and effort.
To enhance IP QA process in terms of efficiency and coverage, Renesas has built an SRAM IP QA methodology in collaboration with Siemens' Solido Crosscheck. This methodology includes several custom checks from Renesas, in addition to standard SRAM and IP checks. It covers all relevant front-end and back-end design views for IP production and integration workflows, and enables Renesas to fully validate IPs in significantly less time than before.
In this paper, we will discuss Renesas' efficient SRAM IP QA methodology. Within this methodology, we will also highlight key QA checks for SRAM validation, the importance of such rules, and provide insight into QA efficiency and coverage of the flow.
Research Manuscript


Design
In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionTransformer models equipped with multi-head attention (MHA) mechanism have demonstrated promise in computer vision tasks, i.e., vision transformers (ViTs). Nevertheless, the lack of inductive bias in ViTs leads to substantial computational and storage requirements, hindering their deployment on resource-constrained edge devices. To this end, multi-scale hybrid models are proposed to take the advantages of both transformers and CNNs. However, existing domain-specific architectures usually focus on the optimization of either convolution or MHA at the expense of flexibility. In this work, an in-memory computing (IMC) accelerator is proposed to efficiently accelerate ViTs with hybrid MHA and convolution topology by introducing pipeline reordering. SRAM-based digital IMC macro is utilized to mitigate memory access bottleneck, while avoiding analog non-ideality. The reconfigurable processing engines and interconnections are investigated to enable the adaptable mapping of both convolution and MHA. Under typical workloads, experimental results exhibit that our proposed IMC architecture delivers 2.20× to 2.52× speedup and 40.6% to 74.8% energy reduction compared with the baseline design.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThis paper proposes a layout automation method for area compact memory design with considering the modification strategies of X-peripheral and Y-peripheral circuits in memory. Traditional template-based methods are hindered by manual effort required in template creation. To eliminate the need for manual template creation of each circuit, we propose novel method of reforming layout based on target locations. In the TSMC 28nm process, the layout automation reduces the peripheral circuit area by 1.79% to 4.08% and decreases dynamic power by 0.76% to 12.86%, and reduce access time by 0.75% to 7.23%.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThe shrinking technologies have paved the path for complex devices having various functionalities integrating various IPs in a single SoC and hence, complex clocking structure and efficient power management in AMS IP are gaining popularity. The same design complexity is reflected in HDL behavior model like timing from internal clock, real modeling, power aware modeling etc. There is need of robust behavior modeling of these complex IPs, to enable accurate and efficient functional check along with timing.
In this paper, the challenges and shortcomings associated with modeling of complex AMS IPs for timing simulations are discussed, along with the proposed methodology. It has also been demonstrated how this methodology handles the correct data latching issue in case of negative timing checks present in the design, without compromising on any advanced feature supported in the model.
In this paper, the challenges and shortcomings associated with modeling of complex AMS IPs for timing simulations are discussed, along with the proposed methodology. It has also been demonstrated how this methodology handles the correct data latching issue in case of negative timing checks present in the design, without compromising on any advanced feature supported in the model.
Research Manuscript


Design
SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionGrowing IC manufacturing complexity and reliance on third-party fabrication create supply chain fragility, contributing to chip shortages and IP security risks. General-purpose ICs can mitigate manufacturing security risks but rely on rely on software-based configurations, which is not optimal for high-consequence applications.
Our work proposes a novel IP-agnostic Foundational Cell Array (FC-Array) platform to overcome these challenges. Built on only verified standard cells and industry-standard EDA tools, this platform preserves many advantages of an ASIC. By incorporating 3D split manufacturing, we provide semantically secure IP protection and a base wafer that can be stockpiled. Our tests demonstrate both power-efficient (100 MHz) and high-performance (1 GHz) options. In a post-place-and-route simulated 28nm design, our FC-Array shows a worst-case 1.85x increase in power consumption and a 2.61x increase in area compared to standard cell ASICs for equivalent timing performance.
Our work proposes a novel IP-agnostic Foundational Cell Array (FC-Array) platform to overcome these challenges. Built on only verified standard cells and industry-standard EDA tools, this platform preserves many advantages of an ASIC. By incorporating 3D split manufacturing, we provide semantically secure IP protection and a base wafer that can be stockpiled. Our tests demonstrate both power-efficient (100 MHz) and high-performance (1 GHz) options. In a post-place-and-route simulated 28nm design, our FC-Array shows a worst-case 1.85x increase in power consumption and a 2.61x increase in area compared to standard cell ASICs for equivalent timing performance.
Research Manuscript


Security
Hardware Security: Primitives, Architecture, Design & Test
DescriptionAs a core arithmetic operation and security guarantee of Fully Homomorphic Encryption (FHE), Number Theoretic Transform (NTT) of a large degree is the primary source of computational and time overhead. In this paper, we propose a scalable and conflict-free memory mapping algorithm that breaks the memory bound and releases a large amount of on-chip resources. A flexible and no-stall hardware/software pipeline architecture is designed to boost the throughput of NTT/INTT of $N=2^{16}$ to over 48,543 operations per second with area efficiency, which 4× and 10× speed up the FPGA-based (HPCA'23) and GPU-based (HPCA'23) schemes.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionResearchers have previously developed advanced analysis tools that identify fault-causing inputs in complex digital designs. One contributing factor to the success of these tools is the availability of publicly available digital designs and open-source execution flows. We observe the field of AMS circuit verification currently lacks an open-source, mixed-signal (AMS) execution flow that targets AMS system designs. We present VerA, an analysis-friendly open-source AMS modeling and simulation framework works with open-source digital simulators. VerA's compiler employs optimizations to reduce the state space of the digitized analog model and seamlessly integrates digital and analog blocks, enabling easier analysis of the AMS system.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAn explosion in automotive applications in the form of driverless cars, complex space explorations, aviation advancements has made it mandatory for the Design and EDA community to come up with solutions focused on the safety of such devices. There have been earlier attempts to explore safety-based features as a separate or an add-on requirement once the design is fabricated for integrating purposes. It required long turnaround times and endless back and forth iterations of design modifications to cater to system level requirements. Moreover, these safety requirements always came with a cost to PPA and this was considered as a non-negotiable aspect of safety implementation. This was due to the lack of an industry standard approach to pass on the information of proper specification, implementation, and modelling of safety critical systems through the Implementation flow from Synthesis till Routing.
This paper discusses the industry standard solution from Cadence Design Systems using Unified Safety Format (USF) by Midas Safety Platform that can be seamlessly passed on to Implementation tools like Genus, Innovus and Conformal to provide best in class PPA aware Safety Intent Driven Implementation and Verification of chips targeted for automotive devices.
This paper discusses the industry standard solution from Cadence Design Systems using Unified Safety Format (USF) by Midas Safety Platform that can be seamlessly passed on to Implementation tools like Genus, Innovus and Conformal to provide best in class PPA aware Safety Intent Driven Implementation and Verification of chips targeted for automotive devices.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionIn this paper, we provide the first thorough analysis of 64-bit parallel prefix adders (PPAs) and 32-bit matrix multiply units (MMUs) implemented using 7-nm carbon nanotube field effect transistors (CNFETs). Unlike many previous studies in which researchers performed the analysis of CNFET circuits at the SPICE level, we focus on netlists placed and routed using the state-of-the-art CNFET cell library. This approach enables us to analyze a more complex and wider range of CNFET circuits (i.e., various architectures of parallel prefix adders and matrix multiply units) than researchers in previous studies, while considering various effects of the physical layout of the circuits. Our experimental results show that 7-nm CNFET improves energy-delay products (EDPs) by 90× and 44× on average for PPAs and MMUs, respectively, compared to 7-nm FinFET. In addition, our analysis shows that the impact of wires, particularly on power consumption, is more substantial in CNFET circuits than FinFET circuits, and wire savings are therefore crucial for the optimization of the EDP of CNFET circuits. This study opens up a new opportunity to develop a wire-aware design for CNFET circuits.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionHigh sigma analysis is an important topic in circuit design and analysis area, which predicts the probability of rare circuit/device failure events in VLSI circuits, such as in SRAM arrays. There are EDA start-ups specifically dedicated to address rare failure event problems, such as Solido, MunEDA, etc. The importance sampling, the tail sampling methods, etc. have been used in this area for many years. More recently, the Scaled Sigma Sampling (SSS) method by Prof. X. Li, et al. at Carnegie Mellon advanced the analysis of rare failure events greatly. The SSS method is an extrapolation method. The EDA industry has welcomed the SSS method. However, we have not seen a comparison of the SSS method against a set of known and exact failure probabilities. Without such a benchmark comparison, the validity range and the expected accuracy of the SSS method are not very clear. In this work, we wish to fill this gap. In this work, we also present an improved SSS method.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionMonolithic designs face significant fabrication costs and data movement challenges, especially when dealing with complex and diverse AI models. Advanced 2.5D/3D packaging promises high bandwidth and density to overcome these challenges but also introduces new electro-thermal constraints. This paper presents a suite of analytical performance models to enable efficient benchmarking of a 2.5D/3D AI system. These models cover various metrics of computing units, network-on-chip, and network-on-package. The results are summarized into a new tool, HISIM. Benefiting from the accuracy and efficiency of HISIM, we evaluate the potential of 2.5D/3D heterogeneous integration on representative AI algorithms under thermal constraints.
Research Manuscript
Annotating Slack Directly on Your Verilog: Fine-Grained RTL Timing Evaluation for Early Optimization


EDA
Timing and Power Analysis and Optimization
DescriptionIn digital IC design, the early register-transfer level (RTL) stage offers greater optimization flexibility than post-synthesis netlists or layouts. Some recent machine learning (ML) solutions propose to predict the overall timing of a design at the RTL stage, but the fine-grained timing information of individual registers remains unavailable. In this work, we introduce RTL-Timer, the first fine-grained general timing estimator applicable to any given design. RTL-Timer explores multiple promising RTL representations and customizes loss functions to capture the maximum arrival time at register endpoints. RTL-Timer's fine-grained predictions are further applied to guide optimization in a standard logic synthesis flow.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionApproximate computing is emerging as a promising approach to devise energy-efficient IoT systems by exploiting the inherent error-tolerant nature of various applications. In this work, we present Approx-T, a novel design methodology that conducts an in-depth study on Approximate Multiplication Units (AMUs) via Taylor-expansion. This paper comprises three key contributions: (1) Pioneering the incorporation of Taylor's theorem into the design concept of approximate units. (2) Leverage the inherent symmetrical error distribution of Taylor series to conduct unbiased AMUs. (3) Present a runtime configurable error compensation architecture with low-complexity arithmetic operations. We implemented both approximate integer and floating multiplication arithmetic units and compared with the state-of-the-art approximations, experimental results show that Approx-T outperforms in all aspects including precision, area and power consumption. We also deployed AMUs on embedded FPGA for various edge computing tasks, Approx-T can achieve up to 5.7x energy efficiency in CNN application with negligible impact on accuracy.
Research Manuscript


AI
AI/ML Algorithms
DescriptionLarge Language Models have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer's weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model. We leverage the Hessian trace as a sensitivity metric for mixed-precision quantization. Experiments show APTQ attains state-of-the-art zero-shot accuracy of 68.24% and 70.48% at 3.8 bitwidth in LLaMa-7B and LLaMa-13B, respectively.
Research Manuscript


EDA
Physical Design and Verification
DescriptionThis paper presents a novel reinforcement-learning-trained router for building a multi-layer obstacle-avoiding rectilinear Steiner minimum tree (OARSMT). The router is trained by our proposed combinatorial Monte-Carlo tree search to select a proper set of Steiner points for OARSMT with only one inference. By using a Hanna-grid graph as the input and a 3D UNet as the network architecture, the router can handle layouts with any dimensions and any routing costs between grids. The experiments on both random cases and public benchmarks demonstrate that our router can significantly outperform previous algorithmic routers and other RL routers using Alpha-Go-like or PPO-based training.
Exhibitor Forum


DescriptionBursting EDA workloads from on-prem to cloud is a challenge for most on-prem environments that are increasingly running out of capacity due to the growing complexity of advanced-node designs. For massively parallelized workloads, such as library characterization, implementation and physical verification, engineers currently need to split their designs between on-prem and cloud execution if they want to leverage the scalable compute capacity on cloud. Depending on the design, this is a tedious activity that eats away at precious engineering productivity. And once job execution is complete, the process to transfer output data back from cloud to on-prem and aggregate it with output generated on-prem adds to this overhead. In this session, we will discuss a unique approach to enabling a true hybrid cloud environment architected specifically for EDA workloads which enables engineers to submit a large job exclusively on-prem automatically splitting the job, routing selective worker traffic through a secure network for cloud execution, and syncing data generated on cloud back to on-prem storage for further processing in the flow. Along with license management automation, hybrid cloud optimization can radically improve engineering productivity and enhance the coverall cloud experience for SoC design.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionVon Neumann's architecture has played a fundamental role in advancing state-of-the-art computing platforms. Despite its contributions, the architecture's heavy reliance on data movement between memory and processor elements poses a significant challenge. The evolving compute-in-memory (CiM) paradigm offers a promising solution to address the memory wall bottleneck by facilitating simultaneous processing and storage within static random-access memory (SRAM) elements. Numerous design decisions taken at different levels of hierarchy affect the figure of merits (FoMs) of SRAM, such as power, performance, area, and yield. The absence of a rapid assessment mechanism for the impact of changes at different hierarchy levels on global FoMs poses a challenge to accurately evaluating innovative SRAM designs. This paper presents an automation tool designed to optimize the energy and latency of SRAM designs incorporating diverse implementation strategies for executing logic operations within the SRAM. The tool structure allows easy comparison across different array topologies and various design strategies to result in energy-efficient implementations. Our study involves a comprehensive comparison of over 6900+ distinct design implementation strategies for EPFL combinational benchmark circuits on the energy-recycling resonant compute-in-memory (rCiM) architecture designed using TSMC 28 nm technology. When provided with a combinational circuit, the tool aims to generate an energy-efficient implementation strategy tailored to the specified input memory and latency constraints. The tool reduces 80.9% of energy consumption on average across all benchmarks compared to baseline implementation of single-macro topology by considering the parallel processing capability of rCiM cache size ranging from 4KB to 192KB
Research Manuscript


Security
Hardware Security: Attack and Defense
DescriptionSecurity practices in the field of Machine learning (ML)
encompass a range of measures, with one notable strategy that involves
concealing the architecture of ML models from users, thereby adding
an extra layer of protection. This proactive strategy serves multiple
key purposes, including safeguarding intellectual property, mitigating
model vulnerabilities, and preventing adversarial attacks. In this work, we
propose a novel fingerprinting attack that identifies a given ML model's
architecture family, from among the latest categories. To this aim, we
are the first to leverage a Frequency Throttling Side-Channel Attack, a
method that enables us to convert power side-channel information into
timing variations at the user-space level. We utilize the timing information
of crafted adversary kernels combined with a supervised machine learning
classifier to identify the ML model architecture. In particular, our
proposed method involves capturing timing information by monitoring
an adversary kernel's execution time while a specific ML model runs,
unveiling distinctive timing patterns. This process involves initiating the
frequency throttling side-channel effect and transforming it into timing
information. Subsequently, we employ a specialized machine learning
classifier trained on this timing data to precisely identify the victim's
ML model architecture. With this approach, we achieve 98% accuracy
in correctly classifying a known ML model into its corresponding
architecture family. Furthermore, our attack demonstrates transferability
by accurately assigning the correct family to unseen models with 90.6%
accuracy on average. Additionally, for the purpose of thorough analysis, we
have reproduced this attack across 3 different platforms, with comparable
results underscoring the attack's platform portability. Finally, it is notable
that we intend to publicly release our work, making it accessible to the
research community for the purpose of reproducibility.
encompass a range of measures, with one notable strategy that involves
concealing the architecture of ML models from users, thereby adding
an extra layer of protection. This proactive strategy serves multiple
key purposes, including safeguarding intellectual property, mitigating
model vulnerabilities, and preventing adversarial attacks. In this work, we
propose a novel fingerprinting attack that identifies a given ML model's
architecture family, from among the latest categories. To this aim, we
are the first to leverage a Frequency Throttling Side-Channel Attack, a
method that enables us to convert power side-channel information into
timing variations at the user-space level. We utilize the timing information
of crafted adversary kernels combined with a supervised machine learning
classifier to identify the ML model architecture. In particular, our
proposed method involves capturing timing information by monitoring
an adversary kernel's execution time while a specific ML model runs,
unveiling distinctive timing patterns. This process involves initiating the
frequency throttling side-channel effect and transforming it into timing
information. Subsequently, we employ a specialized machine learning
classifier trained on this timing data to precisely identify the victim's
ML model architecture. With this approach, we achieve 98% accuracy
in correctly classifying a known ML model into its corresponding
architecture family. Furthermore, our attack demonstrates transferability
by accurately assigning the correct family to unseen models with 90.6%
accuracy on average. Additionally, for the purpose of thorough analysis, we
have reproduced this attack across 3 different platforms, with comparable
results underscoring the attack's platform portability. Finally, it is notable
that we intend to publicly release our work, making it accessible to the
research community for the purpose of reproducibility.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAAET (Architecture Area Evaluation Tool) is designed to address the pressing need for accurate and unified area estimation for future devices. A precise estimation plays an important role in determining the approximate cost of the devices in terms of area so as to meet the market requirements.
AAET aims in enabling seamless data consolidation by efficiently processing information from various legacy devices, which can be configured based on frequency, memory, processor features etc. in order to adapt to the changing requirements of future devices.
The Dashboard provides a user-friendly and efficient tool for estimating area (synthesis and PD) using interactive GUI features, serving as a data analysis tool, with the goal of reducing workload, preventing manual errors, and facilitating data-driven decision-making for competitive advantage.
AAET aims in enabling seamless data consolidation by efficiently processing information from various legacy devices, which can be configured based on frequency, memory, processor features etc. in order to adapt to the changing requirements of future devices.
The Dashboard provides a user-friendly and efficient tool for estimating area (synthesis and PD) using interactive GUI features, serving as a data analysis tool, with the goal of reducing workload, preventing manual errors, and facilitating data-driven decision-making for competitive advantage.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionMost existing works reveal that the deep learning system is extremely susceptible to adversarial examples (AEs), which continually reverberate around the community of DL testing. Consequently, adversarial attacks are exploited to test the robustness of DL model, especially optimized gradient-based technologies in white-box testing. Although AEs have achieved competitive performance in fault revealing-ability and coverage improvement-ability in DL testing, there is little research analyzing the phenomenon theoretically. In this work, we give a formal analysis between gradient-based attack and loss minima of the loss function to prove that powerful adversaries will share similar feature representations with a high probability. Our extensive evaluation and theoretical analysis revealed that (1) the optimized gradient-based technologies can only cover several limited decision logic which is apparently contradictory to the diversity of test suites, (2) the reasons why adversarial examples can increase test coverage, and (3) the weaknesses of AEs by comparing with search-based and fuzz-based test suites generation technologies. Finally, our results prove that AEs can efficiently discover the vulnerability of DL model but are not suitable to explore more inner logic as test suites.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionAs the semiconductor technology has been increased, a lot of challenges related to IR-Drop have been increased considerably in recent years. Especially Dynamic IR-Drop issue becomes a bigger factor resulting in function failure and this will be true for advanced process node below 5nm and smaller. We need post-VCD files having various actual scenarios to find out if there are IR-Drop issues or not. But this post-VCD files can't be available until the end of design cycle and this is too late to fix out IR-Drop issues. It's very time consuming, painful and sometimes almost impossible to fix out at the final stage of design cycle when the post-VCD files can be obtainable.
The only to resolve this situation is to find out where is weak to dynamic IR-Drop as earlier as possible and that's why we have proposed the Areal and Time decomposed Phalanx based DNN(Deep Neural Network) methodology. Using this methodology, we have chosen Phalanx which is most similar to DNN modeling and predicted IR-Drop at the new design. We have found out where is weak at PDN(Power Distribution Network) even without layout routing information which is essential in the traditional flow and can fix out issues and strengthen PDN at the very earlier stage of design cycle with this methodology.
This method shows a IR-Drop accuracy over 95% and reduced a lot of iteration time to fix IR-Drop violation by 40% or so.
This Areal and Time decomposed Phalanx based DNN methodology has been verified using commercial tool, Cadence Voltus.
The only to resolve this situation is to find out where is weak to dynamic IR-Drop as earlier as possible and that's why we have proposed the Areal and Time decomposed Phalanx based DNN(Deep Neural Network) methodology. Using this methodology, we have chosen Phalanx which is most similar to DNN modeling and predicted IR-Drop at the new design. We have found out where is weak at PDN(Power Distribution Network) even without layout routing information which is essential in the traditional flow and can fix out issues and strengthen PDN at the very earlier stage of design cycle with this methodology.
This method shows a IR-Drop accuracy over 95% and reduced a lot of iteration time to fix IR-Drop violation by 40% or so.
This Areal and Time decomposed Phalanx based DNN methodology has been verified using commercial tool, Cadence Voltus.
Research Manuscript


AI
AI/ML Application and Infrastructure
DescriptionThis paper presents Artisan, an automated operational amplifier design framework using large language models. We develop a bidirectional representation to align abstract circuit topologies with their structural and functional semantics. We further employ Tree-of-Thoughts and Chain-of-Thoughts approaches to model the design process as a hierarchical question-answer sequence, implemented by a mechanism of multi-agent interaction. A high-quality opamp dataset is developed to enhance the design proficiency of Artisan. Experimental results demonstrate that Artisan outperforms state-of-the-art optimization-based methods and benchmark LLMs, in success rate, circuit performance metrics, and interpretability, while accelerating the design process by up to 50.1x. Artisan will be released for public access.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionLarge language models (LLMs), like ChatGPT, has been shown to be quite effective at providing information retrieval. By leveraging conversational AI, we have extended the functionality of the our in-house stack overflow-like system. We provide a virtual assistant capable of answering questions about design, technology, and tools that our design team needs. We ingest design manuals, methodology and tool documentation, education materials and use retrieval augmentation generation with a LLM to respond to queries. We have built a private, on-premises system that keeps our confidential data in-house. We'll show early progress for this project.
Research Panel


AI
EDA
DescriptionAccording to data provided by the World Health Organization, it is a grim reality that
more than 1.3 million people lose their lives annually due to the tragic outcomes of
road traffic accidents, further exacerbating the situation with a staggering 20 to 50
million individuals being left with non-fatal injuries. These disheartening statistics serve
as a stark reminder of the urgent need for improved safety measures in the automotive
industry.
Historically driven by the pursuit of creating vehicles that captivate and exhilarate
consumers, the automotive sector has increasingly shifted its focus toward fostering a
robust safety culture. This transformation has only sometimes been an organic
process, as governments worldwide have often found themselves leading the charge
in pushing for more excellent vehicular safety through stringent regulations. These
regulatory frameworks, which initially took root in Europe and China, have now been
rapidly disseminated globally. Consequently, automakers have found themselves
compelled to make safety an integral and non-negotiable facet of their automotive
solutions.
The impending European Safety Regulations, set to become a standard in the
industry, have been significantly motivated by the rapid evolution of automotive
technology and an unwavering commitment to ensuring the safety of both drivers and
passengers. A pivotal component of this technological revolution in the automotive
realm is interior sensing. It plays a critical role in monitoring drivers for distractions and
fatigue, as well as tracking the movements of vehicle occupants.
This distinguished panel of experts brings together some of the foremost sensor and
System-on-chip (SoC) suppliers and in-cabin monitoring specialists who are pivotal in
driving the burgeoning interior sensing market. Their collective aim is to deliberate on
various topics, ranging from emerging technology trends to innovative packaging
options, seamless connectivity, and integration points for Advanced Driver Assistance
Systems (ADAS), including the transformative Driver and occupant Monitoring System
technology.
Recognizing that human drivers are inherently prone to errors, safety technology
providers adopt a holistic systems approach to assist, enhance, and even assume
control of the driving task when necessary. In-cabin monitoring emerges as a crucial
element within this overarching strategy. Overcoming challenges related to cost,
packaging constraints, and system complexity, hardware and application vendors
continually push the boundaries of innovation, seeking novel ways to optimize their
designs to support efficient and cost-effective in-cabin monitoring solutions.
The panel discussion, featuring prominent figures from industry and University leaders
such as Seeing Machines, Qualcomm, Texas Instruments, Ambarella, OmniVision,
and TU Braunschweig, will delve deep into the dynamic Sensor and SoC market for
in-cabin monitoring. They will explore critical issues, including how in-cabin
monitoring technology underpins the global safety agenda, the preferences of
suppliers regarding packaging locations, the pros and cons of variousintegration approaches, and the implications for Original Equipment Manufacturers (OEMs) who must ensure that safety and convenience remain
paramount in their offerings. There are a variety of differing opinions, and it is
these differing opinions that will be brought forth in this panel.
The panel is aimed at students, researchers, and practitioners. Students will
understand the state of the art and the challenges. Researchers will be able to
examine open industrial problems which are still open, and industry practitioners will
be able to understand the available solutions and the industry trends.
The panel aims to engage in a comprehensive discussion surrounding critical
questions, including but not limited to:
How can we best support a low-cost and low-power consumption market?
Which aspect or component of Sensor and SoC design should we prioritize for future
advancements?
What are the foremost challenges associated with Artificial Intelligence (AI) in
designing sensors and SoCs?
Where should we channel our Research and Development (R&D) efforts?
Which packaging configurations are poised to dominate the automotive market?
How vital is cybersecurity in this context?
What obstacles do we face in implementing AI techniques for in-cabin monitoring?
How are these cutting-edge designs rigorously tested to ensure their efficacy and
safety?
more than 1.3 million people lose their lives annually due to the tragic outcomes of
road traffic accidents, further exacerbating the situation with a staggering 20 to 50
million individuals being left with non-fatal injuries. These disheartening statistics serve
as a stark reminder of the urgent need for improved safety measures in the automotive
industry.
Historically driven by the pursuit of creating vehicles that captivate and exhilarate
consumers, the automotive sector has increasingly shifted its focus toward fostering a
robust safety culture. This transformation has only sometimes been an organic
process, as governments worldwide have often found themselves leading the charge
in pushing for more excellent vehicular safety through stringent regulations. These
regulatory frameworks, which initially took root in Europe and China, have now been
rapidly disseminated globally. Consequently, automakers have found themselves
compelled to make safety an integral and non-negotiable facet of their automotive
solutions.
The impending European Safety Regulations, set to become a standard in the
industry, have been significantly motivated by the rapid evolution of automotive
technology and an unwavering commitment to ensuring the safety of both drivers and
passengers. A pivotal component of this technological revolution in the automotive
realm is interior sensing. It plays a critical role in monitoring drivers for distractions and
fatigue, as well as tracking the movements of vehicle occupants.
This distinguished panel of experts brings together some of the foremost sensor and
System-on-chip (SoC) suppliers and in-cabin monitoring specialists who are pivotal in
driving the burgeoning interior sensing market. Their collective aim is to deliberate on
various topics, ranging from emerging technology trends to innovative packaging
options, seamless connectivity, and integration points for Advanced Driver Assistance
Systems (ADAS), including the transformative Driver and occupant Monitoring System
technology.
Recognizing that human drivers are inherently prone to errors, safety technology
providers adopt a holistic systems approach to assist, enhance, and even assume
control of the driving task when necessary. In-cabin monitoring emerges as a crucial
element within this overarching strategy. Overcoming challenges related to cost,
packaging constraints, and system complexity, hardware and application vendors
continually push the boundaries of innovation, seeking novel ways to optimize their
designs to support efficient and cost-effective in-cabin monitoring solutions.
The panel discussion, featuring prominent figures from industry and University leaders
such as Seeing Machines, Qualcomm, Texas Instruments, Ambarella, OmniVision,
and TU Braunschweig, will delve deep into the dynamic Sensor and SoC market for
in-cabin monitoring. They will explore critical issues, including how in-cabin
monitoring technology underpins the global safety agenda, the preferences of
suppliers regarding packaging locations, the pros and cons of variousintegration approaches, and the implications for Original Equipment Manufacturers (OEMs) who must ensure that safety and convenience remain
paramount in their offerings. There are a variety of differing opinions, and it is
these differing opinions that will be brought forth in this panel.
The panel is aimed at students, researchers, and practitioners. Students will
understand the state of the art and the challenges. Researchers will be able to
examine open industrial problems which are still open, and industry practitioners will
be able to understand the available solutions and the industry trends.
The panel aims to engage in a comprehensive discussion surrounding critical
questions, including but not limited to:
How can we best support a low-cost and low-power consumption market?
Which aspect or component of Sensor and SoC design should we prioritize for future
advancements?
What are the foremost challenges associated with Artificial Intelligence (AI) in
designing sensors and SoCs?
Where should we channel our Research and Development (R&D) efforts?
Which packaging configurations are poised to dominate the automotive market?
How vital is cybersecurity in this context?
What obstacles do we face in implementing AI techniques for in-cabin monitoring?
How are these cutting-edge designs rigorously tested to ensure their efficacy and
safety?
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe increasing prevalence of device aging significantly complicates the timing analysis of digital circuits, especially due to the time-consuming nature of current methodologies, which struggle with the variety of standard cells and diverse input conditions. Addressing this challenge, this work proposes a novel, design-friendly framework for efficient and rapid aging-aware timing analysis. This framework harnesses the capabilities of Hybrid Graph Neural Networks to effectively capture cell structural details and extract delay-related information, enabling a straightforward mapping from operational conditions to specific cell aging delays. Additionally, it incorporates a Relational Graph Convolution Network (R-GCN) for modelling the complex relationships between nodes and a Graph Attention Network (GAN) for assessing the relative importance of each node, based on their types. This integrated approach significantly streamlines the process of aging-aware timing analysis, offering a substantial improvement in both speed and accuracy for digital circuit design. Our framework has 5% to 28% higher average prediction accuracy and better generalization ability on new cell than other benchmark networks; Compared with the conventional method, our framework greatly reduces time consumption and achieve average acceleration ratio of 600 on prediction tasks of a large number of cell structures and input conditions.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionPerforming per-packet Neural Network (NN) inference on the network data plane is required for high-quality and fast decision-making in computer networking. However, data plane architecture like the Reconfigurable Match Tables (RMT) pipeline has limited support for NN. Previous efforts have utilized Binary Neuron Networks (BNNs) as a compromise, but the accuracy loss of BNN is high. Inspired by the accuracy gain of the two-bit model. this paper proposes Athena. Athena can deploy the sparse low-bit quantization (two-bit and four-bit) model on RMT. Compared with the BNN-based state-of-the-art, Athena is cost-effective regarding accuracy loss reduction, inference latency, and chip area overhead.
Front-End Design


AI
Design
Engineering Tracks
Front-End Design
DescriptionIn the realm of ever evolving Semiconductor technology landscape with complex SoC's and Systems , integration of Chat GPT like AI Transformers in IP/SoC Design Verification could potentially revolutionize a transformative wave of automating verification there by contributing to increased robustness of designs.
IP and SOCs underpin many modern electronic systems like HPC/AI and Automotive SoC's. While functional correctness is crucial, it no longer suffices for real-world applications and usage. In this paper we have explored to utilize the power of light-weight generative AI BER Transformer Model in verification as it redefines the possibilities of how we interact with textual data, including hardware design specifications and taking verification to completeness by suggesting extra scenarios for Performance and Security aspects . It bridges the gap between 'what' a system does, 'how well' it performs, and 'how securely' it operates and addresses the grey areas in system level verification which cannot be captured at IP or sub-system level.
We can scale this model to SOC level and try to address verification challenges for miscellaneous SOC IP's like GPIO,DFT mux, Lower Power Elements and Safety Elements.
This paper highlights the power of using Generative AI in verification, Augmenting AI with verification can help us catch bugs/issues early in the verification life cycle.
IP and SOCs underpin many modern electronic systems like HPC/AI and Automotive SoC's. While functional correctness is crucial, it no longer suffices for real-world applications and usage. In this paper we have explored to utilize the power of light-weight generative AI BER Transformer Model in verification as it redefines the possibilities of how we interact with textual data, including hardware design specifications and taking verification to completeness by suggesting extra scenarios for Performance and Security aspects . It bridges the gap between 'what' a system does, 'how well' it performs, and 'how securely' it operates and addresses the grey areas in system level verification which cannot be captured at IP or sub-system level.
We can scale this model to SOC level and try to address verification challenges for miscellaneous SOC IP's like GPIO,DFT mux, Lower Power Elements and Safety Elements.
This paper highlights the power of using Generative AI in verification, Augmenting AI with verification can help us catch bugs/issues early in the verification life cycle.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn our 2.5D/3D System on Chip (SoC) designs that are being developed at lower (< 10nm) technology nodes, it is crucial to ensure that the IR drop is within the signoff threshold limits in order to achieve the targeted PPA goals.
Traditionally this includes multiple iterations of IR simulations after which the engineer identifies the IR, timing critical areas in the design that need to be improved. Manual identification of even a handful of regions pose a significant bandwidth impact.
Utilizing k-means clustering algorithm, we have developed an end to end pipeline where the engineer can:
• Provide the IR threshold limit and the algorithm will provide the list of regions where instances having drop higher than the threshold are clustered.
• Provide the type of cell which is resistance critical and the algorithm will provide the list of regions where the instances of the specified cell type are clustered. (example: Level Shifters)
• Provide the Instance toggle rate data and algorithm will cluster the regions based on use given high toggle rate threshold.
The regions are provided in the form of bounding boxes which can then be incorporated into the PnR flows like PG grid reinforcement, VT swap to downsize cells, etc.
Traditionally this includes multiple iterations of IR simulations after which the engineer identifies the IR, timing critical areas in the design that need to be improved. Manual identification of even a handful of regions pose a significant bandwidth impact.
Utilizing k-means clustering algorithm, we have developed an end to end pipeline where the engineer can:
• Provide the IR threshold limit and the algorithm will provide the list of regions where instances having drop higher than the threshold are clustered.
• Provide the type of cell which is resistance critical and the algorithm will provide the list of regions where the instances of the specified cell type are clustered. (example: Level Shifters)
• Provide the Instance toggle rate data and algorithm will cluster the regions based on use given high toggle rate threshold.
The regions are provided in the form of bounding boxes which can then be incorporated into the PnR flows like PG grid reinforcement, VT swap to downsize cells, etc.
Research Manuscript


AI
Design
AI/ML System and Platform Design
DescriptionImage Signal Processor (ISP) is widely used in intelligent edge devices across various scenarios. The intricate and time-consuming tuning process demands substantial expertise. Current AI-based auto-tuning operates discretely offline, relying on predefined scenes with human intervention, leading to inconvenient manipulation, with potentially fatal impacts on downstream tasks in unforeseen scenes. We propose a real-time automatic hyperparameter optimization ISP hardware system to address real-world scenarios. Our design features a tri-step framework and a hardware accelerator, demonstrating superior performance in human and computer vision tasks, even in real-time unforeseen scenes. Experiments showcase its practicality, achieving 1080P@75FPS/240FPS in FPGA/ASIC, respectively.
Exhibitor Forum


AI
DescriptionArtificial Intelligence (AI), particularly Large Language Models (LLMs), has revolutionized the landscape of Hardware Description Language (HDL) generation in digital design. This breakthrough technology holds immense promise for streamlining design processes and accelerating innovation. However, the probabilistic nature of LLMs poses unique challenges in HDL generation, frequently leading to inaccurate code predictions. This is a crucial concern in hardware design, where precision is paramount.
To address this critical challenge, we introduce AutoDV, an innovative LLM-based architecture designed to enhance the precision and reliability of AI-generated HDL code. At its core lies a system of interconnected, specialized, and compact LLMs, each meticulously crafted to handle specific aspects of the HDL generation process. This approach not only enables AutoDV to leverage the collective strengths of individual LLMs, but also fosters synergistic interactions among them.
AutoDV's groundbreaking capabilities stem from its two key components: the capability of automatically interfacing with external verification tools and a comprehensive library of pre-defined IPs. By seamlessly interfacing with established verification tools, AutoDV ensures rigorous Design Verification (DV), minimizing the risk of propagating errors to subsequent design stages. Additionally, AutoDV's IP library empowers LLMs to directly access and utilize these well-established and rigorously verified design components, significantly elevating the accuracy of the generated HDL code.
In this presentation, we will explore the technical underpinnings of AutoDV, beginning with an overview of its architecture and then examining the synergism between its components. The presentation will conclude with a practical demonstration.
To address this critical challenge, we introduce AutoDV, an innovative LLM-based architecture designed to enhance the precision and reliability of AI-generated HDL code. At its core lies a system of interconnected, specialized, and compact LLMs, each meticulously crafted to handle specific aspects of the HDL generation process. This approach not only enables AutoDV to leverage the collective strengths of individual LLMs, but also fosters synergistic interactions among them.
AutoDV's groundbreaking capabilities stem from its two key components: the capability of automatically interfacing with external verification tools and a comprehensive library of pre-defined IPs. By seamlessly interfacing with established verification tools, AutoDV ensures rigorous Design Verification (DV), minimizing the risk of propagating errors to subsequent design stages. Additionally, AutoDV's IP library empowers LLMs to directly access and utilize these well-established and rigorously verified design components, significantly elevating the accuracy of the generated HDL code.
In this presentation, we will explore the technical underpinnings of AutoDV, beginning with an overview of its architecture and then examining the synergism between its components. The presentation will conclude with a practical demonstration.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThis paper proposes a novel method for automatically inferring message flow specifications from the communication traces of a system-on-chip (SoC) design that captures messages exchanged among the components during a system execution.
The inferred message flows characterize the communication and coordination of components in a system design for realizing various system functions, and they are essential for SoC validation and debugging.
The proposed method relieves the burden of manual development and maintenance of such specifications on human designers.
Our method also develops a new accuracy metric, acceptance ratio, to evaluate the quality of the mined specifications instead of the specification size often used in the previous work, enabling more accurate specifications to be mined.
The effectiveness of the proposed method is evaluated on both synthetic traces and traces generated from executing several system models in GEM5.
In both cases, the proposed method achieves superior accuracies compared to a previous approach.
The inferred message flows characterize the communication and coordination of components in a system design for realizing various system functions, and they are essential for SoC validation and debugging.
The proposed method relieves the burden of manual development and maintenance of such specifications on human designers.
Our method also develops a new accuracy metric, acceptance ratio, to evaluate the quality of the mined specifications instead of the specification size often used in the previous work, enabling more accurate specifications to be mined.
The effectiveness of the proposed method is evaluated on both synthetic traces and traces generated from executing several system models in GEM5.
In both cases, the proposed method achieves superior accuracies compared to a previous approach.
Tutorial


Autonomous Systems
DescriptionThe contemporary era struggles with the intricate challenge of designing ``complex systems''.
These systems are characterized by intricate webs of interactions that interlace their components, giving rise to multifaceted complexities, springing from at least two sources.
First, the co-design of complex systems (e.g., a large network of cyber-physical systems) demands the simultaneous selection of components arising from heterogeneous natures (e.g., hardware vs. software parts), while satisfying system constraints and accounting for multiple objectives.
Second, different components are interconnected through interactions, and their design cannot be decoupled (e.g., within a mobility system).
Navigating this complexity necessitates innovative approaches, and this tutorial responds to this imperative by focusing on a monotone theory of co-design.
Our exploration extends from the design of individual platforms, such as autonomous vehicles, to the orchestration of entire mobility systems built upon such platforms.
In particular, we will delve into the theoretical foundations of a monotone theory of co-design, establishing a robust mathematical framework and its application to a diverse array of real-world problems, revolving around the domain of embodied intelligence.
The presented toolbox empowers efficient computation of optimal design solutions
tailored to specific tasks and, in its novelty, paves the way for several possibilities for future research.
This tutorial will focus on the particular application of computational design of autonomous systems, featuring both a technical and a practical session.
Participants will have the opportunity to explore dedicated demos and ``learn by doing'' through guided exercises.
The tutorial provides participants with an introduction to robot co-design and aims to connect multiple communities to enable the development of composable models, algorithms, fabrication processes, and hardware for embodied intelligence.
It is intended to be accessible from any background and seniority level and will present applications to a wide array of topics of interest to the design automation and robotics communities.
These systems are characterized by intricate webs of interactions that interlace their components, giving rise to multifaceted complexities, springing from at least two sources.
First, the co-design of complex systems (e.g., a large network of cyber-physical systems) demands the simultaneous selection of components arising from heterogeneous natures (e.g., hardware vs. software parts), while satisfying system constraints and accounting for multiple objectives.
Second, different components are interconnected through interactions, and their design cannot be decoupled (e.g., within a mobility system).
Navigating this complexity necessitates innovative approaches, and this tutorial responds to this imperative by focusing on a monotone theory of co-design.
Our exploration extends from the design of individual platforms, such as autonomous vehicles, to the orchestration of entire mobility systems built upon such platforms.
In particular, we will delve into the theoretical foundations of a monotone theory of co-design, establishing a robust mathematical framework and its application to a diverse array of real-world problems, revolving around the domain of embodied intelligence.
The presented toolbox empowers efficient computation of optimal design solutions
tailored to specific tasks and, in its novelty, paves the way for several possibilities for future research.
This tutorial will focus on the particular application of computational design of autonomous systems, featuring both a technical and a practical session.
Participants will have the opportunity to explore dedicated demos and ``learn by doing'' through guided exercises.
The tutorial provides participants with an introduction to robot co-design and aims to connect multiple communities to enable the development of composable models, algorithms, fabrication processes, and hardware for embodied intelligence.
It is intended to be accessible from any background and seniority level and will present applications to a wide array of topics of interest to the design automation and robotics communities.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIP cores require integration into top-level subsystems and/or SoCs. Writing constraints manually for top level design is prone to errors and difficult to verify and manage. This Synopsys webinar will cover how automated SDC constraints promotion from the IP to SoC level provides high-quality SDC relative to traditional manual time-consuming approaches. We will demonstrate the approach taken and benefits observed using automated constraints promotion and generation on an early PCIe® Gen 6 design resulting in shorter TAT and improved PPA. Lastly, we will illustrate how designers can ensure constraints correctness is maintained or bettered during the constraints promotion effort
IP


Engineering Tracks
IP
DescriptionIn recent times, Increased size of SOC has made static verification time and memory consuming. In a SOC which contains billions of design elements, few cases of missing/false violations or large run time issues get reported by customer on any static tool. When such issues get reported at the time of final sign off stage of the chip, they become a gating issue for any static tool. In such case static tool vendors are expected to provide the fix in the tool on urgent priority.
To fix any issue in the tool R&D engineer need to first identify root cause of the issue. Below methods are the traditional ways of root cause identification in a big design:
1. Using debug prints
2. Apply debugger on code execution
3. Code profiling tools
4. Reducing the size of design by making unrelated portion of design as Blackbox model
Finding the root cause of the issue using above mentioned ways and provide quick fix in tool takes time as:-
R&D, AE may not have direct access to design.
Shipping design to a secure network is difficult or takes time
High number of debug prints make it difficult to find root cause
Attaching debuggers on large design is cumbersome and slow
From the debug fields in violations and other reports, R&D or field only has limited knowledge about the design scenarios. It is difficult to create a unit reproducer
It has often been observed that having a small reproducer in hand reduces the turnaround time significantly for delivery of the fix. To overcome this challenge we have developed an utility in our tool which provides us a capability to create a small reproducer out of the big design.
To fix any issue in the tool R&D engineer need to first identify root cause of the issue. Below methods are the traditional ways of root cause identification in a big design:
1. Using debug prints
2. Apply debugger on code execution
3. Code profiling tools
4. Reducing the size of design by making unrelated portion of design as Blackbox model
Finding the root cause of the issue using above mentioned ways and provide quick fix in tool takes time as:-
R&D, AE may not have direct access to design.
Shipping design to a secure network is difficult or takes time
High number of debug prints make it difficult to find root cause
Attaching debuggers on large design is cumbersome and slow
From the debug fields in violations and other reports, R&D or field only has limited knowledge about the design scenarios. It is difficult to create a unit reproducer
It has often been observed that having a small reproducer in hand reduces the turnaround time significantly for delivery of the fix. To overcome this challenge we have developed an utility in our tool which provides us a capability to create a small reproducer out of the big design.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionConventional hierarchical design planning flows are neither runtime efficient nor resource efficient for a) quick floorplan porting during process node evaluation and library bring up with minimal dependency or b) what-if exploration to hasten block convergence with improved local FP optimization and identify critical limiters for different partition layout topologies. The scaling framework is a one-stop solution capable of operating on bare minimum baseline floorplan information to port floorplans even without any netlist or memory collaterals. The Framework can generate basic floorplanning compatible netlist and scaled library memory collateral from baseline floorplans on a different node/library. The framework can also enable evaluation of block convergence recipes and floorplan utilization or frequency sweeps through macro placement techniques including ML macro placement suitably augmented with additional algorithmic pin placement intelligence to retain global context. The framework has evolved to be the de facto early floorplan execution flow, scaling and porting floorplans between libraries, nodes and even foundries, and improving the work model execution efficiency by 16X and resource efficiency by 3X for each partition. The framework has also been a key pillar in block optimization exploration, during later execution milestones, saving 2-4 weeks of convergence efforts on 80% of blocks with pre-configured techniques and strategies.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionASIPs are attractive for high energy efficiency. However, design effort for ASIP is time-consuming and error-prone. We present an automatic design framework that generates out-of-order ASIPs from ISA documents via nano-operator (nOP) abstraction. The key insight is the proposed nOPs are semantic-aligned and functional-complete. Therefore, we first leverage LLMs to generate nOP graphs from ISA documents, then propose an nOP fusion algorithm to optimize, and generate corresponding OoO ASIPs. Experiments show that compared with SOTA LLM-assisted methods, our approach generate a processor with 5818x larger area without HDL modification. Furthermore, our processor achieves 3.33x speedup compared with a general-purpose CPU.
Embedded Systems and Software


AI
Embedded Systems
Engineering Tracks
DescriptionReinforcement learning has demonstrated optimization performance in various simulation environments, yet there has been limited evidence of its effectiveness in real-world scenarios.
In this study, we applied offline reinforcement learning in an SSD simulator with real product-level complexity. Attempting to design test cases that impose high loads on the SSD, we confirmed a reduction of over 50% in test input quantity compared to random testing.
To overcome the high complexity, we transformed the extensive input range supported by the product into an optimal range, reflecting product characteristics. We effectively represented internal information using a Graph Neural Network.
We propose an automated test generation framework that applies the reuse of trajactiories generated during the agent training process for training.
In this study, we applied offline reinforcement learning in an SSD simulator with real product-level complexity. Attempting to design test cases that impose high loads on the SSD, we confirmed a reduction of over 50% in test input quantity compared to random testing.
To overcome the high complexity, we transformed the extensive input range supported by the product into an optimal range, reflecting product characteristics. We effectively represented internal information using a Graph Neural Network.
We propose an automated test generation framework that applies the reuse of trajactiories generated during the agent training process for training.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionMost SoCs today have analog or mixed signal blocks, such as SerDes cores, DACS, ADCs, PLLs and other transceivers. Many analog blocks have digital control logic. As such, an increasing amount of analog IP is mixed signal, and with rapidly increasing SoC capacity, a single IP block might represent an extremely complex mixed signal function. Currently, a sizable part of mixed signal design implementation is done manually, which is a slow and laborious process that can lead to design errors and numerous iterations. The blocks are placed and routed using semi-manual process, without the aid of design rule-correct automation. In this paper, we introduce a methodology to automate the placement and routing of such digital/mixed signal blocks with LVS and DRC awareness. Within a few clicks the digital block is placed and routed with the addition of boundary cells, tap cells and fills. The solution is capable to read user constraints and enhance quality of routing.
Back-End Design


Back-End Design
Design
Engineering Tracks
DescriptionTI supports various Packaging Technologies which brings forth the challenge of Thermal Modeling and Analysis, where design teams grapple with the intricacies of mastering thermal modeling tools for diverse package families. The current process involves time-consuming manual efforts in creating intricate package geometry and PCB setups with the CAD tools, often resulting in errors.
The collaborative dance between design teams and centralized units prolongs the thermal modeling iteration cycle to 2+ weeks. In response, an automated solution is proposed to streamline this process, reducing the timeline to around 2 days. This automation liberates design teams from the need for extensive CAD/modeling tool familiarity, empowering them to conduct thermal modeling independently without overreliance on centralized teams.
This shift toward automation not only addresses efficiency but also marks a practical evolution in product development. It promises a smoother journey through the complexities of thermal modeling and analysis, reflecting a commitment to innovation while maintaining a grounded approach to practical implementation.
The collaborative dance between design teams and centralized units prolongs the thermal modeling iteration cycle to 2+ weeks. In response, an automated solution is proposed to streamline this process, reducing the timeline to around 2 days. This automation liberates design teams from the need for extensive CAD/modeling tool familiarity, empowering them to conduct thermal modeling independently without overreliance on centralized teams.
This shift toward automation not only addresses efficiency but also marks a practical evolution in product development. It promises a smoother journey through the complexities of thermal modeling and analysis, reflecting a commitment to innovation while maintaining a grounded approach to practical implementation.
Back-End Design


Back-End Design
Design
Engineering Tracks
DescriptionThe identification of layout constraints in analog circuits such as symmetry, matching, etc. has become a crucial task to meet more and more aggressive design specifications, especially in new process nodes where parasitic effects can have a severe impact on circuit performance and lifetime. However, the manual annotation of such constraints requires design expertise and is a challenging and error prone task. In this paper, we propose an unsupervised node embedding method on circuit netlist graph to capture topological similarities between nodes. We evaluate our method on open-source and in-house analog circuit designs to validate the ability of this new approach to identify symmetry constraints. Compared to other solutions based on machine learning (ML) techniques recently proposed in the literature that rely on annotated netlists datasets, this unsupervised solution does not need any prior knowledge usually extracted during computationally expensive machine learning phase.
Research Manuscript


AI
AI/ML Application and Infrastructure
DescriptionThis paper presents RTLFixer, a novel framework enabling automatic syntax errors fixing for Verilog code with Large Language Models (LLMs). Despite LLM's promising capabilities, our analysis indicates that approximately 55\% of errors in LLM-generated Verilog are syntax-related, leading to compilation failures. To tackle this issue, we introduce a novel debugging framework that employs Retrieval-Augmented Generation (RAG) and ReAct prompting, enabling LLMs to act as autonomous agents in interactively debugging the code with feedback. This framework demonstrates exceptional proficiency in resolving syntax errors, successfully correcting about 98.5\% of compilation errors in our debugging dataset, comprising 212 erroneous implementations derived from the VerilogEval benchmark. Our method leads to 32.3\% and 8.6\% increase in pass@1 success rates in the VerilogEval-Machine and VerilogEval-Human benchmarks, respectively.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionThe characterization of input/output (IO) devices is complex and time-consuming process due to the multiple supplies involved, such as VDD and VDDE, which ramp up at different rates and in different orders. This is particularly important in the context of modern complex IO design, which often require rigorous validation to ensure reliable and robust operation.
This complexity can be addressed with automation scripts that enable the efficient generation of various validation scenarios in characterization process. In this way, designers can save significant time and effort, while also improving the accuracy and completeness of the validation process
To achieve this, the automation scripts is designed to automatically generate series of tests that cover a range of supply ramp rates and orders. The scripts can be customized to the specific requirements of the IO device being characterized, and by addition to Solido Design Environment can incorporate a variety of simulation and analysis techniques available, such as Monte Carlo analysis and sensitivity analysis.
The addition of an automation script for IO device characterization to the Solido Design Environment represents a significant technical advance in the design and verification of analog and mixed-signal ICs, with important implications for efficiency, accuracy, and reliability.
This complexity can be addressed with automation scripts that enable the efficient generation of various validation scenarios in characterization process. In this way, designers can save significant time and effort, while also improving the accuracy and completeness of the validation process
To achieve this, the automation scripts is designed to automatically generate series of tests that cover a range of supply ramp rates and orders. The scripts can be customized to the specific requirements of the IO device being characterized, and by addition to Solido Design Environment can incorporate a variety of simulation and analysis techniques available, such as Monte Carlo analysis and sensitivity analysis.
The addition of an automation script for IO device characterization to the Solido Design Environment represents a significant technical advance in the design and verification of analog and mixed-signal ICs, with important implications for efficiency, accuracy, and reliability.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionDesign synthesis flows are not aware of Clock Domain Crossing (CDC). Thus, synthesis optimizations that are built to enhance power, performance, and area (PPA), may cause corruption in CDC paths and therefore, the netlist generated by the synthesis tools can introduce new CDC errors even after CDC signoff at the RTL.
The synthesis optimizations may also cause functional glitch issues due to retiming, self-gating, and mux-decompositions which can result in silicon escapes.
Currently, designers use ad hoc methods such as manual synthesis constraints, full CDC re-verification at gate-level, or relying on Gate-level Simulation (GLS) to overcome these challenges. However, it is error prone due to over-constraining, high noise-level during re-verification, or low GLS coverage.
Using VC SpyGlass CDC-aware Fusion Compiler flow, correct-by-construction synthesis is performed with regard to avoiding CDC bugs during netlist transformation.
Running this automated flow using the following steps:
• After RTL CDC signoff using VC SpyGlass CDC, a Static database is generated to guide the synthesis
• Fusion Compiler generates synthesis constraints using the Static database to ensure no corruption happens to CDC paths and no functional glitches are introduced
Integrating this technology in the flow mitigates the risk of introducing any new CDC violations in Netlist that were previously qualified at RTL.
The synthesis optimizations may also cause functional glitch issues due to retiming, self-gating, and mux-decompositions which can result in silicon escapes.
Currently, designers use ad hoc methods such as manual synthesis constraints, full CDC re-verification at gate-level, or relying on Gate-level Simulation (GLS) to overcome these challenges. However, it is error prone due to over-constraining, high noise-level during re-verification, or low GLS coverage.
Using VC SpyGlass CDC-aware Fusion Compiler flow, correct-by-construction synthesis is performed with regard to avoiding CDC bugs during netlist transformation.
Running this automated flow using the following steps:
• After RTL CDC signoff using VC SpyGlass CDC, a Static database is generated to guide the synthesis
• Fusion Compiler generates synthesis constraints using the Static database to ensure no corruption happens to CDC paths and no functional glitches are introduced
Integrating this technology in the flow mitigates the risk of introducing any new CDC violations in Netlist that were previously qualified at RTL.
Work-in-Progress Poster
B-Ring:An Efficient Interleaved Bidirectional Ring All-reduce Algorithm for Gradient Synchronization


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThe prevailing Ring all-reduce technique in distributed computing comprises communication establishment, data transmission, and data processing phases in each step. However, as nodes increase, it suffers from excessive communication overhead due to underutilized bandwidth during communication establishment and data processing. To address this, we introduce a bidirectional ring all-reduce (B-Ring) approach, employing asynchronous communication to alleviate communication establishment and data processing impact. Extensive experiments demonstrate B-Ring's effectiveness, reducing average communication overhead by 8.4% and up to 23.6%.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionFederated Learning is a privacy-centric distributed learning paradigm that aims to build a highly accurate global model.in Mobile Edge IoT, FL training can drain device energy. Current optimization methods focus on reducing overall energy use, potentially causing high consumption in some devices, shortening their lifespan.For enhancing the accuracy of global model and balancing the energy consumption between devices,we introduce a novel FL training approach.We propose a client selection strategy integrates cluster partitioning and utility-driven approaches,then introduce a Sequential Least Squares Quadratic Programming scheme for effective communication resource allocation. Our approach outperforms existing methods,increasing model accuracy and reducing energy consumption gap.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionIn the rapidly evolving landscape of technology, the pursuit of high-performance systems has become increasingly essential. With the growing complexities in chip design, achieving a harmonious balance between Power, Performance, and Area (PPA) – the foundational pillars of contemporary chip architecture – presents formidable challenges. Traditional clock methodologies such as clock tree synthesis, clock mesh, and multi-source clock tree synthesis have proven inadequate in addressing the intricacies of modern chip design. Recognizing these limitations, we introduce the innovative Hybrid Clock Network technique, a customized approach designed to construct robust clock networks within Network On Chips (NoC).
Our technique has yielded remarkable improvements in clock quality when compared to conventional clock tree methodologies. Notably, our results showcase a 41.66% reduction in latency, a 43.75% enhancement in skew, a 14.22% decrease in clock power consumption, and an overall 12.46% reduction in total power consumption. Additionally, our approach has conserved 11.55% of routing resources, reduced the clock buffer count by 16.2%, and streamlined the clock depth from 23 to 19 levels. These compelling findings underscore the efficacy of our proposed technique in significantly enhancing critical PPA metrics. The Hybrid Clock Network technique represents a breakthrough in addressing the challenges of contemporary chip design, offering a promising path forward in the pursuit of high-performance systems.
Our technique has yielded remarkable improvements in clock quality when compared to conventional clock tree methodologies. Notably, our results showcase a 41.66% reduction in latency, a 43.75% enhancement in skew, a 14.22% decrease in clock power consumption, and an overall 12.46% reduction in total power consumption. Additionally, our approach has conserved 11.55% of routing resources, reduced the clock buffer count by 16.2%, and streamlined the clock depth from 23 to 19 levels. These compelling findings underscore the efficacy of our proposed technique in significantly enhancing critical PPA metrics. The Hybrid Clock Network technique represents a breakthrough in addressing the challenges of contemporary chip design, offering a promising path forward in the pursuit of high-performance systems.
Research Manuscript


Embedded Systems
Embedded Memory and Storage Systems
DescriptionThis paper proposes Balloon-ZNS that enables transparent compression in emerging storage devices ZNS SSDs to enhance cost efficiency. ZNS SSDs require data pages to be stored and aligned in logical zones and flash blocks, conflicting with the management of variable-length compressed pages. Motivated by an observation that compressibility locality widely exists in data streams, Balloon-ZNS performs compressibility-adaptive, slot-aligned storage management to address the conflict. Evaluation with RocksDB shows Balloon-ZNS can reap more than 80% of the compression gain while achieving -7% to 14% higher throughput than a vanilla ZNS SSD, on average, when data compressibility is not poor.
Work-in-Progress Poster


AI
Autonomous Systems
Cloud
Design
EDA
Embedded Systems
IP
Security
DescriptionThis paper proposes Bayesian learning driven automated embedded memory design methodology that aims to minimize power consumption and/or maximize performance while meeting predefined constraints. To achieve this objective effectively, we present an automatic tool that leverages a reference initial circuit design to generate a diverse set of schematic and layout options for logic-equivalent circuit variants. Subsequently, leveraging the range of circuit options generated, Bayesian optimization is employed not only to identify optimal circuit parameters but also to select the most appropriate circuit topology to attain the desired design objectives. TSMC 28nm process simulation results demonstrate the proposed methodology reducing dynamic power by 21.59%-39.02% and access time by 29.45%-38.21% compared to the compiler-generated design, with a runtime of 10-40 hours.
DAC Pavilion Panel


DAC Pavilion
DescriptionThis panel will explore, with leading software companies, a phenomenon that has long been anticipated: the business, market and technical convergences of the two halves of Engineering Software (EDA and "industrial" software). These convergences are increasingly evident in the companies' product and acquisition strategies.
Research Manuscript


Security
Hardware Security: Attack and Defense
DescriptionThis research investigates the vulnerability of ML-enabled Hardware Malware Detection(HMD) methods to adversarial attacks. We introduce proactive and robust adversarial learning and defense based on Deep Reinforcement Learning(DRL). First, highly effective adversarial attacks are employed to circumvent detection mechanisms. Subsequently, an efficient DRL technique based on Advantage Actor-Critic(A2C) is presented to predict adversarial attack patterns in real-time. Next, ML models are fortified through adversarial training to enhance their defense capabilities against both malware and adversarial attacks. To achieve greater efficiency, a constraint controller using Upper Confidence Bounds(UCB) algorithm is proposed that dynamically assigns defense responsibilities to specialized RL agents.
IP


Engineering Tracks
IP
DescriptionInterface IPs are an important part of any integrated circuit design that needs to communicate with the outside world or other integrated circuits. Out of many design views of IO libraries (e.g., GPIO, I2C, I3C, etc.) the logical views have special importance as it defines the basic function of the design. The functionality in these views should be verified to the best possible extent as broken functionality leads to one of the heaviest costs a design house may pay in terms of silicon failures. Symbolic simulation provides unique and powerful solutions to the plethora of technical challenges faced by logic verification engineers of interface IPs. The Synopsys ESP uses symbolic simulation technology to offers high-quality equivalence checking for full-custom designs.
In this paper, Synopsys ESP has been explored to validate complex interface IP's. ESP is quite known for equivalence checking of Standard cells & Memories, which is mostly having digital blocks. On another side - Interface IPs consists of bunch of analog blocks along with digital which makes it more complex for equivalence checking. Resolving analog-blocks is complex for ESP and sometimes resolved to incorrect logic, so we are showcasing the challenges faced with analog-blocks of Interface IP's along with their proven solutions and showcasing the advantages it brought within ESP broadening its Analog Design validation coverage.
In this paper, Synopsys ESP has been explored to validate complex interface IP's. ESP is quite known for equivalence checking of Standard cells & Memories, which is mostly having digital blocks. On another side - Interface IPs consists of bunch of analog blocks along with digital which makes it more complex for equivalence checking. Resolving analog-blocks is complex for ESP and sometimes resolved to incorrect logic, so we are showcasing the challenges faced with analog-blocks of Interface IP's along with their proven solutions and showcasing the advantages it brought within ESP broadening its Analog Design validation coverage.
Research Manuscript


AI
AI/ML Application and Infrastructure
DescriptionDeep neural network (DNN) inference has become an important
part of many data-center workloads. This has prompted focused ef-
forts to design ever-faster deep learning accelerators such as GPUs
and TPUs. However, an end-to-end vision application contains
more than just DNN inference, including input decompression, re-
sizing, sampling, normalization, and data transfer. In this paper,
we perform a thorough evaluation of computer vision inference
requests performed on a throughput-optimized serving system. We
quantify the performance impact of server overheads such as data
movement, preprocessing, and message brokers between two DNNs
producing outputs at different rates. Our empirical analysis encom-
passes many computer vision tasks including image classification,
segmentation, detection, depth-estimation, and more complex pro-
cessing pipelines with multiple DNNs. Our results consistently
demonstrate that end-to-end application performance can easily
be dominated by data processing and data movement functions (up
to 56% of end-to-end latency in a medium-sized image, and ∼ 80%
impact on system throughput in a large image), even though these
functions have been conventionally overlooked in deep learning
system design. Our work identifies important performance bottle-
necks in different application scenarios, achieves
2.25× better throughput compared to prior work, and paves the
way for more holistic deep learning system design.
part of many data-center workloads. This has prompted focused ef-
forts to design ever-faster deep learning accelerators such as GPUs
and TPUs. However, an end-to-end vision application contains
more than just DNN inference, including input decompression, re-
sizing, sampling, normalization, and data transfer. In this paper,
we perform a thorough evaluation of computer vision inference
requests performed on a throughput-optimized serving system. We
quantify the performance impact of server overheads such as data
movement, preprocessing, and message brokers between two DNNs
producing outputs at different rates. Our empirical analysis encom-
passes many computer vision tasks including image classification,
segmentation, detection, depth-estimation, and more complex pro-
cessing pipelines with multiple DNNs. Our results consistently
demonstrate that end-to-end application performance can easily
be dominated by data processing and data movement functions (up
to 56% of end-to-end latency in a medium-sized image, and ∼ 80%
impact on system throughput in a large image), even though these
functions have been conventionally overlooked in deep learning
system design. Our work identifies important performance bottle-
necks in different application scenarios, achieves
2.25× better throughput compared to prior work, and paves the
way for more holistic deep learning system design.
Research Manuscript


EDA
Timing and Power Analysis and Optimization
DescriptionThough using multi-bit flip-flop (MBFF) cells provide the benefit of saving dynamic power, its big cell size with many D/Q-pins inherently entails two critical limitations, which are (1) the loss of full flexibility in optimizing the wires connecting to the D/Q-pins in MBFFs and (2) the loss of selectively resizing i.e., controlling output driving strength of internal flip
Research Manuscript


Design
Emerging Models of Computation
DescriptionHyperdimensional computing (HDC), a powerful paradigm for cognitive tasks, often demands hypervectors of high dimensions (e.g., 10,000) to achieve competitive accuracy. However, processing such large-dimensional data poses challenges for performance and energy efficiency, particularly on resource-constrained devices. In this paper, We present a framework to terminate bit-serial HDC inference early when sufficient confidence is attained in the prediction. This approach integrates a Naive Bayes model to replace the conventional associative memory in HDC. This transformation allows for a probabilistic interpretation of the model outputs, steering away from mere similarity measures. We reduce more than 70% of bits that need to be processed while maintaining comparable accuracy across diverse benchmarks. In addition, We show the adaptability of our early termination algorithm during on-the-fly learning scenarios.
DAC Pavilion Panel


Design
DAC Pavilion
DescriptionSoCs designed for compute-intensive workloads, such as AI training and inferencing, continue to grow and power budgets are increasing geometrically. Handling these power budgets from an SoC and system perspective requires rigorous tools, flows, and methodologies. The question that remains is how these burgeoning power budgets impact broader systems and system-of-system effects, and what role does silicon IP play in shaping these outcomes.
2.5D and 3D solutions are emerging as potential mitigators for the expanding power budgets, but the extent of their effect is yet to be fully understood. Additionally, with the constant evolution and growth in technology, there is a looming question: will power budgets level off or continue on a path of exponential growth? The influence of silicon IP in directing this trajectory is a topic of keen interest.
A significant player in this dynamic is the role of next-generation VRMs. With their potential to regulate voltage and hence influence power, they might hold the answer to managing the surge in power budgets. This conference seeks to explore their impact, dissect the role of silicon IP, and generate insightful discussions on the future of power consumption within technology. Together, we will answer some of the following questions from an EDA, system, IP, and SoC design perspective:
o What are the primary factors driving the immense leaps in on-die power?
o What tools, flows, and methodologies are required to manage SoC and system power budgets? o What are the system and system-of-system effects of ballooning power budgets?
o What effect will 2.5D and 3D solutions have on growing power budgets?
o Will we see a leveling off in power budgets or will they keep growing exponentially? And why? o What is the role of next-generation VRMs
2.5D and 3D solutions are emerging as potential mitigators for the expanding power budgets, but the extent of their effect is yet to be fully understood. Additionally, with the constant evolution and growth in technology, there is a looming question: will power budgets level off or continue on a path of exponential growth? The influence of silicon IP in directing this trajectory is a topic of keen interest.
A significant player in this dynamic is the role of next-generation VRMs. With their potential to regulate voltage and hence influence power, they might hold the answer to managing the surge in power budgets. This conference seeks to explore their impact, dissect the role of silicon IP, and generate insightful discussions on the future of power consumption within technology. Together, we will answer some of the following questions from an EDA, system, IP, and SoC design perspective:
o What are the primary factors driving the immense leaps in on-die power?
o What tools, flows, and methodologies are required to manage SoC and system power budgets? o What are the system and system-of-system effects of ballooning power budgets?
o What effect will 2.5D and 3D solutions have on growing power budgets?
o Will we see a leveling off in power budgets or will they keep growing exponentially? And why? o What is the role of next-generation VRMs
Research Manuscript


Design
Design for Manufacturability and Reliability
DescriptionYield estimation and optimization is ubiquitous in modern circuit design but remains elusive for large-scale chips. This is largely due to the mounting cost of transistor-level simulation and one's often limited resources. In this study, we propose a novel framework to estimate and optimize yield using Bayesian Neural Network (BNN-YEO). By coupling machine learning method with Bayesian network, our approach can effectively integrate prior knowledge and is unaffected by the overfitting problem prevalent in most surrogate models. With the introduction of a smooth approximation of the indicator function, it incorporates gradient information to facilitate global yield optimization. We examine its effectiveness via numerical experiments on 6T SRAM and found that BNN-YEO provides 100x speedup (in terms of SPICE simulations) over standard Monte Carlo in yield estimation, and 20x faster than the state-of-the-art method for total yield estimation and optimization with improved accuracy.
Research Manuscript


Design
Quantum Computing
DescriptionBoolean matching is an important problem in logic synthesis and verification. Despite being well-studied for conventional Boolean circuits, its treatment for reversible logic circuits remains largely, if not completely, missing. This work provides the first such study. Given two (black-box) reversible logic circuits that are promised to be matchable, we check their equivalences under various input/output negation and permutation conditions subject to the availability/unavailability of their inverse circuits. Notably, among other results, we show that the equivalence up to input negation and permutation is solvable in quantum polynomial time, while the classical complexity is exponential. This result is arguably the first demonstration of quantum exponential speedup in solving design automation problems. Also, as a negative result, we show that the equivalence up to both input and output negations is not solvable in quantum polynomial time unless UNIQUE-SAT is, which is unlikely. This work paves the theoretical foundation of Boolean matching reversible circuits for potential applications, e.g., in quantum circuit synthesis.
Front-End Design


Design
Engineering Tracks
Front-End Design
DescriptionDot-product compute engines are pivotal to any of the AI/ML hardware accelerators. Multi-term and floating-point dot-product engines increase the datapath complexity due to added logic for rounding, normalization, and alignment of significands per maximum exponent. To formally verify such dot-product compute engines, a C/C++ vs RTL formal check tool (e.g. Synopsys's VC Formal DPV) is used. The datapath complexity of a multi-term and floating-point dot-product engine for a complex AI/ML chip along with different dataflow graph (DFG) structures of the corresponding C/C++ and RTL models often make it difficult for the formal tool to converge. This research depicts various techniques (assume-guarantee, lemma partitioning, DFG optimization, maximizing equivalence points, case splitting, and using optimized solvers) that are adopted to obtain formal convergence across several floating-point types. Moreover, we enable helper lemmas after coming up with adder tree expressions to match the RTL and C-model adder tree structures. The results demonstrate that a formal run for a multi-term FP32-based dot-product operation can converge within 30mins. We recommend a new feature for the VC Formal DPV tool to streamline detection of the adder trees and automatically resolving them in the flow, which Synopsys is currently working on.
IP


Engineering Tracks
IP
DescriptionThe methods and tools we use for digital hardware design today are deeply antiquated and little changed from the 1990s when IP Reuse was in its infancy. Software design, on the other hand, has undergone explosive changes since that time. We have now reached the inflection point where a combination of new open-source software EDA tools and modern software development environments can change the way we design hardware. In this paper, we present our work showing a complete digital design flow that can produce high-quality, professional-grade IP built entirely with open-source software and EDA tools. We also share early results of how generative AI may become a powerful tool in the designer's toolbox for creating ever more complex IP.
Keynote
Special Event


AI
DescriptionJim Keller is CEO of Tenstorrent and a veteran hardware engineer. Prior to joining Tenstorrent, he served two years as Senior Vice President of Intel's Silicon Engineering Group. He has held roles as Tesla's Vice President of Autopilot and Low Voltage Hardware, Corporate Vice President and Chief Cores Architect at AMD, and Vice President of Engineering and Chief Architect at P.A. Semi, which was acquired by Apple Inc. Jim has led multiple successful silicon designs over the decades, from the DEC Alpha processors, to AMD K7/K8/K12, HyperTransport and the AMD Zen family, the Apple A4/A5 processors, and Tesla's self-driving car chip.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionHigh bandwidth memory (HBM) consists of several memory chips and a dedicated buffer die that serializes and de-serializes data for processing and transferring. One major parameter deciding the performance of a buffer die is the number of parallel signal buslines spanning half the die between signal IO circuitry (e.g., PHY) and input/output ports (i.e., through-silicon via (TSV)) of the buffer die. The speed of signal buses is also important to make smoother signal transitions during the clock cycle time. This transition time ensuring full signal swing, determines the maximum clock frequency of the HBM. The faster the device and the larger the number of buslines, the higher the performance an HBM can deliver. The busline bit count is expected to exceed several tens of thousands in the next HBM generation. The busline delay difference must be minimized for correct signal transfer of all bits within a very narrow available time slot for signal transition. Until now, the bus design has been done by iterative manual layout and simulation, since no good automated solutions exist.This work seeks an automated layout and optimization methodology for the many signal buslines for a next generation HBM. We formulate the design constraints from custom layouts, and develop a novel bus delay optimization algorithm based on a commercial P&R tool. This automated solution demonstrates a bus layout for an HBM buffer die within seconds, while satisfying all metric requirements.
Research Manuscript


Design
Emerging Models of Computation
DescriptionThe concept of Nash equilibrium (NE), pivotal within game theory, has garnered widespread attention across numerous industries.
However, verifying the existence of NE poses a significant computational challenge, classified as an NP-complete problem.
Recent advancements introduced several quantum Nash solvers aimed at identifying pure strategy NE solutions (i.e., binary solutions) by integrating slack terms into the objective function, commonly referred to as slack-quadratic unconstrained binary optimization (S-QUBO).
However, incorporation of slack terms into the quadratic optimization results in changes of the objective function, which may cause incorrect solutions.
Furthermore, these quantum solvers only identify a limited subset of pure strategy NE solutions, and fail to address mixed strategy NE (i.e., decimal solutions), leaving many solutions undiscovered.
In this work, we propose C-Nash, a novel ferroelectric computing-in-memory (CiM) architecture that can efficiently handle both pure and mixed strategy NE solutions.
The proposed framework consists of
(i) a transformation method that converts quadratic optimization into a MAX-QUBO form without introducing additional slack variables, thereby avoiding objective function changes;
(ii) a ferroelectric FET (FeFET) based bi-crossbar structure for storing payoff matrices and accelerating the core vector-matrix-vector (VMV) multiplications of QUBO form;
(iii) A winner-takes-all (WTA) tree implementing the MAX form and a two-phase based simulated annealing (SA) logic for searching NE solutions.
Evaluations demonstrate that C-Nash has up to 68.6% increase in the success rate for identifying NE solutions, finding all pure and mixed NE solutions rather than only a portion of pure NE solutions, compared to D-Wave based quantum approaches.
Moreover, C-Nash boasts a reduction up to 157.9X/79.0X in time-to-solutions in comparison to D-Wave 2000 Q6 and D-Wave Advantage 4.1, respectively.
However, verifying the existence of NE poses a significant computational challenge, classified as an NP-complete problem.
Recent advancements introduced several quantum Nash solvers aimed at identifying pure strategy NE solutions (i.e., binary solutions) by integrating slack terms into the objective function, commonly referred to as slack-quadratic unconstrained binary optimization (S-QUBO).
However, incorporation of slack terms into the quadratic optimization results in changes of the objective function, which may cause incorrect solutions.
Furthermore, these quantum solvers only identify a limited subset of pure strategy NE solutions, and fail to address mixed strategy NE (i.e., decimal solutions), leaving many solutions undiscovered.
In this work, we propose C-Nash, a novel ferroelectric computing-in-memory (CiM) architecture that can efficiently handle both pure and mixed strategy NE solutions.
The proposed framework consists of
(i) a transformation method that converts quadratic optimization into a MAX-QUBO form without introducing additional slack variables, thereby avoiding objective function changes;
(ii) a ferroelectric FET (FeFET) based bi-crossbar structure for storing payoff matrices and accelerating the core vector-matrix-vector (VMV) multiplications of QUBO form;
(iii) A winner-takes-all (WTA) tree implementing the MAX form and a two-phase based simulated annealing (SA) logic for searching NE solutions.
Evaluations demonstrate that C-Nash has up to 68.6% increase in the success rate for identifying NE solutions, finding all pure and mixed NE solutions rather than only a portion of pure NE solutions, compared to D-Wave based quantum approaches.
Moreover, C-Nash boasts a reduction up to 157.9X/79.0X in time-to-solutions in comparison to D-Wave 2000 Q6 and D-Wave Advantage 4.1, respectively.
Research Manuscript


Embedded Systems
Embedded Software
DescriptionEnergy harvesting offers a scalable and cost-effective power solution for IoT devices, but it introduces the challenge of frequent and unpredictable power failures due to the unstable environment.
To address this, intermittent computing has been proposed, which periodically backs up the system state to non-volatile memory (NVM), enabling robust and sustainable computing even in the face of unreliable power supplies.
In modern processors, write back cache is extensively utilized to enhance system performance.
However, it poses a challenge during backup operations as it buffers updates to memory, potentially leading to inconsistent system states.
One solution is to adopt a write-through cache, which avoids the inconsistency issue but incurs increased memory access latency for each write reference.
Some existing work enforces a cache flushing before backups to maintain a consistent system state, resulting in significant backup overhead.
In this paper, we point out that although cache delays updates to the main memory, it may preserve a recoverable system state in the main memory.
Leveraging this characteristic, we propose a cache-aware task decomposition method that divides an application into multiple tasks, ensuring that no dirty cache lines are evicted during their execution.
Furthermore, the cache-aware task decomposition maintains a unchanged memory state during the execution of each task, enabling us to parallelize the backup process with task execution and effectively hide the backup latency.
Experimental results with different power traces demonstrate the effectiveness of the proposed system.
To address this, intermittent computing has been proposed, which periodically backs up the system state to non-volatile memory (NVM), enabling robust and sustainable computing even in the face of unreliable power supplies.
In modern processors, write back cache is extensively utilized to enhance system performance.
However, it poses a challenge during backup operations as it buffers updates to memory, potentially leading to inconsistent system states.
One solution is to adopt a write-through cache, which avoids the inconsistency issue but incurs increased memory access latency for each write reference.
Some existing work enforces a cache flushing before backups to maintain a consistent system state, resulting in significant backup overhead.
In this paper, we point out that although cache delays updates to the main memory, it may preserve a recoverable system state in the main memory.
Leveraging this characteristic, we propose a cache-aware task decomposition method that divides an application into multiple tasks, ensuring that no dirty cache lines are evicted during their execution.
Furthermore, the cache-aware task decomposition maintains a unchanged memory state during the execution of each task, enabling us to parallelize the backup process with task execution and effectively hide the backup latency.
Experimental results with different power traces demonstrate the effectiveness of the proposed system.
Engineering Track Poster


Back-End Design
Embedded Systems
Front-End Design
IP
DescriptionWith the exponential growth in design complexity, stringent timelines in Chip design cycle closure, the process advancements and increased runtimes in both physical design sign-off verification and Quality Analysis are constantly driving the need for faster and more efficient physical verification (PV) strategies.
Early PV analysis ensures designers to be able to quickly and easily analyze the critical issues. They can find and fix the root cause of errors in an efficient, accurate and fast manner. Fixing critical DRC and DFM issues later in the project cycle becomes more challenging. Our paper describes some of the efficient techniques which enable the faster Chip design sign-off convergence.
Early PV analysis ensures designers to be able to quickly and easily analyze the critical issues. They can find and fix the root cause of errors in an efficient, accurate and fast manner. Fixing critical DRC and DFM issues later in the project cycle becomes more challenging. Our paper describes some of the efficient techniques which enable the faster Chip design sign-off convergence.
Research Manuscript


Design
Design for Manufacturability and Reliability
DescriptionOptical proximity correction (OPC) is a vital step to ensure printability in modern VLSI manufacturing. Various OPC approaches have been proposed, which are typically data-driven and hardly involve particular considerations of the OPC problem, leading to potential performance bottlenecks. In this paper, we propose CAMO, a reinforcement learning-based OPC system that integrates important principles of the OPC problem. CAMO explicitly involves the spatial correlation among the neighboring segments and an OPC-inspired modulation for movement action selection. Experiments are conducted on via patterns and metal patterns. The results demonstrate that CAMO outperforms state-of-the-art OPC engines from both academia and industry.
Research Manuscript


Design
In-memory and Near-memory Computing Circuits
DescriptionRange search is the key part of the point cloud processing pipeline. CAM has proven its efficiency for search tasks on switches. In this work, we propose CAMPER, aiming to explore the potential of CAM for point cloud range search. We developed a ripple comparison 13T CAM cell for distance comparison, designed a spatial approximation search algorithm based on Chebyshev distance, and discussed the flexibility and scalability of the architecture. The results show that in the 64k@64k task, CAMPER achieves a latency of 0.83ms and a power consumption of 114.6mW, increased by 10.4x and 228x, respectively.
Research Manuscript


Design
In-memory and Near-memory Computing Circuits
DescriptionDemands for efficient computing under memory wall have led to computation-in-memory (CIM) accelerators that leverages memory structure to perform in-situ computing. The content addressable memory (CAM) processing is a CIM paradigm that accomplishes general purpose functions, via sequences of search