Extension of ACETONE C code generator for multi-core architectures

Imagine you are the manager of a busy kitchen in a high-stakes restaurant (like an airplane's navigation system). Your goal is to prepare a complex dish (a Deep Neural Network) that must be perfect and, crucially, finished within a strict, predictable time limit. If the dish isn't ready on time, the plane can't land safely.

For a long time, this kitchen had only one chef (a single-core processor). This chef was incredibly reliable and followed a strict recipe book (the ACETONE code generator) that ensured the food was always safe and the cooking time could be predicted down to the second. However, as the recipes got more complex, one chef simply couldn't cook fast enough.

The problem? The restaurant couldn't afford to buy a magical, specialized robot chef (a dedicated AI accelerator) yet. They had to stick with their existing team of human chefs (multi-core CPUs). But hiring more chefs introduced a new problem: coordination. If you just tell five chefs to start cooking at once, they might bump into each other, fight over the same ingredients, or wait around doing nothing.

This paper is about teaching the ACETONE system how to manage a team of chefs instead of just one. Here is how they did it, broken down into simple concepts:

1. The Recipe Map (The DAG)

First, the team realized that a complex recipe isn't just a list of steps; it's a map. Some steps must happen in order (you can't frost the cake before baking it), but other steps can happen at the same time (chopping vegetables and boiling water).

The Analogy: They turned the neural network into a flowchart (called a DAG). Imagine a subway map where some stations are connected by tracks (dependencies). You can't get to the next station until you pass the current one, but some lines run parallel.
The Goal: The system needs to figure out which chef works on which part of the map so that the whole meal is ready as fast as possible without anyone stepping on toes.

2. The Scheduling Puzzle (The Brain)

The hardest part is deciding who does what. This is a math puzzle.

The Old Way: The researchers looked at existing math formulas (ILP) to solve this. It was like trying to solve a Rubik's cube by checking every single possible move. It worked for small puzzles but took forever for big ones.
The New Way: They invented a smarter, faster way to solve the puzzle. They created two main strategies:
- The "Fast & Good Enough" Strategy (ISH): Like a head chef who quickly assigns tasks to whoever is standing nearest to the ingredients. It's fast and gets the job done, but maybe not perfectly optimized.
- The "Copy-Paste" Strategy (DSH): Sometimes, a chef has to wait for an ingredient to arrive from another station. Instead of waiting, this strategy says, "Let's just have a second chef cook a small batch of that ingredient right here!" It duplicates work to save time, trading a little extra memory for a lot of speed.

3. The Handshake (Synchronization)

When Chef A finishes chopping onions and needs to pass them to Chef B (who is on a different counter), they can't just throw them across the room. They need a system to ensure Chef B doesn't grab the onions before they are chopped, and Chef A doesn't drop new onions on top of the old ones.

The Analogy: They built a flag system in the shared memory.
- Chef A writes the onions on a specific spot on the counter and raises a red flag.
- Chef B looks at the red flag. If it's up, they know the onions are ready. They take them, lower the flag, and start cooking.
This ensures that even though the chefs are working in parallel, they never mess up the order of operations.

4. The Result: A Faster, Safer Kitchen

The team tested this new system on a real kitchen (a Texas Instruments computer chip with 4 cores).

The Outcome: By splitting the work among the cores and using their new "flag" system, they managed to cook the meal 8% faster overall.
The Real Win: While 8% sounds small, in the world of airplanes, that extra time is huge. More importantly, they proved that they could do this predictably. They knew exactly how long the worst-case scenario would take, which is the golden rule for safety-critical systems.

Why This Matters

Before this paper, if you wanted to run complex AI on an airplane, you had to wait for new, expensive hardware or accept that it would be too slow. This paper shows that we can take existing, reliable multi-core processors and make them work together like a well-oiled team.

It's like upgrading a single-lane road to a multi-lane highway with a smart traffic light system. You don't need to build a new highway; you just need to teach the cars (the code) how to drive in parallel without crashing. This makes advanced AI safer and more practical for the future of aviation.

Here is a detailed technical summary of the paper "Extension of ACETONE C code generator for multi-core architectures."

1. Problem Statement

The integration of Deep Neural Networks (DNNs) into safety-critical aeronautical systems faces significant challenges regarding predictability and certification. While the existing ACETONE framework successfully generates certifiable, sequential C code for single-core systems, it suffers from performance bottlenecks as model sizes increase.

The Challenge: Aeronautical systems are transitioning from single-core to multi-core architectures but are not yet ready to embed dedicated hardware accelerators (GPUs/TPUs). Therefore, efficient inference must be achieved on multi-core CPUs.
The Gap: Simply running sequential code on a single core is too slow for time-constrained applications. Parallelizing the code on multi-core systems introduces complexity regarding task scheduling, inter-core communication latency, and synchronization, which the original ACETONE framework does not address.
Goal: To extend ACETONE to generate predictable, parallel C code for multi-core architectures without dedicated accelerators, ensuring the code remains certifiable (i.e., Worst-Case Execution Time (WCET) is estimable).

2. Methodology

The authors propose a three-stage methodology: modeling the problem as a scheduling task, solving for optimal schedules, and extending the code generator.

A. System and Application Modeling

Platform: Modeled as a multi-core CPU with $m$ identical cores under a Unified Memory Architecture (UMA). Inter-core communication latency is constant, though memory interference is abstracted as a WCET margin.
Application: The DNN is modeled as a Directed Acyclic Graph (DAG).
- Nodes ( $V$ ): Represent DNN layers (tasks) with an associated WCET ( $t$ ).
- Edges ( $E$ ): Represent data dependencies with associated communication latency ( $w$ ) if tasks are on different cores.
Scheduling Constraints:
- Static, non-preemptive scheduling.
- Tasks are "ready" only when all predecessors are complete.
- Task Duplication: Allowed to reduce communication overhead (a task can be computed on multiple cores to avoid data transfer), provided it is not redundant.

B. Schedule Optimization Strategies

The paper evaluates three approaches to find the schedule that minimizes the makespan (total execution time):

Improved Integer Linear Programming (ILP):
- Builds upon Tang et al.'s [15] formulation but introduces an optimized encoding.
- Key Innovation: Removes the complex 4D communication variable ( $d_{a,i,b,j}$ ) and replaces it with tighter constraints on task duplication and start/end times. This reduces the search space complexity, allowing the solver to find solutions for larger graphs within reasonable timeframes compared to the original formulation.
Heuristic Approaches:
- Insertion Scheduling Heuristic (ISH): Assigns tasks to the core that minimizes start time. If idle time exists, it attempts to insert lower-priority tasks into the gap.
- Duplication Scheduling Heuristic (DSH): Similar to ISH but actively attempts to duplicate parent tasks on the destination core to eliminate communication delays, thereby reducing the start time of the current task.
Search Space Pruning:
- Utilizes dominance and equivalence relations (Chou and Chung [2]) to prune sub-optimal branches in the solution tree.

C. ACETONE Extension & Code Generation

Workflow: The ACETONE parser now accepts a DAG, the scheduler assigns layers to specific cores, and the generator produces separate C functions for each core.
Synchronization Mechanism:
- Implemented for a bare-metal environment.
- Uses shared memory flags and arrays for inter-core communication.
- Protocol: A "Writer" waits for a flag, writes data to a shared array, and increments the flag. A "Reader" waits for the flag update, reads the data, and increments the flag.
- Overhead: Requires $2m(m-1) $variables (flags + arrays) for$ m$ cores.

3. Key Contributions

Formalization: Defined the offline parallel scheduling of DNNs on multi-core systems as a DAG scheduling problem with specific constraints for embedded safety-critical contexts.
Optimized ILP Encoding: Proposed a more efficient Constraint Programming formulation that scales better than previous state-of-the-art methods, enabling the solution of larger DAGs.
ACETONE Extension: Successfully integrated scheduling logic and synchronization primitives into the ACETONE framework to generate parallel C code.
Validation: Validated the approach through both static WCET analysis (using the OTAWA tool) and experimental execution on real hardware.

4. Results and Evaluation

A. Scheduling Performance (Offline)

Heuristics vs. ILP:
- DSH provides the highest speedup (closest to optimal) but has high computational cost (up to 2 minutes per graph).
- ISH is significantly faster and more stable but yields slightly lower speedups.
- Optimized ILP: Outperforms the original Tang et al. encoding, finding solutions where the original timed out. However, for very large graphs, it remains slower than heuristics.
Speedup: Speedup increases with the number of cores until it hits a plateau determined by the graph's maximum parallelism (number of independent branches).

B. Experimental Evaluation (On Target)

Hardware: Texas Instruments Keystone II SoC (4x ARM Cortex-A15).
Test Case: A modified GoogLeNet architecture.
Performance Gains:
- Total WCET: Achieved an 8% reduction in total execution time (from $2.90 \times 10^{10} $to$ 2.68 \times 10^{10}$ cycles in static analysis; similar results in hardware).
- Parallelizable Segment: The section of the network capable of parallelization (from maxpool_2 to inception_2/concat) saw a 46% gain in static analysis and 31% gain in hardware execution.
Bottlenecks: The overall gain was limited by large sequential layers (conv_1 and conv_2) that could not be parallelized.
Overhead: Inter-core communication and synchronization flags introduced minor delays, but the synchronization mechanism proved robust.

5. Significance and Future Work

Certifiability: The work demonstrates that DNN inference can be parallelized on multi-core CPUs while maintaining the strict predictability required for aeronautical certification (DO-178C/ED-12C context).
Practicality: It offers a viable path for deploying medium-sized DNNs in safety-critical systems without waiting for the maturation of embedded AI accelerators.
Limitations & Future Directions:
- Current assumptions rely on homogeneous cores and UMA. Future work aims to support heterogeneous architectures (mixing CPUs and accelerators).
- Investigating non-blocking write schemes to reduce synchronization overhead.
- Adapting the ILP and heuristics for non-uniform memory access (NUMA) systems.

In conclusion, this paper successfully bridges the gap between theoretical multi-core scheduling and practical, certifiable code generation for safety-critical AI applications, proving that significant performance gains are achievable even with modest parallelization in complex neural networks.