[IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimal Data Allocation for Scratch-Pad Memory on Embedded Multi-core Systems

Optimal Data Allocation for Scratch-Pad Memory on Embedded Multi-core Systems Yibo Guo2, Qingfeng Zhuge1, Jingtong Hu2, Meikang Qiu3 and Edwin H.-M. Sha1,2 1College of Information Science and Engineering, Hunan University, Changsha, China 410082 2Dept. of Computer Science, University of Texas at Dallas, Richardson, TX, 75080 3Dept. of Electrical and Computer Engineering, University of Kentucky, Lexington, KY 40506 Abstract—Multi-core systems have been a popular design for high-performance embedded systems. Scratch Pad Memory (SPM), a software-controlled on-chip memory, has been widely adopted in many embedded systems due to its small area and low energy consumption. Existing data allocation algorithms either cannot achieve optimal results or take exponential time to complete. In this paper, we propose one polynomial-time algorithms to solve the data allocation problem on multi-core system with exclusive data copy. According to the experimental results, the proposed optimal data allocation method alone reduces time cost of memory accesses by 16.45% on average compared with greedy algorithm. The proposed data allocation algorithm also can reduce the energy cost signiﬁcantly. I. INTRODUCTION With the ever-growing performance requirements of em- bedded applications, multi-core design becomes trend for embedded systems to achieve high performance. Scratch-Pad Memory (SPM), a software-controlled on-chip memory, has been widely adopted in multi-core embedded systems because it has less hardware overhead and better control over data allocation [1], [2], [3], [4], [5]. Example multi-core systems employing SPM include TI’s TNETV3010 CMP [6] and IBM’s Cell processor [7]. The data accesses on SPM are entirely managed by software (e.g., by a compiler). The key issue of effectively using these on-chip memory modules is how to optimize data allocation so that the cost of memory accesses is minimized. Several papers proposed data allocation methods on SPM for single-core systems. Generally speaking, there are two kinds of SPM data allocation techniques: static data allocation or dynamic data allocation. In static data allocation methods, the data allocation is determined for the whole program and will never be changed during the execution of the program. Avissar et al. [8] proposed an ILP formulation to generate an allocation for the whole program. In dynamic data allocation methods, programs are divided into many regions and the compiler generates a data allocation for each program region. During the execution of a particular region, the data alloca- tion remains the same. Udayakumaran et al. [9] proposed a dynamic data allocation algorithm. Dynamic data allocation is better than static data allocation since it can take better advantage of the data locality of each program region. In this paper, we also adopt dynamic approach for multi-core systems. There are also several works proposed for SPMs on multi- core systems. Hu et al. [10] presented their programming experience on a multi-core chip architecture which employs SPM. In their work, programmers still need to manually decide how to allocate data for the multi-core SPMs. Che et al. [11] proposed an ILP formulation to generate data placement for stream applications in order to minimize execution time for the whole program on multi-core architecture. There are two drawbacks of this approach. First, the technique they employ is a static allocation approach. Therefore, it cannot take advantage of data localities. Second, the ILP formulation may take exponential time to ﬁnish when the size of the input increases. Kandemir et al. [12] proposed data tiling to reduce extra off-chip memory accesses caused by inter- processor communication. They determine data allocation for each loop. It is a dynamic approach. However, their methods greatly rely on the characteristics of loops and cannot achieve optimal results. In this paper, we propose a polynomial-time algorithm that determines the optimal data allocation for programs running on SPM-equipped multi-core systems. In this algorithm, programs are divided into parallel regions. Each parallel region is a block of code that can be executed in parallel on multi-cores. For each parallel region, we will generate a data allocation which guarantees that the memory access cost during the execution of this parallel region is optimal. We assume that each data can only has one exclusive copy in one of the SPMs or in main memory. According to the experimental results, the proposed RO- DAM reduces time cost of memory access for multi-core system by 16.45% on average compared with an algorithm derived from the Udayakumaran’s algorithm. For energy con- sumption, the proposed optimal data allocation method reduces the energy cost by 17.59%. The rest of this paper is organized as follows. Background and related works are discussed in Section II. The hardware architecture and software execution model are introduced in Section III. A motivational example is presented in Section IV to illustrate the basic ideas of this paper. The main algorithms are explained in detail in Section V. The experiments are presented in Section VI. Finally, this paper is concluded in Section VII. 2011 International Conference on Parallel Processing 0190-3918 2011 U.S. Government Work Not Protected by U.S. Copyright DOI 10.1109/ICPP.2011.79 464 II. BACKGROUND AND RELATED WORKS In this section, we introduce the background of SPM architecture and related works on multi-core systems. Several SPM data allocation and management approaches have been proposed in order to minimize either the executing time of programs or the energy consumption for single core processors. Generally speaking, there are two kinds of SPM data allocation schemes: static or dynamic. In static data allocation methods, the data allocation is determined for the whole program and will never change during the execution of the program. Panda et al. [13], [14] proposed static data partition schemes for scalar and array variables. Avissar et al. [8] proposed ILP solution for data allocation problem on SPMs. Sjo¨din et al. also proposed [15] a static data allocation method. Static allocation methods have an inherent drawback: they do not account for dynamic program behavior. Thus, most recent researches are focusing on dynamic management of data. In dynamic data allocation methods, programs are divided into many regions and the compiler generates a data allocation for each program region. During the execution of a particular region, the data allocation remains the same. Udayakumaran et al. [16] proposed a greedy algorithm to solve the data allocation problem on a single-core system with a dynamic approach. In this paper, we also take a dynamic approach. Several research works proposed different data allocation algorithms for SPM equipped multi-core systems. Hu et al. [10] presented their programming experience on a multi- core chip architecture which employs SPM. In their work, the programmer still needs to manually decide how to allocate data for the multi-core SPMs. Che et al. [11] proposed an ILP formulation and heuristic approach in order to maximize the throughput of stream programs for multi-core processors. There are two drawbacks of this approach. First, they employ a static allocation approach. Therefore, it cannot take advantage of data localities. Second, ILP takes exponential time to ﬁnish. Zhang et al. [17] introduced a data partitioning and scheduling method for SPM on multi-core system. In their works, the data allocation scheme they use cannot generate optimal data allocation for one region. Kandemir et al. [12] proposed data tiling to reduce extra off-chip memory accesses caused by inter-processor communication. They determine data allocation for each loop. However, their methods greatly rely on the characteristics of loops and cannot achieve optimal results. There are also works that focusing on the improving pro- gram parallelism or balancing work loads among different cores. However, none of them considered minimizing cost of memory accesses. Tan et al. [18] proposed a scheme to exploit ﬁne grain parallelism and locality of a dynamic programming algorithm with non-uniform data dependence on a multi-core architecture. In their works, they are focusing on increasing the parallelism for programs. They do not consider data allocation. Edmonds et al. [19] proposed a dynamic programming method on a shared memory multiprocessor. They attempt to balance work load, while keeping synchronization cost low. Again, they do not consider data allocation for SPM at all. III. HARDWARE AND SOFTWARE MODEL In this section, we will introduce the hardware architecture that is related to our algorithm, and the software execution model of program for our algorithm. A. Hardware Architecture We focus on the Virtually Shared SPM (VS-SPM) archi- tecture [20] in this paper, which is shown in Fig. 1. In a VS-SPM architecture, every core has its own private on-chip SPM. All cores share a main memory, which can be a DRAM with large capacity. Each core can access its own SPM and other cores’ SPMs. We call its own SPM local SPM and other cores’ SPMs remote SPM. Accessing data in a remote SPM costs more time and energy than accessing data in a local SPM, but costs less than data accesses to an off-chip main memory. The communication cost grows when the distance between cores increases. In a system with a large number of cores, therefore, it is very likely that each core can only access a constant number of other cores’ SPMs. SPM SPM Interconnecting Bus Off−chip Main Memory Interconnecting Bus ...Core SPM Core Core Fig. 1: The ﬁrst architecture B. Execution Model Before determining the data allocation for SPM in multicore systems, the program (tasks/regions) must be partitioned and assigned to different cores. There are many works that focus on this problem [12], [17], [21]. In this paper, we assume that this stage has been performed. After task partitioning and assignment, we already have the information of which task being executed in which core at what time. Then we need to determine the data allocation for each task. . . . One parallel region Program data allocation instructions Fig. 2: Demonstration for parallel regions 465 We give some basic ideas of our software execution model. The execution model graph is shown in Fig 2. A program for multi-core systems can be divided into several parallel regions. For each parallel region, there are multiple threads executing in parallel. It is possible for compilers to add data allocation instructions between different parallel regions. The goal of our algorithm is to minimize the cost of memory accesses for a parallel region by achieving the best data allocation for SPMs and main memory. IV. MOTIVATIONAL EXAMPLE In this section, we present a motivational example to illus- trate the main idea of the proposed algorithms. We will com- pare the results of the algorithms proposed in this paper with the results of an algorithm derived from the Udayakumaran’s algorithm on multi-core system. A list of notations is deﬁned in Table I. The column “No- tation” is the name of variables and the column “Deﬁnition” is the deﬁnition of each variable. TABLE I: Notations used in this paper Notation Deﬁnition RSi the cost of reading its own SPM for core i. WSi the cost of writing to its own SPM for core i. RMi the cost of reading from main memory for core i. WMi the cost of writing to main memory for core i. RSi→Sj the cost of reading from core j’s SPM for core i. WSi→Sj the cost of writing to core j’s SPM for core i. MSi→Sj the cost of moving data from SPM of core i to SPM of core j. MSi→M the cost of moving data from SPM of core i to main memory. MM→Si the cost of moving data from main memory to SPM of core i. SizeSi the SPM size on core i. TABLE II: Number of Accesses Of Data for Each Core A B C D E Core 1 Read 2 15 25 40 0 Core 1 Write 1 0 0 0 1 Core 2 Read 0 15 4 40 1 Core 2 Write 0 0 0 0 0 In the motivational example, we assume that there are two cores in the system. Each core is equipped with a SPM of capacity 2. There are two tasks: task X and task Y . Task X is assigned to core 1 and task Y is assigned to core 2. Task X and Y both access 5 data: A, B, C, D, and E. The number of memory accesses of each data is shown in Table II. The rows “Core 1 Read” and “Core 1 Write” show the read and write times for core 1 and rows “Core 2 Read” and “Core 2 Write” show the read and write times for core 2. We also assume that initially data A is in main memory; data B and data C are in the SPM of core 1 (SPM1); data D and data E are in the SPM of core 2 (SPM2). For this example, we deﬁne the cost of reading and writing data from private SPM is 1, cost of reading and writing data from main memory is 50, and cost of accessing data from remote SPM is 10 as shown in Table III. TABLE III: Access Cost Table For The Motivational Example Data Cost RS1 , RS2 1 WS1 , WS2 1 RM1 , RM2 50 WM1 , WM2 50 RS1→S2 , RS2→S1 10 WS1→S2 ,WS2→S1 10 MS1→S2 , MS2→S1 11 MM→S1 , MM→S2 51 MS1→M , MS2→M 51 When the initial state is not considered, or when the initial state is all the SPMs are empty and the variables reside in main memory, it is not very difﬁcult to apply a greedy strategy to the data allocation problem. Yet, when considering a pre- existing initial state on SPMs and the communications among remote SPMs, the greedy strategy cannot ﬁnd the optimal solution. The Udayakumaran’s algorithm proposed in [9] is a greedy algorithm that is most similar to our approach, in which programs are divided into regions and most accessed data in each region is allocated into the SPM. However, their approach only targets single-core processors. Therefore, we adopted their algorithm and apply it on multi-core systems for comparison purpose. The derived Udayakumaran’s algorithm works as follows: First, it sorts data according to the total number of accesses. Then, it picks the most accessed data. There might be many cores that access this data. It allocates the data into the core that accesses the data most times. When the SPM of the core is full, it chooses the SPM of the core with second most number of accesses. When all SPMs of the cores are full, the data is allocated into the main memory. TABLE IV: Data Allocations for the example Data Alloc. Uday. Optimal Core1 B, D A, C Core 2 A, C B, D Main Mem. E E Total Cost 1113 886 The data allocation generated by the derived Udayaku- maran’s algorithm is not optimal. Table IV shows two different data allocations generated by different algorithms. As is shown in the column “Uday.”, for the motivational example, the derived Udayakumaran’s algorithm will allocate data B and D in the core 1’s SPM; data A and C in the core 2’s SPM; data E in the main memory. The total cost is 1113. Rather than this data allocation, if we place data A and C in SPM1, data B and D in SPM2, and data E in the main memory, the total memory access cost is 886, which is shown in column “Optimal”. The cost of memory access is reduced by 20.40%. Actually, this is the data allocation with the minimum memory access cost when we do not consider duplicate data. In the later section, we will show how we achieve this with a dynamic 466 programming approach in our algorithm. V. OPTIMAL DATA ALLOCATION ALGORITHM In this section, we present details of the proposed al- gorithms. In Section V-A, we deﬁne the problem formally. After that, we propose Regional Optimal Data Allocation for Multi-core (RODAM) algorithm. We will use the motivational example to illustrate the algorithm. A. Problem Deﬁnition Deﬁnition 5.1: SPM Cost Optimization problem on multi-core: Given the initial data allocation of all cores’ on- chip SPMs, capacity of every core’s on-chip SPM, number of data Nd, number of cores C, number of read and write access to the data for each core, the read and write cost to the on-chip SPMs and main memory for each core, what is a data allocation for all cores’ SPMs and main memory that the total time/energy cost of memory access can be optimally minimized? The inputs are: initial data allocation for each core’s on- chip SPM; each core’s capacity of SPM Si for core i; number of data Nd; number of cores C; the read and write cost to the local on-chip SPM RSi ; WSi for core i; the read and write cost from core i to core j’s SPM RSi→Sj ; WSi→Sj ; the read and write cost to the main memory RMi ; WMi for core i. The output is: a data allocation for all core’s SPM and main memory, under which the total cost of memory access is optimally minimized. B. Regional Optimal Data Allocation for Multi-core In this subsection, we introduce the process of determining the optimal data allocation for multi-core systems that employ SPMs. The process includes two steps. First, we build a cost table. Then, a dynamic programming algorithm is used to obtain the optimal data allocation utilizing the cost table. Step 1: Build Cost Table For each data, we compute (C + 1) costs, where C is the number of cores. In our algorithm, the number of cores is a constant as we do not change the number of cores during the execution of a certain program. For each memory location (main memory, or SPMi), we have a cost of allocating this data. The cost for each data in different locations is computed as shown in Equation (1) ∑ number of accesses× cost of each access+ reallocating cost (1) The cost is the sum of execution cost and reallocating cost. We compute the execution cost by counting the number of accesses of the data and then multiply it with the cost of each access for the core, then sum up all core’s cost for the data. The reallocating cost is the cost of moving the data from its initial location to the new location. If the data is not moved, then the reallocating cost is 0. We can see that the reallocating cost is related to the initial state of the data. The derived Udayakumaran’s algorithm does not consider the initial data allocation and only has the ﬁrst part of Equation 1. Also, it cannot minimize the communication costs among all cores. On the other hand, our algorithm will generate the optimal solution because we consider the initial state of all core’s SPMs as well as the communication costs. Table V is the cost table for the motivational example. The column “Data” shows different variables. The column “CostS1” shows the memory access cost of data Di when it is allocated in core 1 ’s SPM. The column“CostS2” shows the memory access cost of data Di when it is allocated in core 2’s SPM. If there are k cores, we will have k columns for each data. The column “CostM” of Table V shows the memory access cost of data Di when it is allocated in the main memory. In Table V, the cost of data A being allocated at the SPM of core 1 is 54. It can be computed as ( 2× RS1 + 1× WS1 )+( RM1+WS1 )= ( 2×1 + 1×1 )+(50+1)=54. Here, the ( RM1 + WS1 ) part means we change data A’s initial location and move it from main memory to SPM1. This action includes one read from main memory and one write to the SPM. We does not count the communication cost, which may further increase the reallocating cost. TABLE V: Cost table for the example Data CostS1 CostS2 CostM A 54 81 150 B 165 176 1551 C 65 265 1501 D 451 440 4051 E 22 11 151 Step 2: Dynamic Programming Scheme After the cost table is built, in the second step, a dynamic program algorithm is proposed to ﬁnd the optimal data allo- cation. Let CostMin[j, i1, i2, ..., iC] be the minimal cost of mem- ory access when we have optimally determined the allocation of the jth data item (j ≤ Nd, Nd is the number of variables for the current region) while the rest of the data items are in the main memory, and there are ik empty memory units on SPM of core k (k ≤ C, C is the number of cores). Here we know that CostMin is a (C + 1)-dimension table. In Equation (2), we present the recursive function to com- pute CostMin, from which we can derive our dynamic programming scheme. The equation has three parts. First, if ∀k = 1, 2, ..., n, ik = SizeSk , CostMin[j, i1, i2, ..., iC ] is equal to ∑ j CostM . This means that we have not determined any data for SPMs and all the data are in the main memory. Therefore, the cost equal the summation of main memory costs plus migration costs. In the second part of Equation (2), if∑ ik < ∑ SizeSk−j or ∃k ∈ {1, 2, 3, . . . , C}, ik > SizeSk , those cells are not possible and are set to ∞. In the third part, if ∑ ik ≥ ∑ SizeSk − j, we compute the minimum of C possible data allocation costs. That is, CostMin[j, i1, i2, ..., iC] is the minimal of cells of CostMin under which we have only considered j−1 data and one of the SPMs has one more 467 CostMin[j, i1, i2, ..., iC ] = ⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩ ∑Nd j=1 CostM , if j = 0,∀k = 1, 2, ..., n, ik = SizeSk , ∞ if ∑Ck=1 ik < ∑C k=1 SizeSk − j or if ∃k ∈ {1, 2, 3, . . . , C} ik > SizeSk , min(CostMin[j − 1, i1, i2, ..., iC ], CostMin[j − 1, i1 + 1, i2, ..., iC ]− CostM − CostS1 (Dj), CostMin[j − 1, i1, i2 + 1, ..., iC ]− CostCostS2 (Dj), . . . , CostMin[j − 1, i1, i2, ..., iC + 1]− CostM − CostSn (Dj)). if ∑C k=1 ik ≥ ∑C k=1 SizeSk − j. (2) space. Since there are C SPMs, where C is a constant of the number of cores, we only need to choose the minimal cell from them. According to recursive function, the Regional Optimal Data Allocation for Multi-Core ( RODAM ) Algorithm is presented in Algorithm V.1. The ﬁrst step is to set the initial state for the dynamic programming under which all data are in the main memory. Line 1 initializes CostMin according to the ﬁrst part of Equation (2). This is the case that we have not optimally determined any data yet. Then, we compute each cell of CostMin by the second and third part of the recursive function. From line 6 to line 7, we initialize CostMin to ∞. These cases are not possible. In line 2 to line 28, we compute CostMin according to the third part of Equation (2). There are (C+1) layers of loops in the third part of Algorithm V.1. In line 4 and line 26, we use “. . .” to denote those omitted layers of loops. From line 11 to line 24, we are constructing two arrays from which we can generate the optimal solutions. There are C + 1 if/else branches. In line 20, we use “. . .” to represent these branches. The two arrays that we construct are: TraceBack[j, i1, i2, ..., iC] and CurrentMove[j, i1, i2, ..., iC]. TraceBack keeps the track of previous position from which we obtain this solution. CurrentMove records the data movement action for the current data Dj . TABLE VI: CostMin[j, i1, i2] for motivational example i2 = 0 ��i1 j 1(A) 2(B) 3(C) 4(D) 5(E) 0 ∞ ∞ ∞ 886 886 1 ∞ ∞ 4524 982 982 2 ∞ 5960 4784 2418 2418 i2 = 1 ��i1 j 1(A) 2(B) 3(C) 4(D) 5(E) 0 ∞ ∞ 4497 971 971 1 ∞ 5933 4593 2357 2357 2 7335 6029 6029 3793 3793 i2 = 2 ��i1 j 1(A) 2(B) 3(C) 4(D) 5(E) 0 ∞ 5922 4582 2368 2368 1 7308 6018 5968 3804 3804 2 7404 7404 7404 7404 7404 After presenting the algorithm, here we illustrate RODAM by the motivational example. We have already shown the cost table for the example in Table V. Based on this cost Algorithm V.1 Regional Optimal Data Allocation for Multi- Core (RODAM) Algorithm Input: Number of data Nd, capacity of core i’s SPM SizeSi , total cost of each data Dj on core i CostSi (Dj ), total cost of each data Dj on main memory CostM (Di). Output: A data allocation under which the total cost of memory access for the execution of this parallel region is minimized. 1: CostMin[0, SizeS1 , SizeS2 , ..., SizeSC ] ← ∑ k CostSk (Di); 2: for j ← 1 to Nd do 3: for i1 ← SizeS1 to 0 do 4: ... 5: for iC ← SizeSC to 0 do 6: if ∑ ik < ∑ SizeSk − j or ∃k ∈ {1, 2, 3, . . . , C} ik > SizeSk , then 7: CostMin[j, i1, i2, ..., iC ] = ∞; 8: else 9: Compute CostMin[j, i1, i2, ..., iC ] according to Equa- tion (2); 10: end if 11: if CostMin[j, i1, i2, ..., iC ] =CostMin[j − 1, i1, i2, ..., iC ] then 12: CurrentMove[j, i1, i2, ..., iC ] = “data Dj in main memory”; 13: TraceBack[j, i1, i2, ..., iC ] = (j, i1, ..., iC ) ; 14: else if CostMin[j, i1, ..., iC ] = CostMin[j − 1, i1+1, i2, ..., iC ]-(CostM (Dj )-CostS1 (Dj ) then 15: CurrentMove[j, i1, i2, ..., iC ] = “data Dj into core 1’s SPM”; 16: TraceBack[j, i1, i2, ..., iC ] = (j − 1, i1+1, i2, ..., iC ); 17: else if CostMin[j, i1, ..., iC ] = CostMin[j − 1, i1, i2+1, ..., iC ]-(CostM (Dj )-CostS1 (Dj ) then 18: CurrentMove[j, i1, i2, ..., iC ] = “data Dj into core 2’s SPM”; 19: TraceBack[j, i1, i2, ..., iC ] = (j − 1, i1, i2+1, ..., iC ); 20: ... 21: else if CostMin[j, i1, i2, ..., iC ] = CostMin[j − 1, i1, i2, ..., iC+1]-Costn (Dj )-CostSC (Dj ) then 22: CurrentMove[j, i1, i2, ..., iC ] = “data Dj into core C’s SPM”; 23: TraceBack[j, i1, i2, ..., iC ] = (j − 1, i1, i2, ..., iC+1); 24: end if 25: end for 26: ... 27: end for 28: end for table, Algorithm V.1 can generates three tables, CostMin, TraceBack, and CurrentMove. The content of CostMin[j, i1, i2] is shown in Table VI. The TraceBack contains the information of the previous cell, and the CurrentMove array stores the speciﬁc location of the data for this cell. When we ﬁnd the minimum cost in CostMin, we can trace the path by TraceBack and ﬁnd the allocation of each data by CurrentMove. The cell “886” in the last column of table i2=0 is the minimum memory access cost for this parallel region. The 468 cells that are labeled bold in Table VI is one possible path, from which we can construct the optimal data allocation solution. The corresponding data allocation for this path is: CostMin[1, 2, 2], CostMin[1, 1, 2], CostMin[2, 1, 1], CostMin[3, 1, 0], CostMin[4, 0, 0], CostMin[5, 0, 0]. From this path we can construct the optimal solution: data A is in SPM1; data B is in the SPM2; data CostMin is in the SPM1; data D is in the SPM2; and data E is in the main memory. VI. EXPERIMENT In this section, we will ﬁrst describe the setup of the exper- iments in Section VI-A. Then, we will show the experimental results comparison between the derived Udayakumaran’s algo- rithm as described in the motivational example and RODAM in Section VI-B. A. Experiment setup The experiments are conducted on a custom simulator which is similar to the IBM CELL processor. We simulate a multi- core architecture which has 4 cores. Each core has an on-chip 16 KB on-chip memory, and all cores share an off-chip DDR main memory of which the capacity is 512MB. Each core can access its own SPM and other cores’ SPMs. The memory access costs are obtained from HP CACTI 5.3 [22]. CACTI is an integrated cache and memory access time, cycle time, area, leakage, and dynamic power model. From CACTI, we obtained SPM/main memory memory access time, dynamic energy consumption, and leakage power consumption. We integrated the parameters obtained from CACTI into our custom simulator. The experimental system speciﬁcation is shown in TableVII. The time and energy cost for accessing on-chip and off-chip memory can be seen in TableVII. We use benchmarks from Mibench [23] to conduct ex- periments. First, memory traces of each parallel region are obtained. Then, we have a standalone program to decide the memory allocation for each region After that the data items are allocated into our custom simulator. Finally, each parallel region of Mibench are executed in the simulator and collect the results. We selected 11 applications from the Mibench benchmark suite: qsort, susan, basicmath, bitcount, dijkstra, patricia, stringsearch, rijndael, SHA, CRC32, and FFT. B. Experimental Results In this subsection, we present the experimental results. The latency and energy cost among different algorithms are compared: the derived Udayakumaran’s algorithm and the RODAM algorithm. The comparison of memory access la- tency is shown in Table VIII, and the comparison of energy consumption is shown in Table IX. In Table VIII, the column “Benchmarks” shows the name of benchmarks. The column “Udayakumaran”shows the cost of the derived Udayakumaran’s algorithm. The column “RO- DAM” shows the time cost of RODAM algorithm. We show the improvement of our RODAM algorithm with the derived Udayakumaran’s algorithm in the column “Impr”. From Table VIII, we can see that RODAM has an average 16.45% reduction in memory access latency compared with the derived Udayakumaran’s algorithm. The comparison bar- graph is in Fig 3, we can see our algorithms have consecutive improvements compared with the derived Udayakumaran’s algorithm. In Table IX, the second and third columns show the energy costs of the derived Udayakumaran’s algorithm and the RODAM algorithm. The other columns are the same as Table VIII. From Table IX, we can see that RODAM has an average reduction of 17.59% in energy consumption compared with the derived Udayakumaran’s algorithm. We show the comparison bar-graph in Fig 4. From the graph we can see the enhancement of our algorithm to the system compared with the derived Udayakumaran’s algorithm. � �� Fig. 3: Time cost of memory access between two algorithms � �� Fig. 4: Energy cost of memory access between two algorithms VII. CONCLUSION In this paper, we propose one polynomial time optimal dynamic data allocation algorithm to minimize the cost of memory access of SPM for multi-core system. Our algorithm will achieve the optimal data allocation for each region under different scenarios. According to the experimental results, RO- DAM algorithm can reduce time cost of memory access by an average 16.45% compared with the derived Udayakumaran’s 469 TABLE VII: System Speciﬁcation Component Description CPU Core Number of cores: 4, frequency: 1.0 GHz SRAM SPM Size: 16 KB, access latency: 0.553 ns, access energy: 0.008 nJ Main memory DDR SDRAM, Size: 512MB, Access latency: 19.51 nsaccess energy: 0.607 nJ TABLE VIII: Comparison of Time Cost between the derived Udayakumaran’s algorithm and RODAM Benchmarks Udayakumaran(μs) RODAM(μs) Imprv basicmath 27769.03 22320.79 19.62% bitcount 1769.76 1109.05 37.33% qsort 36784.01 31696.96 13.83% susan 5101.02 4531.21 11.17% dijkstra 1856.06 1324.87 28.62% patricia 18475.29 17271.59 6.51% stringsearch 3021.15 2563.84 15.14% rijndael 22916.43 20038.42 12.56% SHA 13228.17 12304.84 6.98% CRC32 10879.59 9403.09 13.58% FFT 10257.47 8651.27 15.66% Average - - 16.45% TABLE IX: Comparison of Energy cost between the derived Udayakumaran’s algorithm and RODAM Benchmarks Udayakumaran(μJ) RODAM(μJ) Imprv basicmath 654.61 548.49 16.18% btcount 38.56 22.55 20.37% qsort 893.23 762.55 14.63% susan 180.39 158.95 11.89% dijkstra 44.42 30.88 30.48% patricia 438.32 396.00 9.66% stringsearch 76.79 63.34 17.51% rijndael 548.17 487.05 11.15% SHA 316.11 291.30 7.85% CRC32 265.17 225.34 14.99% FFT 244.72 202.47 17.26% Average - - 17.59% algorithm for multi-core. For energy cost, our algorithm has an average 17.59% compared with the derived Udayakumaran’s algorithm. The experimental results show that proposed algo- rithms are better than any existing related algorithms for one parallel region. VIII. ACKNOWLEDGMENTS This work is partially supported by NSF CNS-1015802, Texas NHARP 009741-0020-2009, HK GRF 123609, NSFC 61073148, Changjiang Scholar Professorship, China Thousand-Talent Program. REFERENCE [1] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel, “Scratchpad memory: design alternative for cache on-chip memory in embedded systems,” in CODES ’02, 2002, pp. 73–78. [2] J. Hu, C. J. Xue, W.-C. Tseng, Y. He, M. Qiu, and E. H.-M. Sha, “Reducing write activities on non-volatile memories in embedded cmps via data migration and recomputation,” in DAC ’10, 2010, pp. 350–355. [3] J. Hu, C. Xue, Q. Zhuge, W. Tseng, and E. H.-M. Sha, “Towards energy efﬁcient hybrid on-chip scratch pad memory with non-volatile memory,” in (DATE 2011, 2011, pp. 1 – 6. [4] W.-C. Tseng, C. J. Xue, Q. Zhuge, J. Hu, and E. H.-M. Sha, “Optimal scheduling to minimize non-volatile memory access time with hardware cache,” in VLSI-SOC ’10, 2010. [5] J. Hu, W.-C. Tseng, J. C. Xue, Q. Zhuge, Y. Zhao, and E. H.-M. Sha, “Write activity minimization for non-volatile main memory via scheduling and recomputation,” IEEE (TCAD), vol. 30, no. 4, pp. 584– 592, 2011. [6] S. Kaneko, H. Kondo, N. Masui, K. Ishimi, T. Itou, M. Satou, N. Oku- mura, Y. Takata, H. Takata, M. Sakugawa, T. Higuchi, S. Ohtani, K. Sakamoto, N. Ishikawa, M. Nakajima, S. Iwata, K. Hayase, S. Nakano, S. Nakazawa, K. Yamada, and T. Shimizu, “A 600-mhz single-chip multiprocessor with 4.8-gb/s internal shared pipelined bus and 512-kb internal memory,” IEEE Journal of Solid-State Circuits, vol. 39, no. 1, pp. 184–193, 2004. [7] H. P. Hofstee, “Power efﬁcient processor architecture and the cell processor,” in HPCA ’05, 2005, pp. 258–262. [8] O. Avissar, R. Barua, and D. Stewart, “An optimal memory allocation scheme for scratch-pad-based embedded systems,” ACM Trans. Embed. Comput. Syst., vol. 1, no. 1, pp. 6–26, 2002. [9] S. Udayakumaran, A. Dominguez, and R. Barua, “Dynamic allocation for scratch-pad memory using compile-time decisions,” ACM Trans. Embed. Comput. Syst., vol. 5, no. 2, pp. 472–511, 2006. [10] Z. Hu, G. Gerﬁn, B. Dobry, and G. R. Gao, “Programming experience on cyclops-64 multi-core chip architecture,” in STMCS ’06, 2006. [11] W. Che, A. Panda, and K. S. Chatha, “Compilation of stream programs for multicore processors that incorporate scratchpad memories,” in DATE ’10, 2010, pp. 1118–1123. [12] M. Kandemir, J. Ramanujam, and A. Choudhary, “Exploiting shared scratch pad memory space in embedded multiprocessor systems,” in DAC ’02, 2002, pp. 219–224. [13] P. R. Panda, N. D. Dutt, and A. Nicolau, “Efﬁcient utilization of scratch- pad memory in embedded processor applications,” in ED&TC ’97, 1997, p. 7. 470 [14] ——, “On-chip vs. off-chip memory: the data partitioning problem in embedded processor-based systems,” ACM Trans. Des. Autom. Electron. Syst., vol. 5, pp. 682–704, July 2000. [15] J. Sjo¨din and C. von Platen, “Storage allocation for embedded proces- sors,” in CASES ’01, 2001, pp. 15–23. [16] S. Udayakumaran and R. Barua, “Compiler-decided dynamic memory allocation for scratch-pad based embedded systems,” in CASES ’03, 2003, pp. 276–286. [17] L. Zhang, M. Qiu, and W.-C. Tseng, “Variable partitioning and schedul- ing for mpsoc with virtually shared scratch pad memory,” Journal of Signal Processing Systems, vol. 50, no. 2, pp. 247–265, 2010. [18] G. Tan, N. Sun, and G. R. Gao, “A parallel dynamic programming algorithm on a multi-core architecture,” in SPAA’ 07, 2007, pp. 135 – 144. [19] P. Edmonds, E. Chu, and A. George, “Dynamic programming on a shared memory multiprocessor,” Parallel Computing, vol. 19, no. 1, pp. 9–22, 1993. [20] M. Kandemir, J. Ramanujam, J. Irwin, N. Vijaykrishnan, I. Kadayif, and A. Parikh, “Dynamic management of scratch-pad memory space,” in DAC ’01, 2001, pp. 690–695. [21] V. Suhendra, C. Raghavan, and T. Mitra, “Integrated scratchpad memory optimization and task scheduling for mpsoc architectures,” in CASES 2006, 2006, pp. 401 – 410. [22] “Cacti model.” [Online]. Available: http://www.hpl.hp.com/res earch/cacti/ [23] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown, “Mibench: A free, commercially representative embedded benchmark suite,” in WWC ’01, 2001, pp. 3–14. 471

[IEEE 2011 International Conference on Parallel Processing (ICPP) - Taipei, Taiwan (2011.09.13-2011.09.16)] 2011 International Conference on Parallel Processing - Optimal Data Allocation for Scratch-Pad Memory on Embedded Multi-core Systems

Description

Comments