New Generation Computing, 11 (1993) 251-269 OHMSHA, LTD. and Springer-Verlag P o1 ma, �9 OHMSHA. LTD. 1993 UNIRED II: The High Performance Inference Proc- essor for the Parallel Inference Machine PIE64 Kentaro SHIMADA, Hanpei KOIKE and Hidehiko TANAKA Department of Electrical Engineering, Faculty of Engineering, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113, Japan. Received 20 November 1992 Revised manuscrip received 26 March 1993 Abstract UNIRED II is the high performance inference processor for the parallel inference machine PIE64. It is designed for the committed choice language Fleng, and for use as an element processor of parallel machines. Its main features are: 1) a tag architecture, 2) three independent memory buses (instruction fetching, data reading, and data writing), 3) a dedicated instruction set for efficient execution of Fleng, 4) multi-context processing for reducing pipeline interlocking and overhead of inter- processor synchronization. With the multi-context processing mechanism, the internal pipeline is shared by several independent instruction streams (contexts), and which contexts are to be executed is determined cycle by cycle. So, UNIRED II acts as a shared-pipeline MIMD processor. In this paper, several architectural features including the multi-context processing and the instruction set are described. Performance measurement results by simulation are also presented. On a 10MHz UNIRED II, 920K Reduction Per Second has been achieved, and it is shown that the multi-context processing mechanism is very effective for improved performance. Keywords: Multithreading, Dedicated Processor, Fleng, Parallel Inference Machine. w Introduction Committed choice languages, such as Fleng, 6) FCP, 8) and KL1, 7) are parallel logic programming languages. These are designed for efficient parallel execution of logic programs. However, we have anticipated that high perfor- mance is hardly achieved by parallel machines built out of conventional 252 K. Shimada. H. Koike and H. Tanaka processors because these processors are basically designed for sequential and procedural languages. Recently, RISC type conventional processors have shown great effect on achieving high performance with procedural languages on sequential machines, but we have seen the following difficulties for committed choice languages on parallel machines: (1) Committed choice languages, like Lisp and some functional languages, are not statically typed languages. Static analysis techniques for eliminat- ing such type checking is not powerful enough to remove all runtime type checking. Type checking can often take a significant fraction of instruc- tions in executing a logic program on conventional processors. (2) Committed choice languages are parallel languages which, like most modern high-level languages, require a global address space for imple- mentation. When a parallel machine with non-uniform memory access is used, long-latency remote-memory accesses and inter-processor synchro- nization occur frequently. This can degrade processor utilization dramati- cally. If the processing element of such a machine can switch contexts very fast, processor utilization can be improved by switching contexts at the time of remote memory access. However, it is well known that conventional processors are pretty slow in switching contexts. For these reasons, we designed a dedicated processor for the processing element of the parallel inference machine PIE64, 3) which we are developing, and named UNIRED 11. PIE64 executes committed choice language Fleng, and has sixty-four processing elements. In this paper, we concentrate on the architecture of UNIRED II, and explain Fleng and PIE64 system only briefly. To solve the above two points, UNIRED II has the following features: (i) UNIRED II has a tag architecture and a dedicated instruction set includ- ing dynamic type checking instructions such as variable clereference ones. These instructions are simple enough to be implemented on a highly pipelined architecture without undermining the philosophy of modern RISC processors. (ii) UNIRED II has a cycle-by-cycle multithreading mechanism for very fast context switching. In practice, there is no overhead to switch contexts because context switching is performed almost every cycle. In the rest of this paper, we first explain committed choice language Fleng in Section 2. Then we give an overview of PIE64 in Section 3. In Section 4, we propose the UN1RED I1 processor architecture. In Section 5, we discuss some instructions of UNIRED I1 .and, in Section 6, we present some simulation results. Finally in Section 7, we conclude this paper. w Committed Choice Language Fleng A Fleng program is a set of declarations of Horn clauses like: UNIRED ll: The High Performance Inference Processor 253 for the Parallel Inference Machine PIE64 Head:- Goal1, Goal2 ..... Goal~. where both the head and the body goals have a form like: pred(Argl, Arg2 ..... Argm). Execution begins when the top query goals are given. Each goal is checked against the heads of clauses in the program, and when a match is found, new goals are spawned for each of the body goals of the corresponding clause. This process is called the goal reduction. Then, new goals are checked again, and thus the execution proceeds. The head matching process, which is character- istic of committed choice languages, is called "passive unification", because unbound variables in the goal are not instantiated to a specific value. If some specific value in the goal is required during the head matching process, the process suspends until all the requisite variables are bound by other goals (variable binding is performed by the built-in predicate called unify). This suspension mechanism provides a parallel processing scheme based on data flow control, and gives Fleng a parallel execution model. w PIE64 PIE64 is a parallel inference machine which has been developed at University of Tokyo, Japan since 1988. It is a parallel machine dedicated to parallel execution of committed choice language Fleng. PIE64 has sixty-four processing elements, which are called inference units (IUs), and two indepen- dent interconnection networks (Process Allocation Network: PAN and Data Access/Allocation Network: DAAN)s These interconnection networks are implemented as multistage (actually 3-stage), circuit switching, and have a special function of broadcasting load information for automatic load balancing so that each IU can select the minimum load IU automatically. To/From T~From DAAN PAN *CBIF:Command Bus Interface *MBIF:Memory Bus Interface Fig. 1 Organization of the Inference Unit of PIE64. 254 K. Shimada, H. Koike and H. Tanaka Each IU has an inference processor UNIRED II, two* network interface processors (NIPs), 5) a management processor (MP), and a local memory of four banks (see Fig. 1). NIP, which is a dedicated processor as well as UNIRED II, performs inter-IU communicating/synchronizing functions in a form suitable for Fleng execution. The transmission throughput is 10M word per sec., namely 40MByte per sec. at one connection for each network, MP manages process scheduling, load distribution, and load balancing, and performs other functions such as system maintenance. We use a general purpose processor, SPARC, as MP to make process management flexible. Thus, in the IUs, we use functional parallelism by the three kinds of processor, where UNIRED II performs compu- tation, NIP performs communication and synchronization, and MP performs process management. In an IU, these three kinds of processors share the local memory, and access it through a three-way memory bus which is driven with a 10MHz clock to synchronize with network access over NIPs. So we can get a throughput of 10M word (40MByte) per sec. each way, namely 120MByte per sec. in all. In addition, UNIRED II, NIPs, and MP can communicate with each other through a coprocessor command bus using a specialized protocol for command-reply among these processors. This command bus exists for efficient Fleng specific operations, such as goal forking and remote structured data accessing. The format of the coprocessor commands is determined with the Fleng data types taken into account. w UNIRED II Processor Architecture 4 .1 Overview UNIRED II is a dedicated processor. It was designed for executing Fleng programs efficicently and to meet requirements of an element processor for parallel machines. Its main features are: (1) A Tag Architecture for efficient type checking (2) Three Independent Memory Buses (instruction fetching, data reading, and data writing) to make the most of the three-way local memory bus in an IU of PIE64 (3) Cycle-By-Cycle Multithreading for fast context switching (4) A Dedicated Instruction Set to execute Fleng programs efficiently All instructions of UNIRED II have one word (32 bits) length, and are single-cycle instructions. It is because we adopted a high performance RISC type pipelined architecture. Also the data types of UNIRED II have a length of 32 bits including mark bits for garbage collectiol~ and tag bits (see Fig. 2). * In practice, there are four network interface processor chips in one IU. Two of them act in master mode and can start the action of network connection while the other two act in slave mode and respond to the master NIP when requested. UNIRED ll: The High Performance Inference Processor 255 for the Parallel Inference Machine PIE64 Unbound VariabLe (VAR) ~ ~ 0 I.~1 ,~,~ I ~176176 I lot I I,~ .... is (UDF) 21o ,.,K I . . ~=~o. ~o Ioo I list (kST) 3a 29 z~ 2 i Vector (VEC) ~1 z9 i ] z ~ IMSl IU id Offset within IU I 1111 MS I OTag of Elcmcnls 1 gt Elernent I Lut E~=m~ ] L.~ Constanl zs zz0 Cell Type 0000 : into et 0001 : syr~)~l MS ... Mark Bits 0010 - 1110 : undefined Fig. 2 UNIRED I1 Data Formats. In the following sections, we will discuss the cycle-by-cycle multithread- ing mechanism and the instruction set of UNIRED II. We have implemented UN1RED II in 1.2/1 CMOS gate array and drive it with a 10MHz clock. This clock rate is comparatively slow, but was dictated by the absence of cache between UNIRED ]] and its local memory. UNIRED ]I must synchronize with the local memory bus, which in turn synchronizes with the interconnection network interface hardware (namely N IP ) of PIE64 for high speed communication through the network. Besides the cost restrictions, a good cache design for a Fleng processor is not yet clear. Though it will be studied in future, currently the local memory bus in an IU can provide enough memory band- width for UNIRED II. Similarly, any virtual address system has not been imple- mented. UNIRED II handles only physical memory address and does not have either memory page fault trap mechanism or address translation mechanism. 4.2 Multi-Context Processing Here we explain the cycle-by-cycle multithreading mechanism (we call it "multi-context processing") of UNIRED II. (13 Basic idea The internal pipeline of UNIRED II is shared by multiple instruction streams (contexts). UNIRED II executes them concurrently, and which contexts should be executed is determined cycle by cycle. In other words, UNIRED II acts as a pipeline-shared MIMD processor (see Fig. 3). Because Fleng is a parallel language which generates many instruction streams in parallel, we can expect to get enough instruction streams to fill the contexts of UNIRED II. Process scheduling is determined by MP, and UNIRED II starts a new process as one of the contexts by receiving an appropriate coprocessor command from MP. When one of the contexts waits for a result of some remote memory access, UNIRED II fills its pipeline with other contexts dynamically so that it 256 K. Shimada, H. Koike and H. Tanaka feed newly generated processes from MP multiple contexts context 0 I context 1 [ ] l g ta t l l sReg �9 ]~,context 2 "~= %1 ~176176 3 " 1 =~ ' '= activate a sleeping context PC:ProgramCounier when a reply is received a f select one active context ~ at ever~ clock cycle feed the value of PC of the context I to the internal pipeline put a context to the sleep state if it waits for a reply fetch an instruction and execute it t with the pipeline / UN1RED lh The High Performance Inference Processor 257 for the Parallel Inference Machine PIE64 MASA, 121 and CPC, TM do not have pipeline interlocking mechanism and cannot execute one instruction stream in continuous cycles. Therefore they slow down dramatically as the number of executable instruction streams decrease. UNIRED 1~ also slows down when the number of executable instruction streams decrease, but only a little because it can execute instructions of only one context every cycle unless the pipeline interlock occurs. Context scheduling strategy for the multi-context processing is very simple. UNIRED II schedules its four contexts in a round-robin manner as far as the contexts are executable. When one of the contexts becomes unexecutable, UNIRED II just omits it and schedules only the remaining contexts. For this purpose, there is a small special register (Status Reg. in Fig. 3, two bits per context, eight bits in all) for indicating which context is executable and which context is waiting for some reply and so on. 4.3 Hardware Organization [1) Pipeline organization Figure 4 shows the internal pipeline organization of UNIRED II. The pipeline consists of seven stages. The main reason why as many stages as seven are required is that, in an IU, all memory access needs two phases, one of which is the bus arbitration phase in which the two kinds of independent accesses from UNIRED II and NIP are arbitrated, and another is the data transfer phase in which memory access is actually performed, so that the memory access time hides the bus arbitration time. Thus, UNIRED II fetches an instruction from the local memory at the first and the second stages. It decodes the instruction and reads registers at the third stage; executes the instruction at the fourth stage; reads data from the memory at the fifth and the sixth stages; and writes data into the memory at the sixth and the seventh stages. Also UNIRED II writes data back into registers at the seventh stage, and provides appropriate bypasses in case data is required by the instruction which immediately follows in the pipeline. The effectiveness of the pipeline architecture is determined by when and ! Instruction Fetch 4- ........... ~- ............ J Decode ALU '~Arbitration) J (Data & Calculation J Transfer) Register Read I Stage 1 Stage 2 Stage 3 Stage 4 ! . . . . . , Register (Arbitration) I [Data Write I Transfer) bitration) (Data Transfer) Stage 5 Stage 6 Stage 7 Fig. 4 Pipeline Organization of UNIRED II. 258 K. Shimada, H. Koike and H. Tanaka where the pipeline interlock occurs. In UNIRED II, a pipeline interlock cannot occur if all four contexts are executable because both the branch delay and the load delay are just three cycles each. Because of the cycle-by-cycle multi-context mechanism, it cannot be predicted statically which instructions will execute next. Therefore, it cannot make use of the delayed branch scheme. Unlike that scheme, UNIRED II invalidates already-fetched instructions in the branch delay slots if they belong to the same context as the branch instruction. Similar conditions exist for the load delay slots. When all four contexts are executable, both the branch delay slots and the load delay slots are filled by instructions of different contexts so that there is no dependency among them. Consequently a pipeline interlock never occurs in that case. (2) Mechanism of remote memory access In principle, memory-accessing instructions of UNIRED II can execute remote memory access automatically. UNIRED II has a special register which holds the IU identifier of six bits, and it compares the IU identifier field of the memory address (the upper six bits of the address of twenty-eight bits, see Fig. 2) with the IU-id register when executing memory access instructions. When they are not equal, UNIRED II issues a remote memory access command to NIP instead of accessing its local memory. The result of the remote memory access is sent back as a coprocessor reply from NIP. UNIRED II should receive the replies of the remote memory access commands correctly. For this purpose, each general register has an additional bit indicating that the register is reserved for some reply or not. When an instruction reads the register whose reply wait bit is set before the reply is received, UNIRED II cancels the instruction and puts its context to sleep. And, after receiving the reply, UNIRED II wakes the context up and resumes the canceled instruction. w Instruction Set Table 1 gives an overview of the instruction set. UNIRED II has 32 general registers and some special registers, such as the program counter and the condition flag register for each context (128 general registers in all, and so on). Some of the general registers are implicitly specified by several instructions to reduce the number of bits necessary for encoding operand registers within one word (32 bits). More technical detail can be found in Ref. 2). Here we describe the most interesting instructions of UNIRED 1I. 5.1 Dereference Instructions ~1~ Re-execution technique The dereference instruction is most characteristic of the instruction set of UNIRED I1. A variable in logic programming languages can refer another UNIRED 1I: The High Performance Inference Processor for the Parallel Inference Machine PIE64 Table 1 The Instruction Set of UNIRED II. 259 Dereference derf dereference dfca dereference and check atomic dfcl dereference and check list dcll dereference and check list and dfcv dereference and check vector load car dfcc dereference and check constant dcvl dereference and check vector and load top Execute exec execute exea execute on atomic exel execute on list exll execute on list and load car exev execute on vector exvl execute on vector and load top exct execute on constant Structure cpir copy if remote cprr copy if remote with register Manipulation cvtp check vector top cvtr check vector top with register Load/ ld load tgld tagged load Store ldst load and store tlds tagged load and store st store sudf store undefined code stim store immediate Active bind bind variable bdim bind with immediate Unification cvos check variable order and swap Heap Allocation allc allocate Flow jump jump call call Control jcmp jump on compare jncp jump on not compare jtag jump on tag jntg jump on not tag jrmt jump on remote pointer jloc jump on local pointer jcc jump on flag condition stop stop Coprocessor cpcm send coprocessor command Garbage ldsm load and store mark stmm store with modified mark Collection jmrk jump on mark/stop condition swmm swap with modified mark Set setc set constant sett set mark and tag seta set alternative pointer serf set condition flags clrf clear condition flags gfc get flag condition Arithmetic and Logical Management add add sub subtract and bit-wise and xor bit-wise exclusive or ror rotate right rcr rotate with carry right shr shift right spid set pid register shp set heap pointer exhp exchange heap pointer set ltp load from scratch area pointer ldf load from flags adc add with carry sbb subtract with borrow or bit-wise or rol rotate left rcl rotate with carry left shl shift left asr arithmetic shift right pidt preset pid and tag lhp load from heap pointer stp set scratch area pointer stf store flags var iable, wh ich in turn can refer another var iab le again and so on. Hence, when a processor tr ies to get the value o f a var iab le , it may have to trace l inks o f var iab le in memory unt i l it reaches o ther than a reference to a var iable. Th is var iab le dere ferenc ing requires mul t ip le memory reading, but we restr icted all ins t ruct ions o f UNIRED II to s ingle-cyc le ones for imp lement ing on its h igh ly p ipe l ined arch i tecture . So we cannot read mul t ip le memory words wi th one execut ion o f a s ingle-cyc le ins t ruct ion . However , this k ind o f operat ion is indeed dynamica l ly determined one and, to get h igh per fo rmance , it is des i rab le to imp lement this operat ion wi th one inst ruct ion . The centra l idea of imp lement ing (unbounded) mul t ip le operat ion wi th 260 K. Shimada, H. Koike and H. Tanaka a single-cycle instruction is the re-execution technique. That is, when a single- cycle instruction needs more than one cycle to complete its operation, it jumps to itself so that it will be fetched and executed again to do the rest of the operation. As for the dereference instruction, when the operand register contain- ing a pointer to a variable, it reads the data of the variable from memory, writes the data into the same operand register, and jump to itself. As a result, it has dereferenced one link. Thus, it can dereference one link per cycle without stall- ing the pipeline or blocking the execution of instructions from other contexts. (2] Suspension register One significant point of the dereference instruction of UNIRED 1I is that it has special ability for committed choice languages. That is the suspension register mechanism. In head matching of committed choice languages, a goal may suspend when a component of its arguments is an unbound variable while the corresponding component of the current clause head is not a variable. After trying all alternative clauses and committing no clauses, the goal really sus- pends. Therefore when a goal may suspend at one of alternative clauses, the variable which caused the suspension must be recorded to hook the goal by some suspension stack mechanisms until all alternative clauses are tried. In the case of UNIRED I1, the variable is recorded in the suspension register (general register R30) by the dereference instruction. There are two kinds of effect of the suspension register mechanism. First, the suspension stack, which is in memory in the case of other similar processors, 14~ can easily be implemented in registers so that the suspension check is speeded up. Second and more important, when the head matching is deter- ministic, as is often the case with real programs, once a goal suspends at one of the alternative clauses, the goal suspends after all. Therefore no suspension stack in memory is necessary in that case. The suspension register mechanism speeds up this very case. (3) Combined instructions Several combined versions of the dereference instructions exist to be used by the compiler for further optimizations. The dereference-and-check-list (dfcl) instruction checks the dereferenced value to determine whether it is a pointer to a list or not, and the dereference-and-check-list-and-load-car (dcll) instruction reads the car part of the list if the dereferenced value is a pointer to a list. Similarly the dereference-and-check-vector (dfcv) instruction checks the derefer- enced value to determine whether it is a pointer to a vector, and so on. These instructions are capable of a two-way jump, one for suspension and the other for pointer type check failing (it is a three-way branch if you count the branch non-taken path). The jump addresses are given by the offset value from the instruction itself and the alternative pointer register (general register R28). Another kind of combined instruction is that the execute instructions, UN1RED lI: The High Performance Inference Processor 26I for the Parallel Inference Machine PIE64 which execute tail recursions, are combined with those dereference instructions so that they speed up the tail recursion and the consequent head matching sequence. These combined instructions are implemented as far as they do not violate the restriction of implementing on the highly pipelined architecture and the hardware cost of instruction decoder. However, they can bring some good effect on performance, as shown later. 5.2 Arithmetics and Bit-wise Logic Instructions The arithmetic and bit-wise logic instructions of UNIRED II are very similar to those of conventional processors. They exist for compiling such things as built-in arithmetic predicates. One difference between those of UNIRED II and those of conventional processors is that those of UNIRED II check the tag part of the operands and set the tag error flag bit of the flag register according to the value of the tags*. Therefore there are several switches of those instruc- tions to deal with various tag types. For example, the add.i instruction (.i switch on) adds an integer to another integer (otherwise set the tag error flag), the add. p instruction adds an integer to a pointer, and the add.b instruction adds total 32-bits to 32-bits and does not change the tag error flag. 5.3 An Example of Compiled Code Now, we present an example of append(I-H IT], X, Y):- Y append(l] , X, Y) :- X append: seta dcll $1: tgld allc st sudf bind jntg mov exll $2: Fig. 5 compiled code of UNIRED II in Fig. 5. = [HI Z], append(T, X, Z). y, $suspend, ap r l , $2, r4 ; EHI [ r l + I] , rl ; TI s, Ist, 2, r5 ; l- r4, I-r5~, %tmp; Hi l-r5 + I1, r6 ; Z] rS, r3, r7 ; Y = [-H IZ] r7, udf, $checkl r6, r3 ; Z) r l , $1, r4 ; EHI dfcc r l , nil, $fail ; [ ] cvos r2, r3, r2, r3 ; X = Y. bind r2, r3, r7 jntg r7, udf, $check2 succ gtp, mp stop An Example of Compiled Code (append). To simplify the hardware, there are no tag error trap mechanisms in UNIRED II. Besides, UNIRED II's target is not only numerical computation, but general parallel applications including symbolic computation. 262 K. Shimada, H. Koike and H. Tanaka It is the code compiled from a deterministic append program. In the tail recursion loop (between label $1 and $ 2 in the figure), there are only eight instructions. Therefore UNIRED II can execute the append program at a maximum rate of 1.25 million -/,eduction Per Second with the clock rate of 10 MHz. w Simulation Results 6.1 Simulation Model Fig. 6 shows a simulation model used for the evaluation. We evaluated UNIRED II as a single, independent processor, and emulated the coprocessor commands issued by UNIRED II with the imaginary command processor shown in Fig. 6. In a real IU of PIE64, these commands are processed by MP and NIPs. Table 2 shows the number of cycles which are neccessary for emulating the commands with the command processor. As for network access commands, the number of cycles in the table is determined based on the NIP's performance from Ref. 5). In addition, we used an independent queue memory for queuing newly spawned goals in the simulation model. This roughly corresponds to the MP memory in Fig. 1. The goal scheduling strategy with this queue memory is L IFO (Last-In/First-Out). _ rT'l UNIRED II: The High Performance Inference Processor 263 tor flae "Para'fle'l ln~erence "Mac'lYme P1E64 6.2 Performance with Sample Programs First, we evaluate UNIRED II's performance under the condition that there is no remote memory access. We use, as the sample programs, append 100 (deterministic append of a list of length 100), nreverse 30 (naive reverse of a list of length 30), qsort 50 (quick sort of a list of length 50), 8 queens (the 8-queen problem), and bestpath 36 (finding the shortest pathes to a node from other 35 nodes). Table 3 shows some aspects of the sample program behaviors, including the performance numbers. The performance numbers are expressed in terms of Fleng Goal Reduction Per Second (RPS). These are measured with 10 MHz clock. All codes of the sample programs are compiled by the prototype com- pileg which is still under development. As for append 100, the performance is comparatively low because it spawns only one goal in the body, which is executed by tail recursions, so that only one contex is available. That makes the CPI (clock cycles per instructions) go over two because the multi-context processing does not work and the pipeline interlocks occur frequently. (Note that the CPI does not reach four owing to the weak every" cycle switching.) So the performance is degraded from the predicted number by the compiled code which is shown in Fig. 5. Similar condition exists for the bestpath 36 program. In this program, the average number of active contexts is about two, so the CPI goes up and the performance is somewhat degraded. However, the other three programs in Table 3 (nreverse 30, qsort 50, 8 queens) have enough contexts to keep the CPI close to one. Another interesting feature revealed in Table 3 is that the rate of memory access is about twenty percent or higher in the sample programs. It seems somewhat higher than that of conventional procedural languages. It is partly because the prototype compiler uses the goal-in-register scheme in which all arguments of a goal are loaded into registers before the reduction of the goal starts and stored back into memory when the goal suspends. Besides, we can say that committed choice languages tend to have a lot of memory references because of their great ability of manipulating data structures. Table 3 Several Aspects and Performance Numbers of the Sample Programs. programs append 100 nreverse 30 qsort 50 8 queens bestpath 36 total clock cycles performance [KRPS] number of reductions number of suspensions number of executed instructions instructions per reduction average number of active contexts clock cycles per instruction number of memory read number of memory write 1930 5364 8336 704844 304188 523.3 924.9 455.9 551.6 224.9 101 496 380 38878 6842 0 29 122 545 1211 813 4768 7864 696069 236039 8.05 9.61 20.7 17.9 34,5 0,998 3.24 3.79 3.95 2.33 2.37 1,12 1.06 1.01 1.29 305 1699 2236 154885 53488 (15.8%) (31.7%) (26.8%) (22.0%) (17.6%) 301 1663 2042 142716 54607 (15.6%) (31.0%) (24.9%) (20.2%) (18.0%) RPS: Fleng Goal Reduction Per Second. Numbers in parentheses are rates of memory access to the total clock cycles. 264 K. Shimada, H. Koike and H. Tanaka 6.3 Tolerance of Remote Access Latency To evaluate tolerance of remote memory access latency, we incorporated a pseudo-remote access mechanism in the simulator in spite of the single processor model of it as shown in Fig. 6. In more detail, we change the value of the IU-identifier field of the pointers included in every goal when reduction of the goal starts or resumes after suspension, with the probabil ity which we call remote pointer ratio. Remote memory access commands issued by UNIRED I1 are emulated by the command processor shown in Fig. 6 with cycles listed in Table 2. Under these conditions, we varied the maximum number of the contexts from one to four, and measured the clock cycles required by all sorts of the pipelined execution of instructions using the 8 queens program. Results are shown in Figs. 7 to 9. In these figures, the lowest part (shadowed) of the graph represents the number of executed instructions, the second part (hatched) repre- sents the number of invalidated instructions by some branches, the third part (lightly shadowed) the number of cycles while the internal pipeline of UNIRED II holds, and the fourth, uppermost part (white) the number of cycles while the pipeline are sleeping because, waiting for some replies, no contexts can be executed. The ratio of the lowest shadowed part to the total clock cycles corresponds to the processor utilization according to remote memory access. In Fig. 7, the multi-context processing mechanism of UNIRED II is not activated because the maximum number of the contexts is set to one. Therefore the pipeline sleeping time (the white part of the graph) can not be hidden and becomes longer and longer as the remote memory access increases. Moreover, the 2500000 -I'- [ ] pipeline sleep t [ ] pipeline hold [ ] invalidated insts. 2000000 T [ ] executed insts. 1500000 0 1000000 500000 0 20 40 , 60 80 1 O0 remote pointer ratlo[%] Fig. 7 All Sorts of Clock Cycles vs. Remote Memory Access (8 queens, the maximum number of contexts 1). UNIRED II: The High Performance inference Processor 265 for the Parallel Inference Machine PIE64 "6 o t~ 2 ooo0o i t [] pipeline sleep 2000000 [ ] pipeline hold [ ] invalidated insts. [ ] executed insts. 1500000 1000000 T . . . . 500000 0 20 40 60 80 100 remote~olnterratlo[%] Fig. 8 All Sorts of Clock Cycles vs. Remote Memory Access (8 queens, the maximum number of contexts = 2). 2500000 2000000 1500000 1000000 500000 T [ ] pipeline sleep [ ] pipeline hold [ ] invalidated insts [ ] executed insts 20 , [ 40 60 80 100 remote pointer ratio[%] Fig. 9 All Sorts of Clock Cycles vs. Remote Memory Access (8 queens, the maximum number of contexts = 4). pipel ine hold time and the amount of inval idated instructions are great because the pipel ine interlock occurs frequently. In the other two figures (Figs. 8 and 9), the mult i -context processing mechanism works and works more effectively as the number of contexts increase. The pipel ine sleeping time is least in Fig. 9 and the pipel ine interlock (the 266 K. Shimada, H. Koike and H. Tanaka pipeline hold and the instruction invalidation) hardly occurs in that figure. They become a little longer as the remote memory access increases because the average number of the active contexts decreases. Figure 8 shows an intermediate state between Figs. 7 and 9. 6.4 Effects of Dedicated Instructions Here, we present the effect of the dereference and combined instructions, which are most characteristic of the instruction set of UNIRED II. Fig. 10 shows the speed up about four sample programs (nreverse 30, qsort 50, 8 queens, bestpath 36) without the dereference instructions (the dereference instructions are resolved into more basic instructions), with only the basic dereference ( derf ) instructions, with the dereference-and-check-list / vector/ constant ( dfcl / dfcv/dfcc) instruction, and with the all combined instructions such as the dereference-and-check-list-load-car / execute-on-list-load-car ( dcll / exll) instruc- tions, respectively. In the figure, the speed up of the basic dereference instruction is about ten percent. In addition, the combined instructions have their effect as shown, and the total speed up of these instructions is about thirty percent. Therefore it can be said that the dereference instructions have a great effect. 6.5 Comparison with Other Processors Finally, we compare the performance of UNIRED II with that of other dedicated processors for committed choice languages and sequential logic lan- 1.6 1.5 1.4 1.3' 1.2 ' 1.1 L 1.0" ~- - nreverse 30 �9 qsort 50 �9 ~ 8 queen best )ath J . . . . . . . . . . . . . . . . none +der f +dfc l /d fcv /d fcc +dc l l /dcv l /ex l l Fig. 10 Effects of Dereference Instructions. UN1RED 11: The High Performance Inference Processor for the Parallel Inference Machine PIE64 Table 4 Comparison with Other Processors for Committed Choice Languages and Prolog. language processor BAM 17) CARMEL-215~ UNIRED II KCM t6) P IM/m element processor ~9) PIM/p element processor ~ S/P* performance clock ~KRPS] ~MHz~ Prolog S 3680 33 FCP P 2400 31 Fleng P 920 10 Prolog S 833 12.5 KLI P 833 16.7 KL1 P 600 20 * S: Sequential P: Parallel. 267 guages, namely Prolog. Table 4 shows the performance of UNIRED I1 and other processors. BAM 17) is a RISC processor tuned for Prolog. It has the highest performance, but its main advantage is static analysis premising sequential execution. KCM 16) has a 64-bit tagged architecture and is controlled by a single central microsequencer. CARMEL-2 ~s) is a RISC processor tuned for FCP. The performance number of CARMEL-2 was estimated based on the critical path design of the chip. PIM/p 18) and PIM/m 19) are parallel inference machines which adopt KL1 as their programming language. They have a dedicated processing element which corresponds to UNIRED II of PIE64. In comparison with the others, UNIRED II has a fairly good performance in spite of the slow clock rate and parallel execution. As for the comparison with general purpose processors, we discussed that point in another paper. I~ In that paper, it is revealed that UNIRED II is about fifty percent faster than a SPARC processor of 33Hz clock, and if we normalize the device technology by the clock rate, it means that UNIRED II is about for times faster than the SPARC processor. w Conclusion We have described the architecture of the inference processor UNIRED If, and evaluated some aspects of the architecture. According to the simulation results, we can get high performance (920KRPS for nreverse 30) in spite of slow clock rate of 10MHz, and made certain that the multi-context processing of UNIRED II has a great effect on reducing pipeline interlocking and on reducing overhead of the remote memory access latency. Recently, we have made real LSI chips of UNIRED II. Currently, we are going to complete the total system of PIE64. So, in the near future, we will evaluate real PIE64 by running larger and real application programs. Acknowledgements This work was supported by Grant-in-Aid for Specially Prompted Research (No. 62065002), and is now supported by Grant-in-Aid for Encourage- ment of Young Scientists (No. 03001269) of the Ministry of Education, Science 268 and Culture. K. Shimada, H. Koike and H. Tanaka References 1) Shimada, K., Koike, H. and Tanaka, H., "The Instruction Set Architecture of the Inference Processor UNIRED II," Proc. of IFIP Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism, IFIP Transac- tion A-23, North-Holland, pp. 117-128, 1993. 2) Shimada, K., "Inference Processor UNIRED II Design Specification, Volume I. Architecture" (in Japanese), Technical Report, TRIE-92-2, Information Engineering Course, Graduates School of Engineering, University of Tokyo, 1992. 3) Koike, H. and Tanaka, H., "Multi-Context Processing and Data Balancing Mechanism of the Parallel Inference Machine PIE64," Proc. of Fifth Generation Computer Systems 1988, ICOT, pp. 970-977, 1988. 4) Takahashi, E., Shimizu, T., Koike, H. and Tanaka, H., "A Study of a High Bandwidth and Low Latency Interconnection Network in PIE64," Proc. of Pacific Rim Confer- ence on Communications, Computers and Signal Processing, IEEE, pp. 5-8, 1991. 5) Shimizu, T., Koike, H. and Tanaka, H., "Details of the Network Interface Processor for PIE64" (in Japanese), SIG Reports on Computer Architecture,87-5, IPSJ, 1991. 6) Nilsson, M. and Tanaka, H., "Massively Parallel Implementation of Flat GHC on the Connection Machine," Proc. of Fifth Generation Computer Systems 1988, ICOT, pp. 1031-1040, 1988. 7) Chikayama, T., Sato, H. and Miyazaki, T., "Overview of the Parallel Inference Machine Operating System (PIMOS)," Proc. of the International Conference on Fifth Genera- tion Computer Systems 1988, ICOT, pp. 230-251, 1988. 8) Shapiro, Ehud Y., "A Subset of Concurrent Prolog and Its Interpreter," Technical Report, TR-O03, Institution for New Generation Computer Technology, 1983. 9) Agarwal, A., Lim, B., Kranz, D. and Kubiatowicz, J., "APRIL: A Processor Architec- ture for Multiprocessing," Proc. of the 17th International Symposium on Computer Architecture, IEEE, pp. 104-114, 1990. 10) Weber, W. D. and Gupta, A., "Exploring the Benefits of Multiple Hardware Contexts in a Multiprocessor Architecture: Preliminary Results," Proc. of the 16th Annual International Symposium on Computer Architecture, ACM, pp. 273-280, 1989. 11) Jordan, H. F., "Performance Measurements on HEP--A Pipelined MIMD Computer," Proc. of the lOth Annual International Symposium on Computer Architecture, ACM, pp. 207-212, 1983. 12) Halstead, R. and Fujita, T., "MASA: A Multithreaded Processor Architecture for Parallel Symbolic Computing," Proc. of the 15th International Symposium of Computer Architecture, IEEE, pp. 443-451, 1988. 13) Shimizu, K., Goto, E. and Ichikawa, S., "CPC (Cyclic Pipeline Computer)-- An Architecture Suited for Josephson and Pipelined-Memory Machines," Transactions on Computers, 38, 6, IEEE, pp. 825-832, 1989. 14) Kimura, Y. and Chikayama, T., "An Abstract KL1 Machine and Its Instruction Set," Proc. of the 1987 Symposium on Logic Programming, pp. 468-477, 1987. 15) Harsat, A. and Ginosar, R., "CARMEL-2: A Second Generation VLSI Architecture for Flat Concurrent Prolog," Proc. of the International Conference on Fifth Generation Computer Systems 1988, ICOT, pp. 962-969, 19'88. 16) Benker, H. et al., "KCM:A Knowledge Crunching Machine," Proc. of the 16th International Symposium on Computer Architecture, pp. 186-194, 1989. 17) Holmer, B. K., Sano, B., Carlton, M., Roy, P. V., Haygood, R., Bush, W. R. and UNIRED I1: The High Performance Inference Processor for the Parallel Inference Machine PIE64 18) ~9) 269 Despain, A. M., "Fast Prolog with an Extended General Purpose Architecture," Proc. of the 17th International Symposium of Computer Architecture, IEEE, pp. 282-291, 1990. Shinogi, T., Kumon, K., Hattori, A., Goto, A., Kimura, Y, and Chikayama, T., "MACRO-CALL Instruction for the Efficient KL1 Implementation on PIM," Proe. of the International Conference on Fifth Generation Computer Systems 1988, ICOT, pp. 953-961, 1988. Nakashima, H., Takeda, Y. and Nakajima, K., "Architecture of PIM/m Processor Element" (in Japanese), Technical Report, TR-564, Institute for New Generation Computer Technology, 1990. Kentaro Shimada, Ph.D.: He was born in Okayama on June 16, 1964, and has grown up in Tokyo, Japan. He received the degrees of Bachelor of Electrical Engineering, Master of Information Engineering and Doctor of Information Engineering form the University of Tokyo in 1988, 1990, 1993, respectively. His current research interests are in computer architecture, processor design, and parallel programming languages. He is a member of Information Processing Society of Japan. Hanpei Koike, Ph.D.: He was born in Kamakura, Japan on June 15, 1961. He received the degrees of Bachelor of Electronics Engineering, Master of Information Engineering and Doctor of Information Engi- neering from the University of Tokyo in 1984, 1986 and 1990, respec- tively. In 1989, he joined the faculty of Engineering, University of Tokyo, where he is now a lecturer. His current research interests are in computer architecture, parallel programming, and system software. He is a menber of IEEE, ACM, Information Processing Society of Japan, and Japan Society for Software Science and Technology. Hidehiko Tanaka, Ph.D.: He was born in Hyogo, Japan on January 15, 1943. He received the degrees of Bachelor of Electronics Engineer- ing, Master of Electrical Engineering and Doctor of Electrical Engi- neering from the University of Tokyo in 1965, 1967 and 1970, respec- tively. In 1970, he joined the faculty of Engineering, University of Tokyo, where he is now a professor. From 1978 to 1979, he was a visiting professor of City College of New York, NY. His current research interests are in computer architecture, distributed systems, CAD system for LSI design, and artificial intelligence. He is a member of IEEE, ACM, Information Processing Society of Japan, and IECE of Japan.
Comments
Report "UNIRED II: The high performance inference processor for the parallel inference machine PIE64"