# LAYOUT OF A HIGH-PERFORMANCE MULTIPLY-ACCUMULATE UNIT WITH ADDITIONS INTEGRATED TO REDUCE PARTIAL PRODUCT Processing USING PASTA ADDER

## <sup>1</sup>T. Nageswhar Rao, <sup>2</sup>Dr. Venkata Reddy Adama, <sup>3</sup>R. Srinvias, <sup>4</sup>CH Sanjay Kumar

<sup>1,3</sup>Associate Professor, <sup>2</sup>Professor & HOD, <sup>4</sup>UG Student, <sup>1,2,3,4</sup>Department of Electronics and Communication Engineering, Vaageswari College of Engineering, Karimnagar, Telangana, India

### ABSTRACT

In this paper, we propose low-power high-speed pipelined Mac architecture. Carry propagation of additions consume more power and large path-delay, to resolve this problem we introduce a proposed method. In this we integrate a part of additions into a partial product reduction process. Until the PPR process of next multiplication, addition and accumulation of MSB bits are not performed. To correctly contrast with surplus in the PPR process, a small size adder is designed to accumulate the total number of carries. The effectiveness of the proposed method is designed using XilinxISE14.7.

Keywords: MAC unit, dadda multiplier, Arithmetic circuits, alpha-bit adder

### Introduction

In the proposed method we use two stage MAC unit with 8 and 16 bit. In Partial Product Generation (PPG) process, PPR performed in the first stage, in the second stage performs the  $(k+\alpha)$ -bit addition to produce the accumulation result.

The main trademark of this proposed architecture are mentioned below: To reduce the lengths of carry propagations, we integrate a part of additions into the PPR process. To handle overflow in the PPR process, a  $\alpha$ -bit adder is used to count the total number of carries. By applying the gating technique, the second stage can only be executed in the last cycle (of the entire sequence of multiply-accumulate operations) for power saving.





The proposed architecture of MAC shown in above figure. Our PPM (for the PPR process) is composed of two PPMs: one PPM is derived by the PPG and the other PPM is derived by the accumulation. In the first stage of proposed MAC, the final addition of higher significance bits is not performed. Register *accumulation* is used to store the PPM derived by the accumulation. Thus, register *accumulation* includes three parts: REG1 (i.e., the first row) has 2N-1 bits, REG2 (i.e., the second row) has k bits, which can define by the user, and REG3 (i.e., the third row) has 1 bit. In the PPR process, we adopt the Dadda tree approach to reduce our PPM to two rows. After our PPM is reduced to be two rows, we perform the (2N-k-1)-bit final addition. Since we use an (2N-k-1)-bit adder for the addition of the last two rows obtained by the Dadda tree approach, a largerk can have a smaller carry propagation in the (2N-k-1)-bit adder. However, since the final addition and accumulation result. The inputs of the second stage of the proposed MAC unit, we produce the accumulation result. The inputs of the second stage include register *accumulation* (consisting of REG1, REG2 and REG3) and register REG4. In the proposed MAC unit, the accumulation has been done in both the  $\alpha$ -bit addition and the next PPR process. Thus, if we only need to obtain the final result in the last cycle, we can disable the  $(k+\alpha)$ -bit addition in other cycles for power saving.

### Advantages

Low power consumption

Carry propagation is not considered in this method and hence delay will be reduced

#### **Related work**

In the conventional MAC unit, it is necessary to perform two carry propagations: additions in multiplication and additions in accumulation. Multiplication is done by using dada multiplier in MAC unit. The Dadda approach differs from the Wallace approach using the minimum number of full and half adders necessary to meet predetermined heights for each stage. The stage height limits are: {#2,#3,#4,#6,#9,#13,#19,#28,#42, etc.}. Each stage is 1.5 times larger than the subsequent stage. The 1.5 reduction ratio results from the use of full adders which take 3 partial products and reduces them to 2 partial products. Thus the maximum height of the

succeeding stage is  $\frac{2}{3}$  that of the previous stage. In doing this, Dadda seeks to optimize the area of the multiplier by using the fewest number of adders to reach the CPA stage.

#### Disadvantages

More power consumption is required due to its large gate count in multiplication. It has large path delay.

#### **PROPOSED METHOD**

In data transmission applications, the widely used public-key cryptosystem is a simple and efficient Montgomery multiplication algorithm such that the low-cost and high-performance. In which includes encryption and decryption process. The Montgomery multiplier receives and outputs the data with binary representation and uses only one-level carry-save adder (CSA) to avoid the carry propagation at each addition operation. This CSA is also used to perform operandpre-computation and format conversion from the carry save format to the binary representation, leading to a low hardware cost and short critical path delay at the expense of extra clock cycles for completing one modular multiplication. To overcome the weakness, A configurable CSA (CCSA), which could be one full-adder or two serial half-adders, is proposed to reduce the extra clock cycles for operand pre-computation and format conversion by half. When modular multiplier is done with CCSA technique and it has some drawbacks. The drawbacks are short critical path, high power consumption. To overcome the drawbacks the CCSA is replaced with PASTA (Parallel Self Timed Adder) in the Montgomery modular multiplier. The PASTA adder can achieve less power consumption. Modular Multiplication is the central operation in many application areas including public key cryptography for encryption and decryption. The widely used method for modular multiplication is Montgomery modular multiplier. In which there will be a carry save adder . X 'Y mod M is theoperation to be performed. In which X and Y are the inputs. It is necessary to find the value of mod M, henceforth going for this algorithm. Comparing all previously occurring algorithms, this algorithm will produce the optimized output. There are two cases, semi carry save addition and full carry save addition. In this semi carry save addition, the given inputs are in binary and the inter outputs alone in carry save. Whereas in full carry addition, both inputs and inter outputs are in carry save. On comparing, it can be seen that semi carry save is the most advantageous one because it has only one carry save and hence it has less area and high speed which is required for designing an VLSI based multipliers. The carry save approach has higher benefits since it is the basic key for operating a Montgomery modular multiplier. In such a way, using this semi carry save type only one carry level adder is implemented which may be two serial half adders or a full adder can be used based on the requirement. It thereby reduces the number of clock cycles and hence less delay. So the output will be optimized and it can be implemented using Verilog coding. The above architecture is the semi-carry save based Montgomery multiplier. In which the loop is reduced on comparing to the existing one. It consists of two multiplexers, one multiplier, one configurable carry save adder, flip-flops, skip detector and zero detector. To increase the Speed of Operation we are replacing the CSA with PASTA (Parallel self timed adder) in the proposed architecture. Montgomery multiplication is to perform fast modularmultiplication(MM).PASTA adder using in Montgomery modular multiplication is to reduced area and clock cycles. To design a simple and efficient radix-2 Montgomery Modular multiplication with Parallel Self Timed Adder (PASTA). The design of PASTA is uses half adders (HAs) along with multiplexers requiring minimal interconnections. The selection input for two-input multiplexers corresponds to the Request handshake signal and will be a single 0 to 1 transition denoted by SEL. It will initially select the actual operands during SEL=0and will switch to feedback/carry paths for subsequent iterations using SEL=1. The feedback path from the HAs enables the multiple iterations to continue until the completion when all carry signals will assume zero values are show in Fig. 3. In Fig. 4, two state diagrams are drawn for the initial phase and the iterative phase of the proposed architecture. Each state is represented by (Ci+1 Si) pair where Ci+1, Si represent carry out and sum values,

respectively, from the i th bit adder block. During the initial phase, the circuit merely works as a combinational HA operating in fundamental mode. It is apparent that due to the use of HAs instead of FAs, state (11) cannot appear.



Fig. Block diagram of PASTA.



Fig. Montgomery Modular Multiplication using PASTA adder

The proposed architecture of Montgomery Modular Multiplication using PASTA adder, which consists of one one-level Parallel Self Timed Adder(PASTA) architecture, two 4-to-1 multiplexers (M1 and M2) one simplified multiplier SM3, one skip detector Skip\_D, one zero detector Zero\_D, and six registers. Zero detector Zero\_D is used to detect SC is equal to zero. The Skip\_D is composed of four XOR gates, three AND gates, one NOR gate, and two 2-to-1 multiplexers the skip detector is used to detect the unnecessarymultiplication operations.

### SIMULATION RESULTS

The design has been implemented using Xilinx Verilog coding. For further verification, the design can be done using Cadence. It can be clearly understand by the waveform shown below. It can be proven that it has reduced area complexity and speed complexity on comparing to all other multipliers. The method has been implemented using a configurable carry save adder so as to prove the maximum delay to be less comparing all. The delay and area can be minimized as much as possible as comparing to all other previous existing architectures.

RTL



Internal Block Diagram

|--|

### SIMULATION RESULTS

|              |       |      |        |        |        | 1,000.000 r |
|--------------|-------|------|--------|--------|--------|-------------|
| Name         | Value | Øns. | 200 ns | 400 ns | 600 ns | 800 ns      |
| ▶ 🙀 p[15:0]  | 391   | K    |        | 391    |        |             |
| ▶ 📢 x17:01   | 17    | Ř.   |        | 17     |        |             |
| ▶ 📢 y(7:0)   | 23    | Ċ.   |        | 23     |        |             |
| Manufactoria | 43    |      |        | 23     |        |             |

### CONCLUSION

We have seen different application of MAC unit for the various application. MAC unit designed with various multiplier among them Booth's multiplier is having highest operating speed and consumes less power. MAC unit is high in demand in Digital signal processing to provide the basic hardware for the systems. MAC unit is in use for all type of arithmetic operation such as addition multiplication, division, squares and square-root. MAC unit must be superior in terms ofarea, delay, power consumption, speed and complexity.

SCS-based multipliers maintain the input and Output operands of the Montgomery MM in the carrysave format to escape from the format conversion, leading to fewer clock cycles but smallerarea than FCS-based multiplier. In the existed architecture disadvantages are carry propagation delay and extra clock cycles. To overcome the disadvantages we go for PASTA adder. The PASTA adder is using in Montgomery Modular Multiplier in these advantages are low hard warecost short critical path delay and required clock cycles are reduced for completing

one MM operation.

#### REFERENCES

- 1. Meenu S Ravi1, R H Khade and Ajit Saraf, "Design of Fast Floating Point MultiplyAccumulate Unit using Ancient Mathematics for DSP Applications." European Journal of Advances in Engineering and Technology, pp.55-59, 2015.
- 2. Ku. Shweta N. Yengade, Associate Prof. P. R. Indurkar, "Review On Design Of Low Power Multiply And Accumulate Unit Using BaughWooley Based Multiplier." International Research Journal of Engineering and Technology (IRJET) Volume: 04 Issue: 02 Feb -2017.
- 3. Shaik Nasar, K. Subbarao, "Design & Implementation of MAC Unit Using Reversible Logic." International Journal of Engineering Research and Applications (IJERA) Vol. 2, Issue5,
- 4. September- October 2012, pp.1848-1855
- 5. G.Indira, G. Madhusudhana Rao, P.Jaya Babu, M. Ravi Kiran, "An Efficient Architecture of MAC Unit with LAW Multiplier." International Journal of Advanced Scientific Technologies in Engineering and Management Sciences (IJASTEMS-ISSN: 2454-356X) Volume.2, Issue.11, November.2016
- 6. Saleh, Hani, and Earl E. Swartzlander, "A floating-point fused add-subtract unit." In 2008 51st Midwest Symposium on Circuits and Systems, pp. 519-522. IEEE, 2008.
- Samy, Rodina, Hossam AH Fahmy, RamyRaafat, Amira Mohamed, Tarek ElDeeb, and Yasmin Farouk, "A decimal floating-point fused-multiply-add unit." In 2010 53rd IEEE International Midwest Symposium on Circuits and Systems, pp. 529- 532. IEEE, 2010.
- 8. Michael F. Cowlisha, "Decimal FloatingPoint: Algorism for Computers." Proceedings of the 16th IEEE Symposium on Computer Arithmetic (ARITH'03) IEEE, 2003
- 9. A. D. Robison, "N-Bit Unsigned Division via N-Bit Multiply-Add," Proceedings of the 17th IEEE Symposium on Computer Arithmetic, pp. 131-139, 2005.
- 10. Singh, Harpreet, and Chakshu Goel, "Design Approaches for a Novel Reversible 4-bit Comparator." IJMSC-International Journal of Mathematical Sciences and Computing (IJMSC) (2015).