OTP/eMTP: 128bit  256Kbit NVM in 180nm Standard Logic CMOS process
Software generated BCH as a way to solve challenges of providing multiple configuration IP
Michal Jedrak, Evatronix S.A.
Filip Rak, Warsaw University of Technology
Tomasz Wojciechowski, Evatronix S.A.
Abstract
This article describes the idea of generating synthesizable IP core by a software tool taking an error correction algorithm of BCH (BoseChaudhuriHocquenghem) as an example. First, it gives an overview on the challenges associated with the error correction module flexibility being a trigger to study the subject. It is followed by a short introduction of NAND Flash memory and Error Correction Codes (ECC) supplemented by BCH algorithm description. In the next chapters specific implementation details are provided accompanied by highlights of configuration parameters and procedure conducted to generate selected architecture of module. Finally, the article concludes giving a very simple example how the application takes full set of parameters and translate it into the RTL source code.
Overview
The last decade brought large demand for nonvolatile data storage, especially for modern embedded systems and mobile applications; that caused rapid increase of NAND Flash memory popularity. The new generation of such memories leverage the MLC (Multi Level Cell) technology, that allows to store more than one bit in a single memory cell. Although, it dramatically increases data density, it also decreases reliability and boost data error rate. To mitigate data corruption effect, various correction algorithms are used. Also the rapid development of NAND Flash memory forces developers to think beyond the current memory error correction requirements. To save application compatibility with future memories they need to implement multiple correction strength capability.
The correction code can significantly reduce memory access speed. Written as software application, it increases the processor workload and slows down memory read and write operations – that is why correction codes are usually implemented in hardware. However, with many hardware implementation there is an issue with lack of flexibility – once coded, hardware cannot be changed to suit particular application in an easy way, just opposite to the software approach. Every change of code parameter (such as data block length or correction capability) or implementation constrains (like throughput or chip area) triggers an extra development work consequently delaying project introducing unwanted costs and risk. The reality is that such changes are required in every single design forcing managers and engineers to make tradeoffs between project goals and engineering time to fit module to their needs.
To help reducing this unpleasant effect by improving module flexibility several parameterization methods can be employed. Many of them come from hardware description language capabilities, some are provided by CAD (Computer Aided Design) tools, but most are too simple solutions that are merely enough only in simple cases, where only slight module adjustment is needed. More complex tasks – like changing not only structure, but also algorithm – find mentioned solutions to be insufficient. A real solution to this dilemma is full software generation of the error correction module, it takes the behavioral description of the algorithm, applies configuration parameters and translates them into fully synthesizable component in either Verilog or VHDL.
NAND Flash memory introduction
The NAND Flash came to life in 1984 at Toshiba laboratories being an effect of research conducted over NOR Flash technology. As the NAND Flash cell silicon area is much smaller than NOR Flash one it became an efficient solution for storing media and user data, it is not dedicated for processor operation memory.
The memory cell transistor can store single bit of information and such memory is called SLC. As the charge of the cell can be quantified with higher precision than just “0” and “1” the MultiLevel Cell memory came to live. First, storing two data bit and also lately three bits gaining a new name of Triple Level Cell. The MLC name has been reserved for just two bit memory over last year. As transistor gate stores more information it is consequently also more vulnerable to lose this information.
The NAND Flash memory is organized into the pages, which are power of two kilobytes in size (2KB, 4KB, 8KB …) extended by some additional space. The primary page area is called user space and the extra space on the top is called spare area. Every page is divided into the subpages which are also called codewords for ECC operation.
Detailed explanation of the NAND Flash technology is beyond scope of this article, for more please see [1].
Correction codes introduction
To protect against loosing of the information stored in NAND Flash memory the various error correction codes have been employed. At the very beginning just simple Hamming algorithm was enough to detect two errors and correct single error. However, as memory technologies shrank and MLC as well as TLC (Triple Level Cell) memory became more common the more advanced solutions where required. The two prevailed the most, they are: RS (ReedSolomon) and BCH (BoseChaudhuriHocquenghem) codes, with the latter one winning the prime position in the race. As advantage to RS codes, BCH achieves better performance [2], needs less redundant bits and corrects single error bits instead of the whole words.
In the just last years, along the technology progressing even further more and more applications target the LDPC (Low Density Parity Code). As it is more effective than BCH in terms of silicon area for high error bit correction.
BCH algorithm principles
For sake of this study we selected BCH as it still the most popular error correction algorithm in NAND Flash applications. The rest of this chapter gives basic details of the algorithm, still for more detailed description please refer to [3, 4, 5].
The BCH uses operations on the Galois Field GF(2n). The field size (n) is one of basic parameters depending on the codeword size. It is the exponent of the smallest power of two greater then sum of codeword size and parity bits.
The element of the Galois Field are nsize vectors, which requires a specific arithmetic to manipulate them, hereafter “multiplication” and “addition” operation will refer to the GF operations.
The BCH can be divided in two distinguishable submodules of encoder and decoder. The former one is used at the memory write stage to calculate the parity bits. The codeword bits are written into the shift register, at every shift operation the content of the register is XOR’ed with predefined, so called, generation polynomial depending on the value of the register most significant bit. The shift operation is repeated as long as all bits are pushed through the whole register. In summary, the encoder is the Linear Feedback Shift Register.
Figure 1. Simplified example encoder
The other submodule of the BCH engine is decoder and it is used during memory read operation, its architecture is by far more complicated that the one of encoder, primarily because it is composed of independent stages, which are error detection (syndrome calculation) and error correction, which further is composed of solving key equations, equation roots inversion to find error locations parts [6, 7].
The error detection is realized as multiplication and accumulation in GF(2n) of every data bit which are treated as coefficients of polynomial and given α raised to appropriate power. The polynomial roots are calculated as an effect of the operation giving syndromes in specific points. All syndromes equal zero means no error, otherwise the calculation is continued to find the erroneous bits of information.
Figure 2. Syndrom calculation
The actual correction process is composed of computing error location polynomial followed by finding specific error bit locations within the data block. The first stage is achieved by solving a set of equation, which roots provide information on error locations. The reduced inversionless BerlekampMassey [8] algorithm has been employed to do this job. The computations is performed in so called Processing Elements (PE); varying number of PEs the desired processing time can be achieved.
The other part of decoding phase is to compute error locations within the data block. This is achieved by calculating inversion of error location polynomial roots by means of Chien’s search algorithm. For consecutive elements of Galois field GF(2n) the values of location polynomial are computed, then roots are found as places, where polynomial evaluates to zero
Figure 3. Chien’s search algorithm
Generator implementation
Initially the idea was to build the configurator around the inhouse tested and well known technique of parameterizing Verilog code along with on code pragmas informing parser to eliminate part of the code. However, these techniques very effective in process of configuring for example R8051XC2 microcontroller proved to be insufficient for complex algorithm of BCH that should allow not only elements extraction and bus width changes but also deep architectural changes of the design.
To avoid all drawback of script driver configuration an alternative approach of software tool called generator was worked out. The generator uses algorithm description provided in proprietary Intermediate Language to build full blown ECC engine. All optimization and architecture dynamic modifications are done at the IL level and only then at the very end the in memory description of algorithm is translated into desired RTL language of VHDL or/and Verilog, which, in fact, is further processed to introduce language specific optimization.. It does not only give great flexibility in configuration ability, but it also eases code management as instead of two sets of RTL source code files (in Verilog and VHDL) only one single version needs to be maintained. Moreover, any additional output processor can be built to generate source code file in other language as C for example.
Such approach saves an immense amount of time associated with delivering BCH module with always changing correction strength and codeword size requirements.
The process of translating IL into fullfledged RTL source code can be divided into few steps:
 Creation of Intermediate Language memory tree applying configuration parameters
 Languageindependent optimizations and transformations
 Translating IL tree into target language tree (VHDL and/or Verilog)
 Target languagespecific optimizations and transformations
 Text source code generation (using output formatting settings)
Figure 4. Code generation flow
Scope of configurability
The generator enables to engage many code parameters to precisely shape the final architecture of the BCH engine. They can be divided into two groups that differ by function they serve:
 BCH native characteristics like for example:
 Correction capability
 Codeword size
 Hardware implementation features like for example:
 Input data bus width
 Parallel factor
Inside these groups the parameters can be farther divided into available and not available to the user, that is, the ones that can be modified by a user, and the ones that are predefined by the author or calculated internally basing on the user parameters.
Specifically, user parameters define:
 Correction capabilities, which can be multiple
 Codeword size, also multiple values are acceptable
 Speed vs. area ratio
The internal parameters are for example:
 Galois Field size
 Size and coefficients of generation polynomial
 Elements of the Galois Field itself.
The overall configuration of the module architecture can be divided into three abstraction levels:
 Calculations of constants used in code (e.g. generation polynomial, Galois field elements)
 Structure modifications (e.g. functional blocks replication)
 Functional modifications (e.g. FSM states, that are present only for certain parameter values, including some pieces of code depending on parameters)
The first abstraction level concerns parameters that need to be assigned before actual module implementation starts, e.g. elements of Galois field GF(2n) used in calculations, generation polynomial used in encoder, registers widths or number of data encoding/decoding phase iterations. These constants are computed by generator and placed directly into module code.
The second level of abstraction contains simple architecture changes including repetition of certain statements such as number of arithmetic units for blocks depending on parallel factor, or number of data registers that control Berlekamp algorithm data flow.
The third abstraction level refers to changes in blocks architecture and functionality. Here, for example the most important ones are: architecture of computational elements that allow to perform calculations of Galois field GF(2n) elements, and construction of FSM (Finite State Machines) in all module blocks.
Example of use
For sake of demonstration how the generator is used, the desired correction strength is assumed to be 4, 8, 16, 32 with codeword size of 512 bytes.
The generator is to be executed in JAVA environment. Once run it displays a set of few prompt windows, such as demonstrated below.
Figure 5. Defining correction strength
Figure 6. Defining codeword (block) size
These windows enable user to enter all required parameters, after defining them generator executes the actual generation procedure producing set of Verilog and VHDL files ready to be synthesized.
Summary
In case of complex mathematical algorithm the legacy way of configuring the IP is not enough and requires a new approach of generating synthesizable code out of abstractive mathematical description. It allows to gain a great control over the results of such operation providing unprecedented flexibility in configuration of the module. It saves hundreds of development hours that otherwise would need to be spent to provide every single version of BCH engine. Finally, it eases the everyday code maintaining tasks as improvements need to be introduced just in a single base code.
Bibliography
 M. Jedrak, Evatronix SA, “NAND Flash memory in Embedded Systems”. IP ESC 2009, December 2009
 Yanni Chen and Keshab K. Parhi, “Area Efficient Parallel Decoder Architecture For Long BCH Codes” Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN USA
 J. Moreira, P. Farell, „Essentials of ErrorControl Coding”, Wiley, 2006
 Shu Lin, Daniel Costello, „Error Control Coding: Fundamentals and Applications”, PrenticeHall, 1983
 R. MorelosZaragoza, „The Art of Error Correcting Coding”, Wiley, 2006
 Wei Liu, Junrye Rho, Wonyong Sung, “LowPower HighThroughput BCH Error Correction VLSI Design for MultiLevel Cell NAND Flash Memories”, 2006
 F. Sun, S. Devarajan, K. Rose, T. Zhang, “Design of onchip error correction systems for multilevel NOR and NAND flash memories”, IET Circuits Devices Syst., 2007, 1, (3), pp. 241–249
 D.V. Sarwate, N.R. Shanbhag, “Highspeed architectures for ReedSolomon Decoders”, October 5, 2001
