Preface, or a ‘what? why?’
Error correction code in this context refers to mechanism used to compute number sequences that can then be used for correcting data corruption that has occured because of damaged storage media, noisy data transmission method, or any other reason.
There’s a lot of different pre-existing designs, some of which I’ve implemented before, but until recently I hadn’t really had the time to come up with one of my own. This needed to change, so I got myself a notebook, pen, some coffee, combined those with lack of sleep, and then on a flight whilst moving to Germany I came up with a silly error correction scheme :)
Disclaimer; I’m fairly sure I’m not the first one to come up with this approach, the whole purpose of this project was just to get to play around with some maths I like, and to get to implement code for error correction.
Mathematics
I’ll try and explain all of the relevant mathematics in a way where it’s understandable without very deep understanding of mathematics. However, there’s still non-zero chance that I’ve fallen into the average familiarity issue. Please feel free to poke me about it, and I might try and re-work this chapter.
The design is built around prime numbers, and more specifically the Chinese remainder theorem, mainly because I’ve been playing with this theorem for a while now and I wanted to apply it to something.
Chinese remainder theorem
Our error correction code – here in after referred to as ECC – calculation starts by selecting two integers m and n that are relative prime numbers, meaning that biggest integer that both m and n can be divided by is 1. For simplicity’s sake, let’s decide that we always select these numbers so that m is less than n.
The Chinese remainder theorem states that for every integer a and b between 0 and m, there exists one and only one x that fulfills the following requirements:
- x is higher than a and b
- x is less than m multiplied by n
- Remainder of a divided by m is equal to remainder of x divided by m
- Remainder of b divided by n is equal to remainder of x divided by n
Taken the requirements 1 and 2, we’ll want to select smallest possible values for m and n to narrow down the range of possible values of x. It’s easy to get into values for x where it’s no longer practical to compute x with home computers.
As a and b are less than m, remainder of a divided by m is always a and b divided by n is always b. We can therefore work the above requirements into following equation:
- Remainder of x divided by m is a
- Remainder of x divided by n is b
Or by using the word mod to represent modulo:
- x mod m = a
- x mod n = b
And with above equation pair, we have a way to bundle any two numbers in our data into a single number that’s unique for that combination of a and b.
Representing 4 values with a single X and 2 co-prime numbers.
We’ve now figured out that for any a, b, m and n we’ll have one unique x. However, we might like to have a bit less numbers to store per every a-b pair. Since we have unique x for every combination of ours, we can make use of exclusive or so that we only need to store one value of x to represent 4 different numbers a, b, c and d. For simplicity’s sake, let’s define a function crt(a, b) as a function that computes x for any given a and b with pre-defined m and n. Then, our x1, x2 pair would be computed by
x1 = crt(a, b)
x2 = crt(c, d)
Now, as x1 and x2 are unique for every a, b, c, d combination, that means there’s one and only one possible x3 for every a, b, c, d group, and we’ve successfully come up with a single number to represent all four input numbers. Finally, our ECC word for any 4 numbers of data would be defined as:
ECC = x1 ⊕ x2 = crt(a, b) ⊕ crt(c, d)
Summary of everything above
Let m and n be co-prime integers such that 0 < m < n.
For every a, b, c, d integers such that 0 <= a, b, c, d < m
there exists only one pair of integers x1, x2 that fulfill equations
x1 mod m = a AND x1 mod n = b
x2 mod m = c AND x2 mod n = d
And for every x1, x2 there exists one and only one ECC such that
ECC = x1 ⊕ x2.
Implementation
But wait! there’s more maths to be done
As we’re operating with bytes, so numbers going from 0 to 255, or 0 to FF, we can select our m and n to be 256 and 257. This gives range of x of 256 to 65792. For convenvience I’ll be moving on to base 16 representation of numbers here, so our range for our 4 data numbers would be 0 to FF each, m and n are 100 and 101, and x can be anything from 101 to 010100. Annoyingly, the highest value of x doesn’t quite fit into 2 bytes, which leaves us with ratio of 3 ECC bytes per 4 bytes of data.. this isn’t ideal.
I decided to solve this issue by having a bitmask for indicating which ECC values were above FFFF, and just store modulo 10000 of those code values:
uint32_t x = crt(a, b) ^ crt(c, d);
if (x > 0xFFFF) {
x = x % 0x10000;
ecc_overflow_bitmask |= (1 << (offset));
}
ecc_block[offset] = (uint16_t)x;
And thanks to that approach, instead of having 3/4 ECC per data ratio, we have block-size + bitmask-size + 2/4 ratio instead.
Blocks everywhere
I came up with a following structure for the ECC data:
ECC block format:
offset | size (B) | data
-----------------------------------------------
0 | 1 | Amount of ECC words stored
-----------------------------------------------
1 | 32 | Overflow bitmask
-----------------------------------------------
2 | 1 - 510 | ECC words 0 - 255
-----------------------------------------------
This type of structure allows us to have ratio of 543 bytes of ECC related data per 1020 bytes of data, which isn’t awfully bad I think. The flow for calculating ECC words would be basically:
- Per every 4 bytes of data, calculate 1 ECC word, and update our block info
- Per every 1020 bytes of data, create a new ECC block
- Once finished, append all ECC blocks to end of the data.
This might not be the ideal way of storing ECC information, but at this point it’s good enough :)
Error detection and correction
Now that we have our ECC blocks added to our data, it’s time to play around with error detection. The way I went with here is to just calculate new ECC word per every 4 bytes of data we receive, compare that to corresponding received ECC value, and if new and old ECC dwords match, just continue on. Note, that double word isn’t a typo here, the received ECC value has to be updated based on the ECC Overflow bitmask. Essentially the process is something like:
uint32_t current = ecc_block->ecc[offset];
if (ecc_block->of_mask & (1 << offset)) {
current |= 0x10000;
}
if (current != calculated_ecc) {
// Error detected
}
Now, what happens when we encounter an error? How exactly do we correct from that? The approach I took here was to break the new and incorrect ECC dword into it’s two x components, and then go through the corrupted 4-bytes long data block byte by byte, checking with which value we’ll get the correct ECC value. And as we go through the received corrupted data, for every a, b and c, d pair, we’ll do brute-force search from 0 to 255 for finding the correct original byte.
/* Try and correct a single byte error
*
* @param uint8_t a -- 1st byte of data
* @param uint8_t b -- 2nd byte of data
* @param uint32_t ex -- Assumed correct x
* @param uint32_t ecc -- Correct ecc dword
* @return corrected data on success or -1 on error
*/
uint16_t correct_byte_error(uint8_t a, uint8_t b, uint32_t ex, uint32_t ecc) {
uint32_t cx, dx;
uint8_t c, d;
for (c = 0; c < 255; c++) {
cx = crt(c, b);
if ((cx ^ ex) == ecc) {
return (a << 8) | b;
}
}
for (d = 0; d < 255; d++) {
dx = crt(a, d);
if ((dx ^ ex) == ecc) {
return (d << 8) | b;
}
}
return -1;
}
// ...
uint8_t a = data[offset];
uint8_t b = data[offset + 1];
uint8_t c = data[offset + 2];
uint8_t d = data[offset + 3];
uint32_t x1 = crt(a, b);
uint32_t x2 = crt(c, d);
uint16_t c1 = correct_byte_error(a, b, x2, ecc);
uint16_t c2 = correct_byte_error(c, d, x1, ecc);
if (c1 != -1) {
// a, b block had errors detected and corrected
}
// ...
Edit on 10th of September
As pointed out by fogti, the above error correction mechanism could be greatly simplified by exploiting the fact that:
let m = 256, n = 257
(ECC mod m) = a ⊕ c
(ECC mod n) = b ⊕ d
Now, assume b has been corrupted, the error correction flow would be something among the lines of;
- Check if (ECC % m) == (a ⊕ c), if equation is true, a and c are correct
- Check if (ECC % n) == (b ⊕ d), doesn’t hold true, b or d is corrupt.
- Assume x1 is correct, error would then be at d
- corrected x2, cx2, would then be ECC ⊕ x1
- d would then be corrected with cx2 mod n to cd
- compare (b ⊕ cd) to (ECC mod n), they’re not equal, so error must be at b, and x2 must be correct
- Corrected x1, cx1, would then be ECC ⊕ x2
- b would be corrected with cx1 mod n to cb
- compare (cb ⊕ d) to (ECC mod n), they’re equal, so error must be have been b and it can now be corrected to cb.
or in pseudo-code, something like:
bool check(ecc, which, x, y) {
if (((ecc % which)) != (x ^ y)) {
return false;
}
return true;
}
// ...
uint8_t correct_byte_error(ECC, x1, x2, a, b, c, d) {
if ((ECC % m) == (a ^ c)) {
// a and c are correct
// assume x1 is correct
uint32_t cx2 = ECC ^ x1;
uint8_t cd = cx2 % n;
if ((b ^ cd) == (ECC % 257)) {
return cd;
}
// x1 was not correct
uint32_t cx1 = ECC ^ x2;
uint8_t cb = cx1 % n;
if ((cb ^ d) == (ECC % 257)) {
return cb;
}
} else if ((ECC % n) == (b ^ d)) {
// Above but for a/c pair
// ..
} else {
// Unrecoverable error
}
}
What next?
I still have to figure out how to detect over 16-bit errors, correct over 8-bit errors, and how to tell apart ECC being corrupted from data being corrupted. However those – as well as further math optimisations – are something to do later, eventually, and they’ll deserve blog post of their own I should think.
If you did read this far, cool ^^ Hope you liked it