The MVCOMP compression format Mateusz Viste Last update: 2024-10-01 MVCOMP is a minimalist format of data compression. It is very easy to implement: a depacker can be written within less than 20 lines of C. It's easy, fast and light. MVCOMP is meant to be used as a compression method in highly constrained environments (think 8086 and 16K of RAM). Technically speaking, it is a primitive compression method that is limited to back references and literal strings. The compressed output is a stream of 16-bit words (tokens). There are three types of token, but you only need to recognize the first two by checking the highest 4 bits of the token for zero. === TOKEN TYPE 1: "BACK REFERENCE" ============================================ Token format is LLLL OOOO OOOO OOOO, where: OOOO OOOO OOOO is the back reference offset (number of bytes-1 to rewind) LLLL is the number of bytes (-1) that have to be copied from the offset. In other words, when the decoder encounters such token, it needs to copy LLLL+1 bytes from past output that occured OOOO+1 bytes ago. LLLL is guaranteed to be non-zero (meaning that a back reference of 1 byte is not legal). If LLLL is zero, then the token is of TYPE 2. === TOKEN TYPE 2: "LITERAL STRING START" ====================================== Token format is 0000 RRRR BBBB BBBB This token is used to encapsulate uncompressible data. BBBB BBBB is the literal value of the byte to be copied RRRR is the number of RAW (uncompressible) WORDS that follow (possibly 0) === TOKEN TYPE 3: "LITERAL STRING CONTINUATION" =============================== Token format is AAAA AAAA BBBB BBBB Such token occurs after a TYPE 2 (literal string start) with non-zero RRRR. It contains two raw bytes (A and B) to be copied to output. ===============================================================================