Binary description of the document format for a known document type

This format is defined to be compact and simple to implement but also capable to support large document streaming or fast access.

1) Value types

base types can be in Little-Endian or Big-Endian.

CODE NAME DESCRIPTION ENDIANNESS
0 Bit bool 1 bit interpreted as a boolean value
1 Byte bool 1 byte interpreted as a boolean value
2 Int8 signed byte
3 UInt8 unsigned byte
4 Int16 signed short (2 bytes) LE
5 UInt16 unsigned short (2 bytes) LE
6 Int32 signed integer (4 bytes) LE
7 UInt32 unsigned integer (4 bytes) LE
8 Int64 signed long (8 bytes) LE
9 UInt64 unsigned long (8 bytes) LE
10 Float IEEE 754 (4 bytes) LE
11 Double IEEE 754 (8 bytes) LE
12 Int16 signed short (2 bytes) BE
13 UInt16 unsigned short (2 bytes) BE
14 Int32 signed integer (4 bytes) BE
15 UInt32 unsigned integer (4 bytes) BE
16 Int64 signed long (8 bytes) BE
17 UInt64 unsigned long (8 bytes) BE
18 Float IEEE 754 (4 bytes) BE
19 Double IEEE 754 (8 bytes) BE
20 VarUInt variable size unsigned integer, (for 1 to 8 bytes)
21 Text for strings
22 Doc for sub documents
23 Variable varying type, which can take any of the above values
0x80 Var Bits Mask used for 1 to 64 bits fields

All other types are defined as documents.

Text value

Text is a very common structure encode as :

TYPE CARD DESCRIPTION
VarUInt (1,1) data array size
UByte (1,1) ARRAY the text datas

The character encoding is given by the document field type

Varying value

The varying type is unknown before reading the document. All varying values are composed of a field description document and followed by the value itself.

TYPE CARD DESCRIPTION
Doc (1,1) field value type
XXXXX (1,1) Datas

2) Binary structure - Streamed

The streamed form allow a quick writing with no backward cursor movement. The downside is an unpredictable document size until reading is finished. This form is more suited for tiny and very large documents.

Bytes 0   1               X               Y
      +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+
      | 0 |  FIELD 1 DEF  | FIELD 1 VALUE |  FIELD 2 DEF  | FIELD 2 VALUE |  FIELD N DEF  | FIELD N VALUE |
      +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+
Offset Description
0-1 signature, 0x00 for streamed
1-X field définition if the type do not define all elements
X-Y field value if values are present

PSEUDO-CODE

FUNCTION Document readDocument(Input in, DocumentType docType)

    Document doc = Document.new()

    FOR EACH field IN docType.fields THEN

        # read number of occurences
        Int nbOcc = readNbOccurence()

        IF nbOcc != 0 THEN

            # read field value doc type if undefined
            DocumentType fieldDocType = field.docType
            IF field.type = 'document' AND field.docType = null THEN
                Document encodedDocType = readDocument(in, SCHEMA_DOCTYPE)
                fieldDocType = toDocumentType(encodedDocType)
            END IF

            # read field values
            IF nbOcc = 1 AND field.maxOcc = 1 THEN
                # read a single value

                doc.setFieldValue(field.id, readValue(in, field))

            ELSE IF nbOcc = -1 THEN
                # read a streamed collection

                List values = List.new()                
                WHILE in.readByte() != 0
                    values.add(readValue(in, field))
                END WHILE

                doc.setFieldValue(field.id, values)
            ELSE
                # read a defined size collection

                List values = List.new(nbOcc)
                WHILE in.readByte() != 0
                    values.add(readValue(in, field))
                END WHILE

                doc.setFieldValue(field.id, values)
            END IF

        END IF

    END FOR

    RETURN doc

END FUNCTION



FUNCTION Int readNbOccurence(Input in, Field field)

    Int nbOcc = field.max_occ
    IF field.minOcc = 0 AND field.maxOcc = 1 THEN
        IF in.readBit() == 0 THEN
            nbOcc = 0
        ELSE
            nbOcc = 1
        END IF
    ELSE IF field.minOcc != field.maxOcc THEN
        Int n = in.readVarUInt()
        IF n = 0 THEN
            nbOcc = -1
        ELSE
            nbOcc = field.minOcc + n - 1
        END IF
    END IF

    RETURN nbOcc

END FUNCTION



FUNCTION Object readValue(Input in, Field field)

    # read array size
    Int[] arraySize = field.arraySize
    IF arraySize != null THEN
        FOR i = 0 TO arraySize.length
            IF arraySize[i] <=0
                arraySize[i] = in.readVarUInt()
            END IF
        END FOR
    END IF

    # field values
    IF arraySize != null THEN
        RETURN readArrayValues(in, field, arraySize, 0)
    ELSE
        RETURN readSingleValue(in, field)
    END IF

END FUNCTION


FUNCTION Object readArrayValues(Input in, Field field, Int[] arraySize, int depth)

    Object[] values = new Object[array[depth]]

    FOR i = 0 TO values.length
        IF depth = array.length-1 THEN
            values[i] = readSingleValue(in, valueType)
        ELSE
            values[i] = readArrayValues(in, valueType, arraySize, depth+1)
        END IF
    END FOR

    RETURN values

END FUNCTION


FUNCTION Object readSingleValue(Input in, Field field)

    SWITCH field.type
        CASE 'Bit'       : RETURN in.readBit() != 0
        CASE 'ByteBool'  : RETURN in.readByte() != 0
        CASE 'Int8'      : RETURN in.readByte()
        CASE 'UInt8'     : RETURN in.readUByte()
        CASE 'Int16_BE'  : RETURN in.readShortBE()
        CASE 'UInt16_BE' : RETURN in.readUShortBE()
        CASE 'Int32_BE'  : RETURN in.readIntBE()
        CASE 'UInt32_BE' : RETURN in.readUIntBE()
        CASE 'Int64_BE'  : RETURN in.readLongBE()
        CASE 'UInt64_BE' : RETURN in.readULongBE()
        CASE 'Float_BE'  : RETURN in.readFloatBE()
        CASE 'Double_BE' : RETURN in.readDoubleBE()
        CASE 'Int16_LE'  : RETURN in.readShortLE()
        CASE 'UInt16_LE' : RETURN in.readUShortLE()
        CASE 'Int32_LE'  : RETURN in.readIntLE()
        CASE 'UInt32_LE' : RETURN in.readUIntLE()
        CASE 'Int64_LE'  : RETURN in.readLongLE()
        CASE 'UInt64_LE' : RETURN in.readULongLE()
        CASE 'Float_LE'  : RETURN in.readFloatLE()
        CASE 'Double_LE' : RETURN in.readDoubleLE()
        CASE 'VarUInt'   : RETURN in.readVarUInt();
        CASE 'VarBits'   : RETURN in.readBits(field.nbBits);
        CASE 'Text'      : RETURN Chars.new(in.readBytes(in.readVarUInt()), field.charEncoding)
        CASE 'Document'  : RETURN readDocument(in, field.docType, field.inline)
        CASE 'Variable'  : 
            Document typeDoc = readDocument(in, SCHEMA_FIELDVALUETYPE)
            FieldValueType type = toFieldValueType(typeDoc)
            RETURN readValue(in, type)
    END SWITCH

END FUNCTION

3) Binary structure - Indexed

The indexed form allow quick skipping and access to properties. The downside is a slightly bigger file and backward cursor positioning when writing.

Bytes 0   1   2               X               Y               Z
      +---+---+ - - - + - - - +---+ - - - +---+---+ - - - +---+
      | 1 | S | SIZE1 | SIZEN |  FIELD N DEF  | FIELD N VALUE |
      +---+---+ - - - + - - - +---+ - - - +---+---+ - - - +---+
Offset Description
0-1 marker, 0x01 for indexed
1-2 number of bytes used to store a field size
2-X size in bytes of each document field, the number of values is given by
the document type, the total document size can be calculated using the
formula : sum ( size1 ... sizeN ) + 2
X-Y field definition if the type do not define all elements
Y-Z field value if values are present

The fields definitions and values use the same structure as in the streamed form

4) Binary structure - Encapsulated

The encapsulated form is intended to be used for compression and encryption needs. The encapsulated document is a document in any form, this allows the combine several layers of encapsulation.

Bytes 0   1               X               Y
      +---+---+ - - - +---+---+ - - - +---+
      | 2 |    METHOD     |   ENC. SIZE   |
      +---+---+ - - - +---+---+ - - - +---+
Offset Description
0-1 signature, 0x02 for encapsulated
1-X String in UTF-8 to identify the mehod.
for example [0x03,'Z','I','P'] or [0x03,'A','E','S']
X-Y variable size integer to indicate the encapsulated document size

If the encapsulated size if not zero the complete encapsulated document is in the next bytes.

Bytes Y               Z
      +---+ - - - +---+
      |  COMP. DOC.   |
      +---+ - - - +---+
Offset Description
Y-Z the encapsulated document on N bytes, N being the number defined at [X-Y[

If the encapsulated size is zero the encapsulated document is split in blocks of fixed sizes.

Bytes Y               Z               T  T+1
      +---+ - - - +---+---+ - - - +---+---+---+ - - - +---+---+
      |  BLOCK SIZE   |    BLOCK 1    | F |    BLOCK N    | F |
      +---+ - - - +---+---+ - - - +---+---+---+ - - - +---+---+
Offset Description
Y-Z variable size integer to define size of the blocks
Z-T block
T-T+1 block flag, if value is zero, this was the last block

5) Binary structure - Reference

In some cases it is necessary to define cyclic, backward or distant references. The reference binary structure encodes and UTF-8 string which points toward the document. Common cases include URL, URN or file paths but those are not restricted.

Bytes 0   1               X
      +---+---+ - - - +---+
      | 3 |   REFERENCE   |
      +---+---+ - - - +---+
Offset Description
0-1 signature, 0x03 for reference
1-X Reference String in UTF-8

6) Binary structure - Deleted

Documents can be deleted, this particular structure allows document files to be modified without rewriting the entire file. Decoders must skip those documents when they occur.

Bytes 0   1               X               Y
      +---+---+ - - - +---+---+ - - - +---+
      |255|   DOC SIZE    |    PADDING    |
      +---+---+ - - - +---+---+ - - - +---+
Offset Description
0-1 signature, 0xFF for deleted
1-X VarUInt, size of the deleted document, the size includes only the
padding length
X-Y bytes to skip, may contain any kind of data, encoders should fill it
with random or constant values for security reasons.