How to do Type-Length-Value (TLV) serialization with Serde?

1.3k views Asked by At

I need to serialize a class of structs according to the TLV format with Serde. TLV can be nested in a tree format.

The fields of these structs are serialized normally, much like bincode does, but before the field data I must include a tag (to be associated, ideally) and the length, in bytes, of the field data.

Ideally, Serde would recognize the structs that need this kind of serialization, probably by having them implement a TLV trait. This part is optional, as I can also explicitly annotate each of these structs.

So this question breaks down in 3 parts, in order of priority:

  1. How do I get the length data (from Serde?) before the serialization of that data has been performed?

  2. How do I associate tags with structs (though I guess I could also include tags inside the structs..)?

  3. How do I make Serde recognize a class of structs and apply custom serialization?

Note that 1) is the (core) question here. I will post 2) and 3) as individual questions if 1) can be solved with Serde.

1

There are 1 answers

0
Caesar On

Brace yourself, long post. Also, for convention: I'm picking both type and length to be unsigned 4 byte big endian. Let's start with the easy stuff:

  1. How do I make Serde recognize a class of structs and apply custom serialization?

That's really a separate question, but you can either do that via the #[serde(serialize_with = …)] attributes, or in your serializer's fn serialize_struct(self, name: &'static str, _: usize) based on the name, depending on what exactly you have in mind.

  1. How do I associate tags with structs (though I guess I could also include tags inside the structs..)?

This is a known limitation of serde, and the reason protobuf implementations typicall aren't based on serde (take e.g. prost), but have their own derive proc macros that allow to annotate structs and fields with the respective tags. You should probably do the same as it's clean and fast. But since you asked about serde, I'll pick an alternative inspired by serde_protobuf: if you look at it from a weird angle, serde is just a visitor-based reflection framework. It will provide you with structure information about the type you're currently (de-)serializing, e.g. it'll tell you type and name and fields of the type your visiting. All you need is a (user-supplied) function that maps from this type information to the tags. For example:

struct TLVSerializer<'a> {
    ttf: &'a dyn Fn(TypeTagFor) -> u32,
    …
}
impl<'a> Serializer for TLVSerializer<'a> {
    fn serialize_bool(self, v: bool) -> Result<Self::Ok, Self::Error> {
        let tag = &(self.ttf)(TypeTagFor::Bool).to_be_bytes();
        let len = &1u32.to_be_bytes();
        todo!("write");
    }

    fn serialize_i32(self, v: i32) -> Result<Self::Ok, Self::Error> {
        let tag = &(self.ttf)(TypeTagFor::Int {
                    signed: true,
                    width: 4,
                })
                .to_be_bytes();
        let len = &4u32.to_be_bytes();
        todo!("write");
    }
}

Then, you need to write a function that supplies the tags, e.g. something like:

enum TypeTagFor {
    Bool,
    Int { width: u8, signed: bool },
    Struct { name: &'static str },
    // ...
}
fn foobar_type_tag_for(ttf: TypeTagFor) -> u32 {
    match ttf {
        TypeTagFor::Int {
            width: 4,
            signed: true,
        } => 0x69333200,
        TypeTagFor::Bool => 0x626f6f6c,
        _ => unreachable!(),
    }
}

If you only have one set of type → tag mappings, you could also put it into the serializer directly.

  1. How do I get the length data (from Serde?) before the serialization of that data has been performed?

The short answer is: Can't. The length can't be known without inspecting the entire structure (there could be Vecs in it, e.g.). But that also tells you what you need to do: You need to inspect the entire structure first, deduce the length, and then do the serialization. And you have precisely one method for inspecting the entire structure at hand: serde. So, you'll write a serializer that doesn't actually serialize anything and only records the length:

struct TLVLenVisitor;
impl Serializer for TLVLenVisitor {
    type Ok = usize;
    type SerializeSeq = TLVLenSumVisitor;

    fn serialize_i32(self, _v: i32) -> Result<Self::Ok, Self::Error> {
        Ok(4)
    }
    fn serialize_str(self, str: &str) -> Result<Self::Ok, Self::Error> {
        Ok(str.len())
    }
    fn serialize_seq(self, _len: Option<usize>) -> Result<Self::SerializeSeq, Self::Error> {
        Ok(TLVLenSumVisitor { sum: 0 })
    }
}
struct TLVLenSumVisitor {
    sum: usize,
}
impl serde::ser::SerializeSeq for TLVLenSumVisitor {
    type Ok = usize;
    fn serialize_element<T: Serialize + ?Sized>(&mut self, value: &T) -> Result<(), Self::Error> {
        // The length of a sequence is the length of all its parts, plus the bytes for type tag and length
        self.sum += value.serialize(TLVLenVisitor)? + HEADER_LEN;
        Ok(())
    }
    fn end(self) -> Result<Self::Ok, Self::Error> {
        Ok(self.sum)
    }
}

Fortunately, serialization is non-destructive, so you can use this first serializer to get the length, and then do the actual serialization in a second pass:

    let len = foobar.serialize(TLVLenVisitor).unwrap();
    foobar.serialize(TLVSerializer {
        target: &mut File::create("foobar").unwrap(), // No seeking performed on the file
        len,
        ttf: &foobar_type_tag_for,
    })
    .unwrap();

Since you already know the length of what you're serializing, the second serializer is relatively straightforward:

struct TLVSerializer<'a> {
    target: &'a mut dyn Write, // Using dyn to reduce verbosity of the example
    len: usize,
    ttf: &'a dyn Fn(TypeTagFor) -> u32,
}
impl<'a> Serializer for TLVSerializer<'a> {
    type Ok = ();
    type SerializeSeq = TLVSeqSerializer<'a>;

    // Glossing over error handling here.
    fn serialize_seq(self, _len: Option<usize>) -> Result<Self::SerializeSeq, Self::Error> {
        self.target
            .write_all(&(self.ttf)(TypeTagFor::Seq).to_be_bytes())
            .unwrap();
        // Normally, there'd be no way to find the length here.
        // But since TLVSerializer has been told, there's no problem
        self.target
            .write_all(&u32::try_from(self.len).unwrap().to_be_bytes())
            .unwrap();
        Ok(TLVSeqSerializer {
            target: self.target,
            ttf: self.ttf,
        })
    }
}

The only snag you may hit is that the TLVLenVisitor only gave you one length. But you have many TLV-structures, recursively nested. When you want to write out one of the nested structures (e.g. a Vec), you just run the TLVLenVisitor again, for each element.

struct TLVSeqSerializer<'a> {
    target: &'a mut dyn Write,
    ttf: &'a dyn Fn(TypeTagFor) -> u32,
}
impl<'a> serde::ser::SerializeSeq for TLVSeqSerializer<'a> {
    type Ok = ();

    fn serialize_element<T: Serialize + ?Sized>(&mut self, value: &T) -> Result<(), Self::Error> {
        value.serialize(TLVSerializer {
            // Getting the length of a subfield here
            len: value.serialize(TLVLenVisitor)?,
            target: self.target,
            ttf: self.ttf,
        })
    }

    fn end(self) -> Result<Self::Ok, Self::Error> {
        Ok(())
    }
}

Playground
This also means that you may have to do many passes over the structure you're serializing. This might be fine if speed is not of the essence and you're memory-constrained, but in general, I don't think it's a good idea. You may be tempted to try to get all the lengths in the entire structure in a single pass, which can be done, but it'll either be brittle (since you'd have to rely on visiting order) or difficult (because you'd have to build a shadow structure which contains all the lengths).

Also, do note that this approach expects that two serializer invocations of the same struct traverse the same structure. But an implementer of Serialize is perfectly capable to generating random data on the fly or mutating itself via internal mutability. Which would make this serializer generate invalid data. You can ignore that problem since it's far-fetched, or add a check to the end call and make sure the written length matches the actual written data.


Really, I think it'd be best if you don't worry about finding the length before serialization and wrote the serialization result to memory first. To do so, you can first write all length fields as a dummy value to a Vec<u8>:

struct TLVSerializer<'a> {
    target: &'a mut Vec<u8>,
    ttf: &'a dyn Fn(TypeTagFor) -> u32,
}
impl<'a> Serializer for TLVSerializer<'a> {
    type Ok = ();
    type SerializeSeq = TLVSeqSerializer<'a>;
    
    fn serialize_seq(self, _len: Option<usize>) -> Result<Self::SerializeSeq, Self::Error> {
        let idx = self.target.len();
        self.target
            .extend((self.ttf)(TypeTagFor::Seq).to_be_bytes());
        // Writing dummy length here
        self.target.extend(u32::MAX.to_be_bytes());
        Ok(TLVSeqSerializer {
            target: self.target,
            idx,
            ttf: self.ttf,
        })
    }
}

Then after you serialize the content and know its length, you can overwrite the dummies:

struct TLVSeqSerializer<'a> {
    target: &'a mut Vec<u8>,
    idx: usize, // This is how it knows where it needs to write the length
    ttf: &'a dyn Fn(TypeTagFor) -> u32,
}
impl<'a> serde::ser::SerializeSeq for TLVSeqSerializer<'a> {
    type Ok = ();

    fn serialize_element<T: Serialize + ?Sized>(&mut self, value: &T) -> Result<(), Self::Error> {
        value.serialize(TLVSerializer {
            target: self.target,
            ttf: self.ttf,
        })
    }

    fn end(self) -> Result<Self::Ok, Self::Error> {
        end(self.target, self.idx)
    }
}

fn end(target: &mut Vec<u8>, idx: usize) -> Result<(), std::fmt::Error> {
    let len = u32::try_from(target.len() - idx - HEADER_LEN)
        .unwrap()
        .to_be_bytes();
    target[idx + 4..][..4].copy_from_slice(&len);
    Ok(())
}

Playground. And there you go, single pass TLV serialization with serde.