I am currently working on an architecture, where users can post content any server. To ensure the content has actually been posted by a certain user (and has not been altered after being posted), a signature is created using the private key of the author of the content, whose public key is accessible for everyone on a centralized repository.
Problem is, I have no control over how the content is actually stored on these servers. So I might transmit the content e.g. as a JSON object with all data being base64-encoded and the signature is created using a hash of this the base64-encoded content concatenated in a certain order:
{
"a": "b",
"c": "d",
"signature": "xyz"
}
with
signature := sign(PrivKey, hash(b + d);
Now the server will probably store the content of this in another way, e.g. a database. So maybe the encoding changes. Maybe a mysql_real_escape_string() is done in PHP so stuff gets lost. Now if one wants to check the signature there might be problems.
So usually when creating signatures you have a fixed encoding and a byte sequence (or string) with some kind of unambiguous delimiter - which is not the case here.
Hence the question: How to deal with signatures in this kinda scenario?
It is still required to have a specific message representation in bits or bytes to be able to sign it. There are two ways to do this:
A canonical representation of a message is a special, unique representation of the data that somehow distinguishes it from all other possible messages; this may for instance also include sorting the entries of a table (as long as the order doesn't change the meaning of the table), removing whitespace etc.
XML encryption for instance contains canonicalization methods for XML encoding. Obviously it is not possible to define canonicalization for data that has no intrinsic structure. Another (even) more complicated canonical representation is DER for ASN.1 messages (e.g. X509 certificates themselves as well as within RSA signatures).