I am designing a java app that uses algorithms to import data from other sources into a database. And the app also searches for records in the database.
How can I implement row level security in a way that the database does not even know that the data is encrypted, but also in a way that allows searching of the database using queries called from the java code?
I can use BouncyCastle to encrypt each field in each row of data before it is inserted into the database. But then how do I search the rows if every row and field in the database is individually encrypted? Is the answer as simple as encrypting each search parameter using the same keys before the search parameters are passed into the SQL or JPA SELECT queries? Or is a more complicated approach required?
I am using MySQL at the moment, but it would be nice if this were agnostic with respect to database vendor.
One of the most important properties of good encryption is that similar plaintexts are encrypted into vastly different ciphertexts. Roughly half of the bits of two ciphertexts will match. This property makes it hard (impossible) to formulate any kind of query that looks for substrings through
LIKE
or determines whether field values are greater or smaller than a given value.There is another property and that is semantic security. When the same plaintext is encrypted under the same key, the produced ciphertexts should be different. This property makes it impossible for an attacker get some meta information about the plaintext blocks, but this property must be dropped, because of the way the proposed solution works.
Let's take for example AES as a basic encryption primitive in CBC mode. It's block size it 16 bytes, so ciphertexts will be a multiple of this. If this is an overhead that is too big, you should use Triple DES with three different keys (=24 byte key for 168-bit security).
Simple Example:
All the table cells are encrypted using the same key. Now, you want to query the table to get rows where one column has a specific value. First, you encrypt the value to match with the same key and since we said there is no semantic security, the resulting ciphertext will exactly match the ciphertext in the table.
Then you iterate over the result set and decrypt every value. Warning: query is not parametrized for the sake of simplicity. Use prepared statements to disable SQL injection.
Achieving non-semantic security:
ECB mode is a pillar for insecurity and I would suggest to use CBC mode with a static IV (maybe all 0x00 bytes:
new byte[16];
) instead. There are other modes of operation that are also deterministic, but more on that later.Limitations:
order by
<
,>
,<=
or>=
in the where clauseMake it interesting:
There are some things that you can do to increase the security.
If you know beforehand that you will never try to see if two columns have the same value, then you can use a semi-randomized approach where you would assign every column of every table a different random Initialization Vector (IV). That way an attacker cannot try to match the ciphertexts from one column with the ciphertexts from another column to find similarities to get some meta data about the plaintexts.
If decreasing the overhead is not that important, you can opt-in for a deterministic authenticated encryption mode like SIV, but not CCM or GCM (not sure about EAX). It has only the overhead of the authentication tag (16 byte for AES). By using it, you can always check if the ciphertext was manipulated by somebody and you can check whether the ciphertext value was moved from another table cell, because you can simply use the column name as associated data. It is still be hard to detect if it was moved in the column without serious performance drop.
Fantasies about fixing the limitations
Order-Preserving Encryption can be used to fix the 1st limitation presented above, but you compomise the security, because
Source: How does order-preserving encryption work?
The 2nd limitation may be avoided (and maybe the other ones too) if the SQL flavor provides encryption functions directly in SQL, but this is probably too slow to be used in a large scale.
Public-Key Crypto
You may have noticed that I only referred to symmetric crypto throughout. It's not necessary to use only symmetric crypto, but the problem with for example RSA is that the ciphertexts are huge (256 byte for a 2048-bit key) compared to that little overhead for AES. The footprint for ECC based encryption is much better (such as ElGamal Encrypt).
The other nice thing about Public-Key Crypto is that you can query the data all you want, but you can't decrypt it without the private key. So you can always put data in (using the public key), but only get the data out with the private key.