Most efficient method for persisting complex types with variable schemas in SQL

Question

Most efficient method for persisting complex types with variable schemas in SQL

2.1k views Asked by smartcaveman At 23 February 2011 at 14:13

What I'm doing

I am creating an SQL table that will provide the back-end storage mechanism for complex-typed objects. I am trying to determine how to accomplish this with the best performance. I need to be able to query on each individual simple type value of the complex type (e.g. the String value of a City in an Address complex type).

I was originally thinking that I could store the complex type values in one record as an XML, but now I am concerned about the search performance of this design. I need to be able to create variable schemas on the fly without changing anything about the database access layer.

Where I'm at now

Right now I am thinking to create the following tables.

 TABLE:  Schemas
   COLUMN NAME       DATA TYPE
   SchemaId          uniqueidentifier
   Xsd               xml                 //contains the schema for the document of the given complex type
   DeserializeType   varchar(200)        //The Full Type name of the C# class to which the document deserializes.

 TABLE:  Documents
   COLUMN NAME       DATA TYPE      
   DocumentId        uniqueidentifier
   SchemaId          uniqueidentifier

 TABLE:  Values                            //The DocumentId+ValueXPath function as a PK
   COLUMN NAME       DATA TYPE      
   DocumentId        uniqueidentifier
   ValueXPath        varchar(250)
   Value             text

from these tables, when performing queries I would do a series of self-joins on the value table. When I want to get the entire object by the DocumentId, I would have a generic script for creating a view mimics a denormalized datatable of the complex-type.

What I want to know

I believe there are better ways to accomplish what I am trying to, but I am a little too ignorant about the relative performance benefits of different SQL techniques. Specifically I don't know the performance cost of:

1 - comparing the value of a text field versus of a varchar field.
2 - different kind of joins versus nested queries
3 - getting a view versus an xml document from the sql db
4 - doing some other things that I don't even know I don't know would be affecting my query but, I am experienced enough to know exist

I would appreciate any information or resources about these performance issues in sql as well as a recommendation for how to approach this general issue in a more efficient way.

For Example,

Here's an example of what I am currently planning on doing.

I have a C# class Address which looks like

public class Address{
     string Line1 {get;set;}
     string Line2 {get;set;}
     string City {get;set;}
     string State {get;set;}
     string Zip {get;set;
}

An instance is constructed from new Address{Line1="17 Mulberry Street", Line2="Apt C", City="New York", State="NY", Zip="10001"}

its XML value would be look like.

<Address>
   <Line1>17 Mulberry Street</Line1>
   <Line2>Apt C</Line2>
   <City>New York</City>
   <State>NY</State>
   <Zip>10001</Zip>
</Address>

Using the db-schema from above I would have a single record in the Schemas table with an XSD definition of the address xml schema. This instance would have a uniqueidentifier (PK of the Documents table) which is assigned to the SchemaId of the Address record in the Schemas table. There would then be five records in the Values table to represent this Address.

They would look like:

DocumentId                              ValueXPath        Value
82415E8A-8D95-4bb3-9E5C-AA4365850C70    /Address/Line1    17 Mulberry Street
82415E8A-8D95-4bb3-9E5C-AA4365850C70    /Address/Line2    Apt C
82415E8A-8D95-4bb3-9E5C-AA4365850C70    /Address/City     New York
82415E8A-8D95-4bb3-9E5C-AA4365850C70    /Address/State    NY
82415E8A-8D95-4bb3-9E5C-AA4365850C70    /Address/Zip      10001

Just Added a Bounty...

My objective is to obtain the resources I need in order to give my application a data access layer that is fully searchable and has a data-schema generated from the application layer that does not require direct database configuration (i.e. creating a new SQL table) in order to add a new aggregate root to the domain model.

I am open to the possibility of using .NET compatible technologies other than SQL, but I will require that any such suggestions be adequately substantiated in order to be considered.

Original Q&A

There are 5 answers

**averell** · Answer 1 · 2011-02-23T14:34:30+00:00

Somehow what you want sounds like a painful thing to do in SQL. Basically, you should treat the inside of a text field as opaque as when querying an SQL database. Text fields were not made for efficient queries.

If you just want to store serialized objects in a text field, that is fine. But do not try to build queries that look inside the text field to find objects.

Your idea sounds like you want to perform some joins, XML parsing, and XPath application to get to a value. This doesn't strike me as the most efficient thing to do.

So, my advise:

Either just store serialized objects in the db, and do nothing more than load them and perform all other operations in memory
Or, if you need to query complex data structures, you may really want to look into document stores/databases like CouchDB or MongoDB; you can also check Wikipedia on the subject. There are even databases specifically designed for storing XML, even though I personally don't like them very much.

Addendum, per your explanations above

Simply put, don't go over the top with this thing:

If you just want to persist C#/.NET objects, just use the XML Serialization already built into the framework, a single table and be done with it.
If you, for some reason, need to store complex XML, use a dedicated XML store
If you have a fixed database schema, but it is too complex for efficient queries, use a Document Store in memory where you keep a denormalized version of your data for faster queries (or just simplify your database schema)
If you don't really need a fixed schema, use just a Document Store, and forget about having any "schema definition" at all

As for your solution, yes, it could work somehow. As could a plain SQL schema if you set it up right. But for applying an XPath, you'll probably parse the whole XML document each time you access a record, which wouldn't be very efficient to begin with.

If you want to check out Document databases, there are .NET drivers for CouchDB and MongoDB. The eXist XML database offers a number of Web protocols, and you can probably create a client class easily with VisualStudio's point-and-shoot interface. Or just google for someone who already did.

**Alex** · Answer 2 · 2011-02-23T15:18:32+00:00

In part, it will depend of your DB Engine. You're using SQL Server, don't you?

Answering your topics:

1 - Comparing the value of a text field versus of a varchar field: if you're comparing two db fields, varchar fields are smarter. Nvarchar(max) stores data in unicode with 2*l+2 bytes, where "l" is the lengh. For performance issues, you will need consider how much larger tables will be, for selecting the best way to index (or not) your table fields. See the topic.

2 - Sometimes nested queries are easily created and executed, also serving as a way to reduce query time. But, depending of the complexity, would be better to use different kind of joins. The best way is try to do in both ways. Execute two or more times each query, for the DB engine "compiles" a query on first executing, then the subsequent are quite faster. Measure the times for different parameters and choose the best option.

"Sometimes you can rewrite a subquery to use JOIN and achieve better performance. The advantage of creating a JOIN is that you can evaluate tables in a different order from that defined by the query. The advantage of using a subquery is that it is frequently not necessary to scan all rows from the subquery to evaluate the subquery expression. For example, an EXISTS subquery can return TRUE upon seeing the first qualifying row." - link

3- There's no much information in this question, but if you will get the xml document directly from the table, would be a good idea insted a view. Again, it will depends of the view and the document.

4- Other issues is about the total records expected for your table; the indexing of the columns, in wich you need to consider sorting, joining, filtering, PK's and FK's. Each situation could demmand different aproaches. My sugestion is to invest some time reading about your database engine and queries functioning and relating to your system.

I hope I've helped.

**Neville Kuyt** · Answer 3 · 2011-03-09T11:36:32+00:00

Interesting question.

I think you may be asking the wrong question here. Broadly speaking, as long as you have a FULLTEXT index on your text field, queries will be fast. Much faster than varchar if you have to use wild cards, for instance.

However, if I were you, I'd concentrate on the actual queries you're going to be running. Do you need boolean operators? Wildcards? Numerical comparisons? That's where I think you will encounter the real performance worries.

I would imagine you would need queries like:

"find all addresses in the states of New York, New Jersey and Pennsylvania"
"find all addresses between house numbers 1 and 100 on Mulberry Street"
"find all addresses where the zipcode is missing, and the city is New York"

At a high level, the solution you propose is to store your XML somewhere, and then de-normalize that XML into name/value pairs for querying.

Name/value pairs have a long and proud history, but become unwieldy in complex query situations, because you're not using the built-in optimizations and concepts of the relational database model.

Some refinements I'd recommend is to look at the domain model, and at least see if you can factor out separate data types into the "value" column; you might end up with "textValue", "moneyValue", "integerValue" and "dateValue". In the example you give, you might factor "address 1" into "housenumber" (as an integer) and "streetname".

Having said all this - I don't think there's a better solution other than completely changing tack to a document-focused database.

**Ben** · Answer 4 · 2011-03-10T18:15:06+00:00

I need to be able to create variable schemas on the fly without changing anything about the database access layer.

You are re-implementing the RDBMS within an RDBMS. The DB can do this already - that is what the DDL statements like create table and create schema are for....

I suggest you look into "schemas" and SQL security. There is no reason with the correct security setup you cannot allow your users to create their own tables to store document attributes in, or even generate them automatically.

Edit: Slightly longer answer, if you don't have full requirements immediately, I would store the data as XML data type, and query them using XPath queries. This will be OK for occasional queries over smallish numbers of rows (fewer than a few thousand, certainly).

Also, your RDBMS may support indexes over XML, which may be another way of solving your problem. CREATE XML INDEX in SqlServer 2008 for example.

However for frequent queries, you can use triggers or materialized views to create copies of relevant data in table format, so more intensive reports can be speeded up by querying the breakout tables.

I don't know your requirements, but if you are responsible for creating the reports/queries yourself, this may be an approach to use. If you need to enable users to create their own reports that's a bigger mountain to climb.

I guess what i am saying is "are you sure you need to do this and XML can't just do the job".

**Derick Schoonbee** · Answer 5 · 2011-03-11T10:43:30+00:00

How about looking for a solution at the architectural level? I was also breaking my head on complex graphs and performance until I discovered CQRS.

[start evangelist mode]

You can go document-based or relational as storage. Even both! (Event Sourcing)
Nice separation of concerns: Read Model vs Write Model
Have your cake and eat it too!

Ok, there is an initial learning / technical curve to get over ;)

[end evangelist mode]

As you stated: "I need to be able to create variable schemas on the fly without changing anything about the database access layer." The key benefit is that your read model can be very fast since it's made for reading. If you add Event Sourcing to the mix, you can drop and rebuild your Read Model to whatever schema you want... even "online".

There are some nice opensource frameworks out there like nServiceBus which saves lots of time and technical challenges. All depends on how far you want to take these concepts what you're willing/can spend time on. You can even start with just basics if you follow Greg Young's approach. See the info in the links below.

See

TechQA.

Most efficient method for persisting complex types with variable schemas in SQL

Just Added a Bounty...

There are 5 answers

Related Questions in SQL

Related Questions in PERFORMANCE

Related Questions in DATABASE-DESIGN

Related Questions in SELF-JOIN

Related Questions in DOCUMENT-DATABASE

Popular Questions

Trending Questions