Wednesday, September 19, 2007

The Past and the Future

Just finished reading two great articles about the past, history, and the future.

The first article is from Joel Spolsky (Strategy Letter VI) and how history tends to repeat itself. Starting there, and with some extrapolation from performance considerations in the past (with Excel) and the present (with Ajax and JavaScript), some patterns become apparent (although the actual names could change). But the real kicker is from the comparison between mainframes, CICS and the evolution to microcomputers and Windows to the current state of the Web and it's problems (interoperability between Ajax applications). Very good and thought provoking. It just reminds me of how circular history is: we started with mainframes and dumb terminals, then to client PCs applications (Office, Access, fat clients), then back to big servers and thin clients (with the Web), and with smart clients we are returning the processing power to the client.

 

The other great post (more of a link to actual MSDN content - Finally some news on PLINQ), reveals the future after .NET 3.5 and MS solutions to the increasing cores with the same speed problem.

The MSDN article shows the primitives for concurrency and how easily concurrency can be "added" to a program. The PTasks article shows it's application to Linq (specially Linq to SQL).

Just checked the MSDN October cover, and it's mainly about concurrency/threads and multiple cores. I'll put it on my short list of "must read"s.

Marcas Technorati: , ,

Friday, September 14, 2007

Yet another PK debate

Recently the article in http://weblogs.sqlteam.com/jeffs/archive/2007/08/23/composite_primary_keys.aspx got me thinking again in DB design, especially the PK dilemma (natural or surrogate).

So i googled around to see the current discussion and came up with some links:

Articles defending surrogate keys:
http://www.4guysfromrolla.com/webtech/110806-1.shtml
http://searchsqlserver.techtarget.com/general/0,295582,sid87_gci1156604,00.html
http://r937.com/20020620.html
http://www.bcarter.com/intsurr1.htm

http://www.aspfaq.com/show.asp?id=2504
http://databases.aspfaq.com/database/what-should-i-choose-for-my-primary-key.html
Articles defending natural keys:
http://www.sqlteam.com/article/identity-and-primary-keys

http://www.solidqualitylearning.com/blogs/Dejan/archive/2005/02/24/434.aspx

http://groups.google.com/group/microsoft.public.sqlserver.programming/msg/6d61dbf80d6f0fb6?hl=en&lr=&ie=UTF-8&oe=UTF-8&rnum=14

But returning to the initial article, in it, it's argued that in relation tables the PK should be natural (composite of the PKs of the relations). Naturally, it seems right, and it aligns well with the relational theory. Unfortunately, business clients don't align so well ;)

The initial problem is modeling a CustomerProducts relation table, to which the following two solutions are proposed:

Create table CustomerProducts
(
    CustomerID int references Customers(CustomerID) not null,
    ProductID int references Products(ProductID) not null,
    OrderLimit int not null,
    Primary key (CustomerID, ProductID)
)

Create table CustomerProducts
(
    Customer_ProductID int identity primary key,
    CustomerID int references Customers(CustomerID) not null,
    ProductID int references Products(ProductID) not null,
    OrderLimit int not null
)

The exchange of ideas in the comments was good, and it got me thinking how I do design.

So here are my thoughts (published also in the comments of the initial article).

One of the downsides of the modeling based on natural keys is it's averse to change.

Of course with an identity column, the model, the application must be validated, and possibly changed (including migration) to acommodate the change. But the change doesn't ripple throughout the model as with natural keys.

I honestly can't see the big difference between entity tables and relation tables. I could argue (and many DB purists would) that the PK in an entity table should be a natural key (probably composite).

And it appears you can relate to the problems it introduces:

  • very difficult to change (I can only imagine what must have been the migration of a DB where ISBN was used as a PK, or using the bank filial as part of the PKs of bank cards - imagine changing the account);
  • no significant advantadge over the more flexible surrogate model.

To me the same argument holds with relation tables. It eases future changes. The disadvantages are extra complexity, but the end result is more flexible.

I haven't seen a solid example of the supposed advantages (in practice) of the composite key approach. The extra space isn't so much as stated if considering that other indexes include the PK. So the wider is the PK, bigger becomes the index.

I haven't seen any examples or papers that demonstrate in practice that it's more performant to use composite keys (in general).

Regarding the purity or correctness of the model, I can only say, that it's a model... It can be abused if one tries to be to flexible, but sometimes it's better to be flexible and pragmatic than to be rigid and theoretic.

This comment generated a response by Jeff Smith, in which he defends his article conclusions.

 

Until any of this is refuted, we've clearly established that, for relation tables:
* using a natural PK uses less space, less resources, since if you use an identity, you MUST also introduce a unique constraint on your true key columns
* using a natural PK clarifies your data model and makes it easier to read, understand and work with
* using a natural PK simplifies your SQL statements
* using a natural PK enforces data integrity without the need for an additional unique constraint
* using a natural PK results in true normalization, since your attributes are fully dependent on your primary key and not just a meaningless, random number
* using an identity PK actually complicates things like editing a relation, if you are relying only on the PK, since knowing that you are editing or viewing "CustomerProductID 3922" means nothing, while knowing that you are editing or vewing the relation between "CustomerID 23 and ProductID 4" tells you everything, simplifying your code and your logic.
I'd write more, but again, I feel as though I am just repeating the same points over and over and no one is specifically addressing them ....
Let's talk actuals, here, not theory ... can you please give me a simple example where using an identity as a PK for a relation table, instead of a composite PK, makes future changes "easier"?

 

But this got me thinking even more in the merits and shortcomings of each design technique, in primary keys and it's relation with natural keys, business keys/rules, etc.

So my response headed for evolutive maintenance, change, flexibility:

I just can't run away from a good discussion. :)

Suppose that the client comes in and wants to introduce some “minor” changes/improvements.
First, he wants to keep the history of changes in the CustomerProducts table. So you create a CustomerProductsHistory table with the timestamp of the operation.
But he also reports that querys to find associations within an OrderLimit range are running very slowly. So you add an index to the OrderLimit.

These “changes” compound and the table becomes larger and wider with more indexes. Child tables are added and because we shouldn't use a surrogate as PK, we keep the parent PK and add some other attribute. Using Ids as the PKs from Customer and Products, the effect is lessened (the size of the PK is just double of a surrogate key, but with natural keys the effect is clear), but eventually your indexes start to suffer because the PK is larger, as well as all child tables PKs and indexes.

Future changes are easier because you get more flexibility in maintenance and with the model.

  • In maintenance: if a mistake was detected and you MUST correct an association (the CustomerId or ProductId), you don’t have to mess with the PK. Because you didn’t expose the PK of the table to the customer (the surrogate key), you can change the data easily.
  • In evolutions: if there’s a need to change the CustomerProducts table (you want to turn it into a temporal table, adding a start date and end date), it’s much easier if there’s a surrogate key, especially if the table has child tables.

I think most pragmatic people, having been through similar situations, can see the benefits and flexibility of the surrogate key model. Besides that, there’s also the other advantages mentioned like:

  • Consistency in the model (easing mapping tools and development),
  • Comprehension and usage of the model (this table PK is what 4 columns? And joins this other table with what 3 columns?) because the PK is just TableNameId and joins with the FK named TableNameId

The last point I don’t understand. You say CustomerProductID 3922 means nothing. Yet CustomerId 23 has meaning? Why? Only if you look it up in the Customers table… I just hope you don’t show CustomerId 23 in your UI, because then you just exposed the PK, and you can’t ever change it easily.

I’ll concede to the argument that it’s not completely 3NF. But if you go this way, why not make all the tables 5NF mandatory?

Just my 0.02 €

And just to clear Jeff Smith's doubts:

When you create your tables, and you are adding your identity primary keys, do you also create additional unique constraints to all of your tables to ensure you have data integrity?

Got me thinking in the design of the models that I currently adhere, and why I like them:

Yes, of course.
But maybe I wasn't explicit in my last comment (or there's some ambiguity in my terms). To me TableNameId is the PK of the table. It's "private" and not exposed to the business. If the natural key (defined by the business) is just a number, I normally use TableNameNumber (eg: CustomerNumber, ProductNumber). This column is visible to the business, and follows the business rules (no gaps, prefixes, suffixes, etc).
This way the model is very consistent, predictable, flexible and usable. It may have some overhead (not much), and not be completely normalized, but to me it's a good compromise.

Marcas Technorati: ,