Normalised or Denormalised?

vincent · 29 September 2021 03:52

Hi All

I’m working on a server application and I’m fiddling with the database design at the moment, trying different ideas. It’s been a long time since I did any database design work (the Continua CI schema back in 2012) so I’m a bit rusty!

In the past I have usually normalised the tables (ie used relationships rather than duplicate data) - the benfits are

Data integrity
Data size

The downsides are

it’s a pain to work with - joins and more joins and more…
performance - all those joins slow things down.

Since this application will be heavily skewed towards reads (search) I’m leaning towards denormalising a bit (at the moment it’s completely normalised). There will be inserts of course but very little in the way of updates (initial processing after which the data doesn’t change).

FWIW, this is for my package manager project, the tables look like this (well the ones that matter)

Package (name) → PackageTargetPlatform (compiler, platform) → PackageVersion(version, meta data etc)

Right now in testing there are only 36 rows in the Package table, a few hundred in the PackageTargetPlatform table - the PackageVersion table has 6500 rows - the explosion comes from the combinations of compiler version, platform and package versions (13 compiler versions, win32/64/mac/andriod etc). So the PackageVersion table has he potential to blow out to quite a few rows over time.

Using Postgresql 13 FWIW (have used it for years with Continua CI).

I obviously need to do some testing with the different options, but I’m wondering what others do? How far do you go with denormalising (if at all) and how do you manage data integrity if you do.

Rohit_Nz · 29 September 2021 05:02

I have always had a mixture. Certain numbers are too slow to evaluate in a grid situation. So, for some fields, I have triggers updating them. This way the numbers cant go wrong.

Mathias · 29 September 2021 08:02

I normalise the database as much as I can. Keeping the duplicate values in synch can be a pain and bite you lateron. DB servers are very powerful nowadays, If you want to have more comfort-level when searching for data build some views that combine the data from different tables for your most common queries. That way you don’t need all these joins again and again.

Salut,
Mathias

Lachlan · 29 September 2021 08:03

I must have denormalised a schema to some degree at some point but it’s not something I find myself doing often. To be honest I’d be further normalising that PackageVersion table to reduce that explosion of 6500 records.

You can always use views to simplify your SQL queries if the joins are getting you down.

As for performance, perhaps the databases I’ve worked on haven’t been big enough but I’ve never felt the need to denormalise to increase performance. I’m sure someone will correct me but I wonder if that’s still a valid reason to denormalise these days.

mishac · 29 September 2021 08:18

My opinion is that you only have denormalised (duplicate) data for performance. In general this is for data where you have more than a million rows since below this, queries with well indexed databases on decent servers are near instant, no matter how many joins you have. I use views almost exclusively to read complex data, and views can reference other views. SQL Server does a good job optimising no matter how deep you go (I might have 5 levels for a really complex view). I work with databases up to 1 TB in size and sometimes with more than 100 million rows in a table, and I’m astounded what a good database server can do with large data and a decent amount of RAM (64-128 GB)

vincent · 29 September 2021 08:36

Thanks all - normalized it is then - and yes using views will help - that’s something we do a lot of in Continua’s db.

Eric_March · 30 September 2021 23:52

Good decision in my view.

I have used Normalising for the past 30 odd years, not because it matters for my small systems but it just makes sense to me and also improves speed and accuracy. I purchased Australia Post PostCode table and only store a pointer to the postcode record. Cannot mistype a suburb. Views in MS SQL appear to be instant even when accessing the server over than the LAN with 13 terminals.

I have dabbled with Postgresql which runs some of the largest DB’s in the world (so I am told). Don’t see why it shouldn’t perform as fast as other mainstream systems.

vincent · 1 October 2021 02:49

I have fully normalised my schema - it can be a pain to work with but I’m getting there - brushing up on my sql skills again!

For those wondering about postgresql - it’s very good. We find it performs better than sql server, uses less resources (memory in particular) and we don’t get anywhere near the issues we do with sql server (deadlocks etc).

One downside is it’s a bit more complicated to deal with collations - something sql server does really well.

mishac · 1 October 2021 03:09

Having used SQL Server you really need to understand how it does locking, where you can use “nolock”, etc. This has improved each version. Fine tuning a highly concurrent database take a lot of experience, and this is one particular area where you need specific experience with that particular database server (generic SQL suffices for most queries). I’m sure I could get your database to run without any deadlocks using SQL Server, but would be totally out of my depth with Postgresql

Shanevan · 1 October 2021 03:14

Normalise, but contemplate denormalising summaries.

I had a huge application that tracked subcontractor work orders. On any one docket, we could have multiple subcontractors working on it, and each subcontractor would have all their charge items.

The tables were:

Docket
Docket_Subcontractor (allowing a many to many relationship between dockets and subcontractors)
Docket_Item
Subcontractor (name, contact details etc)
Item (standard list of SKUs, base price etc)

From a reporting perspective, we could report all the way down to the item level, but more often than not we simply wanted the subcontractor total (to pay them) or the docket total (to charge our customer).

I had buy_price and sell_price on the docket, docket_subcontractor and docket_item tables. There was a trigger on the docket_item table that automatically updated the docket_item buy and sell price, and then updated the docket_subcontractor buy and sell price. A trigger on the docket_subcontractor then updated the docket.

The benefit was a huge increase in reporting speeds and vastly simpler SQL.

Given that you are mostly reading and not updating, denormalising the sum/count fields may be beneficial.

Shanevan · 1 October 2021 03:21

The other cool trick that we had triggers on was the Active flag for where we had a tree hierarchy. We needed to be able to deactivate an asset and then have all the children assets deactivated.

We ended up with two fields. The first was the Active Flag and the second was the IsActive flag.

The trigger on the table was:
If Active then
begin
If asset has parent then
IsActive = parent.IsActive
else
IsActive = true
else
IsActive = false

update assets set IsActive = this asset.IsActive from assets join assets on hierarchy stuff where you want the children assets and I couldn’t be bothered writing the SQL here

And that would propagate the change down the tree.

We would control the Active flag but check the IsActive flag.

It is also possible to change the active field to being nullable and then use it to force it on or off again.

vincent · 1 October 2021 04:02

@mishac the deadlocks challenge comes from using an orm (nhibernate) and not having control over the sql - we rewrote some parts using dapper (thin wrapper over IDbConnection) to work around the deadlocks (and nhibernate performance issues).

Needless to say, I would never do another project with NHibernate (or Entity Framework) - they get you 80% there in 20% of the time, but you pay for that in spades afterwards with a ton of obscure issues.

We chose postgresal back in 2012 - we evaluated a bunch of database engines at the time (the requirement being free or free to distribute) including firebird and postgresql won out easily on all fronts. It has taken some getting use to, especially the case sensitivity of objects and the default case sensitive collation. I’m still learning things about it.

vincent · 1 October 2021 04:16

I have 3 main tables (there are plenty of others but these are the ones users will be querying)

Package → PackageTargetPlatform → PackageVersion

Most of the work happens returning data from the PackageVersion table, so I have queries like this (to find a packageversion by packageid, compilerversion, platform )

select pv.* from  package_version pv
inner join package_targetplatform tp
on tp.id = pv.targetplatform_id
inner join package p
on p.id = tp.package_id
where p.packageid = @packageId
and tp.compiler_version = @compilerVersion
and tp.platform = @platform
and pv.version = @version

Not too bad and performance is ok given that the tables don’t have much data right now. I will probably leave it as is for now and focus on other parts of the application and look at optimising it if/when needed - there’s still plenty of other work to complete!

Eric_March · 1 October 2021 04:31

Good to know. Thanks

Shanevan · 1 October 2021 05:54

I can’t see how denormalising this would have any benefit.

I’m not sure how you would. The package will have multiple platforms, so package_targetplatform is, at a guess, giving you a many to many relationship.

vincent · 1 October 2021 06:08

you are right - I looked at this and it’s not going to make it any quicker, and in fact it would cause index sizes to balloon (postgresql complex indexes can be quite large).

JackRussel · 7 December 2021 08:10

The more simple the code, the better. But I could be wrong