After my first small paternity
leave from about two weeks I got a direct tweed asking for a sql
script which is cleaning all unnecessary versions from all nodes at
once. Unlike to the SQL Script posted
before which deletes all versions from a given nodeId (usefull
if you have a massive growing database with thousand of versions
for one node) this script will delete all unnecessary versions from
all nodes at once. As usual this is a works on my machine post.
Please make a backup from your database and report
misbehaviour.
At first the script will create a temporary table and inserts all nodeId's from nodes which are documents (remember, not only documents have versions, also members and medias can have versions) and which are published and not in the recycle bin.
-- Create a temporary table for all documents which are published and not in the recycle bin
CREATE TABLE #Nodes (id int)
-- Delete all rows if the table exists before
TRUNCATE TABLE #Nodes
-- Insert all nodeIds from all documents which are published and not in the recycle bin
INSERT INTO #Nodes
SELECT N.id
FROM umbracoNode N
INNER JOIN cmsDocument D ON N.ID = D.NodeId
WHERE nodeObjectType = 'C66BA18E-EAF3-4CFF-8A22-41B16D66A972'
AND [path] NOT LIKE '%-20%'
AND D.Published = 1
After that it creates another temporary table and inserts all versions from the nodeIds in the first temporary table which not are published an which not are the newest ones.
-- Create a temporary table for all versionId's to delete
CREATE TABLE #Versions (id UniqueIdentifier)
-- Delete all rows if it exists before
TRUNCATE TABLE #Versions
-- Insert all versionId's from all nodeIds in the #Nodes table
-- and where published is set to false and newest is set to false
INSERT INTO #Versions
SELECT versionId
FROM cmsDocument
WHERE nodeId IN (SELECT id FROM #Nodes)
AND published = 0 AND newest = 0
Now it will delete all rows from the cmsPropertyData, cmsContentVersion and cmsDocument where the versionId's are in the second temporary table.
-- DELETE all versions from cmsPropertyData, cmsContentVersion, cmsDocument -- from the nodes which are published and which are not in the recycle bin -- and which are not published and which are not the newest DELETE FROM cmsPropertyData WHERE VersionId IN (SELECT id FROM #Versions) DELETE FROM cmsContentVersion WHERE VersionId IN (SELECT id FROM #Versions) DELETE FROM cmsDocument WHERE VersionId IN (SELECT id FROM #Versions)
Unlike the script in the post Cleaning versions in the umbraco db for given nodes via SQL this script doesn't delete the indexes. If the script runs to long you are free to add these step to this script. On the other hand this script can be used as a regularly cleanup.
Download the complete script here and give me your feedback.
In this thread I want to explain how to clean the umbraco database from unnecessary versions to reduce the database size. Again I got some mails pointing me to help them to clean the database from millions of versions some nodes have. In one of the last mails Jennifer mentioned that she has 4 nodes which have each more than 900.000. This results in a size for the cmsPropertyData table of more than 1.7 GB.
Before we go on I have to mention again that this
post is a "Works on my machine" post.
Please be sure what you do and please, please make a backup of your
database first. Also please read my post Understanding the
umbraco database first to get a glue of what you are doing.
If you only have a few nodes which have millions of versions you can clean them easily via SQL, far from it it is the easiest way. In the next paragraphs I will show how you can do it by yourself.
First of all identify the nodes which have the various versions and their ids. As in my first post described it can be done with the following SQL:
SELECT nodeId, COUNT(*) Number FROM cmsDocument GROUP BY nodeId HAVING COUNT(*) > 2 ORDER BY Number DESC
Just mark the ids down.
Each time umbraco saves or publishes a node a new version is created in the database. But for each published version there is always a newer version in the database. But why? Umbraco is by concept working with a new version (a copy of the published version, or a copy of the last saved version) and not the one which is published. So it prevents the user to break the published version. So if you publish a node the actual version you are working on is saved as the published one, the previously published version is marked as not published, a new version is created and send back to the umbraco user to let him work with this. Each published node has a published and a newest row in cmsDocument:
For each id (node) you now have to identify which version is the newest and which version is the published one. I added some variables via DECLARE for typing horror:
DECLARE @id int SET @id = 3243 DECLARE @newest UNIQUEIDENTIFIER SET @newest = (SELECT versionId FROM cmsDocument WHERE nodeId = @id AND newest = 1) DECLARE @published UNIQUEIDENTIFIER SET @published = (SELECT versionId FROM cmsDocument WHERE nodeId = @id AND published = 1)
The cmsPropertyData table has four indexes for better reading performance. For deleting it slows down and let the SQL Server log file grow heavily. So if you have the possibility you can delete the indexes first like this:
IF EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[dbo].[cmsPropertyData]') AND name = N'IX_cmsPropertyData') DROP INDEX [IX_cmsPropertyData] ON [dbo].[cmsPropertyData] WITH ( ONLINE = OFF ) IF EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[dbo].[cmsPropertyData]') AND name = N'IX_cmsPropertyData_1') DROP INDEX [IX_cmsPropertyData_1] ON [dbo].[cmsPropertyData] WITH ( ONLINE = OFF ) IF EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[dbo].[cmsPropertyData]') AND name = N'IX_cmsPropertyData_2') DROP INDEX [IX_cmsPropertyData_2] ON [dbo].[cmsPropertyData] WITH ( ONLINE = OFF ) IF EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[dbo].[cmsPropertyData]') AND name = N'IX_cmsPropertyData_3') DROP INDEX [IX_cmsPropertyData_3] ON [dbo].[cmsPropertyData] WITH ( ONLINE = OFF )
You can create these entry's by a rightclick on the index in SQL Server Management Studio.
Now you can delete the not needed versions via
DELETE FROM cmsPropertyData WHERE contentNodeId = @id AND versionId NOT IN (@newest, @published) DELETE FROM cmsContentVersion WHERE ContentId = @id AND VersionId NOT IN (@newest, @published) DELETE FROM cmsDocument WHERE nodeId = @id AND versionId NOT IN (@newest, @published)
Be aware that these scripts can last a very long time thus you deleted the indexes (without deleting the indexes it will last hours ;-))
After deleting all unneeded versions you can now recreate the indexes:
CREATE NONCLUSTERED INDEX [IX_cmsPropertyData] ON [dbo].[cmsPropertyData]
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
CREATE NONCLUSTERED INDEX [IX_cmsPropertyData_1] ON [dbo].[cmsPropertyData]
(
[contentNodeId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
CREATE NONCLUSTERED INDEX [IX_cmsPropertyData_2] ON [dbo].[cmsPropertyData]
(
[versionId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
CREATE NONCLUSTERED INDEX [IX_cmsPropertyData_3] ON [dbo].[cmsPropertyData]
(
[propertytypeid] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
After completing the cleaning you now can reorganize or shrink your database files.
The complete SQL script can be downloaded here.
I hope I could show how various versions easily can be deleted via SQL if you have a growing database like in this umbraco forum post described. As mentioned before please be careful with your database and make a backup first. If you have any suggestions just write a comment or get in contact with me via mail (th |at| thoehler |dot| com).
In the last weeks I got many mails asking for help regarding a growing umbraco database. Two months ago I helped a friend cleaning his database which was grown up to 8 GB. But be aware: working in the database is not best practice. We don't know exactly the internal relations how umbraco handles the entries in the database. We don't know what effects a direct change in the database has.
Regarding this I have to say: this
post is a "Works on my machine" post.
Please be sure what you do and please, please make a backup of your
database first.
�
Ok, one of the main things you have to know regarding the umbraco database is the umbracoNode table. This table is the main table where all things in umbraco, regardless of which type or meaning, are stored. This means that there are content nodes, members, document types, property types, and all other things in this table. How does umbraco now identify what kind of type this row is? Is it a content node or a member or a document type? The answer is stored in the column nodeObjectType. Each type has its own GUID. The GUID is defined in the code of Umbraco. For example: a document has the GUID "c66ba18e-eaf3-4cff-8a22-41b16d66a972". It is defined in the umbraco.cms.businesslogic.web.Document class:
namespace umbraco.cms.businesslogic.web
{
/// <summary>
/// Document represents a webpage,
/// type (umbraco.cms.businesslogic.web.DocumentType)
///
/// Pubished Documents are exposed to the runtime/the public website in a cached xml document.
/// </summary>
public class Document : Content
{
public static Guid _objectType = new Guid("c66ba18e-eaf3-4cff-8a22-41b16d66a972");
...
After knowing what the umbracoNode table is we can try to identify which tables are related to documents and versions. We have to look at the following tables:
You can see all these tables and their database relations here:
Coming back to the growing databases. You can now identify which nodes have massive publishing issues by the number of versions the nodes have. For example you can run the following SQL:
SELECT nodeId, COUNT(*) Number FROM cmsDocument GROUP BY nodeId HAVING COUNT(*) > 2 ORDER BY Number DESC
This SQL gives you all nodeIds which have more than two versions (a published node has at minimum two versions). Check the nodes with massive numbers, unpublish it or delete it manually and the issue should stop.
I hope you got a glue on how umbraco acts in the database. If you have massive number of versions you now can identify the related nodes. In the next post I will explain how to delete the old versions manually or by my ClientTools.
Thomas
09.07.10 20:02: Aaaaadler ;-)
09.07.10 20:01: Einmal Bier holen = Tor verpasst => Der Weg in den Keller ist eindeutig zu lang...
09.04.10 06:04: @Shazwazza Done ;-) #ucomponents #Umbraco
09.03.10 20:43: @netaddicts LOL
09.03.10 20:43: @asp_net wie war das bei der 2010 Roadshow, Zitat: 'google das mal mit Bing'
![]()
This blog is written by Thomas Höhler. Living next to Frankfurt, Germany,
I am trying to share my experiences about
Umbraco [the most flexible CMS I know],
my ClientTools for Umbraco,
and some other more or less usefull stuff