There is always a threat from somewhere. Old-style DBAs are standing on the shaky ground.
5 years ago I remember myself defending the SQL Server against the domineering Oracle DBAs. The arguing was always about “who is better”. The Management Studio was a great weapon against sqlplus. And you could always use a “the DDLs are not a part of the transaction “ thing. Without any Oracle DBAs around I must admit that Oracle is pretty amazing product although it requires a huge deal of tuning.
Then came an era of NoSQL. It’s not easy to compete against such fast loads, easy scale-up and fast development. The NoSQL solutions support the usecases that do not really fit into the relational model, like graphs or unstructured data. On the other hand, the eventual consistency thing does not fit for everyone, the joins between the entities are hard and there are stability problems with many NoSQL products. Twitter apparently has been trying to move from MySQL to Cassandra for over a year. In August 2011 Foursquare reported an 11-hour downtime because of the MongoDB.
A few years ago a new trend/treat arrived to our lands. Analytical databases.
There are quite a lot of quite expensive products positioning themselves as a revolutionary technology that can easily handle an overwhelming amount of Big Data querying and analytical processing, self-tuning, making no need for query tuning, prepared aggregations or calculations. There is no locking or logging overhead therefore data loading into the analytical DBMS is declared to be extremely fast. On the other hand, those DBMS admit that they are not built for the OLTP loads, massive updates and deletes are their weak point.
The Data Warehouse team at my company has got to the decision that they need such analytical database solution. The vision is to have a scalable self-tuning database that will ingest terabyte or two on a weekly basis and will allow to any number of the concurrent users capability to execute an ad-hoc analytical queries on any period of time.
I strongly believe that with the proper database design SQL Server 2012 can provide good performance under the heavy analytical workload. However, I was really curious to understand how the analytical databases achieve their self-tuning performance magic. There is nothing better than the competitors to show us where we have weaknesses that need some strengthening.
First of all, here is a great map of all the technologies from the Big Data puzzle http://www.bigdatamap.net/technologies/ .
In order to get a better understanding of the strong and weak points of the analytical databases world, I will drill down a little bit into the technologies that are standing behind the buzzwords.
There are 4 categories of products inside the analytic databases part of the Data Repositories world:
- In-Memory solutions (like SAP HANA, XtremeData, Kognitio etc.):
- When querying the DBMSs from this group, you are working primarily against the tables in the memory. Those DBMSs are the fastest solution in the analytical databases world. In-memory DBMS primary relies on the memory optimization algorithms which are much simpler than the disk access. There are no locking/ concurrency issues and queries are amazingly fast. With those systems you can expect sub-second results in the truly ad-hoc DWH workload with unpredictable queries coming through, sometimes in a size of 6 pages of the code. I have asked whether there are partitions on the tables and the Kognitio representative was quite surprised. “What for?” – he asked me.”Well, how do you delete the old data?”. “You just write a DELETE statement and data gets deleted” – he answered. That’s it. Just use DELETE statement.
- There are several flavors of the in-memory products, each requires a different memory to data ratio. For instance, SAP HANA loads 100% of the database into the memory which makes this solution extremely expensive for the databases with hundreds of terabytes. Kognitio, on the other hand, allows you to choose which data goes into the memory, usually 20% - 40%, and the rest of the data stays in the persistency layer. For instance, if your queries usually access only last month data, you can have sliding window of this data in the memory and, in case any unpredicted query comes in, the system will load the missing data off the disc. However, on each cluster node 60% of memory will be used for data and the other 40% for the users activity and temporary calculations. When summed up you get to the quite pricey solution anyway.
- All products in this category save the data to the disk as well. This persistency layer can be their proprietary files, Hadoop or any other DBMS.
- One might think it will take time after the server downtime to get all data back into the memory. Kognitio states that it will take only 5 min for each 10 TB of data.
- Watch out for some limitations coming along with in-memory architecture. Some products are not available during the maintenance tasks, like backups, or new nodes additions ( ExtremeData) and should be taken down. This does not suit the systems that must be up 24/7.
- Disk- oriented solutions (Vertica, Greenplum, Netezza, SybaseIQ, VectorWise, Microsoft PWD, etc.)
- Disk oriented products are very well optimized for working with the data on the disk. The main trend in this category is about storing the data in the columnar way which allows high compression ratio. However, some products are row-based like Netezza. Microsoft PDW ( V2 which is based on SQL 2012) allows both flavors of storage, rowbased and columnar. Their release is actually in February 2013 so we'll see how they will compete with the others.
- Disk oriented solutions usually do not rely that there is a lot of RAM.
- Since the world around us is row-based, all the columnar products are loading their data from the row-based files. They inject the incoming data into some write-optimized row store. The background process afterwards is converting the inserted data into the read-optimized sorted columnar storage.
- Due to the fact that each column in columnar table is being stored in the separate file, the smaller the amount of requested columns the faster the queries perform. The wider queries usually require some sort of manual optimizations. Vertica, for instance is keeping several copies of the table, called projections. The projections, like covering indexes, contain columns grouped together or columns from different tables (for faster joins), nullifying the benefits of the compression. This is a hybrid columnar/row-based storage solution.
- Hybrid DBMS products ( ParAccel, Exasol, etc. )
- Those products are built on any disk-oriented technology, usually columnstores, but provide the best performance given as much RAM as they can get. For instance ParAccel submitted tcp-h benchmark on a system with 288% data/memory ratio.
- Daemons running on the Hadoop nodes. (Claudera Impala)
- Kind of HIVE alternative, those solutions are running on the same physical cluster as Hadoop creating a seamless environment. Kognitio is considering such architecture as a future enhancement of their product.
- Claudera Impala is a public beta, it goes out sometime in 2013. According to some web forum, the fastest Impala queries take > 1 second. Impala was tested and survived a load of 100 concurrent queries. Another test showed Impala running 10 cloned copies of a query with 25%~ish performance degradation.
If you are evaluating an analytical database solution for your Big Data, the features below can help you to understand better the different flavors of the products:
- Product scalability. Scalability is about whether the product is MPP (massively parallel processing) system or single node system. MPP systems usually are built as a cluster of commodity servers where the tables data distributed across the cluster according to the administrator’s choice, like some sort of partitioning. Cluster nodes combined into a single very powerful analytic machine have vast amounts of processing power. New nodes can be easily added to the cluster when needed. Vertica, ParAccel, Exasol, Greenplum are the MPP solutions. Sybase IQ and VectorWise are single node systems though VectorWise promises that there is a MPP version coming out soon.
- Single point of failure. While exploring an MPP product, it’s good to know if it’s share-nothing system or there is a main node (ParAccel, Netezza or Greenplum) that is in charge of producing query plans and processing the serial parts of the execution plans. For instance, the count distinct operation requires to run distinct sort again on the merged results. The main node might become a bottleneck in the future when the system load increases.
- Replication factor/data redundancy. Most MPP solutions are replicating every subset of the data to an additional node for the redundancy. In case any node goes down, there is always another copy of the data which means that the cluster continues to work. The question is which kind of algorithm they use to store the second copy. Two phaze commit is too pricey because it slows down the inserts. And most of the products use optimistic approach. Each one implements that approach differently but the data is usually being replicated by the background process. What happens if the node failed after the data was inserted and before it has got replicated ?
- Watch out for the limitations. VectorWise can support databases up to 30TB on single node. Another common limitation for the columnar solutions is character data type length. SAP HANA has the 2 billion records limit in a single partition.
- Data Encryption. Most of the analytical databases that I have looked into did not have a data encryption feature.
- Row to columnar conversion. The freshly loaded data querying is definitely worth a check. Especially on the column storage solutions. Your queries will probably be merging data from the write optimized row-based storage and column based files.
- Data modifications. Vertical and Exasol are locking the entire table during the row modification operations ( updates/deletes) locking other modifying processes. I must note here that this does not impact read flows against that table.
- Management tool. All this can sound great till you realize that your main tool will be the command line like in ParAccel. In such case you will probably need to purchase some third party management tool, like Toad. Make sure there is an ODBC/JDBC provider.
- Languages support. Analytical solution should usually support ANSI SQL including analytical extensions. Many products support other languages, like R, JAVA, MDX etc.
- Development/QA environment – It’s worth a check whether there it is possible to restore the production multi-node database to single-node server for development or testing purposes.
- Hadoop connectivity. Many products have implemented this, it’s quite a useful feature, called External tables. This is very tight products integration, making Hadoop data visible for the user queries without pre-loading it. Although, integration might be not so tight for some products: under the hood they simply connect to Hadoop using the Hive and performance is slow in such cases. Kognitio, on the other hand, have an agent installed on the Hadoop cluster nodes and thus can “push down the predicates” – ask Hadoop to filter the data prior to loading.
- Number of customers. If there are only 10 customers behind the revolutionary technology, would you consider paying a huge amount of money for it?
Machine generated data grows at the amazing speed therefore the requirements from the DBMS products are changing. SQL Server is evolving as well having a new columnstore index feature inside SQL 2012. In the next version more features from the analytical databases world will be implemented, like in-memory technology <Hekaton> , Hadoop External tables feature <Polybase> and true updatable columnnar data storage implemented by using new column-based clustered index.
Any software solution has it’s strength and there is no fit-for-everyone solution. There are many shades of grey besides the black and white colors.
We are starting the POC for the several analytical DBMS. I will surely test the same load on SQL 2012. I accept the architectural limitation that it’s not scalable. I also understand that before Hekaton comes to town, SQL Server cannot compete with the in-memory products. However, I cannot wait to compare the performance of the columnstore index against the other columnstore products.
Are there any performance benchmarks between the columnar non-mpp product and SQL 2012 that you were a part of ?
… Four tortoises were playing in the backyard when they decided they needed hibiscus flower snacks. They pooled their money and sent the smallest tortoise out to fetch the snacks.
Two days passed and there was no sign of the tortoise.
"You know, she is taking a lot of time", said one of the tortoises.
A little voice from just out side the fence said, "If you are going to talk that way about me I won't go."
Is it too much to request from the quite expensive 3rd party backup tool to be a way faster than the SQL server native backup?
Or at least save a respectable amount of storage by producing a really smaller backup files? By saying “really smaller”, I mean at least getting a file in half size.
After Googling the internet in an attempt to understand what other “sql people” are using for database backups, I see that most people are using one of three tools which are the main players in SQL backup area:
- LiteSpeed by Quest
- SQL Backup by Red Gate
- SQL Safe by Idera
The feedbacks about those tools are truly emotional and happy. However, while reading the forums and blogs I have wondered, is it possible that many are accustomed to using the above tools since SQL 2000 and 2005.
This can easily be understood due to the fact that a 300GB database backup for instance, using regular a SQL 2005 backup statement would have run for about 3 hours and have produced ~150GB file (depending on the content, of course).
Then you take a 3rd party tool which performs the same backup in 30 minutes resulting in a 30GB file leaving you speechless, you run to management persuading them to buy it due to the fact that it is definitely worth the price.
In addition to the increased speed and disk space savings you would also get backup file encryption and virtual restore - features that are still missing from the SQL server.
But in case you, as well as me, don’t need these additional features and only want a tool that performs a full backup MUCH faster AND produces a far smaller backup file (like the gain you observed back in SQL 2005 days) you will be quite disappointed. SQL Server backup compression feature has totally changed the market picture.
Medium size database.
Take a look at the table below, check out how my SQL server 2008 R2 compares to other tools when backing up a 300GB database.
It appears that when talking about the backup speed, SQL 2008 R2 compresses and performs backup in similar overall times as all three other tools. 3rd party tools maximum compression level takes twice longer.
Backup file gain is not that impressive, except the highest compression levels but the price that you pay is very high cpu load and much longer time. Only SQL Safe by Idera was quite fast with it’s maximum compression level but most of the run time have used 95% cpu on the server.
Note that I have used two types of destination storage, SATA 11 disks and FC 53 disks and, obviously, on faster storage have got my backup ready in half time.
Looking at the above results, should we spend money, bother with another layer of complexity and software middle-man for the medium sized databases? I’m definitely not going to do so.
Very large database
As a next phase of this benchmark, I have moved to a 6 terabyte database which was actually my main backup target.
Note, how multiple files usage enables the SQL Server backup operation to use parallel I/O and remarkably increases it’s speed, especially when the backup device is heavily striped. SQL Server supports a maximum of 64 backup devices for a single backup operation but the most speed is gained when using one file per CPU, in the case above 8 files for a 2 Quad CPU server. The impact of additional files is minimal.
However, SQLsafe doesn’t show any speed improvement between 4 files and 8 files.
Of course, with such huge databases every half percent of the compression transforms into the noticeable numbers. Saving almost 470GB of space may turn the backup tool into quite valuable purchase. Still, the backup speed and high CPU are the variables that should be taken into the consideration.
As for us, the backup speed is more critical than the storage and we cannot allow a production server to sustain 95% cpu for such a long time.
Bottomline, 3rd party backup tool developers, we are waiting for some breakthrough release.
There are a few unanswered questions, like the restore speed comparison between different tools and the impact of multiple backup files on restore operation. Stay tuned for the next benchmarks.
- SQL Server 2008 R2 sp1
- 2 Quad CPU
- Database location: NetApp FC 15K Aggregate 53 discs
- Backup destination volumes: two physical NetApps FC 15K Aggregate 53 discs, 4 files on each volume.
No matter how good that UI is, we need to run the backup tasks from inside of SQL Server Agent to make sure they are covered by our monitoring systems. I have used extended stored procedures (command line execution also is an option, I haven’t noticed any impact on the backup performance).
backup database <DBNAME> to
disk= '\\<networkpath>\par1.bak' ,
with format, compression
@database = N'<DBName>',
@backupname= N'<DBName> full backup',
@desc = N'Test',
@init = 1
'-SQL "BACKUP DATABASE <DBNAME>
TO DISK= ''\\<networkpath>\par1.sqb'',
DISKRETRYINTERVAL = 30,
DISKRETRYCOUNT = 10,
COMPRESSION = 4,
@database = 'UCMSDB',
@filename = '\\<networkpath>\par1.bak',
@backuptype = 'Full',
@compressionlevel = 4,
@backupfile = '\\<networkpath>\par2.bak',
@backupfile = '\\<networkpath>\par3.bak'
If you still insist on using 3rd party tools for the backups in your production environment with maximum compression level, you will definitely need to consider limiting cpu usage which will increase the backup operation time even more:
- RedGate : use THREADPRIORITY option ( values 0 – 6 )
- LiteSpeed : use @throttle ( percentage, like 70%)
- SQL safe : the only thing I have found was @Threads option.
A very useful error handling TRY/CATCH construct is widely used to catch all execution errors that do not close the database connection. The biggest downside is that in the case of multiple errors the TRY/CATCH mechanism will only catch the last error.
An example of this can be seen during a standard restore operation. In this example I attempt to perform a restore from a file that no longer exists. Two errors are being fired: 3201 and 3013:
Assuming that we are using the TRY and CATCH construct, the ERROR_MESSAGE() function will catch the last message only:
To workaround this problem you can prepare a temporary table that will receive the statement output. Execute the statement inside the xp_cmdshell stored procedure, connect back to the SQL Server using the command line utility sqlcmd and redirect it's output into the previously created temp table.
After receiving the output, you will need to parse it to understand whether the statement has finished successfully or failed. It’s quite easy to accomplish as long as you know which statement was executed. In the case of generic executions you can query the output table and search for words like“Msg%Level%State%” that are usually a part of the error message.
Furthermore, you don’t need TRY/CATCH in the above workaround, since the xp_cmdshell procedure always finishes successfully and you can decide whether to fire the RAISERROR statement or not.
- Click on me. Choose me. - asked one forgotten feature when some bored DBA was purposelessly wondering through the Management Studio menu at the end of her long and busy working day.
- Why would I use you? I have heard of no one who does. What are you for? - perplexedly wondered aged and wise DBA. At least that DBA thought she was aged and wise though each day tried to prove to her that she wasn't.
- I know you. You are quite lazy. Why would you do additional clicks to move from window to window? From Tool to tool ? This is irritating, isn't it? I can run windows system commands, sql statements and much more from the same script, from the same query window!
- I have all my tools that I‘m used to, I have Management Studio, Cmd, Powershell. They can do anything for me. I don’t need additional tools.
- I promise you, you will like me. – the thing continued to whine .
- All right, show me. – she gave up. It’s always this way, she thought sadly, - easier to agree than to explain why you don’t want.
- Enable me and then think about anything that you always couldn’t do through the management studio and had to use other tools.
- Ok. Google for me the list of greatest features of SQL SERVER 2012.
- Well... I’m not sure... Think about something else.
- Ok, here is something easy for you. I want to check if file folder exists or if file is there. Though, I can easily do this using xp_cmdshell …
- This is easy for me. – rejoiced the feature.
By the way, having the items of the menu talking to you usually means you should stop working and go home. Or drink coffee. Or both. Well, aged and wise dba wasn’t thinking about the weirdness of the situation at that moment.
- After enabling me, – said unfairly forgotten feature (it was thinking of itself in such manner) – after enabling me you can use all command line commands in the same management studio query window by adding two exclamation marks !! at the beginning of the script line to denote that you want to use cmd command:
-Just keep in mind that when using this feature, you are actually running the commands ON YOUR computer and not on SQL server that query window is connected to. This is main difference from using xp_cmdshell which is executing commands on sql server itself. Bottomline, use UNC path instead of local path.
- Look, there are much more than that. - The SQLCMD feature was getting exited.- You can get IP of your servers, create, rename and drop folders. You can see the contents of any file anywhere and even start different tools from the same query window:
Not so aged and wise DBA was getting interested: - I also want to run different scripts on different servers without changing connection of the query window.
- Sure, sure! Another great feature that CMDmode is providing us with and giving more power to querying. Use “:” to use additional features, like :connect that allows you to change connection:
- Now imagine, you have one script where you have all your changes, like creating staging table on the DWH staging server, adding fact table to DWH itself and updating stored procedures in the server where reporting database is located.
- Now, give me more challenges!
- Script out a list of stored procedures into the text files.
- You can do it easily by using command :out which will write the query results into the specified text file. The output can be the code of the stored procedure or any data. Actually this is the same as changing the query output into the file instead of the grid.
- Now, take all of the scripts and run all of them, one by one, on the different server.
- Come on... I’m sure that you can not...
-Why not? Naturally, I can do it using :r commant which is opening a script and executing it. Look, I can also use :setvar command to define an environment variable in SQLCMD mode. Just note that you have to leave the empty string between :r commands, otherwise it’s not working although I have no idea why.
- Wow.- She was really impressed. - Ok, I’ll go to try all those…
-Wait, wait! I know how to google the SQL SERVER features for you! This example will open chrome explorer with search results for the “SQL server 2012 top features” ( change the path to suit your PC):
“Well, this can be probably useful stuff, maybe this feature is really unfairly forgotten”, thought the DBA while going through the dark empty parking lot to her lonely car. “As someone really wise once said: “It is what we think we know that keeps us from learning. Learn, unlearn and relearn”.
There are about 200 QA and DEV SQL Servers out there.
There is a maintenance plan on many of them that performs a backup of all databases and removes the backup history files.
First of all, I must admit that I’m no big fan of maintenance plans in particular or the SSIS packages in general.
In this specific case, if I ever need to change anything in the way backup is performed, such as the compression feature or perform some other change, I have to open each plan one by one. This is quite a pain.
Therefore, I have decided to replace the maintenance plans with a stored procedure that will perform exactly the same thing. Having such a procedure will allow me to open multiple server connections and just execute an ALTER PROCEDURE whenever I need to change anything in it. There is nothing like good ole T-SQL.
The first challenge was to remove the unneeded maintenance plans. Of course, I didn’t want to do it server by server. I found the procedure msdb.dbo.sp_maintplan_delete_plan, but it only has a parameter for the maintenance plan id and it has no other parameters, like plan name, which would have been much more useful. Now I needed to find the table that holds all maintenance plans on the server. You would think that it would be msdb.dbo.sysdbmaintplans but, unfortunately, regardless of the number of maintenance plans on the instance, it contains just one row.
After a while I found another table: msdb.dbo.sysmaintplan_subplans. It contains the plan id that I was looking for, in the plan_id column and well as the agent’s job id which is executing the plan’s package:
That was all I needed and the rest turned out to be quite easy. Here is a script that can be executed against hundreds of servers from a multi-server query window to drop the specific maintenance plans.
DECLARE @PlanID uniqueidentifier
SELECT @PlanID = plan_id
Where name like ‘BackupPlan%’
EXECUTE msdb.dbo.sp_maintplan_delete_plan @plan_id=@PlanID
The second step was to create a procedure that will perform all of the old maintenance plan tasks: create a folder for each database, backup all databases on the server and clean up the old files. The script is below. Enjoy.
ALTER PROCEDURE BackupAllDatabases
@PrintMode BIT = 1
DECLARE @BackupLocation VARCHAR(500)
DECLARE @PurgeAferDays INT
DECLARE @PurgingDate VARCHAR(30)
DECLARE @SQLCmd VARCHAR(MAX)
DECLARE @FileName VARCHAR(100)
SET @PurgeAferDays = -14
SET @BackupLocation = '\\central_storage_servername\BACKUPS\'+@@servername
SET @PurgingDate = CONVERT(VARCHAR(19), DATEADD (dd,@PurgeAferDays,GETDATE()),126)
SET @FileName = '?_full_'+
+ REPLACE(CONVERT(VARCHAR(19), GETDATE(),126),':','-')
SET @SQLCmd = '
IF ''?'' <> ''tempdb'' BEGIN
EXECUTE master.dbo.xp_create_subdir N'''+@BackupLocation+'\?\'' ;
BACKUP DATABASE ? TO DISK = N'''+@BackupLocation+'\?\'+@FileName+'''
WITH NOFORMAT, NOINIT, SKIP, REWIND, NOUNLOAD, COMPRESSION, STATS = 10 ;
EXECUTE master.dbo.xp_delete_file 0,N'''+@BackupLocation+'\?\'',N''bak'',N'''+@PurgingDate+''',1;
IF @PrintMode = 1 BEGIN
EXEC sp_MSforeachdb @SQLCmd
There are those rare times, when your replication solution is working perfectly. No consistency errors, no errors at all.
Or so it seems.
Then, all of a sudden, you catch one application which is writing to the read-only subscriber. Next, a developer complains that some data is missing from the subscriber.
Similarly, you always wonder, after the consistency errors have appeared and solved, whether the data is synchronized. In some specific cases, the publication can be dropped by mistake and re-created with replication continuing without reporting any errors.
Is there a native way to make sure all data is synchronized between the publisher and the subscriber?
Validate subscriptions using Replication Monitor
If you open the Replication Monitor application and right click on your publication, you will notice the “Validate subscriptions…” feature.
Click on it, choose the subscription that requires validation and select the way validation will be performed.
These options are quite critical since they directly impact the speed of the validation process. Here you can choose whether the tables themselves will be queried or row number information will be extracted from the sysindexes system view.
The default option is to compare the numbers from sysindexes and compare actual row counts only in case the differences were found.
You can also choose to compare the checksums of the data. In this case, the validation process will take a long time.
The result of the comparison will appear in the distribution service details:
You will see all tables that were scanned by the validation process and the row counts.
If a table is out of sync you will see the error as below
What’s going on behind the scenes?
Here is the query that is being executed against each table in the database if you choose to compare actual count and the checksum:
SELECT count_big(*), sum (convert(numeric, binary_checksum(*) ) )
FROM <tablename> WITH (TABLOCK HOLDLOCK)
All the above is nice when you have only one publication to validate. What if you have hundreds of publications?
As you might have expected, this UI is executing the stored procedure on the publisher. The stored procedure is called sp_publication_validation and it’s main parameters are
§ 1 - Perform a rowcount check only ( default )
§ 2 - Perform a rowcount and binary checksum
§ 0 – Full count using count(*)
§ 1 – Fast count using sysindexes view
§ 2 – Compare the subscriber and publisher using the fast count and, if the results are not equal, uses count on table. If the rows field in sysindexes is NULL full count will be used. ( default )
USE <published database>
EXEC sp_publication_validation @publication = 'PublicationName'
, @rowcount_only = 2
, @full_or_fast = 2
You can execute this stored procedure over all publications on the server.
I have put together a simple script to do this, in order to use it you need to define a linked server to the Distributor server in order to get a list of publications from the MSpublications table which is located in the Distribution database.
CREATE PROCEDURE ValidateSubscriptions
@PrintMode int = 0
DECLARE @SQLCmd varchar(max);
SET @SQLCmd= ''
SELECT @SQLCmd = @SQLCmd + 'EXEC ['+publisher_db+'].dbo.sp_publication_validation
@publication = '''+publication+''',
@rowcount_only = 2,
@full_or_fast = 2;' + CHAR(10) + CHAR(13) +
' WAITFOR DELAY ''00:01:00'';' + CHAR(10) + CHAR(13)
FROM DISTRIBUTOR.DistributionDB.dbo.MSpublications p
JOIN DISTRIBUTOR.master.sys.servers s
ON p.publisher_id = s.server_id
WHERE s.name = @@servername
IF @PrintMode = 1 BEGIN
All the validation results (for all publications) will be inserted into the MSdistribution_history table which is located in the Distribution database on the Distributor server.
Viewing it can be done using the following query:
SELECT time Time,
FROM dbo.MSdistribution_history h
JOIN dbo.MSdistribution_agents a
ON h.agent_id = a.id
JOIN master.sys.servers s
ON a.publisher_id = s.server_id
JOIN master.sys.servers s2
ON a.subscriber_id = s2.server_id
WHERE comments like 'Table %'
The best part is that if any article fails validation you will immediately see it in the error log
EXEC master..xp_readerrorlog 0,1,'failed data validation'
The above means that if you have some usual routine that notifies you in case there are any unusual errors in the SQL Server Error Log, it will show validation failures as well.
May all your articles pass their validation successfully.
Partitioning is a very important feature for scalable applications, it allows breaking up huge monolithic tables in order to ensure their scalability & manageability within a single instance. For example, data deletion and loading can be very fast if done by partition. Reorganizing, optimizing and rebuilding indices also can be done by partition which is much less intrusive and faster.
Also, when talking about performance, joins become faster when using tables that are partitioned. Smaller b-trees for each partition make partition access fast, especially when limiting rows by a partition key, accessing only a subset of data. Another performance benefit can be achieved when using partition level lock escalation.
There are two partitioning designs in the SQL Server. In a partitioned view, the member tables are horizontally joined by a view, so from the user’s perspective, the data comes from one table. A natively partitioned table, a feature introduced in SQL Server 2005, is treated as a single object by the relational engine, yet is handled as multiple objects by the storage engine.
When comparing native partitioning to a partitioned view, one of the greatest improvements is that using a partitioned table results in relatively small query plans. PVs in general had enormous query plans because each table in the view was a separate object with separate statistics. Due to the fact that natively partitioned tables are treated by the relational engine as a single object, fewer permutations are considered during the optimization phase, meaning a faster optimization time. Also, auto-parameterization doesn’t work with PVs because the rules of auto-parameterization are quite strict.
Adhoc SQL that uses a single partitioned table has many more opportunities to get auto-parameterized and produce a re-usable adhoc query plan.
In PVs, tables are accessed sequentially. When using native partitioning many operations can be done in parallel.
However, there are many challenges that we face when migrating from an environment that is using partitioned views to native partitioning. The usual way of upgrading such an environment is to create a new partitioned table and then to switch the data in, partition by partition. At some point in time, the old view and the new table flip flop names. During such an upgrade, from the user’s perspective, data is not available or even worse, partially available and the reports are not correct.
After upgrade, maintenance tasks are also more cumbersome when performed against a partitioned table. Adding new indices to a huge table with thousands of millions of rows takes about 4 hours and the table is locked during this time. Same story with data type changes on columns. Also consider primary key changes when you need to drop the PK, which also takes about 3 hours and then add a new one – another 4 hours (a PK change is usually a bad thing anyway since it indicates the database design is not correct but we all know that it happens and sometimes we need to add an additional column to the PK). Also, there is no scale out for partitioned tables, meaning all partitions must reside on the same server on the same database.
Here, surprisingly, our good old, unfairly forgotten, partitioning views can help us make our partitioning design even better.
Consider the following scenario. You have a partitioned view with many big tables. Instead of switching the data partition by partition into a new table, you simply add the new partitioned table under the same partitioned view. No long Sch-M locks, no data partial availability and no downtime. All the new data is inserted into the new partitioned table. When the time comes, the old partitions get purged, leaving the partitioned view with only the native partitioned table underneath. If you need to support indices and other lengthy changes on metadata, you can simply add a new partition table to the view with the required metadata changes which means zero downtime to your environment. In case you need a distributed architecture across servers you also can do it using partitioned views.
Native Partitioning is a great feature of SQL Server, but partitioned views can make it even better helping scaling out and performing maintenance tasks with zero downtime.
I am truly exited to start blogging here among this incredible #sqlfamily.
My name is Maria Zakourdaev. I have more than 10 years experience with SQL Server. The last five years have been spent mostly on benchmarking different SQL Server features and flows, like replication, data import, indexes impact on DML flows, star transformations in RDBMS, Hierarchic queries and custom OLAP-like aggregations. I was a speaker in the Microsoft Teched (Israel) on the SQL Server track. I am also an active member of the Israel SQL Server Group.