THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Alexander Kuznetsov

  • Cars, Databases, and Benchmarks

    Over the last few months we have migrated some functionality from Sql Server to PostgreSql, and developed a couple of new systems powered by PostgreSql. So far I do not see that either of these two products is better than another - they are different. Even though I guess this is the right time to write something like "N reasons why PostgreSql is better than Sql Server", or vise versa:  "N reasons why Sql Server PostgreSql is better than PostgreSql", or both ( I guess I could write it both ways), I am not going to write along these lines.

    I do not see either of these two products being better than another - they are just different. I am migrating to PostgreSql and choosing it for my new systems because it makes practical sense given the specific circumstances, considering the specific requirements I am dealing with right now. I guess in agile environments it makes sense not to be too partial to any particular technology, but to choose the right technology for the problem at hand and to be open-minded.

    Let me provide an analogy between choosing and using  a database and choosing and using a car. I am definitely not the first to introduce the analogy between cars and databases. Consider, for example, the logo of Sql Rally conference, with the presumably car's RpM needle deeply in the red. This logo has always kept me wondering: what exactly is the message this is trying to convey? We are not supposed to drive cars like this - it means abusing and eventually ruining the engine, does it not?

    Also let us consider the super-sleek race car on the cover page of a book about administering SQL Server. This cover page is also kind of surprising: is it related to the content of the book? If the answer is yes, does it mean that the database requires a full-time team of highly skilled mechanics, maintaining it every several minutes? If we are in the business of winning races, I guess this is the way to do it, although I've never done anything like that.

    However, if all we need from a car is to reliably get from point A to point B, then there are simpler and cheaper alternatives to a race car, such as a common kind of sedan or van or SUV - this is what most of us are actually driving to get from point A to point B.

    Let us extend this analogy to gas mileage and TPC benchmarks. As I drive my car, I do not really get the gas mileage as specified in the manual. On the highway, I may get somewhat more miles per gallon than the number in the manual. In the city, it is the other way around - I burn more fuel than specified in the manual. My mileage may and does vary - this is not a surprise at all. The roads I drive on might be quite different from the ones the standard mileage is measured on. Also I am definitely not a perfect driver - driving is not my core competence, which probably impacts the mileage I am getting.

    Similarly, as we develop systems that use databases, our results may be quite different from the ones in TPC benchmarks. Our workloads might be quite different from the standardized ones actually used in standard benchmarks. More importantly, we are not full-time experts in any database engine. Instead, we are more or less close to full stack developers. As such, we are not going to spend months squeezing every last bit of performance from every SQL query - we have neither the time nor the expertise to do so. More likely we would expect to spend 20% of effort and hopefully get 80% of results. Our typical usage of databases is not like driving a race car trying to win a race - it is more like a daily commute in a typical commuter's car: we can command some reasonable effort to get where we want, but we cannot compete in an all-out race twice a day, every day.

    Also I would not typically consider a checklist of available/missing features as the most important criteria when comparing databases. Just like with cars, some features are important or mandatory, others are irrelevant, and this all depends on the workload we are dealing with. For example, having a 4WD/AWD is a must in Upper Peninsula, but it is probably useless in Southern Texas.

    Instead, we should evaluate the overall efficiency: how much time and money we spend building and maintaining the thing, and how well it solves users' problems.

    Regarding migrations from SQL Server to PostgreSql, there are quite a few details we need to be aware of. In the past I've described a few of these details, and I am going to continue that, so stay tuned.

     

  • Book Review: Tribal SQL,Performance Tuning With SQL Trace and Extended Events

    Tara Kizer wrote this chapter, and it is relevant for us developers.

    Tara briefly explains how to use the GUI, the Profiler, and states that it can heavily impact the performance on the server. Then she explains how to set up a server-side trace which should incur less overhead, and I noticed that the working example which she provided is quite similar to the one our team is using.

    There is one minor thing, however, that we do differently: our T-SQL is self-documenting. For example, instead of the following:

    EXEC sp_trace_setevent @traceid , 10, 1, @on

    We use this:

    EXEC sp_trace_setevent @traceid =@traceid
    @eventid=@RpcCompleted, @columnid=@Cpu, @on=@on;

    After describing how to set up a trace, Tara demonstrates several real life ways to use its output, such as finding missing indexes, stale statistics, and parameter sniffing causing problems.

     I have found this section of the chapter practical and useful for developers. We need to be proficient with this tool, we need to practice using it, so that we can quickly use it whenever the need arises. This chapter provides a good, short and useful, practical introduction to becoming proficient. I would recommend developers to read it.

    Later Tara proceeds to describe Extended Events. I am not qualified to review that, since I do not have enough real life experience with them.

    This completes the chapter review, which is followed by my personal opinion on deprecating of the Profiler.

    I am disappointed that that the familiar and convenient interface of the Profiler and server side traces is going to disappear, rendering all the experience using it useless, and forcing lots of people to spend precious time to master the next thing that replaces it.

    I understand that the old implementation might have to go, but I also think that it might be feasible to have the old interface invoke the new implementation. 

    In my narrow experience, even if some application has just ten or twenty users, it may be cheaper to keep the old interface unchanged, even when we completely replace the implementation, such as migrating from SQL Server to PostgreSql. As a result our users do not have to relearn how to do the same thing with the new tool, and can do something more useful instead. The math is simple: a day of developer's time spent on backward compatibility is cheaper than two hours of user's time spent on relearning multiplied by the number of users.

    Because the Profiler and server side traces have a huge number of customers, I think that SQL Server would be a more useful product if it kept providing the old interface even if the old implementation needs to be replaced with a new one. 

    The time and resources spent on keeping the old interface would probably be just a tiny fraction of the time and resources spent by the huge amount of users relearning how to do the same thing with the new tool.
  • Book Review: Tribal SQL, Chapter 1.

    "Tribal SQL" is an interesting book, so I will be reviewing some, although not all, of its chapters.

    Quoting from the introduction, "This is a book for DBAs, for things you think they really ought to know".
    As an agile developer, I clearly do not belong in this book's target audience. Also I am not qualified to review some of the chapters, as I have zero real life experience with some of the technologies described in it.
    I will not be reviewing these chapters.

    As a developer frequently working with RDBMS, I am much more productive when I understand the underlying RDBMS, and develop software that utilizes the strengths and avoids the weaknesses of the RDBMS I currently work with. This is why I am curious about SQL Server, and this is why some of the chapters in "Tribal SQL" are really interesting to me.

    So, the following series of posts is written by a developer, and targeted at developers.

    Review for "SQL Server Internals 101"

    Mark S Rasmussen has written a chapter entitled "SQL Server Internals 101".

    This chapter is very relevant for us developers - to deliver robust and performant systems,
    we do need to have some knowledge of internals.

    Quoting from the author, "... every SQL Server DBA and developer should have a sound basic understanding,
    not just of what storage objects exist in SQL Server (heaps and indexes), but of the underlying data structures".

    This rings true to me - in all my experience it is cheaper to invet some time, learn the internals, and avoid very costly mistakes.

    The author begins with a personal story, describing his own mistake caused by insufficient knowledge.
    This is exactly what's needed to grab readers' attention.

    The story definitely resonates with me - I did make some mistakes with SQL Server databases myself, and who did not? These days I develop against PostgreSql a lot, and maybe I am doing something wrong with my PostgreSql systems as well, and need better knowledge of it.

    This story is followed by an introduction into records and pages, and a practical hands on way to look directly into bytes and bits these pages consist of.

    This is followed by an practical introduction into B-trees, indexes, both clustered and non-clustered, and heaps.
    The details on heaps and clustered indexes are especially necessary, because this is where SQL Server is so different from PostgreSQL and Oracle.

    Conclusion


    This chapter is a highly practical introduction into a very relevant topic. It has been written for busy practitioners, including developers - it starts with explaining why we want to know all these things, then it clearly and succinctly explains the basics, and refers to more detailed resources for further learning.

  • Learning postgreSql: serialization failures with SERIALIZABLE

    We shall reproduce a serialization failure and see SERIALIZABLE isolation level enforces data integrity.

    Prerequisites

    We shall need the following test data:

    CREATE TABLE Carpools(
      
    Car_Name VARCHAR NOT NULL,
      
    Passenger VARCHAR NOT NULL
    );

    INSERT INTO Carpools(car_name, passenger)
    VALUES
    ('Carol''s car', 'Jim'),
    (
    'Carol''s car', 'Carmen'),
    (
    'Carol''s car', 'Ted');

    Reproducing a serialization failure

    Suppose that Carol can only take four passengers in her car, so we can add at most one more passenger. The following command verifies that three still is enough room for one more passenger, adds one more person, but does not commit. The script completes successfully:

    BEGIN ISOLATION LEVEL SERIALIZABLE;
    INSERT INTO Carpools(car_name, passenger)
    SELECT 'Carol''s car', 'Juan'
    WHERE (SELECT COUNT(*) FROM Carpools
      
    WHERE car_name='Carol''s car')<4;

    In another session, we can successfully add one passenger - there is no blocking whatsoever:

    BEGIN ISOLATION LEVEL SERIALIZABLE;
    INSERT INTO Carpools(car_name, passenger)
    SELECT 'Carol''s car', 'Pawel'
    WHERE (SELECT COUNT(*) FROM Carpools
      
    WHERE car_name='Carol''s car')<4;

    Should both transactions commit, we'd have five passengers, but this is not going to happen. We can commit only one transaction, no matter whether the first or the second. When we try to commit the other one, PostgreSql detects serialization failure, rolling back the transaction with the following error:

    ERROR:  could not serialize access due to read/write dependencies among transactions

    The reason is as follows: there is no way these two transactions could insert these two rows serially, with one transaction committing before the other one begins. Whichever transaction runs last, it would see that there already are four passengers, and not insert the fifth one.

    As we have seen, PostgreSql successfully detects a potential violation of data integrity, and fails a transaction to prevent the violation. Clearly this is a highly useful feature. As we migrate functionality to PostgreSql, we should utilize its advantages. In my experience, we are not very productive when we are trying to emulate Sql Server's behavior on PostgreSql - we are much more productive when we have a good understanding of the new platform and use it the way it is supposed to be used.

    Of course, this data integrity rule is very simple, As such, we could use constraints to enforce it. I have chosen this simple scenario because I wanted to keep my examples simple, not because we cannot solve the problem using only constraints.

  • Learning PostgreSql: Fun with REPEATABLE READ

    In this post we shall run some examples under REPEATABLE READ, and see how they behave differently.

    Setting up test data

    On SQL Server, run the following:

    CREATE TABLE Tickets(
      
    ID INT NOT NULL,
      
    Problem VARCHAR(100) NOT NULL,
      
    SpaceFiller CHAR(200) NOT NULL
    );

    INSERT INTO Tickets(
      
    ID ,
      
    Problem,
      
    SpaceFiller)
    SELECT Number*10, 'Problem '+CAST([Number] AS VARCHAR(10))+'0', 'Space Filler'
    FROM data.Numbers;

    ALTER TABLE Tickets ADD PRIMARY KEY(ID);

    UPDATE Tickets SET Problem = 'Add 16Gb Ram to Kenny''s workstation'
    WHERE ID = 90090;

    On PostgreSql, we do not need a Numbers table - there is a built in function generate_series, which is very useful:

    CREATE TABLE Tickets(
      
    ID INT NOT NULL,
      
    Problem VARCHAR NOT NULL
    );

    TRUNCATE TABLE Tickets;

    INSERT INTO Tickets(
      
    ID ,
      
    Problem)
    SELECT generate_series*10, 'Problem ' || CAST(generate_series AS VARCHAR)
    FROM generate_series(1, 16000);
      
    ALTER TABLE Tickets ADD PRIMARY KEY(ID);

    UPDATE Tickets SET Problem = 'Add 16Gb Ram to Kenny''s workstation'
    WHERE ID = 90090;

    Identical selects running under REPEATABLE READ on Sql Server may return different results 

    The following scripts demonstrate that reads are not completely repeatable, despite the isolation level's name being REPEATABLE READ and as such claiming otherwise.

    In one tab, run this:

    BEGIN TRANSACTION;
    UPDATE Tickets SET Problem = 'Replace printer in NW corner'
    WHERE ID = 49000;

    In another tab, run this:

    SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
    BEGIN TRANSACTION;
    SELECT COUNT(*) AS TotalProblems,
    SUM(CASE WHEN Problem LIKE '%Kenny%' THEN 1 END) AS [Kenny's Problems],
    SUM(CASE WHEN Problem LIKE '%printer%' THEN 1 END) AS [Printer Problems]
    FROM Tickets WHERE ID BETWEEN 0 AND 100000;

    This select is blocked by the uncommitted update in the first tab and as such it does not complete. Let us get back to the first tab and run the following:

    UPDATE Tickets SET ID=101
    WHERE ID = 90090;

    COMMIT;

    This update physically moves Kenny's ticket from a page that has not been read by the select yet to a page which has already been read. Once the transaction in the first tab commits, the select in the second tab completes with the following results:

    TotalProblems Kenny's Problems Printer Problems
    ------------- ---------------- ----------------
    10000         NULL             1

    Clearly Kenny's ticket has not been read at all. If we rerun this select, in the same transaction, we shall get different results, as follows:

    TotalProblems Kenny's Problems Printer Problems
    ------------- ---------------- ----------------
    10001         1            1

    Note that Kenny's problem is included in the second result set, and the total in TotalProblems column is now correct as well.

    As we have seen,

    • when we rerun queries in the same transaction, we may get different results 
    • selects may return rows inserted after the select started - such as the only ticket for "Printer problem"
    • selects may fail to return some rows that were committed before the transaction started - note that the ticket for "Kenny's problem" is completely missing from the first result set. This effect may also cause incorrect totals. Note that TotalProblems column is different in the first and second result sets.

    If we rerun this example on PostgreSql, the first select is not blocked by the modification, and consequent selects return the same results as the first one.

    REPEATABLE READ on Sql Server does not prevent lost updates

    The following scenario demonstrates how an update can be lost. To begin the scenario, run the following script in one tab:

    BEGIN TRANSACTION;

    UPDATE Tickets
    SET Problem = 'Water cooler making funny noises'
    WHERE ID = 4010;

    In another tab, run this:

    SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
    BEGIN TRANSACTION;

    UPDATE Tickets
    SET Problem = 'Printer problem ' + CAST(ID AS VARCHAR(5))
    WHERE ID BETWEEN 4000 AND 4100;

    The update is waiting on the uncommitted change. As soon as we commit the transaction in the first tab, the update completes. Note that it overwrites the change from the first tab:

    SELECT Problem FROM Tickets
    WHERE ID BETWEEN 4000 AND 4100;

    Problem
    ----------------------------------------------------------------------------------------------------
    (snip)
    Printer problem 4010
    (snip)
    Similarly, a modification running under REPEATABLE READ can update or delete rows that were inserted after the transaction started. 
    To reproduce, let us begin archiving tickets in one tab: 

    SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
    BEGIN TRANSACTION;

    SELECT *
    INTO ArchivedTickers
    FROM Tickets
    WHERE ID BETWEEN 4000 AND 4100;
    In another tab, let us add a new ticket: 
    INSERT Tickets(ID, Problem, SpaceFiller)
    VALUES(4005, 'Water cooler making funny noises', 'Test value');

    Let us get back to the first tab and complete the archiving:

    DELETE FROM Tickets
    WHERE ID BETWEEN 4000 AND 4100;
    COMMIT;

    SELECT COUNT(*) FROM Tickets
    WHERE ID BETWEEN 4000 AND 4100;

    -----------
    0

    Clearly the new ticket number 4005 was deleted along with the archived ones. This is a textbook example of a lost update.

    REPEATABLE READ on PostgreSql prevents lost updates

    Let us rerun the previous scenario - we shall observe a very different behavior. In one session, let us start archiving a range of tickets:

    BEGIN ISOLATION LEVEL REPEATABLE READ;;

    CREATE TABLE archived_tickets
    AS
    SELECT
    *
    FROM Tickets
    WHERE ID BETWEEN 4000 AND 4100;

    In another session, let us add a new ticket:

    INSERT INTO Tickets(ID, Problem)
    VALUES(4005, 'Water cooler making funny noises');

    In the first session, let us complete the archiving. The unarchived ticket number 4005 has not been deleted:

    DELETE FROM Tickets
    WHERE ID BETWEEN 4000 AND 4100;

    COMMIT;

    SELECT Problem FROM Tickets
    WHERE ID BETWEEN 4000 AND 4100;

    "Water cooler making funny noises"

    Let us rerun the scenario that updates an existing ticket. In one session, run this:

    BEGIN TRANSACTION;

    UPDATE Tickets
    SET Problem = 'Add index on Shipments.Height'
    WHERE ID = 4010;

    In another session run this:

    BEGIN ISOLATION LEVEL REPEATABLE READ;;

    UPDATE Tickets
    SET Problem = 'Printer problem ' || CAST(ID AS VARCHAR(5))
    WHERE ID BETWEEN 4000 AND 4100;

    Commit the transaction in the first session, and the second transaction detects a lost update and blows up, as follows: "ERROR:  could not serialize access due to concurrent update".

    As we have seen, PostgreSql has detected and prevented both lost updates. As such, we do not need to troubleshoot two subtle bugs that are rather difficult to reproduce.

    Conclusion

    As we have seen REPEATABLE READ is implemented very differently on PostgreSql, so we need to be very careful when we port T-SQL that uses REPEATABLE READ. As we develop against PostgreSql, we can and should take advantage of its optimistic implementation of isolation levels, rather than trying to exactly replicate Sql Server's behavior.

  • Learning PostgreSql: READ COMMITTED and Data Integrity

    As we have just discussed, READ COMMITTED isolation level behaves very much like Sql Server's READ_COMMITTED_SNAPSHOT. As such, we need to be very careful with data integrity - lots of code that just works on Sql Server under its default isolation level, READ COMMITTED, does not work on PostgreSql under its default isolation level, which is also READ COMMITTED, but behaves differently.

    Note: whenever we use READ_COMMITTED_SNAPSHOT on Sql Server, we need to be just as careful with data integrity  - otherwise we can introduce lots of subtle bugs. Hugo Kornelis described these bugs in his blog series entitled "Snapshot isolation: a threat for integrity". Also I described the same effects in the book entitled "Defensive Database Programming". This is why we are going to discuss how to enforce data integrity when readers do not block writers.

    I am well aware that there is a lot of advice to just enable READ_COMMITTED_SNAPSHOT and boost performance, without mentioning the side effects impacting data integrity. We need to be very careful with such advice: performance is important, of course, but I am usually working on systems where sacrificing data integrity to boost performance is not acceptable.

    Demonstrating the problem.

    Suppose that we have the following business rule: high priority tickets cannot be assigned to developers who are on vacation. Also suppose that to enforce this business rule, we are using subqueries whenever we modify data in Developers and Tickets tables. It does not matter if we add subqueries directly to our DML commands or wrap them in triggers - the problem stays the same.

    Let us set up the tables and some test data:

    CREATE TABLE Developers(
      
    Name VARCHAR NOT NULL PRIMARY KEY,
      
    Status VARCHAR NOT NULL
      );
    INSERT INTO Developers(Name, Status)
    VALUES('Jeff', 'Active');  
    CREATE TABLE Tickets(
      
    Problem VARCHAR NOT NULL PRIMARY KEY,
      
    Priority VARCHAR NOT NULL,
      
    AssignedTo VARCHAR NULL,
      
    CONSTRAINT FK_devlopers FOREIGN KEY(AssignedTo)
      
    REFERENCES Developers(Name),
      
    CONSTRAINT CHK_high_priority_always_assigned 
        
    CHECK((Priority='high' AND AssignedTo IS NOT NULL)
          OR 
    Priority<>'high')
      );
    INSERT INTO Tickets(Problem, Priority, AssignedTo)
    VALUES('TPS report hangs','low','Jeff');

    We are ready to reproduce the problem. In one session, let us begin a transaction, update Jeff's status to "Vacation", but not commit it yet:

    BEGIN TRANSACTION;
    UPDATE Developers SET Status='Vacation'
    WHERE Name='Jeff'
    AND NOT EXISTS(SELECT * FROM Tickets AS t 
    WHERE t.AssignedTo = Developers.Name
       
    AND t.Priority='high');

    In another session, let us begin a transaction, increase the ticket's priority to high, but not commit it yet:

    BEGIN TRANSACTION;
    UPDATE 
    Tickets SET Priority='high'
    WHERE Problem='TPS report hangs'
    AND (SELECT Status FROM Developers WHERE Name=Tickets.AssignedTo)<>'Vacation';

    Apparently the update succeeds:

    SELECT * FROM Tickets;

    "TPS report hangs";"high";"Jeff"

    The reason why the update succeeded is simple: this session does not see the uncommitted data from the other session, and it is not blocked, because writers do not block readers. This session read Jeff's status, which is still active:

    SELECT * FROM Developers ;

    "Jeff";"Active"

    We can commit both transactions. Now we have data that violates our data integrity rules: a high priority ticket is assigned to a developer who is on vacation. Let us restore the original data:

    UPDATE Developers SET Status='Active'
    WHERE Name='Jeff';

    UPDATE Tickets SET Priority='low'
    WHERE Problem='TPS report hangs';

    Let us try out a higher isolation level. Our attempts to use REPEATABLE READ and even SERIALIZABLE make no difference. However, after a short investigation, we realize that the following script works differently on PostgreSql and Sql Server:

    SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
    BEGIN TRANSACTION;
    On PostgreSql, the SET command does not affect the isolation level of the following transaction. 
    The following section explains why.

    SET TRANSACTION works differently

    Unlike in SQL Server, a standalone SET TRANSACTION command has no effect whatsoever. The following script shows that:

    SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
    select current_setting('transaction_isolation');

    "read committed"

    The reason is simple: in PostgreSql, SET TRANSACTION command affects only the current transaction. The standalone SET TRANSACTION runs in its own transaction, so when it completes, its transaction completes too, restoring the default isolation level. The correct way to set isolation level is as follows:

    BEGIN TRANSACTION;
    SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
    select current_setting('transaction_isolation');    

    "repeatable read"

    Also we can use an equivalent, but more concise, method:

    BEGIN ISOLATION LEVEL REPEATABLE READ;
    select current_setting('transaction_isolation');    
    "repeatable read"

    Trying out higher isolation levels

    Now that we know how to set isolation levels properly, let us rerun the same scenario under higher isolation levels.

    Under REPEATABLE READ, we still get the same problem, and we can commit both transactions.

    After restoring the original data, let us replay the scenario under SERIALIZABLE isolation level. As under lower isolation levels, both modifications complete - this is completely different from Sql Server, where one modification would be blocked. However, when we try to commit both transactions, only one will succeed, while another will fail with the following error message:

    ERROR:  could not serialize access due to read/write dependencies among transactions

    So, essentially PostreSql implementation is optimistic - it only detects conflicts at commit time.

    Mimicking Sql Server's behavior

    We can make sure that subqueries that enforce data integrity are blocked by writers, which mimics Sql Server's implementation of isolation levels. The following scripts shows how to do that - note the FOR UPDATE clause in the subquery. In one session, run this:

    --restore original data
    UPDATE Developers SET Status='Active'
    WHERE Name='Jeff';

    UPDATE Tickets SET Priority='low'
    WHERE Problem='TPS report hangs'; 

    -- enforce data integrity Sql Server style
    BEGIN TRANSACTION;
    UPDATE Developers SET Status='Vacation'
    WHERE Name='Jeff'
    AND NOT EXISTS(SELECT * FROM Tickets AS t 
    WHERE t.AssignedTo = Developers.Name
       
    AND t.Priority='high' FOR UPDATE); 

    In another session, run this:

    BEGIN TRANSACTION;
    UPDATE Tickets SET Priority='high'
    WHERE Problem='TPS report hangs'
    AND (SELECT Status FROM Developers WHERE Name=Tickets.AssignedTo FOR UPDATE)<>'Vacation'; 

    This command hangs, exactly like it does on SQL Server. Let us return to the first session and commit. The update in the second session immediately completes, but the ticket's priority is not updated:

    SELECT * FROM Developers;
    "Jeff""Vacation" 
    SELECT * FROM Tickets;
    "TPS report hangs""low""Jeff" 

    As we have seen, FOR UPDATE clause allows us to replicate SQL Server's pessimistic implementation of isolation levels on PostgreSql.

    Conclusion

    As we have seen, PostgreSql implementation of isolation levels is substantially different. As such, we need to be very careful whenever we port DML that uses subqueries to enforce data integrity - otherwise we may end up with dirty data, just like when we turn on READ_COMMITTED_SNAPSHOT on Sql Server. There is much more about the differences in the implementation of isolation levels - we shall continue next time.

    Note that we could use constraints to enforce this business rule and save ourselves a lot of trouble. I've described the solution in my free eBook "Defensive Database Programming", in the chapter entitled "Advanced Use of Constraints".

     

  • Learning PostgreSql: different behavior with READ UNCOMMITTED/COMMITTED

    Because of multi-valued concurrency control aka MVCC, there are many differences in queries' behavior with different isolation levels. Before running examples, let us set up test data.

    Setting up test data

    The following script sets up the data we shall be playing with:

    DROP TABLE test;
    CREATE TABLE test(ID INT NOT NULL, 
    CONSTRAINT PK_test PRIMARY KEY(ID),
    col2 INT NOT NULL, 
    col3 FLOAT);
    INSERT INTO test(ID, col2, col3)
    VALUES(1,0,0),(8,1,1);

    There are no dirty reads.

    In one session, let us insert an uncommitted row:

    BEGIN TRANSACTION;

    INSERT INTO test(ID, col2, col3)
    VALUES(3,1,1);

    In another session let us try to read uncommitted data:

    BEGIN ISOLATION LEVEL READ UNCOMMITTED;
    SELECT * FROM test;
    1;0;0
    8
    ;1;1

    Clearly, we do not see the uncommitted row inserted by another session. This is exactly as documented:

    "When you select the level Read Uncommitted you really get Read Committed, and phantom reads are not possible in the PostgreSQL implementation of Repeatable Read, so the actual isolation level might be stricter than what you select"

    READ COMMITTED behaves like READ_COMMITTED_SNAPSHOT

    At this point we still have an uncommitted row. Let up keep the inserting transaction open, commit the selecting one, and begin another selecting transaction:

    BEGIN ISOLATION LEVEL READ COMMITTED;
    SELECT * FROM test;
    1;0;0
    8
    ;1;1 

    This SELECT is working exactly as the previous one, exactly as documented. This behavior is different from ow SQL Server executes SELECT with READ COMMITTED, where the SELECT would be blocked. However, this is consistent with how SQL Server executes SELECTs under READ_COMMITTED_SNAPSHOT.

    If we commit the insert and rerun the select without committing the selecting transaction, we shall see the new row:

    1;0;0
    8;1;1
    3;1;1

    Again, this is consistent with how SQL Server executes SELECTs under READ_COMMITTED_SNAPSHOT.

    Multi-statement functions and READ COMMITTED

    When multi-statement functions run under READ COMMITTED isolation level, each statement can see the committed data as of the beginning of the statement, and not as of the beginning of function execution. This is consistent with how SQL Server executes SELECTs under READ_COMMITTED_SNAPSHOT.

    The following function selects two snapshots of data ten seconds apart:

    CREATE OR REPLACE FUNCTION two_snapshots()
    RETURNS TABLE(test_id int, snapshot_name varchar) AS 
    $BODY$
    BEGIN
        CREATE TEMP TABLE 
    before_sleep AS
        SELECT 
    ID from test;
        
    PERFORM pg_sleep(10);
        
    RETURN QUERY SELECT ID, CAST('before sleep' AS VARCHAR) FROM before_sleep
            
    UNION ALL
                     
    SELECT ID, 'after sleep' FROM test;
        
    DROP TABLE before_sleep;
    END;
    $BODY$ 
    LANGUAGE plpgsql
    ;

    Let us open a window and cut and paste the following modification, but not run it yet: 

    INSERT INTO test(ID, col2, col3)
    VALUES(16,1,1);

    In another session, let us invoke the function:

    SELECT * FROM two_snapshots();

    We have ten seconds to run the insert - that should be more than enough. Once the function completes, it returns the following result, including the row inserted after the function has been invoked:

    1;"before sleep"
    8;"before sleep"
    3;"before sleep"
    1;"after sleep"
    8;"after sleep"
    3;"after sleep"
    16;"after sleep"

    As we have seen, the data modified after the function has been invoked is still visible to commands in the function, as long as the data has been committed before the command started.

    There is much more to discuss about isolation levels in PostgreSql - we shall continue later.

     

  • Learning PostgreSql: XACT_ABORT is Always ON

    PostgreSql has simple and consistent error handling, which can be roughly explained in Sql Server terms as follows: XACT_ABORT is Always ON. In other words, error handling in PostgreSql has substantially less features, but it does have all the features which we actually use in the project being migrated.

    Simplicity of error handling in PostgreSql is very good news - we need to learn much less before we are productive, and there are less chances to make a mistake. Let us consider some examples.

    On errors, transactions roll back.

    This behavior is similar to what Sql Server does when XACT_ABORT is set ON.

    Let us add a test table, and add one row of test data:

    CREATE TABLE test(ID INT NOT NULL, col2 INT NOT NULL, col3 FLOAT);
    INSERT INTO test(ID, col2, col3)
    VALUES(1,0,0);

    Let us run a transaction that is supposed to fail:

    BEGIN TRANSACTION;
    INSERT INTO test(ID, col2, col3)
    VALUES(2,1,1);
    UPDATE test SET col3=ID/col2;
    COMMIT; 

    ERROR: division by zero
    SQL state: 22012

    After the transaction fails, there is only one row visible from the same session:

    SELECT * FROM test;

    1;0;0

    Also the transaction is rolled back, so there is nothing to commit:

    COMMIT;
    WARNING:  there is no transaction in progress
      

    Function bodies are always run in transaction context.

    We have already discussed that in the previous post. As such, any error occurring inside a function causes the function to abort, and the whole transaction to roll back. The following example shows it:

    DROP TABLE test;
    CREATE TABLE test(ID INT NOT NULL, 
    CONSTRAINT PK_test PRIMARY KEY(ID),
    col2 INT NOT NULL, 
    col3 FLOAT);
    INSERT INTO test(ID, col2, col3)
    VALUES(1,0,0);

    CREATE OR REPLACE FUNCTION Add_Test_Rows()
    RETURNS VARCHAR AS
    $BODY$
    BEGIN
    -- this insert succeeds
    INSERT INTO test(ID, col2, col3)
    VALUES(2,0,0);
    -- the first row violates the PK
    INSERT INTO test(ID, col2, col3)
    VALUES(1,0,0),(3,0,0);
    RETURN 'Success';
    END;
    $BODY$
    LANGUAGE plpgsql VOLATILE
    ;

    SELECT Add_Two_Test_Rows();

    ERROR:  duplicate key value violates unique constraint "pk_test" DETAIL:  Key (id)=(1) already exists.

    SELECT * FROM test;
    1;0;0

    COMMIT;
    WARNING:  there is no transaction in progress

    Note that the error message clearly indicates the offending row - this is very convenient.

    If we catch exceptions, the transaction has already rolled back.

    When we've caught an exception, we can still get some diagnostics information, but the all the changes are gone. See for yourself:

    CREATE OR REPLACE FUNCTION Add_Test_Rows()
    RETURNS VARCHAR AS
    $BODY$
    DECLARE 
      
    err_msg VARCHAR;
      
    err_detail VARCHAR;
      
    cnt INT;
    BEGIN
    -- this row inserts
    INSERT INTO test(ID, col2, col3)
    VALUES(2,0,0);
    -- this row violates the PK
    INSERT INTO test(ID, col2, col3)
    VALUES(1,0,0);
    RETURN 'Success';
    EXCEPTION
        
    WHEN OTHERS THEN
        
    GET STACKED DIAGNOSTICS 
           err_detail 
    = PG_EXCEPTION_DETAIL;
         
    cnt := (SELECT COUNT(*) FROM test);
         
    RETURN err_detail || ' Visible rows: ' || cnt;
    END;
    $BODY$
    LANGUAGE plpgsql VOLATILE
    ;

    SELECT Add_Test_Rows();

    "Key (id)=(1) already exists. Visible rows: 1"
      

    As we have seen, only one row is visible when we catch the exception: the row that existed before the transaction started.

  • Learning PostgreSql: No Nested Transactions, No Transactions in PL/pgSQL

    Transactions and error handling in PostgreSql are substantially different. We shall discuss only the features that we needed to learn to complete our project. It will take us more than one post to describe them all.

    TL;DR; Transactions and error handling in PostgreSql are much simpler, but there are enough features necessary to implement robust error handling without too much learning.

    No Nested Transactions

    We do not use nested transactions in SQL Server, because we have never seen any practical need. From my perspective, lack of nested transactions in PostgreSql simply means that we have less to learn and less to be careful about.

    The following script demonstrates lack of nested transactions in PostgreSql:

    CREATE TABLE test(ID INT NOT NULL, Some_Value VARCHAR);
    BEGIN TRANSACTION;
    INSERT INTO test(ID, Some_Value)
    VALUES(1, 'After the first BEGIN TRANSACTION');

    In SQL Server, we could issue another BEGIN TRANSACTION, and it would do nothing, except for incrementing @@TRANCOUNT. Instead, the following command raises a warning

    BEGIN TRANSACTION;
    WARNING:  there is already a transaction in progress

    At this point, this table has no committed rows - we can select from another session and see for ourselves.

    The following script does not decrement @@TRANCOUNT - it actually commits, as we can see from another session:

    INSERT INTO test(ID, Some_Value)
    VALUES(2, 'After the second BEGIN TRANSACTION');

    COMMIT;

    Should we run the following script on SQL Server, all the inserted rows would be rolled back, and the table would be empty. However, there is no outstanding transaction - it has just been committed. So, the INSERT runs as a standalone transaction. The ROLLBACK fails because there is no outstanding transaction:

    INSERT INTO test(ID, Some_Value)
    VALUES(3, 'After the first COMMIT');

    ROLLBACK;
      

    --NOTICE:  there is no transaction in progress  

    As we have seen, one and the same SQL may run differently on PostgreSql and SQL Server, because

    We Cannot Begin/End Transactions in PL/pgSQL

    If we wrap DML in a function, and invoke that function, of course the function call runs in a transaction. Because there are no nested transactions, it makes no sense to allow BEGIN TRANSACTION and COMMIT in functions. The following script shows a runtime error:

    CREATE OR REPLACE FUNCTION InsertTest(p_ID INT, P_Value VARCHAR)
    RETURNS VOID AS
    $BODY$
    BEGIN
    INSERT INTO 
    test(ID, Some_Value)
    VALUES(p_ID, P_Value);
    COMMIT;
    END;
    $BODY$
    LANGUAGE plpgsql VOLATILE
    ;

    BEGIN TRANSACTION;
    SELECT InsertTest(p_ID := 5, P_Value := 'Added by InsertTest');
    COMMIT; 

    --ERROR:  cannot begin/end transactions in PL/pgSQL
      

    There is more to transactions and error handling. We shall continue later, so stay tuned.

  • Learning PostgreSql: UPDATE...FROM works differently

    UPDATE...FROM command on PostgreSql may raise no errors, but produce completely different results. Later I shall provide a repro, but first let us briefly refresh how UPDATE...FROM works on Sql Server.

    UPDATE...FROM on SQL Server ignores ambiguity

    We shall need the following test data:

    CREATE TABLE #Problems(
    ProblemID INT NOT NULL, 
    Problem VARCHAR(30) NOT NULL,
    Solution VARCHAR(30) NULL);

    CREATE TABLE #SuggestedSolutions(
    SuggestedSolutionID INT NOT NULL,
    ProblemID INT NOT NULL, 
    IsAccepted CHAR(1) NULL,
    Solution VARCHAR(30) NOT NULL);

    INSERT INTO #Problems
            
    ( ProblemID, Problem, Solution )
    VALUES  ( 0, 'Washer won''t drain', NULL),
    ( 
    1, 'Kitchen looks dirty', NULL);

    INSERT INTO #SuggestedSolutions
            
    ( SuggestedSolutionID ,
              
    ProblemID ,
              
    IsAccepted ,
              
    Solution
            
    )
    VALUES  
    (0, 0, 'Y', 'Rebalance the load'),
    (
    1, 0, 'N', 'Turn washer off then on'),
    (
    2, 1, 'N', 'Turn off the light'),
    (
    3, 1, 'Y', 'Clean up the kitchen');

    While we are at it, let us also make sure that at most one proposed solution per problem can be accepted:

    CREATE UNIQUE INDEX OneSolutionPerProblem 
    ON #SuggestedSolutions(ProblemID) 
    WHERE (IsAccepted='Y');

    The following update ignores ambiguity: two proposed solutions match each problem, and the database engine picks one of them to update. It does not raise any errors. I have no idea how the engine chooses the value to update in case of ambiguity, so the output on your server may be different:

    UPDATE #Problems SET Solution = s.Solution
    FROM #Problems AS p, #SuggestedSolutions AS s 
    WHERE p.ProblemID = s.ProblemID;

    SELECT * FROM #Problems;

    0  Washer won''t drain   Rebalance the load
    1  Kitchen looks dirty   Turn off the light

    We can narrow down the WHERE clause, so that there is no ambiguity at all, and the results of UPDATE are predictable:

    UPDATE #Problems SET Solution = s.Solution
    FROM #Problems AS p, #SuggestedSolutions AS s 
    WHERE p.ProblemID = s.ProblemID AND s.IsAccepted='Y';

    SELECT * FROM #Problems;

    0  Washer won''t drain   Rebalance the load
    1  Kitchen looks dirty   Clean up the kitchen
      

    As we have seen, SQL Server interprets the FROM clause in this UPDATE as a correlated subquery - if it uniquely identifies a matching row, we can predict the results of UPDATE command.

    PostgreSql interprets the same UPDATE...FROM differently

    Let us set up test data again:

    CREATE TEMP TABLE Problems(
    ProblemID INT NOT NULL,
    Problem VARCHAR(30) NOT NULL,
    Solution VARCHAR(30) NULL);

    CREATE TEMP TABLE SuggestedSolutions(
    SuggestedSolutionID INT NOT NULL,
    ProblemID INT NOT NULL,
    IsAccepted CHAR(1) NULL,
    Solution VARCHAR(30) NOT NULL);

    CREATE UNIQUE INDEX OneSolutionPerProblem
    ON SuggestedSolutions(ProblemID)
    WHERE (IsAccepted='Y');

    INSERT INTO Problems
            
    ( ProblemID, Problem, Solution )
    VALUES  ( 0, 'Washer won''t drain', NULL),
    (
    1, 'Kitchen looks dirty', NULL);

    INSERT INTO SuggestedSolutions
            
    ( SuggestedSolutionID ,
              
    ProblemID ,
              
    IsAccepted ,
              
    Solution
            
    )
    VALUES  
    (0, 0, 'Y', 'Rebalance the load'),
    (
    1, 0, 'N', 'Turn washer off then on'),
    (
    2, 1, 'Y', 'Turn off the light'),
    (
    3, 1, 'N', 'Clean up the kitchen');
    Note: OneSolutionPerProblem used to be called a filtered index in SQL Server universe. In PostgreSql, it is a partial index.
    Let us rerun the second update, which was producing predictable results on SQL Server:
    UPDATE Problems SET Solution = s.Solution
    FROM Problems AS p, SuggestedSolutions AS s
    WHERE p.ProblemID = s.ProblemID AND s.IsAccepted='Y';

    SELECT * FROM Problems;

    0;"Washer won't drain""Rebalance the load" 1;"Kitchen looks dirty""Rebalance the load"
    See how the second problem was updated with a solution for the first one? 
    The reason is simple: PostgreSql interprets the FROM clause of this UPDATE as the following uncorrelated subquery:
    SELECT s.Solution  FROM Problems AS p, SuggestedSolutions AS s 
    WHERE p.ProblemID = s.ProblemID AND s.IsAccepted='Y';
    So every row in its output is a potential match for every row in Problem, which is why we are getting these results.
    To demonstrate this behavior one more time, let us add a problem without any suggested solutions, and rerun the update:
    INSERT INTO Problems
            
    ( ProblemID, Problem, Solution )
    VALUES  ( 2, 'Fridge is empty', NULL);
    UPDATE Problems SET Solution = s.Solution
    FROM Problems AS p, SuggestedSolutions AS s
    WHERE p.ProblemID = s.ProblemID AND s.IsAccepted='Y';

    SELECT * FROM Problems;
    0;"Washer won't drain";"Rebalance the load"
    1;"Kitchen looks dirty";"Rebalance the load"
    2;"Fridge is empty";"Rebalance the load"
    Rewriting UPDATE..FROM
    The following script shows the correct way to update in PostgreSql
    -- erase wrong values
    UPDATE Problems SET Solution = NULL;

    -- the correct UPDATE
    UPDATE Problems SET Solution = s.Solution
    FROM SuggestedSolutions AS s
    WHERE Problems.ProblemID = s.ProblemID AND s.IsAccepted='Y';

    SELECT * FROM Problems ORDER BY ProblemID;

    0;"Washer won't drain""Rebalance the load" 1;"Kitchen looks dirty""Turn off the light" 2;"Fridge is empty"""
    Conclusion
    To ensure correct results, all UPDATE...FROM commands need to be rewritten, because UPDATE...FROM is interpreted differently by PostgreSql
     

     

  • Learning PostGreSql: replacing TOP and APPLY with LIMIT and LATERAL

    All SQL Server queries using TOP and/or APPLY need to be changed - PostgreSql uses completely different syntax.

    Replacing TOP with LIMIT

    The following script shows how to do that:

    CREATE TEMP TABLE Runs(
    State_Code VARCHAR,
    Run_Date DATE,
    Distance FLOAT,
    Description VARCHAR);

    INSERT INTO Runs(State_Code, Run_Date, Distance, Description)
    VALUES
    ('IL', '2013-11-16',16.6,'2 loops in Wolf Road Woods'),
    (
    'IL', '2013-11-17',8.4,'Herrick Woods'),
    (
    'IL', '2013-11-19',7.2,'Chicago Lakefront');

    SELECT State_Code, Run_Date, Distance FROM Runs 
       
    ORDER BY Run_Date DESC LIMIT 2;
      

    "IL";"2013-11-19";7.2
    "IL";"2013-11-17";8.4

    Replacing CROSS APPLY with LATERAL

    The following script shows how to do that:

    CREATE TEMP TABLE States(Code VARCHAR PRIMARY KEY);
    CREATE TEMP TABLE Runs(
    State_Code VARCHAR,
    Run_Date DATE,
    Distance FLOAT,
    Description VARCHAR);

    INSERT INTO States(Code)
    VALUES('IL'), ('WI');

    INSERT INTO Runs(State_Code, Run_Date, Distance, Description)
    VALUES
    ('IL', '2013-11-16',16.6,'2 loops in Wolf Road Woods'),
    (
    'IL', '2013-11-17',8.4,'Herrick Woods'),
    (
    'IL', '2013-11-19',7.2,'Chicago Lakefront');

    SELECT s.code, r.Run_Date, r.Distance FROM States AS s 
    CROSS JOIN LATERAL(SELECT Run_Date, Distance FROM Runs as r 
       
    WHERE s.Code = r.State_Code 
       
    ORDER BY Run_Date DESC LIMIT 2) AS r;
      

    ---------

    "IL";"2013-11-19";7.2
    "IL";"2013-11-17";8.4

    Replacing OUTER APPLY with LATERAL

    The following script shows how to do that:

    CREATE TEMP TABLE States(Code VARCHAR PRIMARY KEY);
    CREATE TEMP TABLE Runs(
    State_Code VARCHAR,
    Run_Date DATE,
    Distance FLOAT,
    Description VARCHAR);

    INSERT INTO States(Code)
    VALUES('IL'), ('WI');

    INSERT INTO Runs(State_Code, Run_Date, Distance, Description)
    VALUES
    ('IL', '2013-11-16',16.6,'2 loops in Wolf Road Woods'),
    (
    'IL', '2013-11-17',8.4,'Herrick Woods'),
    (
    'IL', '2013-11-19',7.2,'Chicago Lakefront');

    SELECT s.code, r.Run_Date, r.Distance FROM States AS s 
    LEFT JOIN LATERAL(SELECT Run_Date, Distance FROM Runs as r 
       
    WHERE s.Code = r.State_Code 
       
    ORDER BY Run_Date DESC LIMIT 2) AS r ON TRUE
    ORDER BY s.Code, r.Run_Date;

    --------

    "IL";"2013-11-17";8.4
    "IL";"2013-11-19";7.2
    "WI";"";
     

  • Learning PostgreSql: differences in implementation of constraints

    Constraints in PostgreSql are implemented somewhat differently. To use them efficiently, there are quite a few details we need to be aware of.

    NULLs and uniqueness

    In PostgreSql, unique constraints allow multiple NULLs. This behavior is ANSI standard. SQL Server's implementation of unique constraints is not ANSI standard. 

    The following example demonstrates how unique constraints allow multiple NULLs: The insert

    CREATE TABLE public.test(
    ID INTEGER NOT NULL,
    CONSTRAINT PK_test PRIMARY KEY(ID),
    SomeValue INTEGER NULL,
    CONSTRAINT UNQ_test UNIQUE(SomeValue)
    )

    INSERT INTO public.test(ID, SomeValue)
    VALUES(1,NULL),(2,NULL);

    The insert works on PostgreSql. It fails on SQL Server. This can be a breaking change.

    Uniqueness and multi-row updates

    By default, PostgreSql verifies uniqueness after every row. As a result, the following statement fails, even though IDs would be unique at the end of the statement:

    UPDATE public.test
    SET    id = 3 - id;

    -- new IDs would be unique at the end of the statement
    SELECT 3 - ID FROM public.test;
    2
    1

    This is not ANSI standard, and this is a breaking change - in this case SQL Server behaves exactly as specified in ANSI standard, verifying uniqueness at the end of the statement.

    It is possible to create a PRIMARY KEY constraint that behaves as specified in ANSI standard:

    ALTER TABLE public.test DROP CONSTRAINT PK_test;

    ALTER TABLE public.test ADD CONSTRAINT PK_test
    PRIMARY KEY(ID) DEFERRABLE INITIALLY IMMEDIATE;

    -- now this update succeeds
    UPDATE public.test
    SET    id = 3 - id;

    By default UNIQUE constraints behave exactly as PRIMARY KEY ones - they verify uniqueness one row at a time. If that is a problem, we can fix it in exactly the same way.

    The difference between DEFERRABLE INITIALLY IMMEDIATE and DEFERRABLE INITIALLY DEFERRED

    As we have already seen, DEFERRABLE INITIALLY DEFERRED constraints verify at the end of statement. As such, the following transaction fails:

    UPDATE public.test SET SomeValue=ID;
    BEGIN TRANSACTION;

    UPDATE public.test SET ID=2 WHERE SomeValue=1;
    UPDATE public.test SET ID=1 WHERE SomeValue=2;

    COMMIT;

    DEFERRABLE INITIALLY DEFERRED constraints verify at the end of transaction. Let us re-create our PRIMARY KEY:

    ALTER TABLE public.test DROP CONSTRAINT PK_test;

    ALTER TABLE public.test ADD CONSTRAINT PK_test
    PRIMARY KEY(ID) DEFERRABLE INITIALLY DEFERRED;

    ALTER TABLE public.test DROP CONSTRAINT UNQ_test;

    That done, we can rerun the transaction, and it will succeed. This is a great feature - it allows us to change from one valid state to another valid state in more than one DML command. In other words, we have less need to write complex monster modifications - instead, we can write several simpler, easier to understand ones.

    Using deferrable constraints

    Deferrable constraints allow for simple solutions for some very common problems. For example, we can easily guarantee that every Order in our system has OrderItems. This is a very common requirement, and we can easily implement it:

    CREATE TABLE public.Orders(
    order_id INT NOT NULL,
    has_order_item_number INT NOT NULL,
    some_data VARCHAR,
    CONSTRAINT PK_orders PRIMARY KEY(order_id)
    );

    CREATE TABLE public.Order_Items(
    order_id INT NOT NULL,
    CONSTRAINT FK_order_items_orders
      
    FOREIGN KEY(order_id)
      
    REFERENCES public.orders(order_id),
    item_number INT NOT NULL,
    some_data VARCHAR,
    CONSTRAINT PK_order_items PRIMARY KEY(order_id, item_number)
    );

    ALTER TABLE public.Orders
    ADD CONSTRAINT FK_orders_order_items
      
    FOREIGN KEY(order_id, has_order_item_number)
      
    REFERENCES public.order_items(order_id, item_number)
      
    DEFERRABLE;

    Let us run a short test:

    BEGIN TRANSACTION;

    SET CONSTRAINTS FK_orders_order_items DEFERRED;

    INSERT INTO public.Orders(order_id, has_order_item_number, some_data)
    VALUES(1, 1, 'Some order data');

    INSERT INTO public.Order_Items(order_id, item_number, some_data)
    VALUES(1, 1, 'Some order item data');

    COMMIT;

    As we have seen, deferrable constraints may be very useful.

  • Learning PostgreSql: Embracing Change With Copying Types and VARCHAR(NO_SIZE_NEEDED)

    PostgreSql 9.3 allows us to declare parameter types to match column types, aka Copying Types. Also it allows us to omit the length of VARCHAR fields, without any performance penalty. These two features make PostgreSql a great back end for agile development, because they make PL/PgSql more resilient to changes. Both features are not in SQL Server 2008 R2. I am not sure about later releases of SQL Server.

    Let us discuss them in more detail and see why they are so useful.

    Using Copying Types

    Suppose that we have the following table:

    CREATE TABLE MyData(ID SMALLINT, SomeValue REAL);
    INSERT INTO MyData(ID, SomeValue) VALUES(1, 1.23);

    The following two functions do the same thing. The second one uses copying types, and as such it will be more robust should the table structure change. We shall see that soon.


    CREATE OR REPLACE FUNCTION GetSomeValueById(p_ID INT) RETURNS TABLE(SomeValue REAL) AS
    $BODY$
    BEGIN
        RETURN
    QUERY SELECT MyData.SomeValue FROM MyData
                    
    WHERE ID = p_ID;
    END;
    $BODY$
    LANGUAGE plpgsql
    ;

    CREATE OR REPLACE FUNCTION GetSomeValueById_Robust(p_ID MyData.ID%TYPE) RETURNS TABLE(SomeValue MyData.SomeValue%TYPE) AS
    $BODY$
    BEGIN
        RETURN
    QUERY SELECT MyData.SomeValue FROM MyData
                    
    WHERE ID = p_ID;
    END;
    $BODY$
    LANGUAGE plpgsql
    ;

    Let us change the table structure, as follows:


    ALTER TABLE MyData ALTER COLUMN ID Type NUMERIC(10,2);
    ALTER TABLE MyData ALTER COLUMN SomeValue Type FLOAT;

    INSERT INTO MyData(ID, SomeValue) VALUES(1.34, 1.234567890101112);

    All we need to do to get the second function working is drop and create it:

    DROP FUNCTION GetSomeValueById_Robust(MyData.ID%TYPE);
    The first function needs more work - we have to explicitly change all the hard-coded types:
    DROP FUNCTION GetSomeValueById(p_ID INT);

    CREATE OR REPLACE FUNCTION GetSomeValueById(p_ID NUMERIC) RETURNS TABLE(SomeValue FLOAT) AS
    $BODY$
    BEGIN
        RETURN
    QUERY SELECT MyData.SomeValue FROM MyData
                    
    WHERE ID = p_ID;
    END;
    $BODY$
    LANGUAGE plpgsql
    ;
    As we have seen, copying types reduces the need for tedious error-prone busywork after changes to table structure.
    Using VARCHAR without explicitly specifying the size
    Theoretically, we might want to specify the length of VARCHAR columns, and use that length as some kind of CHECK constraint. 
    However, in practice I usually do not do it - typically this approach is not good enough, for two reasons.

    First reason: usually we do not want silent truncation of values that are too long, as shown in the following example:
    DECLARE @ConfirmationCode VARCHAR(6);
    SET @ConfirmationCode = '1234567';
    SELECT @ConfirmationCode AS ConfirmationCode;

    ConfirmationCode
    ----------------
    123456
    Usually we want the code to blow up rather than silently truncate a wrong value. CHECK constraints do exactly what we want: blow up. 
    This is why we typically prefer to use CHECK constraints to validate the length of a VARCHAR value.
    Second reason: changing column length may require changes of child tables and parameters. 
    Should we have to change the length of ConfirmationCode in a parent table, for example, we will have modify all the child tables that
    refer to ConfirmationCode.
    This process is slow and expensive.
    Similarly, we may have to comb through all our code, modifying all the occurrencies of the parameter @ConfirmationCode VARCHAR(6). 
    This is error-prone and inefficient.
    This is why in our SQL Server development we typically declare our VARCHAR columns wider than necessary, 
    and enforce the length limit via a CHECK constraint.

    In other words, in my practice the need to specify the length of VARCHAR columns is an inconvenience.
    In PostgreSql, this inconvenience is eliminated we are recommended not to specify the length of VARCHAR columns at all. 
    This carries no performance penalty, as discussed here
    http://people.planetpostgresql.org/dfetter/index.php?/archives/24-VARCHARn-Considered-Harmful.html
     
    and here:
    http://www.postgresql.org/docs/9.3/static/datatype-character.html 
    Quoting from the last link: "While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead." 

     

     

  • Learning PostgreSql: old versions of rows are stored right in the table

    PostgreSql features multi-version concurrency control aka MVCC. To implement MVCC, old versions of rows are stored right in the same table, and this is very different from what SQL Server does, and it leads to some very interesting consequences. Let us play with this thing a little bit, but first we need to set up some test data.

    Setting up.

    First of all, let us create a numbers table. Any production database must have it anyway:

    CREATE TABLE Numbers(i INTEGER);

    INSERT INTO Numbers(i) VALUES(0),(1);

    CREATE OR REPLACE FUNCTION PopulateNumbers()
    RETURNS VOID AS
    $BODY$
    DECLARE
    k INTEGER;
    BEGIN
    FOR 
    k IN 1..19 LOOP
        
    INSERT INTO Numbers(i) SELECT i + POW(2,k) FROM Numbers;
    END LOOP;
    END;
    $BODY$
    LANGUAGE plpgsql 
    ;

    SELECT PopulateNumbers();
      

    Next, let us use the numbers table to create another test table:

    CREATE TABLE public.TestUpdate
    AS
    SELECT 
    i, 2 AS SomeValue, CAST('Let''s add some space to make the table wider' AS VARCHAR) AS SpaceFiller
    FROM public.Numbers;

    ALTER TABLE public.TestUpdate ADD CONSTRAINT PK_TestUpdate PRIMARY KEY(i);

    Updates insert new versions of rows without modifying old ones.

    Whenever we update a row, we essentially insert a new version, and the old version stays in the same table. This behavior is completely different from SQL Server, and leads to many consequences we need to be aware of.

    For example, if we update all the rows in a table, it can grow twice as big. Let us see for ourselves.

    At this time the test table's size is 84 Mb, as returned by the following script:

    select pg_size_pretty( pg_relation_size('public.TestUpdate'));

    Presumably the following query scans all the rows in the table:

    SELECT MIN(SomeValue), MAX(SomeValue) FROM public.TestUpdate;

    This query returns 2 for both MIN and MAX - this is how we populated our test table.

    EXPLAIN output clearly demonstrates that all 1M rows are scanned:

    EXPLAIN
    SELECT MIN(SomeValue), MAX(SomeValue) FROM public.TestUpdate;

    "  ->  Seq Scan on testupdate  (snip) rows=1048576 "

    Let us update all but one rows in this table, as follows:

    BEGIN TRANSACTION;
    UPDATE public.TestUpdate SET SomeValue=i;

    Query OK, 1048576 rows affected (execution time: 5.912 sec; total time: 5.912 sec)

    After the update the test table's size is 169 Mb, as returned by the following script:

    select pg_size_pretty( pg_relation_size('public.TestUpdate'));

    Both uncommitted new versions and old versions are stored in the same table, which is why it uses up twice as much storage.

    The following query, executed from the same connection, shows that although the table still has exactly 1M current rows, 2M-1 rows are scanned: both original data and new versions:

    EXPLAIN
    SELECT MIN(SomeValue), MAX(SomeValue) FROM public.TestUpdate;

    (snip) Rows=2097055

    We can rerun this query from another connection, and see the original unupdated values, with MIN and MAX both equal to 2:

    SELECT MIN(SomeValue), MAX(SomeValue) FROM public.TestUpdate;

    Yet if we EXPLAIN this query, we still see that 2M-1 rows are scanned.

    Rollback is instantaneous.

    It took the database engine almost 6 seconds to modify the table. Yet when we rollback the transaction, it rolls back immediately, because the original unupdated rows are still in the table:

    Query OK, 0 rows affected (execution time: 0 ms; total time: 0 ms)

    In SQL Server, one would expect the rollback to take considerable time, because the update took considerable time as well.

    The table does not shrink, however: it's size is still 169 MB.

    The extra storage temporarily used by the uncommitted transaction cannot be reused yet

    At this time we have 1M current versions and 1M-1 old versions in the table. Let us update the table one more time:

    UPDATE public.TestUpdate SET SomeValue=i;

    The table size grows again to 253 MB. Even though we have rolled back the update, the storage used by rolled back versions is not available yet, so new storage is allocated for the new versions.

    At this time, for every current row in the table we have two old versions. As a result, even though the table still has 1M rows, exactly as before, its size has grown three times.

    Using VACUUM to reclaim unused storage

    To reclaim the unused storage, we need to VACUUM the table, either explicitly, or via an automated process. Let us do it explicitly:

    VACUUM public.TestUpdate;
    Query OK, 0 rows affected (execution time: 31 ms; total time: 31 ms)

    Having reclaimed the free space, we can add 1M rows, and the table size stays the same:

    INSERT INTO public.TestUpdate(i, someValue, SpaceFiller)
    SELECT i - 1048576, someValue, SpaceFiller
    FROM public.TestUpdate;

    Let us add 1M rows one more time, and the table size still stays the same, even though the table now has 3 M rows:

    INSERT INTO public.TestUpdate(i, someValue, SpaceFiller)
    SELECT i - 2*1048576, someValue, SpaceFiller
    FROM public.TestUpdate
    WHERE i>=0;

    VACUUM and real execution costs

    Currently our table has 3M rows. Let us delete 2M rows:

    DELETE FROM public.TestUpdate WHERE i<0;

    Even though the changes are committed, the selects against our table still have to scan all 3M rows:

    EXPLAIN
    SELECT MIN(SomeValue), MAX(SomeValue) FROM public.TestUpdate;

    "  ->  Seq Scan on testupdate  (snip) rows=3735300 "

    Let us VACUUM the table, and rerun the EXPLAIN. It will show that less rows are scanned:

    (snip) rows=1591111

    Yet the number of scanned rows is still about 50% more than the exactly 1M rows in our table. Clearly some space is not reclaimed yet. VACUUM FULL is our friend - after it completes, EXPLAINshows that exactly 1M rows are scanned.

    Conclusion

    We have seen several significant differences between PostgreSql and SQL Server, caused by the way MVCC is implemented in PostgreSql. As we continue migrating functionality to PostreSql, we need to be aware of these profound differences, so that the modifications against PostgreSql perform well.

    On a side note, I am really impressed by the quality of PostgreSql online documentation.
     

     

  • Learning PostgreSql: overloading

    Functions in PL/PgSql can be overloaded, which is very different from T-SQL.

    Demonstrating overloading

    For example, the second CREATE FUNCTION in the following script does not replace the first function - it creates a second one:

    CREATE OR REPLACE FUNCTION public.GetQuoteOfTheDay(someNumber INTEGER)
    RETURNS VARCHAR AS
    $body$
    BEGIN

    RETURN 
    'Say my name.';

    END;
    $body$
    LANGUAGE plpgsql
    ;


    CREATE OR REPLACE FUNCTION public.GetQuoteOfTheDay(someNumber REAL)
    RETURNS VARCHAR AS
    $body$
    BEGIN

    RETURN 
    'I am the danger.';

    END;
    $body$
    LANGUAGE plpgsql
    ;

    Let us verify that two functions have been created:

    SELECT * FROM public.GetQuoteOfTheDay(1)
    UNION ALL 
    SELECT * FROM public.GetQuoteOfTheDay(1.2);

    The output shows both "Say my name." and "I am the danger.".

    This means that function name alone is not enough to uniquely identify a function. As a result, we cannot drop a function unless we provide its signature. The following command fails:

    DROP FUNCTION public.GetQuoteOfTheDay;

    Once we have provided the function's signature, we can drop it:

    DROP FUNCTION public.GetQuoteOfTheDay(INTEGER);

    Optional parameters

    Optional parameters in PL/PgSql also behave differently from T-SQL. In T-SQL we cannot add an optional parameter to a function without breaking all the code that uses it. Consider, for example, the following T-SQL function:

    CREATE FUNCTION dbo.TestOptionalParameters ( @i INT, @optional INT = 0 )
    RETURNS TABLE
    AS RETURN
      
    ( SELECT  @i AS i ,
                
    @optional AS Optional
      
    ) ;
    GO

    On SQL 2008 R2, we cannot invoke it with only one parameter; we must provide both:

    -- this works
    SELECT * FROM dbo.TestOptionalParameters(1, 2);

    -- this does not compile
    SELECT * FROM dbo.TestOptionalParameters(1);

    Msg 313, Level 16, State 3, Line 2
    An insufficient number 
    of arguments were supplied for the procedure or function dbo.TestOptionalParameters.

    This behavior is both counter-intuitive for C# developers and not consistent with T-SQL stored procedures. The expected way to invoke a T-SQL function with an optional parameter is as follows:

    SELECT * FROM dbo.TestOptionalParameters(1, DEFAULT);

    As a result, if we add an optional parameter to a function, we still need to change all the code that invokes it. This makes T-SQL optional parameters less useful than they could be, less useful than optional parameters in other languages.

    The following script demonstrates the use of optional parameters in PL/PgSql:

     CREATE OR REPLACE FUNCTION public.EchoName(FirstName VARCHAR = 'Jesse', LastName VARCHAR = 'Pinkman')
    RETURNS VARCHAR AS
    $body$
    DECLARE
    ret VARCHAR;
    BEGIN

    RETURN 
    LastName || ', ' || FirstName;

    END;
    $body$
    LANGUAGE plpgsql
    ;

    SELECT * FROM public.EchoName()
    UNION ALL 
    SELECT * FROM public.EchoName('Walter')
    UNION ALL 
    SELECT * FROM public.EchoName('Walter', 'White');

    This script outputs the following results: "Pinkman, Jesse", "Pinkman, Walter", "White, Walter".

    Parameter's optionality does not change function's signature

    For example, let us recreate the function which we dropped earlier:

    CREATE OR REPLACE FUNCTION public.GetQuoteOfTheDay(someNumber INTEGER)
    RETURNS VARCHAR AS
    $body$
    BEGIN

    RETURN 
    'Say my name.';

    END;
    $body$
    LANGUAGE plpgsql
    ;

    The following script will replace it:

    CREATE OR REPLACE FUNCTION public.GetQuoteOfTheDay(someNumber INTEGER = 0)
    RETURNS VARCHAR AS
    $body$
    BEGIN

    RETURN 
    'No more half-measures.';

    END;
    $body$
    LANGUAGE plpgsql
    ;

    When we select from the function, we shall get "No more half measures" in both cases:

    SELECT * FROM public.GetQuoteOfTheDay(1)
    UNION ALL 
    SELECT * FROM public.GetQuoteOfTheDay();

    Parameter names are optional

    Theoretically, we do not have to provide parameters' names. The following script shows a function which does exactly the same thing, although it is less readable:


    DROP FUNCTION public.EchoName( VARCHAR,  VARCHAR);
    CREATE OR REPLACE FUNCTION public.EchoName( VARCHAR = 'Jesse',  VARCHAR = 'Pinkman')
    RETURNS VARCHAR AS
    $body$
    DECLARE
    ret VARCHAR;
    BEGIN

    RETURN 
    $1 || ', ' || $2;

    END;
    $body$
    LANGUAGE plpgsql
    ;
      

    We can rerun the following script, and get exactly the same results as before:

    SELECT * FROM public.EchoName()
    UNION ALL 
    SELECT * FROM public.EchoName('Walter')
    UNION ALL 
    SELECT * FROM public.EchoName('Walter', 'White');

    Npgsql driver disregards parameter names too

    The following code returns results even though both parameters of the function being invoked do not have names:

    [Test, Explicit]
    public void ReadFromProcWithOptionalParams()
    {
       using 
    (var conn = new Npgsql.NpgsqlConnection(ConnectionString: ConnString))
       
    {
           conn.
    Open();
           
    for (var i = 0; i < 3; i++)
               
    using (var tran = conn.BeginTransaction())
               
    using (var command = conn.CreateCommand())
               
    {
                   command.CommandText 
    = "EchoName"
                   command.CommandType 
    = CommandType.StoredProcedure;
                   
    if(i>0)
                   
    {
                       var firstName 
    = new NpgsqlParameter("
    LastName"
                                            NpgsqlTypes.NpgsqlDbType.
    Varchar);
                       
    command.Parameters.Add(firstName);
                       
    command.Parameters[0].Value = "Hank"
                   }
                   
    if (i == 2)
                   
    {
                       var lastName 
    = new NpgsqlParameter("
    FirstName"
                                            NpgsqlTypes.NpgsqlDbType.
    Varchar);
                       
    command.Parameters.Add(lastName);
                       
    command.Parameters[1].Value = "Schrader"
                   }

                   using 
    (var dr = command.ExecuteReader())
                   
    {
                       
    while (dr.Read())
                       
    {
                           Console.WriteLine
    (dr.GetString(0));
                       
    }
                   }
               }
       }
    }
      

    If the function's parameters do have names, NpgSql does not match parameter names at all. Instead, it submits the first value to the first parameter, and so on. For example, let us restore the version of our function with named parameters:

    DROP FUNCTION public.EchoName( VARCHAR,  VARCHAR);
    CREATE OR REPLACE FUNCTION public.EchoName(FirstName VARCHAR = 'Jesse', LastName VARCHAR = 'Pinkman')
    RETURNS VARCHAR AS
    $body$
    DECLARE
    ret VARCHAR;
    BEGIN

    RETURN 
    LastName || ', ' || FirstName;

    END;
    $body$
    LANGUAGE plpgsql
    ;

    Now let us rerun ReadFromProcWithOptionalParams().

    Even though we are passing "Hank" to a parameter named "LastName", "Hank" is passed to the first parameter, as we can see from the output:

    Jesse, Pinkman
    Hank, Pinkman
    Hank, Schrader

    This is a breaking change.We shall have to change our client code in many places.

    Alternatively, are considering if we want to change NpgSql, so that it uses parameter names instead of ignoring them - that would make migration of our client code so very much easier. Because NpgSql is open source, changing it should not be a problem. We do not have to plead our case and, in the best case scenario, wait months or years for the change. If needed, we can get it done it soon.

    In the next post we shall briefly discuss PostgreSql's MVCC vs SQL Server's snapshot isolation. This is a huge topic; we do not have the time for a more detailed comparison at this time.

     

     

     

More Posts Next page »

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement