THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Maria Zakourdaev

  • Data Quake

    Data Quake. That's what it is. 
    Dave Wells have just gave this great definition that clearly describes what's happening in the data management world during the recent years.
    I am greatly enjoying Dave’s session today at Enterprise Data World summit and couldn't resist writing down the summary. 
    Everything that we did in the last decade becomes wrong now. We have used to believe that application logic can run faster and do better if it sits inside the database layer. Now this architecture is being considered a wrong choice. Same goes for data normalization or strong schema. Some people even say that data warehouses are dead. 
    We need to rethink everything. Data schema used to be defined during the design phase. Now we define schema-on-read, after the data have been persisted. Good news - I have always believed that and Dave have just mentioned - there is no schema-less data. Despite the fact that we do not get to design the schema anymore, for Big Data we need to understand the schema from the existing data and define a separate schema models for each use case. Same goes to Data Quality rules that aren't generic anymore. Need to figure out the data quality rules for each use case. Even data governance is changing. We cannot govern the data anymore. It's getting out of the boundaries, out of the control. We can only govern what people do with the data.
    In the modern data world we have more data sources and more types of data, we have much more ways to organize and store data, more uses for data and more data consumers that require fast and on-demand data delivery.
    Data management world is changing. 
    The age of Data Warehousing and BI have came to an end. 
    We are now in the age of Big Data and Data Lakes but they are slowly going away as well.
    And our future is approaching fast, bringing with it Data Catalogs, Data Hubs and Data Fabric concepts.
    I am myself on the way to figure out what is it and how it all makes sense.
    What a great speaker, awesome session and tons of learning ahead of me.

    Yours,
    Maria


  • Read, Enrich and Transform Data with AWS Glue Service

    More and more companies are aiming to move away from managing their own servers and moving towards a cloud platform. Going serverless offers a lot of benefits like lower administrative overhead and server costs. In the serverless architecture, developers work with event driven functions which are being managed by cloud services. Such architecture is highly scalable and boosts developer productivity.

    AWS Glue service is an ETL service that utilizes a fully managed Apache Spark environment. Glue ETL that can clean, enrich your data and load it to common database engines inside AWS cloud (EC2 instances or Relational Database Service) or put the file to S3 storage in a great variety of formats, including PARQUET.

    I have recently published 3 blogposts on how to use AWS Glue service when you want to load data into SQL Server hosted on AWS cloud platform.

     

    1. Serverless ETL using AWS Glue for RDS databases

    2. Join and Import JSON files from s3 to SQL Server RDS instance Part 1 and Part 2

     

    Are you using AWS Glue? What do you think about serverless ETL ? 

    Yours

    Maria

     

  • Compare SQL Server, MySQL and PostgreSQL Features

    Have you ever wondered how MySQL and PostgreSQL are different? Take a look at the mapping that I have made for myself: https://www.mssqltips.com/sqlservertip/5745/compare-sql-server-mysql-and-postgresql-features/

    I would be happy to see comments posted for this blog and add more things to the mapping.

    Yours

    Maria 

  • Discover how SQL Server can use Python to access any NoSQL engine

    .

    Many companies these days keep their data assets in multiple data stores.  Many companies that I have worked at have used other database systems alongside SQL Server, such as PostgreSQL instances, Redis, Elasticsearch or Couchbase. There are situations when the application, that uses SQL Server as their main database, needs to access data from another database system. Some datastores have ODBC/JDBC drivers so you can easily add a linked server. Some datastores do not have ODBC/JDBC drivers.

    Want to learn how to to access noSQL platforms with SQL Server and Python? Read my article here : https://www.mssqltips.com/sqlservertip/5738/discover-how-sql-server-can-use-python-to-access-any-nosql-engine/

    Happy NoSQLing

    Maria 

  • My AWS Aurora Postgres instance is 100% CPU. What’s next?

    I had to investigate today situation when my AWS Aurora PostgreSQL instance CPU was 100%.

    I have started to search for a root cause at the database statistic views level. There is a view pg_stat_activity which shows information related to the current activity of each process, such as host ip, last transaction starttime and waitstats information. There were no long running transactions or long waits and unfortunately this view does not have any counters on CPU or memory usage per process. Boom.

    Another way to track down performance issues is to use extension pg_stat_statements to see execution statistics. This view rows are per query and provide information about how many times query was executed, query execution time, number of rows retrieved etc. Again, no counters related to memory or cpu usage.

    Some use the below query (source here) that is based on query total_time, assuming that the query that runs longer uses more cpu.

    SELECT substring(query, 1, 50) as query, round(total_time::numeric, 2) AS total_time, calls, 
    
    rows, round(total_time::numeric / calls, 2) AS avg_time, 
    
    round((100 * total_time / sum(total_time::numeric) OVER ())::numeric, 2) AS percentage_cpu 
    
    FROM pg_stat_statements 
    
    ORDER BY total_time 
    
    DESC LIMIT 10;
    

    I am not sure it’s true in all cases but this might point us to the root cause.

    I could have used plperlu extension that can show percentage of CPU and memory used by particular session but it is not supported by AWS RDS.


    Postgresql is a process-based system, it starts new process for each database connection. This is why you can see database connection memory and cpu usage using OS facilities only.

    If we are using RDS, we have no access to OS level and cannot run top to take a look at all processes.

    Today I have discovered advanced monitoring for RDS instances to monitor OS processes:

    After enabling this monitoring, we can chose OS process list in the below drop down:

    Using the above, you can monitor all PostgreSQL processes and their resource consumption !

    Pid 6723 had some sensitive information and I had to clean it. The above screenshot was taken after the CPU peak was over which is why the numbers are low.

    Now I can go back to pg_stat_activity and check which host and which application is using the specific connection and see executed queries and the waitstats:

    select * from pg_stat_activity where pid = 6723
    

    Unfortunately pg_stat_activity does not show an active statement but only the top-level one. And there is no way to join between pg_stat_activity and pg_stat_statements to match pid of the connection with query history that we can see in pg_stat_statements.

    We are halfway through our problem. We now know which processes are consuming CPU but do not really know which queries they have executed.

    However, since we have the hostname and the application that stand behind problematic connections, we can go to the application developers, check together the query patterns that they execute and try to understand why their connections are so CPU intensive.

    I would appreciate hearing your thoughts on this

    Yours

    Maria

  • My #AWSsummit keynote liveblog

    The company I work for, Perion, chose Amazon Web Services as the main cloud provider for managing and operating our applications. These days I am learning how to manage databases and data related flows in the AWS cloud.

    Which is the reason I have attended the AWS summit in Tel Aviv today.

    Today’s AWS summit was the first from the series of AWS summits held in major cities around the world. There were ~ 5000 attendees today, 25 technical sessions and it was sponsored by 33 AWS partners. Most sessions which I have attended today had a lot of sponsor content mixed with technical not-too-deep dive into various AWS services. 

    Sessions were divided into 5 tracks, led by AWS Solution Architects and Sponsors representatives:

    • Big data & Analytics sessions on data ingestion, data transformation and data analysis using AWS services
    • AI and Deep Learning – sessions on AWS offerings for data enrichment and learning models
    • Innovate – Rapid Application Development services
    • Cloud operations – sessions on DevOps methodologies and tools
    • Let’s start - for those who starts their journey into the AWS cloud

    I really enjoyed the keynote by Dr. Werner Vogels, Amazon CTO. The keynote hall was already fully packed when I arrived, and I had to watch the keynote video stream from the second hall. The sound was awesome but the person who was operating the camera wasn’t well at his job. Half of the slides could not be seen; the focus was all the time on the speaker.

    Dr. Werner have talked about Amazon revenue run rate, which is near $20 billion dollars  ( this forbes.com article says $16 billion, Microsoft has $16.7 billion, IBM $15.8 billion). AWS have millions of active customers and their pace of innovations (amount of the unique services per year) grows at very high pace:

    • 24 during 2008
    • 48 during 2009
    • 61 during 2010
    • 82 during 2011
    • 159 during 2012
    • 280 during 2013
    • 516 during 2014
    • 722 during 2015
    • 1017 during 2016
    • 1403 during 2017

    Amazon for 7 consecutive years named the Leader in Gartner’s Infrastructure as a Service (IaaS) Magic Quadrant.

    Dr. Werner have talked about how infrastructure and software vendors used to tell us how to develop software. These days, in the cloud, we have as many tools as we need to make our development tailored for our needs.

    Modern applications are continuously evolving. They have more than web content, they are using micro services architecture and are api-based. Everything is programmable, many of them have a smart connectivity, push notifications and responsive design. We used to have 1 deployment a week, now we have 50 deployments a day.

    There were many awesome examples of the AWS customers during the keynote. One of them was the cloud infrastructure behind the iRobot:

     

    (Image source : https://aws.amazon.com/solutions/case-studies/irobot/)

    Dr. Werner mentioned how Netflix are making use of Machine Learning to build smarter applications. 75% of the Netflix videos were viewed after the recommendation made by the recommendation engine. He has also mentioned the Pinterest, one of the world’s largest visual bookmarking tools, where all images go through Machine Learning process to get a better grip on visual search. Another cool examples were the boxing gloves that check the punch speed and the punch intensity, the helmet that analyses the pilot’s emotions and soccer and baseball video ingestion that uses Machine Learning to help coaches to understand how the team can do better at the next game.

    Machine Learning can be used in many more use-cases:

    • Shopping advises
    • Customer problem detection
    • Product modeling
    • Production substitutes
    • Customer adoption models
    • Sales lead ranking

    Dr. Werner have talked about the modern data architectures and their 10 main characteristics: reliable data ingestion, preservation of source data, lifecycle management, metadata capture, security, self-service discovery, data quality, analytics, orchestration and capturing the data changes. He has mentioned that 60000+ databases already have been migrated into the AWS cloud from private datacenters. He has also talked about Aurora, the fast-growing database service. The Aurora data is being copied into 3 regions and has up and down auto-scaling capacity. For example, Expedia that has 4000 writes / 5000 reads per second into the Aurora database.

    I will skip the keynote part where Dr. Werner have talked about rapid application development, serverless architecture, code services, kubernetes and application services – that is not my cup of tea.

    There was an awesome demo of Trax, the company which uses the innovative image recognition technology to help retail stores to sort out their shelves and are using Artificial Intelligence to unlock business opportunities. As a next stage of their solution, they are planning to use robots to take pictures inside stores which is very cool.

    Another great demo was about Flytrax, first drone based food delivery service.  Drones are super-fast, they can fly up to 60 km/h, affordable, not impacted by the traffic and can easily reach almost any location within a few minutes instead of driving for a few hours. Unfortunately, this service is not available in Israel yet.

    Great keynote, #awssummit !

    Yours,

    Maria

  • Query big data, stored in a csv or parquet files, from any stored procedure

    Every cloud provider these days have a serverless interactive query service that uses standard SQL for data analysis.

    As for the biggest cloud providers, we have Azure Data Lake analytics, Google BigQuery and Amazon Athena.

    Due to the fact that my company is using only Amazon cloud, I have some experience with Athena service and I must say that it is awesome.

    I will show you today how you can use Management Studio or any stored procedure to query the data, stored in a csv file, located on S3 storage. I am using CSV file format as an example here, columnar PARQUET gives much better performance.

    I am going to:

    1. Put a simple CSV file on S3 storage

    2. Create External table in Athena service over the data file bucket

            3. Create linked server to Athena inside SQL Server  

            4. Use OPENQUERY to query the data.  

    Athena service is built on the top of Presto, distributed SQL engine and also uses Apache Hive to create, alter and drop tables. You can run ANSI SQL statements in the Athena query editor, launching it from the AWS web services UI. You can use complex joins, window functions and many other great SQL language features. Using Athena eliminates need for ETL because it projects your schema on the data files at the time of the query.

    Let’s create database in Athena query editor.

    image

    As a next step I will put this csv file on S3. By the way, Athena supports JSON format, tsv, csv, PARQUET and AVRO formats.

    Uploading the below file to S3 bucket (don’t put columns header in file):

    image

    image

    Going back to Athena to create external table over the S3 folder.  You can later add more files to the same folder and your queries will return the new data immidiately.

    image

    Now I can query the data:

    image

    image

    As a next step I will set up a linked server from my SQL Server instance because I would like to offload the big data querying to Athena. Of course, I am using in this example tiny data file but in real life we are querying sometimes 300GB of data files in the single query and it takes a few seconds.

    Athena has ODBC driver, I will install it on SQL Server machine ( AWS EC2 instance for this example).

    Here is an installation link: https://s3.amazonaws.com/athena-downloads/drivers/ODBC/Windows/Simba+Athena+1.0+64-bit.msi

    Setting up the ODBC connection. Important, click on the Authentication Option and fill in AccessKey and SecretKey that have permissions to access S3 bucket. S3 output location below will hold csv files with results from your queries. Remember to clean the output files from time to time.

    image

    What is left is to set up Linked Server inside Management Studio using OLEDB provider for ODBC.

    EXEC master.dbo.sp_addlinkedserver @server = N'DWH_ATHENA', @srvproduct=N'', @provider=N'MSDASQL', @datasrc=N'DWH_ATHENA'

    GO

    EXEC master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N'DWH_ATHENA',@useself=N'False',@locallogin=NULL,@rmtuser=N'*******',@rmtpassword='*********'

    GO

    Replace @rmtuser and @rmtpassword with AWS access key and secret key and now we can query the data files from any script or stored procedure.

    There is one very important thing that you need to know. Regular SQL Server ODBC destinations querying behavior of is to send “select *” to linked server and do filtering inside SQL Server. This is very bad for us since we wanted to offload all work to Athena and we do not want to receive all data.  The way to overcome this is to use OPENQUERY.

    Here is example of the query that is using a linked server. The remote query has omitted all filtering and receiving ALL columns from the remote table and filter is being applied later on, inside the “Filter” step.

    image

    The same query that is using OPENQUERY instead of linked server:

    image

    Result.

    image

     

    Isn’t it wonderful to be able to keep the row data in files and query them with minimal effort from the SQL Server side?

    Yours,

    Maria

  • How would you do dynamic pivot SQL in Amazon Redshift? (This elegant script might be used in many other databases as well)

     

    Several days ago I have spent a few hours of my life figuring out how to do dynamic pivot in Amazon Redshift. To tell you the truth, I have expected much more from this DBMS SQL language.

    Redshift is based on Postgre SQL 8.0.2 ( which was released in 2005 !!!! )

    Anything you would want for this, not too difficult task,  does not exits.  No stored procedures. No JSON datatype. No variables outside of UDF, no queries inside UDFs. “UDF can be used to calculate values but cannot be used to call SQL functions”. Python UDFs also cannot query the data, only perform calculations.

    Finally I have found one useful function LISTAGG that helped me to get distinct values of all pivoted columns.

    For each group in a query, the LISTAGG aggregate function orders the rows for that group according to the ORDER BY expression, then concatenates the values into a single string.

    Syntax

    Copy

    LISTAGG( [DISTINCT] aggregate_expression [, 'delimiter' ] ) [ WITHIN GROUP (ORDER BY order_list) ]

    As a next step, I have used Python UDF to dynamically build pivoting query ( I love Python more and more, this is such a nice language :-) . In addition, I have used Common Table Expression CTE ( yes, they have it! ) to pass a list of pivoted columns to Python UDF.

     

    The last step, execution the SQL query, I am still trying to figure out – there is no dynamic query execution in Redshift SQL dialect. But my customer can execute this query from reporting tool.

     

    Here is a setup and all queries.

     

    Creating a table: 

    create table maria_deleteme ( id int, year int, customer varchar, amount int);

     

    Loading some test values:

    insert into maria_deleteme values

    ( 12, 2010, 'customer1', 4),

    ( 2, 2010,'customer1', 80),

    ( 3, 2010,'customer2', 41),

    ( 4, 2011,'customer1', 45),

    ( 5, 2011,'customer2', 15),

    ( 6, 2011,'customer3', 18),

    ( 7, 2012,'customer1', 23),

    ( 8, 2012,'customer1', 1),

    ( 9, 2012,'customer1', 8),

    ( 10, 2012,'customer3', 98),

    ( 11, 2013,'customer2', 1);


    This query gives us a list of pivoted columns:

           select listagg(distinct year','as years

           from maria_deleteme


    The result will be :  2011,2013,2012,2010 

     

    Using CTE over the above query and passing its result to Python UDF:

    with vars

           as (

           select listagg(distinct year, ',') as years

           from maria_deleteme

           )

    select maria_pivoting ('maria_deleteme',years,'year','customer','amount')

    from vars;

     

    The above query will print out this:

    select   sum (case when year = 2010  then amount else 0 end) as "2010" ,

              sum (case when year = 2012  then amount else 0 end) as "2012" ,

              sum (case when year = 2011  then amount else 0 end) as "2011" ,

              sum (case when year = 2013  then amount else 0 end) as "2013" ,

              customer

     from maria_deleteme group by customer ;

     

    Which results in: 

    # 2010 2012 2011 2013 customer

    1 0 98 18 0 customer3

    2 41 0 15 1 customer2

    3 88 32 45 0 customer1 

     

    Python UDF: 

    DROP FUNCTION maria_pivoting (varchar,varchar,varchar,varchar)

     

    CREATE FUNCTION maria_pivoting(tablename varchar, list_vals varchar, pivot_col varchar, groupby_col varchar, counted_col varchar )

    RETURNS varchar(4000) IMMUTABLE AS $$

          

           vals = list_vals.split(",")

           pivot_query = " select  "

     

           for value in vals:

               pivot_query = pivot_query + ' sum (case when {} = {}  then {} else 0 end) as "{}" ,'.format(pivot_col,str(value),counted_col,str(value))

          

           pivot_query = pivot_query + ' {} from {} group by {} ;'.format(groupby_col,tablename,groupby_col)

          

           return pivot_query

    $$ LANGUAGE plpythonu;


     

    I will be glad to hear what do you think about Redshift SQL language ( Of course I know that this is very powerful database for data processing. And VERY expensive.)

    Yours, Maria

  • Becoming a Multiplatform DBA.

    I wonder if I had started my career as a PostgreSQL DBA or MongoDB DBA, would it be easier for me to accept the rapid growth and the variety of data storage solutions and our new reality of the Polyglot persistence?  

     

    Polyglot is the term that came from the Ancient Greek meaning speaking many languages.  Polyglot persistence is about storing your data not only inside SQL Server but in multiple data storage technologies. Whatever suits better your application needs or sometimes even single application component.

     

    Be prepared that tomorrow or next month one of developers will come up with some "other database" which (they will be 100% confident) will serve their application needs better. In  some situations they might be right. My own first natural reaction to those situations is to immediately start searching for "why not" arguments. To keep them in SQL Server. These days I try my best to hold this reaction and allow them to try. If it will work out - everyone will be happy. If not, everyone will learn.

     

    If you push yourself to this change, you soon will realize that the learning curve is not that huge and you are not starting from zero level. Data can be modeled in various forms. Relational form, document form or key and value form but you already understand how to deal with data and how to take care of it. And believe me or not, the concepts are quite similar. 

     

    Of course, your developer is now convinced that he will not need a DBA with his brand new NoSQL technology. Surprise him. You, as a DBA, and Mr. Developer have a strong focus and interest on an entirely different yet equally important aspects of the application infrastructure.

    You know that  data must be backed up on some agreed schedule and from time to time it is beneficial to try and restore the backup to make sure it is useful.  Of course if data is important and needs to be persistent. 

    After digging for a while you learn that the many open schema systems need someone to manage the schema anyway. Correct data types impact queries in ElasticSearch, Cassandra and some other systems that I have worked with.

    Data modelling can be tricky and experienced DBA has a lot of added value there. Some queries might benefit from the pre-aggregations.

    Monitoring your server is another huge topic. Monitoring free space, monitoring queries per second, events per second - trust me, Mr. Developer didn’t think about all this. He will be more than grateful to have you as a part of his project. He will re-learn to value DBA.

     

    DBA stays DBA no matter which technology he is managing. Your experience is very valuable and you can append it to any technology you meet on your way.

  • SQL Saturday on Monday

    “No, no! The adventures first, explanations take such a dreadful time.”
    ― Lewis Carroll. Alice's Adventures in Wonderland

     

    The idea to organize SQL Saturday is so exiting and scary at the same time. For me - this is my first SQL Saturday. For SQL Saturday - this is a first time in Israel.

    All my religious and not so religious friends are energetically making it clear to me that no one will show up on Saturday. Ok, no problem. We are focused and determined. We settle for the only day that has yet to host an SQL Saturday - Monday.

    From here "SQL Saturday on Monday" is starting to take shape. My two friends, Geri Reshef and Itai Binyamin, are allowing themselves to get involved into this risky adventure.

    We are enthusiastically brainstorming and discussing the details. We are discussing a possible venue. We are discussing the potential speakers. We are discussing how to get some sponsorship.

    We are meeting someone, who is temporarily filling the product marketing position at Microsoft. He tells us that he will help us with everything he can. He looks reliable and trustworthy. He books 2 conference rooms for us and promises the winter rain cover for the open terrace, intended for lunch. He likes our beer party idea and is certain that together we will set up an awesome event. Few weeks later he disappears. Like Bilbo Baggins at his birthday party. But he works in Microsoft, right? Nothing can go wrong if you work with Microsoft, right?

    The event planning goes smoothly forward. 13 Speakers are being chosen out of 90 terrific submissions. 3 of them from are MVPs from overseas. Two generous sponsors, Xplenty and Pyramid Analytics make their mind to be a part of our SQL Saturday and we are extremely excited. Valinor, a project consultation company, agrees to sponsor the beer party at the end of the day. Everything appears to us so flawless and faultless: we visualize breakfast on a sunny deck, accompanied by music by the SQL DJ, our attendees mingling with sweet warm pastry in one hand and the aroma of coffee lingering in the air. Attendees are studying the schedule, discussing how hard is to decide to which session to attend because all of them are so appealing. A big crowd surrounds the SQL Clinic during the breaks, where top SQL problem solvers are answering the attendees questions. Beer party at the end of the day, everyone discusses how awesome PASS organization is, how great was the idea to arrange the SQL Saturday and of course everyone hopes there will be another one next year.

    One month prior, the event plans fall under serious risk. After many unanswered emails to Microsoft finally we are allowed to meet our new contacts who will be responsible for the events. Wait, what do they say? The venue will not tolerate any music, alcohol is not allowed and the patio will not have a winter cover which means there will be no place for lunch. But don't worry! They have much better solution for us. Maria, please don't panic. Listen carefully. There is another place for the event, much better. It has very big auditorium and a fancy entrance area, however somewhat limited in space. Sponsors don't really need a table, they can hang out in the corner with small rollup, cannot they?  However, we can bring in the beer and have a music. Isn't it a great idea? We can do 30 minutes sessions to squeeze agenda from 2 planned tracks in it. This will be very cool event. Luckily, my peaceful and self-confident friend Geri is here with me. He nicely stops my hysterical remarks and thanks them for their efforts. I am drinking Valerian Root to calm my nervous system.

    Patience is a best solution to most troubles. Somehow the things around us are getting  sorted out slowly. Game room, located near the auditorium, once declared to be closed for sponsors and eating guests, now can serve as a second track room where people can also eat and play the piano. Microsoft is extraordinary generously proposing us Platinum sponsorship and now we can have really tasty “Speaker & Volunteer Dinner” on the day before the event , buy t-Shirts and funny attendee bags.

    One day before the event at 10 am the morning I receive an email that the event is cancelled because we haven’t filled correctly some insurance form for the catering. I am enjoying my second heart attack, but few hours later we learn that we didn’t actually need this form. My next heart attack comes at 5 pm in the evening, when we are told that alcohol is not allowed in the venue. Half an hour later we are notified that it is allowed. But a few minutes after the the alcohol approval more information comes in. Beer bottles are ok but the beer cart with draft beer cannot enter the venue without that incorrectly filled insurance form. I am drinking Rum mixed with the Valerian Root. Around 9 pm, during the”Speaker Dinner”, we receive our final approvals. We are still not 100% sure that the event will take place.

    And THE DAY finally arrives. We seem to be out of the mud at last. Everything works out just as it should. I have never realized how stressful the work at the registration deck is. We are learning the art of finding the name tags for arriving guests, every time missing something important. One guest is not marked as present in the list. Another reminds us that he hasn't received an attendee bag. A third gets offended that we didn’t remember his name. Itai, while the venue owners are not looking, is barbarically sticking all kind of instructions to the brightly white walls using scotch tape. Geri exists simultaneously in two rooms and waves a sign “session is over” at the enthusiastic speakers. SQL Doctors in the green surgical bathrobes and caps, wearing toy stethoscope dissect attendee’s SQL problems. The amazing Steve Simon, Lenni Lobel and Peter Larsson, who flew across the half of the globe, are making this event special and unique. The food is delicious. Despite the fact that the amounts were calculated for 200+, 160 attendees work really hard and do a great job of devouring all of it.  And, of course, everyone discusses how awesome PASS organization is, how great the idea was to arrange the SQL Saturday in Israel and everyone hopes there will be another one next year.

    Now, I have a question. Is there such sickness as being addicted to organizing an SQL Saturday?

  • Why would the system process lock the transaction log of the user database?

     

    Yesterday I have noticed that the log file for one of the user databases grew much bigger than it should have been. First of all, I went to check the sys.databases catalog view. The reason for the log growth appeared to be an Active Transaction which is a quite common cause. For some reason, master database ( database id =1) log reuse process has also been waiting for the Active Transaction.

    select database_id,log_reuse_wait_desc from sys.databases

    image

    Further investigation revealed that the oldest active transaction was held by the SIGNAL_HANDLER system user, session 8, which held a lock on the system table sysobjvalues in both databases, the transaction has started about a week ago when the server got restarted. I have used DBCC OPENTRAN to get the oldest transaction in the database and sys.dm_exec_requestssys.dm_tran_active_transactionssys.dm_tran_locks  and  sys.partitions catalog views:

    -- find the oldest open transaction in the database

    DBCC OPENTRAN

    image

     

    -- what the session 8 is, how many open transactions it holds and what is the transaction id

    select * from sys.dm_exec_requests

    image

    -- transaction details

    select * from sys.dm_tran_active_transactions

    where transaction_id = 87868

    image

    -- resources held by the transaction, resource_associated_entity_id can show the object id or partition id 

    select * from sys.dm_tran_locks

    where request_session_id = 8

    image

    -- object id and name for the partition held by the transaction

    select object_id,object_name(object_id)

    from sys.partitions

    where partition_id = 281474980642816

    image

    This table sys.sysobjvalues can be accessed only from DAC connection, otherwise you will be getting this error message

    image

    From DAC connection:

    image

    This table stores encrypted and unencrypted object values and properties. I saw there linked server names and connection strings, security audit path and many more other interesting things. I didn’t really need this table in order to fix the issue, I just wanted to understand who might be using it.

    Who can change the system objects or their properties at the time of the server restart? There should be some upgrade process but I new that we have applied the last SP about a month ago.

    After looking at the SQL Server Error Log, I clearly saw what session 8 was doing right after the server had restarted. It was executing the upgrade scripts. Looking back through the error log entries I saw that, every time that SQL Service got restarted, it tried to run this upgrade script and failed on every run. The marked out database on the screenshot below is the database with log growth issues. We haven't noticed this error message, server worked at usual and due to the fact that this user database is not heavily used so we haven’t noticed it’s log growth earlier.

    image

    The sp_vupgrade_replication stored procedure has failed because it couldn't find the MSreplication_subscriptions table. After the failure table locks were never released. The MSreplication_subscriptions table, in fact, should not be there. This table usually located on the subscriber database and my problematic database was a publisher and not subscriber. 3 other objects: MSreplication_objects, MSsubscription_agents and MSsubscription_properties  also should not be there, they are also part of the subscriber metadata. The reason why they are there that in the past this database was once a subscriber which was promoted into the publisher. Most probably subscription objects got not cleaned out properly.

    image

    As I am not quite sure where to clean the old subscription leftovers, I went ahead and have created the table. Here is a script for you so you don’t need to search for it if you happen to hit the same issue. After creating the table, I have executed the procedure sp_vupgrade_replication to make sure it works properly and it run successfully.

    SET ANSI_NULLS OFF

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    SET ANSI_PADDING ON

    GO

     

    CREATE TABLE [dbo].[MSreplication_subscriptions](

           [publisher] [sysname] NOT NULL,

           [publisher_db] [sysname] NULL,

           [publication] [sysname] NULL,

           [independent_agent] [bit] NOT NULL,

           [subscription_type] [int] NOT NULL,

           [distribution_agent] [sysname] NULL,

           [time] [smalldatetime] NOT NULL,

           [description] [nvarchar](255) NULL,

           [transaction_timestamp] [varbinary](16) NOT NULL,

           [update_mode] [tinyint] NOT NULL,

           [agent_id] [binary](16) NULL,

           [subscription_guid] [binary](16) NULL,

           [subid] [binary](16) NULL,

           [immediate_sync] [bit] NOT NULL DEFAULT ((1))

    )

     

    GO

    SET ANSI_PADDING OFF

    GO

    I have restarted the SQL Server service, the upgrade script had started again and had finally finished without any errors. The databases statuses had returned to their normal state. ( At the beginning of the investigation, I have tried to set the problematic user database OFFLINE and then bring it ONLINE. This have solved the log reuse issue in the user database only, master database however had stayed locked. After the SQL service restart the issue have always returned.)

    Only after finishing this investigation and after the issue has got resolved I finally have found a bug on this issue documented here: http://support2.microsoft.com/kb/2509302. The article not quite full, the issue happens also with the databases using Simple Recovery Model. Also, the transaction name was simple “user_transaction”, not like they claim in the article. Maybe this case was somewhat different bug but really close to what is described in the article.

    Yours,

    Maria

  • Rainbow in the Cloud

    Cloud tecnologies burst into our lives without asking our permission. 
    At the beginning many of us thought the only thing that it brings would be the thunder and the rain. 
    Time shows that, instead, it might even bring some color to our skies.

    Today, on a first day Keynote of the 16th PASS Summit, Microsoft representatives 
     -Corporate VP Data Platform, Cloud & Enterprize T.K. "Ranga" Rengarajan , 
    - General Manager of Power BI James Phillips 
    - Corporate VP Information Management &Machine Learning Joseph Sirosh 
    took stage one after another talking about the Microsoft Data Platform and what it can offer in the cloud space.

    They have talked about the explosion of all kinds of data devices.  Those devices produce enourmous amouts of data. Many also consume enourmous amounts of data.
    This data is changing the way we work, the way we do business, and the way we live. We all depend on data to make decisions.
    Microsoft Data Platform allows us to do more. To achive more. In theory, i might add. Just because I had no chance to try their new products myself.

    Microsot Data Platform has all kind of tecnologies that are used by many mission critical environments around the world.

    Tecnologies based on the structured data & tecnologies based on unstructured data.
    In memory & on disk data processing. 
    Products that can scale up & products that can scale out. 
    Technologies that can store data & tecnologies that only procees the data. 
    Some can live on premises & some can live only in the cloud.

    Microsoft Data Platform in the cloud has a lot to offer. 
    Key/value pairs management using Microsoft Azure Tables ( http://msdn.microsoft.com/en-us/library/azure/jj553018.aspx )
    Document management, schema free, write optimized plaform using new highly Scalable Azure DocumentDB ( http://azure.microsoft.com/en-us/services/documentdb/ )
    Relational database-as-a-service ( http://azure.microsoft.com/en-us/services/sql-database/ )
    File system computation using Azure HDinsight, which both process data and store it in the cloud ( http://azure.microsoft.com/en-us/services/hdinsight/ )
    Json based full indexing using Azure Search ( http://azure.microsoft.com/en-us/services/search/ )
    Broser-based data orchestration by Azure Data Factory ( http://azure.microsoft.com/en-us/services/data-factory/ )
    Prediction models tecnology by Azure Stream Analytics and Azure Machine Learning. (http://azure.microsoft.com/en-us/services/machine-learning/)
     
     PowerBI.com dashboards used in several demos today were very impressive with all its interactive and colorful reports. 

    Loud music contributed to the great spirits of 3,941 attendees from 56 countries and 2,000 companies. Huge screens thanked all  community volunteers, chapter leaders and sponsors.

    The World is excited about data.  Microsoft are excited about their data platform. I am excited to be at the PASS summit. 

    Yours,
    Maria
  • Using Google Charts API to Visualize Schema Changes

     

    Last week I have worked on the new email report using Google Charts and liked it so much that decided to share it here with anyone who finds it useful.

    I have a Schema Changes Audit table which is being maintained by the DDL Trigger. The relevant record is added to this table every time anyone changes objects on the server .

    image              image

    If you are not familiar with the Google Charts, you can read my old post about it and how it works here. It is easy to use, very customizable and FREE. The above visualization is using Google Bar Chart.

    To keep things short, I am using Transact SQL to build an HTML Image tag. The above email body contains this HTML:

    -------------------------------------------------------------------------

    <img src="http://chart.apis.google.com/chart?
    cht=bvg&
    chs=660x250&
    chco=CF9036,90062D,67E13B,82088D,319CBA&
    chd=t:0,1|1,3|3,2|18,5|4,8&
    chds=0,19&
    chxt=x&
    chxs=0,ff0000,12,0,lt&
    chxl=0:|Sep%202|Sep%203&
    chm=N,000000,0,-1,10|N,000000,1,-1,10|N,000000,2,-1,10|N,000000,3,-1,10|N,000000,4,-1,10&
    chdl=SendBarReport%20(by%20Maria)|TestAuditTable%20(by%20Diana)|TestAUditTable%20(by%20Maria)|TestAUditTable%20(by%20Roy)|V1%20(by%20Diana)
    " />

    In order to test the HTML chart in usual Browser, you need take the source of the IMG tag and remove the line breaks.

    http://chart.apis.google.com/chart?cht=bvg&chs=660x250&chco=CF9036,90062D,67E13B,82088D,319CBA&chd=t:0,1|1,3|3,2|18,5|4,8&chds=0,19&chxt=x&chxs=0,ff0000,12,0,lt&chxl=0:|Sep%202|Sep%203&chm=N,000000,0,-1,10|N,000000,1,-1,10|N,000000,2,-1,10|N,000000,3,-1,10|N,000000,4,-1,10&chdl=SendBarReport%20(by%20Maria)|TestAuditTable%20(by%20Diana)|TestAUditTable%20(by%20Maria)|TestAUditTable%20(by%20Roy)|V1%20(by%20Diana)

    ------------------------------------------------------------------------

    Here is a script that produces the HTML code and sends out an email. Take into consideration that, if you have high variety of objects that are changing – render this report for 1 day only. Otherwise the chart will be too big.

     

    Create Stored Procedure that builds the HTML and sends it

     

    In order to make sure that your Server can send emails, you will need to enable mailing functionality. Here you can find a nice article on how to configure mailing profile.

    /*

     

    EXEC SendBarReport 'yourmailgoeshere', 1,1

    */

     

    ALTER PROCEDURE [dbo].[SendBarReport]

                                      @MailRecipient  varchar(256),

                                      @PrintMode  bit = 1,

                                      @MailMode bit = 1

    AS

    BEGIN

     

    DECLARE @Legend varchar(max);

    DECLARE @ChartSeries varchar(max) = '';

    DECLARE @ChartColors varchar(max)= '';

    DECLARE @AxisLabels varchar(max)= '';

    DECLARE @ChartLegend varchar(max)= '';

    DECLARE @DataValueMarkers varchar(max)= '';

    DECLARE @Min int, @Max int;

    DECLARE @html varchar(max)='';

     

    -- Prepare the data for the chart

    SELECT  ChangeDate   = CAST(ChangeDate AS Date),

                  Change = ObjectName +' (by '+LoginName+')' ,

                  NumOfChanges = COUNT(1)

    INTO #FinalData

    FROM TestAuditTable

    GROUP BY CAST(ChangeDate AS Date),ObjectName,LoginName;

     

    -- Get distinct dats for the X axis

    SELECT @AxisLabels = @AxisLabels + LEFT(DATENAME(m,ChangeDate),3) +' '+ CONVERT(varchar(5),DAY(ChangeDate)) + '|'

    FROM #FinalData

    GROUP BY ChangeDate

    ORDER BY ChangeDate

     

    -- Get MIN and MAX values to scale the chart correctly

    SELECT        @Min = MIN(NumOfChanges) -1,

                  @Max = MAX(NumOfChanges) +1

    FROM #FinalData;

     

    -- Prepare Chart legend, dynamically generate colors, define data value markers and chart series

    -- Chart series must be in the following format |series1_val1,series1_val2,series1_val3|series2_val1,series2_val2,series2_val3| … where each part between | | belongs to separate value on X axis

    SELECT  @ChartLegend = @ChartLegend + Change + '|',

               @ChartColors = @ChartColors + CONVERT(varchar(6),LEFT(newid(),6)) + ',',

               @DataValueMarkers = @DataValueMarkers + 'N,000000,'+CAST(ROW_NUMBER() OVER ( ORDER BY Change) -1 AS VARCHAR(10)) +',-1,10|',

               @ChartSeries = @ChartSeries + ( SELECT CAST(ISNULL(NumOfChanges,0) AS varchar(50)) +','

                                               FROM #FinalData f4

                                               RIGHT JOIN (SELECT DISTINCT ChangeDate,f2.Change

                                                           FROM #FinalData

                                                           OUTER APPLY (SELECT DISTINCT Change FROM #FinalData ) f2

                                                           ) f3

                                                    ON f3.ChangeDate = f4.ChangeDate and f3.Change = f4.Change

                                               WHERE f3.Change = f1.Change

                                               ORDER BY f3.ChangeDate

                                               FOR XML PATH ('')

                                       ) + '|'

    FROM #FinalData f1

    GROUP BY Change

    ORDER BY Change;

     

    -- Remove last character “|” or “,” from dynamically created strings

    set @ChartSeries = SUBSTRING(@ChartSeries,1,LEN(@ChartSeries)-2);

    set @ChartSeries = REPLACE(@ChartSeries,',|','|');

    set @ChartColors = SUBSTRING(@ChartColors,1,LEN(@ChartColors)-1);

    set @AxisLabels = SUBSTRING(@AxisLabels,1,LEN(@AxisLabels)-1);

    set @ChartLegend = SUBSTRING(@ChartLegend,1,LEN(@ChartLegend)-1);

    set @DataValueMarkers = SUBSTRING(@DataValueMarkers,1,LEN(@DataValueMarkers)-1);

     

    -- Prepare HTML

    set @html =

     

    '<img src="http://chart.apis.google.com/chart?

    cht=bvg&

    chs=660x250&

    chdlp=t&

    chco='+@ChartColors+'&

    chd=t:'+@ChartSeries+'&

    chds='+CAST(@Min as varchar)+','+CAST(@Max as varchar)+'&

    chxt=x&

    chxs=0,ff0000,12,0,lt&

    chxl=0:|'+@AxisLabels+'&

    chm='+@DataValueMarkers+'&

    chdl='+@ChartLegend+'&

     

    " />'

     

    IF @PrintMode = 1 BEGIN

           PRINT @html

    END

     

    -- Send email using prepared HTML as an email body

    IF @MailMode = 1 BEGIN

           EXEC msdb.dbo.sp_send_dbmail

           @recipients = @MailRecipient,

           @subject ='Schema Changes',

           @body = @html,

           @body_format = 'HTML'

    END

     

    END

    Audit table script

     

    CREATE SEQUENCE dbo.GetNextNumber AS int START WITH 1 INCREMENT BY 1;

     

    CREATE TABLE dbo.TestAuditTable(

           ID int NOT NULL DEFAULT NEXT VALUE FOR dbo.GetNextNumber PRIMARY KEY,

           ChangeDate datetime NULL,

    ObjectName sysname,

           LoginName sysname) ;

    P.S.

    Please, don’t say “Have you noticed that Google Charts API has been deprecated?”

    They have announced the deprecation of this tool about two years ago. We can still enjoy this awesome API, Google will not make any changes to it during the next year and will announce if they intend to discontinue or make any changes to it. As soon as this sad day arrives, we will adjust ourselves to their new API https://google-developers.appspot.com/chart/interactive/docs/gallery based on the Java classes.

    Yours,

    Maria

  • Choosing SEQUENCE instead of IDENTITY? Watch your step.

     

    On the nice sunny day getting this error message can be really frustrating. In fact, it is frustrating on any day. Especially if you are doing something really simple.

    image

    Most of you at this moment yell at your monitor “DDL Trigger!!!!” . Yes. I have a simple DDL trigger for auditing purposes.

    If you use the script below for the auditing table, any DDL statement will get pretty red notification to discard the results.

    CREATE DATABASE TestDB;

    USE TestDB;

    CREATE SEQUENCE GetNextNumber AS int START WITH 1 INCREMENT BY 1;

     

     

    CREATE TABLE dbo.TestAuditTable(

           ID            int NOT NULL DEFAULT NEXT VALUE FOR GetNextNumber PRIMARY KEY,

           ChangeDate    datetime NULL,

    ObjectName    sysname,

           LoginName     sysname) ;

     

    You see, the table is pretty basic, I even use SEQUENCE instead of IDENTITY. Which is, in fact, the main problem.

     

     

    CREATE TRIGGER TRG_DDLServer

    ON  ALL server

    For   

    DDL_EVENTS

    AS

    DECLARE @error varchar(256) ;

    DECLARE @LoginName nvarchar(max) ;

    DECLARE @ObjectName nvarchar(max) ;

     

    SELECT @ObjectName = EVENTDATA().value('(/EVENT_INSTANCE/ObjectName)[1]','nvarchar(max)'),

            @LoginName = EVENTDATA().value('(/EVENT_INSTANCE/LoginName)[1]','nvarchar(max)');

     

     

    BEGIN TRY

                  INSERT INTO TestDB.dbo.TestAuditTable (ChangeDate,ObjectName,LoginName)

                  VALUES(getdate(),@ObjectName,@LoginName);

    END TRY

    BEGIN CATCH

          

            SET @error = 'Failed execute DDL trigger:' + ERROR_MESSAGE()

            ROLLBACK;

            RAISERROR (@error,18,1) WITH LOG

            

    END CATCH

    GO

     

    In addition to the unclear error message, you get a nice memory dump in Error Log logged each time some DDL comes through.

     

    image

     

    Searching for the problem, I have spent few hours of my life giving various types of permissions on the Sequence object to the user that I am working with and to the guest user, adding WITH EXECUTE to the Trigger and many other things.  Regular inserts to the Audit table worked. I even had created an INSERT TRIGGER and INSTEAD OF INSERT TRIGGER and any of this worked perfectly. Apart from the insert from the DDL trigger.

     

    My stubborn friend Yoni Nakache after a long fight discovered that if you change execution context by using EXEC or sp_executesql over the insert statement inside DDL trigger, DDL will not produce the error anymore.

     

    BEGIN TRY

    EXEC ('INSERT INTO TestDB.dbo.TestAuditTable (ChangeDate,ObjectName,LoginName)

            VALUES(getdate(),'''+@ObjectName+''','''+@LoginName+''');')

    END TRY

     

    More brainstorming around the subject has revealed the truth.

    Server side DDL triggers do not have a default schema. It is a must to add Schema name to the Sequence call inside the Table Default definition. I admit, I frequently forget adding schema name if I am using the dbo schema.

     

    CREATE TABLE dbo.TestAuditTable(

          ID         int NOT NULL DEFAULT NEXT VALUE FOR dbo.GetNextNumber PRIMARY KEY,

          ChangeDate datetime NULL,

          ObjectName sysname,

          LoginName  sysname

    ) ;

     

    The fact that SQL Server does not produce the correct error message looks like a bug. QA plan for Sequence feature did not contain DDL triggers. Neither on the SQL Server 2012 not on SQL Server 2014.

     

    May all your errors turn into experiences that will cause you to evolve.

     

    Yours,

    Maria

  • Memory is a new disk

    In the database world, disk based data stores are slowly being replaced by memory-based data stores. Memory prices are becoming more affordable and operational databases can usually fit totally into memory.

    According to the Gartner (the research company that provides independent technology reports) by the end of 2015 all enterprise DBMS will use memory optimizations and most of this  transformation will happen this year. 

    I am playing with the Hekaton tables these days and thinking to whom of my customers it might be relevant. Most of them prefer the new stuff and are quite eager to put new terms on their CV. They like to say “Why SQL Server? Everyone is using Redis as an in-memory database, it’s free and working blazingly fast. And the other department is using the Couchbase cluster. We don’t want to stay behind…” In such situations I need to step outside of the wardrobe where I’m hiding and peek around. The DBMS market keeps growing and many great new technologies are being introduced. I believe that by learning other technologies we understand better the advantages and disadvantages of our favorite ones. It is good to know what our competitors are doing. I also want to be capable to identify those situations where the SQL Server is the better solution and have enough expertize to explain why.

    Here is the Gartner Magic Quadrant from October 2013 for operational (OLTP) DBMS providers, some already have in-memory optimizations:

    image

    The Big Vendors, Microsoft, Oracle, SAP and IBM are still Leaders on the field, all others are way behind. I believe that the true Leaders are always aware of all the trends in the database world, they adjust to them and adopt the new ideas, which sometimes means buying emerging breakthrough solutions.

    Oracle have understood finally that having several database products is not too easy for their customers. They have announced an In-Memory Database option for Oracle 12c, it should be available this July. As opposed to TimesTen, which is a separate caching product, an in-Memory option will be totally integrated into the 12c version. It will use the in-memory column-store format which is usually being used for data analysis but they say that this feature will speed up also OLTP workloads. They will keep the data simultaneously in two formats – in the old row-based and in the in-memory column-based. Both formats will be transactionally consistent.

    SAP has a High performance Analytic Appliance (HANA) DBMS since 2011. It supports 2 formats and stores in memory both row-based tabular data and column-based tables. Naturally, tabular orientation is recommended more for OLTP environments and column-store for OLAP-like queries. SAP HANA can be scaled out by adding nodes, sharing the same storage. In such case the master node will store row-based data and statistics on the rest of the data.

    IBM DB2 10.5 has a BLU acceleration. This is a memory-resident index which helps to find data, stored in columnar form on disk. There is no scale out solution yet but they claim that it will come soon. IBM BLU is not really for OLTP environments, as opposite to SQL Server: for writes IBM recommends “to commit 10,000 rows at a time”. Transactions are using locks to make changes to the data and latches to access shared structures.

    Microsoft SQL 2014 has a new query engine “in-memory OLTP” (“Hekaton”)  where the row-based data resides completely in memory keeping a copy of the data on the disk for durability purposes. Indexes on memory-resident tables live only in memory. Queries can span all 3 existing query engines: relational traditional query engine where the data sits on disk, column-store query engine, where data is using both memory and disk and this 3rd new query engine that manages memory resident data. Hekaton tables are not using locks and latches for concurrency and locking management, it’s using a combination of timestamps and row versioning.

    Here are some of the proven alternative technologies on the DBMS market that are worth noticing:

    Aerospike: a key-value row-based NoSQL DBMS with in-memory indexes and data on SSD. It has many customers in gaming and advertising-driven sites. Key value stores are very simple and powerful structures if you always use key-matching searches. They are extremely fast and can scale easily by adding more servers to the cluster. The main downside of the key-value structure is a limited ability to search inside the value. Aerospike, unlike many of it’s competitors, allows creation of secondary indexes on the value and stores them in memory. Redis, for instance, is a quite popular really fast in-memory key-value store but it is not scalable beyond one node and does not support secondary indexes on value. Probably , due to those two limitations, it’s not yet in the Gartner quadrant. Aerospike is using latches and short-term locks to ensure isolation between transactions.

    VoltDB: an all-in-memory NewSQL relational database, designed to run as a cluster of servers with tables partitioned across the cluster. Transactions in the VoltDB are running in a serial timestamp order, do not use locks or latches, and translate SQL into the Java code. They have many customers in telecom, utilities and financial services.

    MemSQL: a distributed in-memory row-based NewSQL relational DBMS. It stores all data in memory in a row format and maintains the write-ahead log for durability purposes. SQL is being compiled into the C++ code. Data distribution is implemented using an aggregator node, which is splitting the query and sends across all leaf nodes for processing. The main node will aggregate it back before sending the results to the client. Recently they have added a new column-store table type which must reside on SSD. It is sort of in contradiction to the current trend where everyone who adds column-based technology, makes it memory-resident. This IMDBMS is used by customers such as Comcast and Zynga.

    MongoDB and CouchBase are the most popular document based data stores. Their greatest advantage is a well known open schema in which work with Json documents becomes very easy. Both of them are not really in-memory products, they are working with data on disk. MongoDB however has some mode of work where they use the memory-mapped files to access the data on disk. This is in fact a Linux feature. Linux makes the files appear as regular files but they are sitting entirely in memory. Couchbase is “almost” an in-memory database, it caches as much as possible in RAM but data is persisted.

    There are many developers eager to play with the new technologies. However, the biggest challenge of the new DBMS is to convince organizations to put aside well known products and start using their new solutions. I don’t think this will be an easy task despite the fact that many of them are great products. They can be extremely fast but it takes years to understand all users requirements and provide the complete flexible solution that will suit more than just a few cases.

    Enterprise solutions are like LEGO. You can build out of it anything that you want. However there are situations when all your kid wants is just a toy car. He doesn’t care if it can be transformed into a toy plane or into a castle, and doesn’t intend investing the time needed to build it.

    Yours,

    Maria

More Posts Next page »
Privacy Statement