THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Davide Mauri

A place for my thoughts and experiences the Microsoft Data Platform

  • For The Better Developer: SQL Server Indexes

    Indexes?

    Indexes are a key factor for a database, relational or not. Without indexes the only option for a database would be to read all the data and discard what is not needed. Very inefficient.

    If a query has a where or a join or a group by clause and you feel performances are not good, before trying to figure out how to cache results in order to improve overall application performances — which will surely help but it will put on you the burden to maintain the cache, making the solution more expensive and complex to develop, maintain and evolve — keep in mind that you will surely benefit from using an index. How much is hard to say, but I would expect performance improvements in the order of 10x to 1000x. Yes: index can make such difference.

    More on caching and other stuff is discussed at the bottom of the article, in the “For those who want to know more” and “For those who know more” sections.

    Notes

    Wondering who I am and why I’m writing these posts? Read the introductory article here. Also be sure to read the notes about the “For the Better Developer” article series, so that you can know what to expect.

    The indexes you have to know

    There are two principal types of indexes in SQL Server: clustered and non-clustered. The first main difference between the two is that the clustered works directly on table data, while the non-clustered works on a copy of such data.

    Both indexes can supports two different storage models: using a row-store or using a column-store.

    There are therefore four possible combination that we have to choose from when we need to add an index to our tables.

    Clustered Row-Store Indexes

    An easy example to explain this index is the encyclopedia or the phone number directory. Data is stored accordingly the order specified by the index. More specifically, by the columns declared when the index is created.

    create clustered index IXC on dbo.Users(Surname, Name)

    In the example above, rows will be stored ordered by Surname and then Name. Just like the phone number directory, duplicates are allowed.

    Each row will also be stored along with all the remaining columns values. If the table dbo.Users has the following definition:

    create table dbo.Users
    (
    Id int not null,
    Surname nvarchar(50) not null,
    Name nvarchar(50) not null,
    DateOfBirth date not null,
    PhoneNumber varchar(50) not null,
    Address nvarchar(100) not null,
    City nvarchar(50) not null,
    State nvarchar(50) null,
    Country nvarchar(50) not null
    )

    the index row will contain not only the indexed columns but also the Id, DateOfBirth, PhoneNumber, Address, City, State and Country. This happens because, remember, a clustered index works directly on table data.

    Of course, since a clustered row-store index works directly on table’s data, there can be only one clustered index per table.

    When you should use this type of index? It is perfect if:

    • you are looking for zero, one or more rows in a table
    • you are looking for a range of rows: from “A” to “B” for example.
    • you need to have the rows returned in a specific order

    As you can see it fits all usage scenarios and, in fact, the clustered row-store index is very flexible and can be useful in almost every scenario but unfortunately, as already said, only one can exist per table. This limitation is imposed by the fact that such index physically orders data in the table and, of course, you can order data only in one way at time.

    If you need to order it in more than one way at the same time, you need to duplicate some or all data, and that’s why Non-Clustered Row-Store Indexes exists.

    Non-Clustered Row-Store Index

    A Non-Clustered Row-Store index is somehow like the index you can find at the beginning of a magazine, where the topics are listed along with a pointer to the page in which they can be found.

    In the case of SQL Server, when you create a non-clustered row-store index on a column — and following the phone number directory example let’s say the PhoneNumber column — all values in that column will be indexed and each one will have a pointer that allows SQL Server to relate that value with the row in the table that contains it.

    create nonclustered index IXNC on dbo.Users(PhoneNumber)

    Again, please note that values in the chosen columns needs not to be unique.

    The non-clustered index requires additional space to be stored, since it creates a copy of all the data in the columns you specified in the index definition, plus the space needed for the pointer.

    When the index is used to locate some data, the result of that search is a pointer. This means that in order to get the real data a lookup operation needs to be done. For example, given a PhoneNumber you want to know who will be answering if you call it. SQL Server will used the index of the PhoneNumber and after having found the number you are looking for, will take the pointer and use it to find all the related values like Name, Surname, Address and so on. A lookup is nothing more that a jump to the location indicated by the pointer in order to access the real data.

    Of course this additional operation has a little overhead. Overhead that is directly proportional to the number of rows for which the lookup is needed. Are you looking for a thousand rows? A thousand lookup needs to be done. The consequence of this behavior is that it doesn’t make sense to always use an index, even if it exists. If the effort to do the all the lookup (which can be an expensive operation) is too high, compared to reading the full table and just discarding the rows that doesn’t fall in the scope of the requested search, the index will not be used.

    As a general rule, than, it can be said that this index is perfect when you are looking

    • for zero or one row in a table
    • for *very few* rows in a table

    as a rule of thumb, “very few” mean 1% or less of the rows in your table.

    Clustered Column-Store Index

    This index effectively turns SQL Server into a columnar database. Really. Once created on a table, since it is a clustered index, it will change the way SQL Server stores that table: instead of saving data one row at time, data will be saved one column a time, following the principles of columnar databases. For that reason you don’t really need to specify any columns when creating such index, since it will be applied to the whole table and thus all columns will be indexed:

    create clustered columnstore index IXCCS on dbo.TrainingSessions

    Column-Store index are perfect for analytical queries. All queries in which data needs to be grouped, optionally filtered and aggregated fall into this scope. Performance can be easily increased by a factor of 10x and it’s quite common to see improvements of 30x and more. This is also a result of the fact that, to work properly, columnar solutions needs to compress data and SQL Server makes no exception: the clustered column-store index will compress data in your table using several techniques, and you may see you table space shrinking a lot, like up to 60%-70% or even more.

    Column-Store index are especially good in read-intensive scenarios, while they do not excel in being written very often. There is an exception for bulk-load scenario. A bulk load is activated when when you use the T-SQL command BULK INSERT or load data into SQL Server using the SqlBulkCopy object or if you use any solution, like SQL Server Integration Services, that supports the Bulk Load option.

    If you have a table in which the majority if the queries have this pattern:

    SELECT 
    <aggregation_function>(Column1),
    <aggregation_function>(Column2),
    ...
    FROM
    <table>
    WHERE
    ...
    GROUP BY
    Column3,
    Column4,
    ...

    and write operations, compared to read operations, are a small number, then give this index a try. You will probably be amazed by how quickly data can be aggregated and manipulated, even on huge tables. (Huge: hundreds millions rows or more)

    The downside of this index is that while it is great for working with aggregations, is not that good retrieving one or few rows with all their columns. So you may want to mix the row-store index and the column-store index together, and that’s why the next, and last, index on the list may be very useful too.

    Non-Clustered Column-Store Index

    This index allows you to create a colum-store index only on a subset of columns of your table.

    create nonclustered columnstore index IXNCCS on dbo.Users(City, State, Country)

    This is helpful when you need to have your table both supporting a row-store index, that is quick for retrieving whole rows, but you also need to quickly aggregated on some columns of your data.

    Using again our phone directory example, the above index will make any aggregation on City or State or Country or a combination of them, really really really fast.

    Let’s make a more real-life scenario: if you have a database that stores training sessions, for example, and you what to be able to present to your user a nice dashboard with aggregated information of the last six month of the most useful indicators like calories burnt, number of training done and so on and you also need to be able to quickly retrieve a single or a bunch of rows to show all the data you have for a specific training session, the non-clustered column-store index, especially when used together with a clustered row-store index, is just perfect for this job.

    Be Warned

    Creating, altering or dropping and index may have huge impact on the server. The bigger the table the bigger the impact. Depending on the size of your data, such operation can take seconds or hours and in the meantime all users trying to use the database will be affected in a way or another. So before playing with indexes, test the script on a test server to evaluate the impact.

    Conclusions

    You now have the basic foundation to understand how to make application go faster. Much, much, much faster.

    Take the sample done above, remove training and users and put your own objects like invoices, bills, orders, whatever you want, and you can easily understand how SQL Server can be perfect to manage a real-time analytic workload and at the same a time also a transactional workload, which is a quite typical requirement.

    This is a really unique feature in the DBMS space today.

    My opinion is that, if you have less than 1 TB of “hot” data (data updated quite frequently during that day) you don’t really need to look for Big Data solution since the overhead they will put on you and your team can be quite big that I can hardly justify for one 1 TB of data. If you have 1 PB or near, than, let’s talk.

    External References

    Indexes are really a huge topic, here we just scratched the surface and even just a bit. I’ll surely talk more about indexes in next posts, but if you don’t want to wait and you’re interested in the topic, which is my main objective with this post series, that you can go for these external references:

    For those who want to know more

    Caching

    SQL Server has its own internal cache. Actually, it has several different caches. The most important is the buffer cache. Here is where data live. In an healthy situation 99% of your data comes from the cache. Yeah, right, 99%.

    Row-Store Indexes

    Row-Store indexes are implemented a B+Trees. The technical difference between clustered and non-clustered is that for a clustered index in the leaf pages you have the actual data, while in the non-clustered you have data related only to the columns used or included in the index, and the pointer.

    Clustered Row-Store Index

    SQL Server doesn’t really enforce the physical order of the data after the index has been created. Data is ordered following the index definition during the index creation phase, but after that, if there are operation that inserts, deletes or updated data, than the physical order is not preserved. What is preserved is the logical order of the data. This means that data will be read in the correct sequence which may be different than the sequence in which rows can be found on the disk. This logical/physical difference is the index fragmentation that is not necessarily a bad thing. It usually is a bad thing, but not in 100% of the cases. For example if you always access your data by one row at time, you care a little about fragmentation. Or if you want to insert data as fast as possible, you don’t really care about fragmentation; on the opposite, you may want to have fragmented data in order to reduce the chances of having hot spots where contention will happen and thus your performance will suffer. Anyway, as usual, edge cases like the two described must not be used as the general rule.

    Non-Clustered Row-Store Index

    You can have more than one non-clustered row-store index on a table. Of course you can create such index on more than one column. This enables some nice “tricks” like the covering index that I will describe in future articles.

    Other Indexes

    There are many more index type then the one I described. They are specific to a certain data type, like geospatial indexes or to a specific feature that I will discuss in another post, like the Hekaton, the In-Memory, Lock-Free, engine. The indexes I described in this articles are the most commonly used.

    For those who know more

    Caching

    Caching is usually a nice thing to have and I also use caching solution quite a lot. I just don’t want to use it as a surrogate of performance I can have right from my database. Caching is perfect, in my opinion, in highly concurrent systems to reduce the stress on your database so that you can deliver the same user experience with just a less expensive solution (be a smaller hardware on-premises or a lower performance level in the cloud).

    Keep also in mind that SQL Server, offers an extremely good option for creating customized caching solution using In-Memory Tables, as Bwin, the online betting giant, demonstrated. So if you already using a SQL Server version that supports this feature, it may be worth to give it a try. Otherwise Redis and the likes are just perfect.

    The “User” table

    Yes, the table used in the sample has several design problems (both logical and physical), for example:

    • there is a functional dependency between City, State and Country)
    • there is no Primary Key defined
    • data types are really not the best possible

    but I just want to show something that is as close as reality as possible so it will be easier to relate the example to your own experience. In addition to that I’ll talk about Keys and Indexes in another dedicated articles, so here I went for something that just plain, simple and easy to understand.

    You’re really come that far?

    Well, thanks for reading. :)

  • Speaking at the SQL Server Seattle User Group in April

    I’m happy to share that I’ll be speaking at the next SQL Server Seattle User Group meeting on April 13th. The topic is again Temporal Tables a subject I love a lot and that is 100% useful in possible roles: dev, dba or bi guy.

    Temporal Tables with SQL Server 2016, vNext and Azure SQL

    Keeping track of how data changed over time in a table has always been a difficult task. Both data insertion or modification and even querying is just more complex when you want to have the result that was returned at a specific point of time in the past. And even more complex when you’re not looking for a specific point in time, but a period of time. Temporal database theory and temporal operators surely can come to the rescue, but they are not a matter for the faint of heart! Luckily one of the biggest - and most requested – feature that has been added to SQL Server 2016 solves exactly this problem, allowing the creation of change audit trails, data history and point-in-time queries in such a simple what that anyone, even on *current* applications, can benefit from it, simplifying solution architecture and saving time (and money) on maintenance an reporting.
    In this session we’ll see how the feature work on SQL Server 2016 and Azure SQL v12 and also what will be available in the vNext version of SQL Server.

    Additional details, location and free registration are here:

    https://www.eventbrite.com/e/temporal-tables-with-sql-server-2016-and-azure-sql-w-davide-mauri-tickets-33015877378

  • Speaking at SQL Saturday #613 in Redmond, WA

    On April 15, 2017 I’ll be speaking at SQL Saturday in Redmond, WA. I’ve settled down after the recent changes, so it’s time to start to be an active part of the Microsoft Data Platform (former SQL Server) community again.

    I’ll be talking about monitoring SQL Server Integration Services and the amazing support for JSON in SQL Server 2016:

    SSIS Monitoring Deep Dive

    In this session we’ll see everything interesting is hidden in the SSISDB database, where you can gain a lot of insight on the outcome, the performance and the status of your SSIS Packages. I’ll share everything I’ve learned building the SSIS Dashboard we’re actually using in production and that you can test here http://ssis-dashboard.azurewebsites.net/. We’ll see the internals of SSISDB database, how we can add custom logging information and how we can use all these data in order to know exactly what happened on a specific point in time.

    JSON & SQL Server: using it effectively

    JSON is a de-facto standard for machine-to-machine communication but also to store “liquid” data. Once a killer feature of NoSQL database, with SQL Server 2016 is now available also in a (post) relational environment.
    But how it fits in the relational database space? How the usage of JSON can simplify DBAs and DEVs everyday work? What about performances? And what about flexibility?
    In this session we’ll see how powerful JSON can be, when used correctly and not abused, providing solution to common problems
    (like in the well known “dynamic schema” case), simplifying development (for example to pass to Stored Procedure complex objects or variable numbers of parameters) with also very good performances, even if they maybe hidden in not-so-obvious places.

    The event is going to be an amazing one, as you can see from the agenda:

    SQLSaturday #613 - Redmond 2017

    So, if you live near there, don’t miss it.

    See there in a bit more than a month!

  • Apache Zeppelin 0.7 for SQL Server

    During the weekend I’ve updated my SQL Server interpreter for the amazing Apache Zeppelin to be compatible with the latest version — 0.7— that has been released just couple of days ago:

    Apache Zeppelin Release 0.7.0

    This new release has a lot of nice features (customizable charts, better multi-user support, improved tables and many others) and in order to start to use the new features also with SQL Server and SQL Azure, the first step was to sync the existing interpreter with version 0.7. Well, that’s done.

    1-Rj4Y2607mkDxIA9PfH6ldg

    You can download the source code and build it yourself from here:

    GitHub - yorek/zeppelin: Apache Zeppelin with support for SQL Server

    or get the Docker image with everything already built, for maximum comfort:

    https://hub.docker.com/r/yorek/zeppelin-sqlserver/

    If you’re new to Apache Zeppelin, you can take a look at the articles I’ve already written to help people to get started:

    Breaking Changes

    In order to take advantage of automatic interpreter discovery and native syntax highlight introduced in version 0.7, I had to rename the interpreter group and the interpreter name to sqlserver from tsql. This means that any binding to the tsql interpreter will be invalid which, in turn, means that all notebook using it won’t work. Luckily fixing this situation is easy: just create a new interpreter binding, using sqlserver as interpreter group and give it the same name you used before. Make sure the notebooks affected use this new binding (as the default one if in your notebooks you didn’t use the %interpreter notation) and you’re done.

    Updating Docker Container to version 0.7

    If you were already running a docker container with Apache Zeppelin for SQL Server, you may have notebook and interpreter configurations that you want to keep.

    The process to update everything without losing any existing work is the following (just be sure also to read the “Breaking Changes” section above!):

    Get the updated image

    docker pull yorek/zeppelin-sqlserver:v0.7

    Once it’s downloaded, check if you have any running container

    docker ps

    If you see an Apache Zeppelin container is running (it is named zeppelin if you followed my tutorials), stop it

    docker stop zeppelin

    Now create a new container pointing to the volumes of the previous version. The new container will automatically use the updated image

    docker run -name zeppelin2 -p 8080:8080 -volumes-from zeppelin -d yorek/zeppelin-sqlserver:v0.7

    List again all the existing containers (running and not running)

    docker ps -a

    Note that the old zeppelin container the is still there and if not needed can be removed

    docker rm zeppelin

    The new container can now be renamed to the usual name

    docker rename zeppelin2 zeppelin

    The old image can also be removed to free disk space

    docker rmi yorek/zeppelin-sqlserver:v0.6.2

    Now check that only one zeppelin container, with the version 0.7, is available:

    docker ps

    Upgrade done! Now you can start Apache Zeppelin for SQL Server with the usual command:

    docker start zeppelin

    and you’ll have access to all your existing notebooks and interpreter bindings.

  • Azure Functions, Dapper and SQL Server JSON to store form data

    A common problem

    Recently we had to setup a couple of contact form pages to allow users to register to the App 2.0 Beta program (now RTM) or to tell us which of the products we launched at CES 2017 they are interested into.

    Such kind of request are quite frequent and usually, from a developer perspective, they are the worst thing someone can ask. They are boring to architect, boring to write, boring to test. Still, business and marketing needs and love them.

    So how to satisfy this requirement in the fastest and easiest way, but still delivering high-quality and easy to maintain code, while also keeping an eye on integrating the data with the existing environment?

    Evaluating Azure DocumentDB

    Given that we are on Azure, one option was Azure DocumentDB. No need to define a schema upfront and it is usable just using REST so the UX developer could do anything on its own, just using JQuery. Azure DocumentDB can also be queried using a SQL language, so extracting data from it wouldn’t have been a problem.

    But at the end, I decided to go another way. All our data, after swimming in a Data Lake are saved in a Azure SQL database where we heavily rely on its column-store features. So having an additional database to integrate was something that would have made the solution a bit more more complex than the simplest possible one. The famous quote

    everything should be made as simple as possible, but no simpler

    is what drives all my architectural decisions, so I wasn’t really happy about going with Azure DocumentDB.

    With the additional fact that there are no really good tooling around Azure DocumentDB yet, I started to look for alternatives. The obvious alternative, aside from saving data into a blob, which was not on option since that would have been too simple, because it doesn’t offer any querying capability, was to use Azure SQL.

    Moving to Azure SQL?

    With Azure SQL you have great tooling (SQL Server Management Studio and now also the online query editor), we already have all knowledge in house, but surely the fact that it doesn’t allow to use just REST to read and write data was, again, something that wasn’t making me really happy.

    Beside that, Azure SQL seemed to be the perfect option. JSON is now natively supported, so there is no problem to store data without a strictly enforced schema.

    Since we’re already using SQL Azure, we wound’t even have to spend any additional cent for it. The only problem to solve was that you can’t use Azure SQL directly via JQuery.

    Serverless is the key

    The missing link — the REST interface — can easily be created using Azure Functions and a microORM like Dapper. Thanks to the serverless nature of Azure Functions all it’s need are the few lines of code to get the HTTP Request Body that contains the contact form “jsonifyied” data and store into the SQL Azure database.

    The created Azure Function gets called each time an HTTP Request is done, using an HTTP Trigger. Here the function.json file that defines the function bindings:

    {
    "bindings": [
    {
    "authLevel": "function",
    "name": "req",
    "type": "httpTrigger",
    "direction": "in"
    },
    {
    "name": "res",
    "type": "http",
    "direction": "out"
    }
    ],
    "disabled": false
    }

    and the function full code is here:

    https://gist.github.com/yorek/89fb1afecc34db3f9419ceb3ae9664d9#file-azure-functions-save-form-data-to-sql-azure-csx

    Such additional layer is also a welcome addition since it allows to inject some additional validation checks and business if needed.

    I’ve used a Stored Procedure here for better decoupling, and it does just a simple insert, with some additional checks.

    https://gist.github.com/yorek/2877b555452c5d5a155a8f50dbfd9bf7#file-azure-functions-save-form-data-to-sql-azure-sql

    It also turns some JSON data into columns, to make querying even easier.

    Dapper helps to make interacting with SQL Azure a breeze (two, yes two, lines of code to open the connection and write to SQL azure), so all simplicity requirements are finally met. No more impedance mismatch. No more.

    Given the existing external factors (resources, knowledge, integration) the simplest solution but not simpler has been achieved.

    Without any existing factors I would probably have chosen Azure DocumentDB. But as an afterthought, I still have the gut feeling that SQL Azure would have been better in the long term (for example I can connect Marketing Manager’s Excel directly to the SQL Azure, something that I cannot do yet with DocumentDB)…so maybe SQL Azure would be my choice onve again. After all software is made to serve people, and this should drive at least 50% (if not more) of the architectural choices, IMHO.

    In conclusion

    Since I joined Sensoria I’ve moved to work on the cloud 100% of my time. And the more I work on the cloud, the more it is clear to me that serverless and simplicity (which means, implementation-wise: microservices) is the way to go. Efficiency is increased so much in this way that it’s really worth the effort, and embracing PaaS clearly becomes an obvious choice.

  • Creating your first Dashboard with Apache Zeppelin and SQL Server/Azure SQL

    Now that Apache Zeppelin has been downloaded and configured, it’s time to create a dashboard. It will be very easy: all you have to do is figure out which data you want to show, write the corresponding T-SQL query and then add some charts and information to make it perfect.

    To create the first Apache Zeppelin dashboard, let’s use the new Wide World Imports sample database from Microsoft:

    Microsoft/sql-server-samples

    Take the “Standard” version so that you can use it also on a non-premium Azure SQL instance, if you want to try it on Azure.

    Once you have restored the WideWorldImportersStandard database, run Apache Zeppelin 0.6.2 for SQL Server docker container:

    docker run --name zeppelin -p 8080:8080 -v /zeppelin-sqlserver/notebook -v /zeppelin-sqlserver/conf -d yorek/zeppelin-sqlserver:v0.6.2

    make sure you have configured an interpreter (call it “Azure” for example) with the connection information to point to WideWorldImportersStandard and than create a new note:

    And now it’s just a matter of writing the queries and turning the result into charts. As usual I’ve recorded a quick video (it’s just six-something minutes) to easily show how you can do that. (And I’ve also started from configuring the interpreter so that you can also review that part).

    Enjoy:

    image

    https://vimeo.com/198582184

    In the video I also show how the markdown interpreter can be used to add information to the dashboard.

    The sample dashboard, that also includes the queries, can be downloaded here:

    My First Dashboard.json

    If you prefer to download only the queries and then DIY, here’s a SQL file with all the used queries:

    My First Dashboard.sql

    I really recommend you to start using Apache Zeppelin if you haven’t done it yet. It’s incredibly useful even for DBAs just to monitor SQL Server status. I’ll talk about this in a forthcoming post. Stay tuned!

  • Apache Zeppelin 0.6.2 for SQL Server Docker Image updated

    In order to be able to keep created notebooks and interpreters configurations when upgrading the docker image to a newer version, I changed the dockerfile to use docker volumes, as described here (if you’re interested into learning more):

    Manage data in containers

    to run the container, now, the command is

    docker run --name zeppelin -p 8080:8080 -v /zeppelin-sqlserver/notebook -v /zeppelin-sqlserver/conf -d yorek/zeppelin-sqlserver:v0.6.2

    The -v does the trick and will be very useful the first time a new image will be released, so that you’ll be able to keep all your notebooks without having to export them before and, in addition, also interpreter configurations will be preserved.

    The solution used until now (sharing a volume with the host) works nice, but unfortunately works only for notebooks. If you have a lot of different interpreter configured (like me) re-configuring them every time the image is updated is really time consuming and boring.

    To be sure that your container is using volumes, you can check it using the inspect command:

    docker inspect zeppelin

    The output is a JSON with detailed information on the container. Look for the Mounts node:

    If you are using a previous version of the image, my recommendation is to download this updated one so that you’ll be ready for future updates.

    If you’re not using Apache Zeppelin yet, you should really start. No matter if you are a Data Scientists or a DBA, Zeppelin is *really* useful to create nice visualization and dashboard just using T-SQL:

    image

  • Apache Zeppelin 0.6.2 for SQL Server now with correct Syntax Highlighting

    After a little bit of work I’ve managed to have syntax highlight for T-SQL in Apache Zeppelin 0.6.2 working correctly.

    The tricky part is that the Ace Editor already supports T-SQL specific syntax highlighting from v 1.2.0, but Apache Zeppelin is still using version v 1.1.9.

    Moving the v 1.2.0 doesn’t work since it creates some compatibility issues so the only way to go was to manually patch and tweak Apache Zeppelin to use the highlighting file for T-SQL available in version 1.2.0.

    Said and done, now T-SQL is beautifully highlighted:

    SQL Server 2016 and vNext aren’t supported yet but I’ll work on this in future for sure.

    Both the GitHub repository and the Docker Hub are already updated. To update your docker image, if you already have downloaded it before, just do the usual pull:

    docker pull yorek/zeppelin-sqlserver:v0.6.2

    Remember to export your notebooks to be sure not to loose any work done so far (this behavior will be fixed soon). Then go for

    docker stop zeppelin

    docker rm zeppelin

    docker run -p 8080:8080 --name zeppelin -d yorek/zeppelin-sqlserver:0.6.2

  • Configure and use Apache Zeppelin with SQL Server: the Basics

    In the previous post I’ve shown how to download and run Apache Zeppelin 0.6.2 for SQL Server, using Docker to simplify the installation procedure. It’s now time to start to use the fantastic Apache Zeppelin to connect to SQL Server and run some queries.

    This version specifically created for SQL Server, already comes with a predefined and almost configured interpreter so that you just have to specify the SQL Server (or Azure SQL or Azure DW) server you want to connect to, the database, the username and the password and you’re good to go.

    Apache Zeppelin Interpreter

    An Apache Zeppelin interpreter is a library that allows Apache Zeppelin to connect to and use a specific database, data provider or even language.

    Notebooks and Paragraphs

    A notebook is made of one or more paragraph, which are the containers where you actually write and run code. To specify which interpreter a paragraph should use, you just have to write the interpreter name prefixed by the percentage (%) character:

    Interpreter Binding

    Each notebook can use one or more interpreter. The process of defining which interpreter a notebook can use is called binding. The interpreters shipped with Apache Zeppelin 0.6.2 for SQL Server are:

    T-SQL (%tsql)
    Markdown (%md)
    Python (%python)
    Shell (%sh)
    Angular (%angular)

    To specify which interpreter is available in the active notebook, you can just click on the the little gear icon on the top right:

    Apache Zeppelin 0.6.2 for SQL Server comes with the following interpreters already bound to each notebook by default:

    The default interpreter will be used if you don’t specify, for a paragraph, which interpreter you want to use. As you can see you can mix in a notebook different interpreters so that you can do almost everything you want. The Markdown is very useful since with that you can create self-explanatory notebooks.

    Interpreter Configuration

    Before using an interpreter you have to configure it. This can be done using the “Interpreter” menu accessible by clicking on the drop-down arrow visible on the right of the top search bar.

    Click on Interpreter menu item to enter configuration page

    Each interpreter has its own set of configuration options. For T-SQL you have to specify the following options:

    sqlserver.connections
    sqlserver.database
    sqlserver.driver
    sqlserver.max.result
    sqlserver.password
    sqlserver.url
    sqlserver.user

    Beside the usual username, password, initial database you have couple of more options:

    driver

    The JDBC driver to use. Just leave the default, unless you want to use, for any reason, another JDBC driver to connect to SQL Server, which is very unlikely. I’ll probably hide this option in future now that Microsoft has officially released and open sourced JDBC driver for SQL Server and which is the JDBC driver to use.

    url

    Is the “address” of the SQL Server instance you want to connect to, using the JDBC format:

    jdbc:sqlserver://[serverName[\instanceName][:portNumber]][;property=value[;property=value]]

    more details on the available properties can be found here: Building the Connection URL.

    max.result

    Big Data means that you cannot work taking a look at all your data. It is just too big (by definition!). So you’ll normally want to avoid to return all of your rows if, by mistake, you forget to specify a where clause. Here you specify the maximum number of rows you want to be returned ever. 1000 is the default number and should be more than enough. If you want to change it, keep in mind that all downloaded data will live in your browser memory space, so you don’t really want to push that number too high.

    connections

    this is a very tricky and interesting property: it allows you to decide if all paragraph in the notebook will share the same connection or each paragraph will have a dedicated connection on it’s own.

    With a notebook connection you will be able to share #temp tables between paragraphs, for example, but you won’t to able to execute two or more paragraph in parallel. Each paragraph will, in fact, be executed using a FIFO logic.

    By using a paragraph connection style, on the opposite, each paragraph will be totally independent and isolated from the others (which means it will have it’s own spid), just like it happens when you use more than one document with SQL Server Management Studio. This connection style allows each paragraph to be executed in parallel with the others.

    Ready, set, go!

    Once the settings are saved, you can start to query your SQL Server instance:

    T-SQL query executed in Apache Zeppelin 0.6.2

    See it in action

    As I did the last time, I’ve also created a quick video to show, in less than 5 minutes, how you can configure Apache Zeppelin 0.6.2 for SQL Server and run your first queries:

    https://vimeo.com/195148479

  • Run Apache Zeppelin 0.6.2 for SQL Server

    As promised in my last post here is the first post that aim to show how to use Apache Zeppelin with SQL Server. First thing first: installation. Or maybe not.

    The first thing to do, usually, is installing the product. Since we’re talking Java here, things may get a little bit long and complex if, like me, you’re more a .NET guy. Even worse if your not a Java nor .NET guy. You’re just a DBA or a Data Scientist. Well Docker is here to help.

    Download and install Docker. It’s very easy an takes a few minutes only.

    https://www.docker.com/products/overview

    Once it is running, open a console where you can send docker commands (any terminal if on Linux or macOS, PowerShell if running on Windows, Docker Quickstart Terminal if running using the Docker Machine toolbox) and go for the following commands:

    docker pull yorek/zeppelin-sqlserver:0.6.2

    docker run -p 8080:8080 --name zeppelin -d yorek/zeppelin-sqlserver:0.6.2

    The first download the docker image (depending on your connection speed it may take a while) and the second run the docker container with Apache Zeppelin inside. It also expose the port 8080 so that it can be used to reach the contained Apache Zeppelin process.

    That’s it. Now you can connect to your local machine and start using Apache Zeppelin:

    http://localhost:8080

    If you’re still using the “old” Docker Machine (maybe because, like me, you also need to use VMWare and cannot then install Hyper-V), you have to connect to your Docker Machine instead of localhost. To get the IP Address of your Docker Machine simply do

    docker-machine ip

    From the Docker Quickstart Terminal.

    To view the entire process in just a minute, here’s a short video:

    https://vimeo.com/193654694 

    Next stop: Configure Apache Zeppelin and run your first query against SQL Server.

  • SQL Server Interpreter for Apache Zeppelin 0.6.2

    I’ve updated the code-base to Apache Zeppelin 0.6.2 and I’ve also finished a first simple-but-working support to autocomplete (you can activate it using CRTL + .). Right now the autocomplete is based on the keywords specified here:

    Reserved Keywords (Transact-SQL)

    is not much, I know, but is something, at least. Next steps will be to read schemas, tables and columns from SQL Server catalogs table. And maybe extract the list of keywords from…somewhere else, to have a more complete coverage.

    I’ve also removed additional Interpreter that may not be useful if you just plan to use it against T-SQL/TDS compatible engines (SQL Server, Azure SQL and Azure DW), and configured the defaults so that it is ready to use SQL Server right from the beginning.

    The code — along with compilation/install/basic usage instructions — is available on GitHub:

    Apache Zeppelin 0.6.2 for SQL Server

    Right now I’ve tested it only on Ubuntu Linux 16.04 LTS 64bits. It should also work on native Windows, but since I haven’t tried it yet on that platform, I don’t know the challenged you may face in order to have the full stack (Java, Maven, Node, ecc. ecc.) working in order to be able to compile and run it.

    At the beginning of next week I’ll release a small tutorial to show how you can use Apache Zeppelin for SQL Server also on your Windows machine using Docker. I plan to do a few tutorials on the subject, since I find Apache Zeppelin very useful and I’m sure it will be loved also by many other SQL Server guys once one start to play with it.

    At some point I’ll will also release only the bin package so that one doesn’t have to compile it itself (but hey, do we love Linux right now, don’t we?) and so that it can just run on Windows, but for now I find the Docker container approach so much better than anything else (it “just runs” and I can do anything via GitHub and Docker Hub), that I’ll stay with this for a while.

  • Azure Functions to Schedule SQL Azure operations

    One of the things that I miss a lot when working on SQL Azure is the ability to schedule jobs, something that one normally does via SQL Server Agent when running on premises.

    To execute scheduled task, on Azure, Microsoft recommends to use Azure Automation. While this is surely one way of solving the problem, I find it a little bit too complex for my needs. First of all I’m not a PowerShell fan, and Azure Automation is all about PowerShell. Secondly, I just need to schedule some SQL statements to be executed and I don’t really need all the other nice features that comes with Azure Automation. With Azure Automation you can automate pretty much *all* the resources available on Azure but my interest, for now, is only on SQL Azure. I need something simple. As much as simple as possible.

    Azure Functions + Dapper are the answer. Azure Functions can be triggered via CRON settings, which means that a job scheduler can be easily built. 
    Here’s an example of a CRON trigger (in function.json)

    {
    "bindings": [
    {
    "name": "myTimer",
    "type": "timerTrigger",
    "direction": "in",
    "schedule": "0 30 4 * * *"
    }
    ],
    "disabled": false
    }

    CRON format is detailed here: Azure Function Timer Trigger. As a simple guideline, the format is:

    {second} {minute} {hour} {day} {month} {day of the week}

    In the sample above, it tells to Azure Function to be executed every day at 04.30. To turn such expression in something that can be more easily read, tools like

    https://crontranslator.appspot.com/

    are available online. If you use such tools, just keep in mind that many doesn’t support seconds, and then you have the remove them before using the tool.

    Dapper is useful because make executing a query really a breeze:

    using (var conn = new SqlConnection(_connectionString))
    {
    conn.Execute("<your query here>");
    }

    To use Dapper in Azure Function, a reference to its NuGet package has to be put in the project.json file

    {
    "frameworks": {
    "net46": {
    "dependencies": {
    "Dapper": "1.50.2"
    }
    }
    }
    }

    It’s also worth mentioning that Azure Functions can be called via HTTP or Web Hook and thus also via Azure Logic Apps or Slack. This means that complex workflows that automatically responds to certain events can be put in place very quickly.

  • Temporal Tables

    I have delivered a talk about “SQL Server 2016 Temporal Tables” for the Pacific Northwest SQL Server User Group at the beginning of October . Slides are available on SlideShare here:

    http://www.slideshare.net/davidemauri/sql-server-2016-temporal-tables

    and the demo source code is — of course — available on GitHub:

    https://github.com/yorek/PNWSQL-201610

    The ability of automatically keep previous version of data is really a killer feature for a database since it lift the burden of doing such really-not-so-simple task from developers and bakes it directly into the engine, in a way it won’t even affect existing applications, if one needs to use it even in legacy solutions.

    The feature is useful even for really simple use cases, and it allows to open up a nice set of analytics options. For example I’ve just switched the feature on for a table where I need to store that status of an object that needs to pass through several steps to be processed fully. Instead of going through the complexity of managing the validity interval of each row, I’ve just asked the developer to update the row with the new status and that’s it. Now querying the history table I can understand which is the status that takes more time, on average, to be processed.

    That’s great: with less time spent doing technical stuff, more time can be spend doing other more interesting activities (like optimizing the code to improve performance where analysis shows they are not as good as expected). 

  • Azure SQL Database DTU Calculator

    One of the most common questions when you start to use SQL Azure is related to the choice of the level of service needed for your needs. On the cloud every wasted resource is a tangible additional cost, so it is good to chose the best service level the fits your needs, no more and no less. You can always scale it up later if needed.

    The "problem" is that the level is measured in DTU - Database Transaction Units - which a value that represents a mix of CPU, memory and I / O. The problem is that it is very difficult, if not impossible, to calculate this value for an existing on-premises server, so that you can have a compare it with the performance of your well-known on-premises server.

    Well, it *was* impossible. Now you can, thanks to this tool:

    Azure SQL Database DTU Calculator

    developed by Justin Henriksen, a Solution Architect specializing on Azure, that simplifies a lot the estimation effort. After running a PowerShell script to detect some metrics on the On-Premises Server, you have to upload the collected values n that site to get an idea of ​​what level of DTU is optimal in case you want to move that database or server to the cloud.

    Of course the more your workload is representative of a real-world scenario, the better estimates you will have: keep this in mind before taking any decision. In addition to this website, there are also two links very useful to better understand what level of service is best suited to your situation:

    Enjoy!

  • Operator progress changes in LQS

    This has maybe gone unnoticed since August is usually a “slow” month, but with the August release there has been a major change in SQL Server Management Studio and how it show the Live Query Statistics data.

    The operator level percentages shown in the Live Query Statistics is now the ratio between actual and estimated rows, which means that the value can get way higher than 100%. The purpose of this approach is to make easier to spot places where cardinality estimation got it wrong for some reason, so that you can go and try to understand the problem and fix the query in order to improve performance or reduce resource usage.

    A detailed post on this topic by Pedro Lopes of the SQL Tiger team is here:

    https://blogs.msdn.microsoft.com/sql_server_team/operator-progress-changes-in-lqs/

    Now that also the Management Studio is following monthly release schedule, the post done by the SQL Server Release Services team about SSMS really needs to be read carefully, just to be sure not to miss this small-but-huge-impact changes:

    https://blogs.msdn.microsoft.com/sqlreleaseservices/tag/ssms/

This Blog

Syndication

Privacy Statement