THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Davide Mauri

A place for my thoughts and experiences the Microsoft Data Platform

  • Introduction to Azure Cosmos DB: Microsoft’s Globally Distributed, Multi-Model Database Service

    It’s time for something NoSQL at the PASS Application Development Virtual Group. Next week Denny Lee will present on Cosmos DB, the latest evolution of DocumentDB:

    Introduction to Azure Cosmos DB: Microsoft’s Globally Distributed, Multi-Model Database Service

    At Build we announced Azure Cosmos DB, the industry’s first globally-distributed service that enables you to elastically scale throughput and storage across any number of geographical regions while guaranteeing low latency, high availability and consistency – all backed by industry leading, comprehensive SLAs. Azure Cosmos DB allows you to use key-value, graph, and document data in one service, at global scale and without worrying about schema or index management. Join this upcoming webinar to learn more about Azure Cosmos DB, what it offers developers and how it can help you quickly build your modern applications.

    I’m looking forward to it: keeping update with all the database technologies available in the cloud is a mandatory task for today’s architects and developers, so don’t miss it. Registration, as usual, is free here:

    http://appdev.pass.org/

  • Graph data processing with SQL Server 2017 and Azure SQL DB

    As soon I saw that SQL Server 2017 supports graph database concepts (queries and modeling) I started to play with it and wanted to know more. And so I took the chance to ask to the program managers working on that feature to come and speak at the PASS Application Development Virtual Group, and they kindly accepted Smile.

    So, if you’re also interested in the topic (and you should!), point your browser here and register, meeting is set for 8th June, online:

    Graph data processing with SQL Server 2017 and Azure SQL DB

    Shreya Verma & Arvind Shyamsundar

    Microsoft has introduced Graph Database Processing to SQL Server 2017, which will facilitate users track many-to-many relationships between entities, to gather powerful insights from different pieces of connected data. This is well suited for applications where users need to extract information from different pieces of connected data, for example, IoT, fraud detection, recommendation engine, predictive analysis etc. In this session, we will learn the new features and demonstrate them by going over some use cases.

  • From ElasticSearch back to SQL Server

    Sharing with the community why we decided to move back from ElasticSearch to SQL Server

    Sometimes it happens: you move to something that is considered cool and bleeding edge back to something that has dust over it.

    Well, at least this is how SQL Server has been perceived lately by the majority of developers and software architects. Microsoft must have heard this sort-of disappointment against his RDBMS and decided to recover all the lost time, and try to make it cool again.

    If you still think that SQL Server (and his twin, SQL Azure) is still the “old” RDBMS with all its “rigid” limitations, my advice, as Software and Data Architect, is to go back and check it again.

    This is what we did in Sensoria in the last months, and now have moved everything from ElasticSearch back to SQL Server.

    I think that what we discovered and the experience we did is worth sharing: I believe that a lot of companies will find themselves (or are already) in the same situation. Maybe it won’t be ElasticSearch but another product, but the story will be the same. Here it goes.

    The Story So Far

    Something like five years ago, if you were looking for a highly scalable database solution that could also provide a lot of flexibility in terms of object stored and could also offer a good analytical solution, one of the option was ElasticSearch. It can ingest and index JSON documents, using a schema-on-read approach (instead of a schema-on-write, typical of RDMBS), it allows for really easy scale-out, it offers a very nice monitoring tool (based on Kibana), it uses REST as communication protocol so it is compatible with everything (which make it perfect for the web), and, last but not least, it is based on the well-known Lucene technology that is solid and mature.

    So ElasticSearch it was. It did great for several year, millions and millions of documents has been stored into it, both coming from end users using our products or from internal machine learning algorithm that crunches numbers and tries to turn all the raw data into something more useful for the athlete, making sense of the data we gather from garment sensors.

    Requirements and possible options

    We I joined Sensoria, we were in the middle of an architectural revision. Of course one of the system being reviewed was the database. We had to decide if to stay with ElasticSearch or move to something different. I was in charge for this decision, given my strong data architect background, and I could have chosen any database. We could have stayed on ElasticSearch or switched.

    ElasticSearch has lately clearly stated that its focus is on search and analyticsand that ruled it out immediately, since we need a database where to store all our data in first place.

    We started to evaluate the options. One of our main objective is to go 100% PaaS so, using Azure, one of the first options was DocumentDB (now CosmosDB).

    Beside storing and querying JSON we also looked for something that would allow us to quickly aggregate data using any possible combination of filters and aggregations logic. For example someone may want to see the heart rate aggregated on weekly basis for the last two years, while someone else may want to get the pace aggregated by day in the last month. This must be done in real-time, while, at the same time, new sessions are loaded and processed.

    We need to ingest data coming from our IoT devices, Apps and SDKs, but we also need to store “classic” data like user account, preferences, device info and so on.

    Architectural considerations

    Of course there is no a one-size-fit-all solution. Or, better, there is no such solution at a decent costs.

    We’re also implementing a Lambda architecture and the final data store that is used the serve data to the end user it only contain “squeezed” data, where the schema is actually well-known, otherwise developers wouldn’t know how to interact with stored data in order to efficiently generates statistics and charts. In the other stages data is stored in other more suitable places. For example we’ll use a streaming query engine to query data flowing into the system in real-time, and we’ll be using a data lake to store IoT raw data.

    Such architecture allows us to focus only, at least for now, on the master database (I like to call it a online transactional data warehouse, since what we want to do is basically real time analytics), so the search scope is now narrowed down and well defined.

    Technical decision?

    Truth is that almost all enterprise databases we evaluated, relational or not, provides, albeit with different capability levels, the features we were looking for our master database.

    So, from the decision was not only technical one, but more a strategic one. What is the database that, in our context, will give us the highest natively supported features set, the lowest amount of maintenance and the simplest platform so that our developers could use it efficiently? All at competitive costs, of course.

    SQL Azure

    As obvious from the title of this article, the final decision was to use Azure SQL. You may think my choice are biased since I’m a Microsoft Data Platform MVP, and my background is almost SQL Server. “If all you know is a hammer, everything will look like a nail”, right? Sure, but this could have been applied to SQL Server several years ago, the engine has improved a lot in the last version, both in term of raw performances and features offered. Here’s the feature that motivated my choice:

    • Columnar storage support: perfect for real-time aggregations
    • Columnar and Row storage at the same time: optimum to perform aggregate and singleton queries
    • In-Memory (Lock-Free structures): great to support high-concurrency workloads
    • Partitions: perfect for distributing workload
    • Temporal Tables: great to simplify development
    • JSON support with Non-Clustered & Generalized INverted Indexes: perfect to support migration from ElasticSearch and also to make life easier to developers
    • Row-Level Security: optimum for securing data at the source

    And these are the feature that are available now. In the near future SQL Azure will also provide native support for Graph Queries, all within the same engine.

    So technically it’s great but that alone was not sufficient to justify it. Costs played a huge role in the final decision. Initial tests showed that we could handle the load managed by several ElasticSearch servers with just a entry level Azure SQL Premium database, thanks the columnar indexes. The costs showed a reduction ration of 3:1, which was immediately loved by the business guys.

    Integration is a key factor

    For me, having all the aforementioned features in just one product is really a big thing. Of course SQL Server / Azure SQL may not excel in all of them, so you don’t have the best of the best, but, except for rare edge cases, I rather prefer to have just one tool that provides you with hundred of feature at near perfect level that hundred of tools each one just doing just one thing but perfectly.

    The fact that everything is integrated means that we don’t have to move data around, manually figuring out how to integrate different data stores while keeping logical consistency and integrity of data.

    Simplicity is a big plus. By making our solution simpler, much simpler, means that we have less development and maintenance costs.

    Having to learn only one query language is also a big plus. One of the downside of using different databases, is that NoSQL database have the tendency to have their own query language, which make everything a little more challenging: developers need to master n languages, which make them way less productive, and bugs and performance problems are just more frequent.

    Mastering well a database is really important because it avoid developers to go on a caching spree. Caching is a fantastic technology if used correctly (we use the well know Redis database for that). If used just to hide performance problems, it will quickly turn your solution into something worse that the worst spaghetti-code you have ever seen. With the addition of concurrency management and stale data issues to the problem. Hell is a nice place in comparison.

    So far, for us, the feature set, the integration of everything, the SQL language and the existence of very good tools (this is another really important point to keep in mind: good tools allows productivity to increase a lot) was pointing us to Azure SQL. But what about scalability? It will scale enough for us?

    Scalability is in the fabric

    One of the argument that was true when SQL Server was only offering an on-premises options is the scalability factor. What happens if we need to scale out to handle a lot more data and concurrent access we have right now?

    Beside the fact the our architecture uses a queue in order to handle spikes of workload, especially for data ingestion, Azure offers the solution. Azure SQL allows the database to scale up to 4 TB of data and if this should be not enough, Elastic Pools will help even more. We comfortably fit in that space for now (we’re not Twitter or Facebook just yet…), so we’re fine within that limit (even because all the raw data stays in a data lake, and there is were things get huge, usually).

    Being a PaaS solution also means that we don’t have to manage anymore clusters, nodes, updates, reboots and all the well known things that you normally have to deal with when you have to maintain your servers.

    Python and R are also in the mix

    A lot of the machine learning and advanced analytics code is based on R and Python. With the latest release of SQL Server, beside native support to R, also Python has been added as a language that can be used in-database to crunch data. This means that we can reuse all our assets based on scikit-learn. While this is not yet available on Azure SQL, I expect it to come soon, as also mentioned at //build 2017 event.

    Refactoring Code

    One key factor that allowed use to use SQL Azure is the native support to JSON that it now finally has. While is still a v1 feature, it is enough, for us, to implement all the desired features we needed to replace ElasticSearch. I’ll write a more technical post on the subject in future, so stay tuned if you’re interested in the implementation details but, as you can imagine, the trick was to fool the code to make sure it felt like being still talking with ElasticSearch, so overall changes to our code have been minimal.

    Conclusions

    ElasticSearch does offer a lot of nice feature, but, on Azure, is not offered as PaaS. Since it’s focused on analytics and search mostly, if we would had chosen it, would would have still needed a database to store OLTP data. CosmosDB is PaaS and also offers a lot of flexibility and also support SQL but….it’s another database that would have required us to create an additional integration solution to keep the existing Azure SQL and the newly created CosmosDB data in sync. We also have a quite well defined schema, so Azure SQL was the perfect choice for us in terms of ratio between costs, performances and features. It provides a very good range of features that ranges from OLTP to (real time) OLAP which is great for us, until we reach to 4 TB limit. Just like my beloved Python language, I like to think to Azure SQL as a solution “batteries included”.

    What I’d like to underline here is that we’re now living in an era where, technology wise, technology is usually not the limiting factor. Humans and our limited knowledge is. So, what is really important now, is the balance between costs, features and simplicity of the overall solution. Technology should help us create solution that are as simple (both for developers and users) as possible. But no simpler.

    And this clarifies the role of Solution/Data/Software Architectes especially well: find the best technical solution with the best benefits/costs ratio, now and in the near future. And make the solution modular so that in the distant future we can only change what needs to be changed in order to keep up with the new challenges.

    Luckily we live in an era where the cloud is making it possible. We just have to evaluate and re-evaluate technologies without any preconceptions.

  • PASS AppDev next webinar: “Introducing R”

    On May 25th the PASS Application Development Virtual Group will host a new webinar delivered by the well know Dejan Sarka and the topic will be R (https://www.r-project.org/) that is more and more used also by developers:

    Introducing R

    R is a free software programming language and software environment for statistical computing, data mining, and graphics. R is extensively supported in SQL Server 2016 suite, in Power BI, and in Azure ML. In order to use it, you have to learn the language. After attending this session, you should be able to write simple R code and use it in the products and services mentioned.

    As usual webinar is free, and it will be accessible here:

    https://attendee.gotowebinar.com/register/8547344291727590146

  • PASS AppDev Recording: “Building REST API with SQL Server using JSON functions”

    Last week Jovan Popovic (SQL Server Program Manager for JSON support and other nice things) delivered a very interesting webinar on JSON and SQL Server / Azure SQL for the PASS Application Development Virtual Group. If you’re interested in the subject, here’s a couple of links that will make you happy:

    Full recording is on PASS Development Virtual Group YouTube channel:

    https://www.youtube.com/watch?v=0m6GXF3-5WI&feature=youtu.be

    Slides are available on Slideshare:

    https://www.slideshare.net/JovanPopovic/building-rest-api-using-sql-server-and-json

    while demos are on GitHub:

    https://github.com/Microsoft/sql-server-samples/tree/master/samples/features/json/product-catalog

    Jovan showed not only how this feature can help in .NET development, but also how it make life so easy for using SQL Server from any other language be it Javascript or Node.js

    I’ve using JSON with SQL Server for a while now, and I can really say it’s a game changer for developers. If you haven’t played with it yet, you definitely have to give it a try: managing dynamic schema has never been so simple.

  • Data Juice

    The data ecosystem explained in 10 simple images

    Couple of weeks ago I was explaining how IoT and BigData can be a game changer in sports and human activities. Yeah, it’s cool to put a sensor in your own grill so that it can automatically tell you the exact time you should change side of your filet, but I think it’s even more cooler to instrument the human body and objectively quantify human activities, especially in sports and health care, so that we can even enjoy more and more the sports we love and the life we live. After all knowledge is power, right? And the first step to knowledge is the ability to measure: IoT is really a big deal here.

    But then? From a business perspective, how and where all the gathered data will be stored? What all terms like Data Lake, Data Science and the likes really mean? Surprisingly enough, the answer is as simple as describing the life cycle of something that everyone know very well: orange juice.

    I created a slide deck, named Data Juice, to explain the whole data ecosystem to C-level guys that had to understand the business value of every piece in such ecosystem, but without having to deal with technical details. It had to be clear, concise and easy to remember. The juice sample was just perfect, as you we’ll see.

    I created a slide deck to explain the whole data ecosystem: it had to be clear, concise and easy to remember.

    Now, the same concept applies to IoT: the sensor working right now in the wearable you’re wearing, almost surely collect raw data (acceleration, magnetic field, etc) at something around 50 times per second (up to 100 sometimes). But you care about is not that raw value, but the trend of your preferred exercise performance.

    What’s the line that connect those two points? Happily enough for me, answering to this question allows me also to describe what kind of work I do, and how it is related to sports, which was one of the question I’ve been asked recently by a well-known tech & lifestyle magazine.

    Here’s the answer:

    Data Sources

    New data is born everyday.

    New data is generated everyday, everywhere. Accounting systems, IoT devices, machine logs, clickstreams…anything generate some kind of data. At this stage data is generated and consumed, usually, locally and in very small sizes. For example an accounting system generates an invoice, and allows users to interact with it. One a time. Just like oranges that are born on the tree. You can just grab one and consume it in place.

    Big Data

    Oranges are harvested and sent to factory for processing. The factory needs to be prepared to get oranges of any size, kind and color and different ripe status. Sometimes it will get only one load a day, sometimes several per hour. What happen with data is really similar. The famous 3V, Volume, Variety and Velocity says that your data system must be able to accept almost any kind of data. Again, just like with oranges: if you want to be in the business, you better be prepared to handle them when they arrive.

    Data Lake

    All the ingested amount of data needs to be stored somewhere. And it better be a place where space can grow as need and it should also be pretty cheap if you don’t want to spend all your money in storing everything, even if you don-t know if and how you will use that data in future. Such place also need to support querying and processing capabilities in order to allow you to dive in such lake. You don’t really care too much about shape, format or correctness here. You just store data that as it was generated, just to have it in case you need it. Even if it may be not good for consumption.

    You just store data that as it was generated, just to have it in case you need it. Even if it may be not good for consumption.

    Data Curation

    Once named also ETL, Extract-Transform-Load, this is the process where data is selected, cleansed, standardized and integrated with additional informations, if available. More or less how oranges are cleaned, divided in different groups for different usage, and if not in good condition, discarded.

    Data Warehouse

    Once data has been processed by the previous system, it needs to be stored in safe place. Safe not only in the meaning that no-one can steal from it, but also in the meaning that everyone who will get data from it, can assume it is safe to consume. Again, just like oranges, if you get one from a market warehouse you can assume safely enough that you won’t be poisoned. The same goes with data. If you get data from the Data Warehouse, you can safely assume that it is safe to consume, meaning that you can use it to take informed decisions. The same does not apply to a Data Lake, where usually you don’t really know what you’ll get and how to deal with it.

    If you get data from the Data Warehouse, you can safely assume that it is safe to consume, meaning that you can use it to take informed decisions.

    Data Mart

    Once you get your oranges from the store, you can decide how to consume them. If you have hundreds of boxes you won’t consume it one by one. You’ll surely try to transform it in a more easy to consume product. Depending on the target, you may want to process it in different ways, in order to make sure that everyone gets exactly what they need, no more and no less. A Data Mart is exactly this: data ready to be consumed with very little additional processing, if needed at all.

    Of course, different people, or different situation may require different aspects of that data, so having more than one Data Mart is quite common. If something is not clear enough in the Data Mart, one can always go in the Data Warehouse that lies behind and check how and why such unexpected data has been produced.

    Self-Service Analysis

    Also known as Self-Service BI, this is where you’re not happy with the Data Mart, since none of them provides what you exactly need, you have to do it yourself. You need to squeeze the juice of three different oranges, of three different kind, in order to create the perfect mixture you want. You go in the Data Warehouse (usually, but sometime you can also grab data from different Data Marts) and create your mix. Maybe adding a hint of cinnamon taken from the outside world.

    Business Intelligence

    Using all the things described so far, you apply the Business Intelligence process to figure out how to drive your business, check how it is performing and analyze the outcome of defined strategies to stay in business and, possibly, be successful.

    Data Science

    What to do with all the vast amount of unused data in the Data Lake? Maybe among the oranges that does not fit the current sales plan and target, there is the next “big thing”. Something you can leverage to create a new product, sell a new service or something that can help you to understand if there is anything in the collect-store-process-enrich life cycle that can be improved. Usually this means that you are looking for question, and not for answers. This is the main difference between the process of doing Business Intelligence and the process of Data Science: in the first you look for answers to questions you know, in the second you look for questions you don’t know yet.

    This is the main difference between the process of doing Business Intelligence and the process of Data Science: in the first you look for answers to questions you know, in the second you look for questions you don’t know yet.

    Data Juice

    In conclusion this is what the information technology, from a data standpoint, is all about: squeezing every drop of information out of all the data we have, so that we can gain knowledge that ultimately brings to wisdom.

    We squeeze every drop of information out of all the data we have, so that we can gain knowledge that ultimately brings to wisdom.

    So now, when someone ask me what kind of work I do, I say that “I squeeze information out of raw data”. Easy, understandable and, actually, really precise.

  • PASS Application Development Virtual Group

    I've been involved with PASS, once named Professional Association of SQL Server users, now just...PASS, since 2006 when I stepped up as president of SQL Server Italian User Group. Now that I moved to Redmond, WA, I'm still involved with it, but of course, I'm not leading the Italian User Group anymore.

    Since I do really think that community activity is vital for an healthy and successful career, as soon as I noticed that the Application Development Virtual Group was not very active anymore I decided to offer my help and so here I am, I've just stepped up as new leader of that group, AppDev for friends:

    http://appdev.pass.org/

    The focus will be on the entire Microsoft Data Platform, so not only SQL Server, and the languages we'll try to cover are, of course C# and .NET in general and also, more broadly, any language that can interact with the Microsoft Data Platform. So Python (my passion!), Java, Ruby, F#, Rust, R....anything you can think of. Of course everything approached from a developer perspective. No administration here, no sir. Security? Yes. Performances? Sure, give me more! Best Practices? Absolutely! Everything that can be of interests to developers and architects will be in our scope (given that is somehow related to the Microsoft Data Platform, of course).

    I've just started to contact some great speakers to start with the right foot, and the first just confirmed his support. Jovan Popovic, SQL Server / SQL Azure PM will present a session on JSON, one of the most requested and powerful feature in SQL Server 2016 and next:

    Building REST API with SQL Server using JSON functions

    Registration is free both to the seminar and also the the Virtual User group. You can find link to registration and additional info on UG homepage:

    http://appdev.pass.org/

    I recommend to register to both, so you won't miss a very interesting webcast (that will be recorded and made available on YouTube for future reference) and you'll also stay informed on the future activities that will be, I can assure you, just amazing!

    And, last but not least, if you have a topic you want to present, just let me know.

  • For the Better Developer: Notes

    I want to be clear about the target and the scope of the “For the Better Developer” series.

    The articles doesn’t want to be 100% precise, but the aim is to give you, the developer, who understand how important in today work is the ability to manipulate data, a good knowledge on how to use some features and not how they really work behind the scenes. I hope that you will be interested in learning how things works and the section “for those who want to know more” you can find at the end is there with that specific purpose in mind. But again it doesn’t really make sense to dive really deeply in how things works since there are a lot of extremely good book and articles and documentation and white papers that already do this. My objective is to give you enough information to

    • help you in your work
    • spark in you the will to know more

    so sometimes I have to simplify things in order to make them consumable

    • by everyone, especially those who doesn’t have a deep database background yet
    • in just a bunch of minutes, since you’re probably reading this while at work, and you surely have some work to deliver within a deadline that is always too near

    The article in the series assumes that you already have a basic practical knowledge of a database. For example you know how to create a table and how to query it.

    The section “for those who know more” is created for those who already something more deep about database and that found simplifications in the article too…simple. Here I explain and clarify the topics I had to simplify a lot, sometimes even maybe telling a lighter version of truth, in order to make it easily accessible to everyone. But truth needs to be told, and in this section is where I tell it.

  • Know your data(base), or Introduction to the “For the Better Developer” series

    After 15 years in Database Performance Tuning, Database Optimization, Data Modeling, Business Intelligence, Data Warehousing, Data Analytics, Data Science, Data Whatever, I’m back in the development space, where I lived also for the first 10 years of my career.

    Huston, we have a problem

    After 15 year I still see the same big problem: the huge majority of developers doesn’t really know how to properly deal with data. It’s not a critic or a rant. Is just a matter of fact: especially several years ago, if you spoke to a developer, it was clear that dealing with data and especially a database was not something he felt was part of his job.

    This is not true anymore, any developer today knows that a bit of knowledge of database is needed. No, better: is mandatory. Still, that way of thinking that was common years ago created a huge hole in developer knowledge bag that is showing its bad effects now. If you didn’t learn how to properly deal with a database, it’s now time to fill that gap. Unfortunately 15 years is a lot of time and database evolved immensely in that time frame, especially relational databases, so trying to recover the lost time it may be really hard. Still, it’s mandatory. for you, your team and the company you own or you work for.

    I do really believe in community and sharing so I decided to try to help to fix this problem. If you want to read the full story, go on, otherwise you can just skip to the last paragraph. As always, its your choice.

    Changing times

    The market changed, asking for more integrated and interdisciplinary knowledge in order to be able to react more quickly to changes. Developers needed a way out: database were just too old and boring, and they really didn’t want to learn SQL, that freakin’ old language.

    The NoSQL movement for a while seemed to come to the rescue, promising developers to give them something that will take care of data for them. Auto-Indexing, Auto-Sharding, Auto-Scaling, Auto-Replication. JSON queries. JSON data. JSON everything. No Schema. Just freedom, beers and peace for everyone.

    The world today

    Of course that was (obviously) too good to be true. Even NoSQL databases, now that they have matured, needs to have data properly modeled, indexes correctly defined, used and maintained, not to mention security, transactions, and techniques to optimize performances.

    There is a big difference between now and 15 year ago: now we live in a world where data is the center of everything and so developers needs to know how to handle it. Look at any work offer for developers and you will find that some degree of database knowledge is always required.

    You probably already know how to query several database, be SQL or NoSQL: you had to learn it to survive.

    Become a Better Developer

    Frankly speaking, surviving is just not that nice. Evolving, thriving, accomplishing something bigger is nice. And to do that, you really be able to make the difference. To stand out from the crowd, there is only one thing you have to do. Become good, extremely good, in dealing with data.

    You don’t have to become a DBA or a Data Scientist if you don’t want to. But dealing with data in a position of control and not just trying avoid dying under the data weight is something any good developer should learn.

    Do you want to use MongoDB? Go and become the champion of MongoDB. You are excited about the new Google Spanner? Go and learn it like there is no tomorrow. Choose one or more database of your choice and master it. It will make a whole difference, since you will have the power to turn the data in what you want, without having to reinvent the wheel every time.

    More and more computing power is moved within the database. It is just much more easier to move compute logic in the database instead of moving the huge amount of data we have to deal with everyday out to the application. This is already happening and will continue in the foreseeable future: knowledge of a database is now more important than ever. Performances, and thus costs, depends on how good you are in manipulating data.

    The cloud make all these reasons is even more important, since the relationship between performance and costs is stronger than if you are on-premises.

    A good data model, the correct index strategy and a cleverly written code can improve your performance by orders or magnitude. And reduce costs at the same time.

    My choice

    I have chosen SQL Server for myself. I think that nowadays is the most complete platform that supports all the workload one can ask and offers NoSQL, InMemory and Columnar capabilities all in one product. All with exceptional performances and with a great TCO. Plus, it lives in the cloud.

    Just to make it clear: I’m not a fanboy. I’ve also studied and used MongoDB, CouchDB, MemCached, Hadoop, Redis, Event Store, SQL Lite, MySQL, ElasticSearch, PostgreSQL. But since the Microsoft platform is where I work most of the time, it make sense for me to use SQL Server.

    To be 100% honest and transparent: I’m also a Microsoft Data Platform MVP. But that’s not why I like SQL Server and SQL Azure so much. It is because I wanted to be a developer who can make the difference that I have learned SQL Server so well which in turn drove me to be recognized as an MVP. Since I’m a data lover, I always take a deep look at what the market has to offer: I always go for an objective and rational approach.

    “For the Better Developer” series

    I really love development, and I love it even more when I am among talented developers who love their job and know the technologies they work with extremely well, because when that happen, great things happen as a result. When that happen my work become pure joy. It becomes poetry. I’ve been lucky to experience such situation several times in the past. And I want more.

    So I want to increase my chances to be in that situation and that’s why I’m starting a series of articles that will try to help everyone who want to become a better developer with SQL Server and SQL Azure.

    If you’re a developer that needs to use SQL Server or SQL Azure, you will find in articles of the series a quick way to learn the basics and also learn where to look for more deeper knowledge.

    Here I’ll keep the updated list of the articles I have written so far:

    SQL Server Indexes

  • For The Better Developer: SQL Server Indexes

    Indexes?

    Indexes are a key factor for a database, relational or not. Without indexes the only option for a database would be to read all the data and discard what is not needed. Very inefficient.

    If a query has a where or a join or a group by clause and you feel performances are not good, before trying to figure out how to cache results in order to improve overall application performances — which will surely help but it will put on you the burden to maintain the cache, making the solution more expensive and complex to develop, maintain and evolve — keep in mind that you will surely benefit from using an index. How much is hard to say, but I would expect performance improvements in the order of 10x to 1000x. Yes: index can make such difference.

    More on caching and other stuff is discussed at the bottom of the article, in the “For those who want to know more” and “For those who know more” sections.

    Notes

    Wondering who I am and why I’m writing these posts? Read the introductory article here. Also be sure to read the notes about the “For the Better Developer” article series, so that you can know what to expect.

    The indexes you have to know

    There are two principal types of indexes in SQL Server: clustered and non-clustered. The first main difference between the two is that the clustered works directly on table data, while the non-clustered works on a copy of such data.

    Both indexes can supports two different storage models: using a row-store or using a column-store.

    There are therefore four possible combination that we have to choose from when we need to add an index to our tables.

    Clustered Row-Store Indexes

    An easy example to explain this index is the encyclopedia or the phone number directory. Data is stored accordingly the order specified by the index. More specifically, by the columns declared when the index is created.

    create clustered index IXC on dbo.Users(Surname, Name)

    In the example above, rows will be stored ordered by Surname and then Name. Just like the phone number directory, duplicates are allowed.

    Each row will also be stored along with all the remaining columns values. If the table dbo.Users has the following definition:

    create table dbo.Users
    (
    Id int not null,
    Surname nvarchar(50) not null,
    Name nvarchar(50) not null,
    DateOfBirth date not null,
    PhoneNumber varchar(50) not null,
    Address nvarchar(100) not null,
    City nvarchar(50) not null,
    State nvarchar(50) null,
    Country nvarchar(50) not null
    )

    the index row will contain not only the indexed columns but also the Id, DateOfBirth, PhoneNumber, Address, City, State and Country. This happens because, remember, a clustered index works directly on table data.

    Of course, since a clustered row-store index works directly on table’s data, there can be only one clustered index per table.

    When you should use this type of index? It is perfect if:

    • you are looking for zero, one or more rows in a table
    • you are looking for a range of rows: from “A” to “B” for example.
    • you need to have the rows returned in a specific order

    As you can see it fits all usage scenarios and, in fact, the clustered row-store index is very flexible and can be useful in almost every scenario but unfortunately, as already said, only one can exist per table. This limitation is imposed by the fact that such index physically orders data in the table and, of course, you can order data only in one way at time.

    If you need to order it in more than one way at the same time, you need to duplicate some or all data, and that’s why Non-Clustered Row-Store Indexes exists.

    Non-Clustered Row-Store Index

    A Non-Clustered Row-Store index is somehow like the index you can find at the beginning of a magazine, where the topics are listed along with a pointer to the page in which they can be found.

    In the case of SQL Server, when you create a non-clustered row-store index on a column — and following the phone number directory example let’s say the PhoneNumber column — all values in that column will be indexed and each one will have a pointer that allows SQL Server to relate that value with the row in the table that contains it.

    create nonclustered index IXNC on dbo.Users(PhoneNumber)

    Again, please note that values in the chosen columns needs not to be unique.

    The non-clustered index requires additional space to be stored, since it creates a copy of all the data in the columns you specified in the index definition, plus the space needed for the pointer.

    When the index is used to locate some data, the result of that search is a pointer. This means that in order to get the real data a lookup operation needs to be done. For example, given a PhoneNumber you want to know who will be answering if you call it. SQL Server will used the index of the PhoneNumber and after having found the number you are looking for, will take the pointer and use it to find all the related values like Name, Surname, Address and so on. A lookup is nothing more that a jump to the location indicated by the pointer in order to access the real data.

    Of course this additional operation has a little overhead. Overhead that is directly proportional to the number of rows for which the lookup is needed. Are you looking for a thousand rows? A thousand lookup needs to be done. The consequence of this behavior is that it doesn’t make sense to always use an index, even if it exists. If the effort to do the all the lookup (which can be an expensive operation) is too high, compared to reading the full table and just discarding the rows that doesn’t fall in the scope of the requested search, the index will not be used.

    As a general rule, than, it can be said that this index is perfect when you are looking

    • for zero or one row in a table
    • for *very few* rows in a table

    as a rule of thumb, “very few” mean 1% or less of the rows in your table.

    Clustered Column-Store Index

    This index effectively turns SQL Server into a columnar database. Really. Once created on a table, since it is a clustered index, it will change the way SQL Server stores that table: instead of saving data one row at time, data will be saved one column a time, following the principles of columnar databases. For that reason you don’t really need to specify any columns when creating such index, since it will be applied to the whole table and thus all columns will be indexed:

    create clustered columnstore index IXCCS on dbo.TrainingSessions

    Column-Store index are perfect for analytical queries. All queries in which data needs to be grouped, optionally filtered and aggregated fall into this scope. Performance can be easily increased by a factor of 10x and it’s quite common to see improvements of 30x and more. This is also a result of the fact that, to work properly, columnar solutions needs to compress data and SQL Server makes no exception: the clustered column-store index will compress data in your table using several techniques, and you may see you table space shrinking a lot, like up to 60%-70% or even more.

    Column-Store index are especially good in read-intensive scenarios, while they do not excel in being written very often. There is an exception for bulk-load scenario. A bulk load is activated when when you use the T-SQL command BULK INSERT or load data into SQL Server using the SqlBulkCopy object or if you use any solution, like SQL Server Integration Services, that supports the Bulk Load option.

    If you have a table in which the majority if the queries have this pattern:

    SELECT 
    <aggregation_function>(Column1),
    <aggregation_function>(Column2),
    ...
    FROM
    <table>
    WHERE
    ...
    GROUP BY
    Column3,
    Column4,
    ...

    and write operations, compared to read operations, are a small number, then give this index a try. You will probably be amazed by how quickly data can be aggregated and manipulated, even on huge tables. (Huge: hundreds millions rows or more)

    The downside of this index is that while it is great for working with aggregations, is not that good retrieving one or few rows with all their columns. So you may want to mix the row-store index and the column-store index together, and that’s why the next, and last, index on the list may be very useful too.

    Non-Clustered Column-Store Index

    This index allows you to create a colum-store index only on a subset of columns of your table.

    create nonclustered columnstore index IXNCCS on dbo.Users(City, State, Country)

    This is helpful when you need to have your table both supporting a row-store index, that is quick for retrieving whole rows, but you also need to quickly aggregated on some columns of your data.

    Using again our phone directory example, the above index will make any aggregation on City or State or Country or a combination of them, really really really fast.

    Let’s make a more real-life scenario: if you have a database that stores training sessions, for example, and you what to be able to present to your user a nice dashboard with aggregated information of the last six month of the most useful indicators like calories burnt, number of training done and so on and you also need to be able to quickly retrieve a single or a bunch of rows to show all the data you have for a specific training session, the non-clustered column-store index, especially when used together with a clustered row-store index, is just perfect for this job.

    Be Warned

    Creating, altering or dropping and index may have huge impact on the server. The bigger the table the bigger the impact. Depending on the size of your data, such operation can take seconds or hours and in the meantime all users trying to use the database will be affected in a way or another. So before playing with indexes, test the script on a test server to evaluate the impact.

    Conclusions

    You now have the basic foundation to understand how to make application go faster. Much, much, much faster.

    Take the sample done above, remove training and users and put your own objects like invoices, bills, orders, whatever you want, and you can easily understand how SQL Server can be perfect to manage a real-time analytic workload and at the same a time also a transactional workload, which is a quite typical requirement.

    This is a really unique feature in the DBMS space today.

    My opinion is that, if you have less than 1 TB of “hot” data (data updated quite frequently during that day) you don’t really need to look for Big Data solution since the overhead they will put on you and your team can be quite big that I can hardly justify for one 1 TB of data. If you have 1 PB or near, than, let’s talk.

    External References

    Indexes are really a huge topic, here we just scratched the surface and even just a bit. I’ll surely talk more about indexes in next posts, but if you don’t want to wait and you’re interested in the topic, which is my main objective with this post series, that you can go for these external references:

    For those who want to know more

    Caching

    SQL Server has its own internal cache. Actually, it has several different caches. The most important is the buffer cache. Here is where data live. In an healthy situation 99% of your data comes from the cache. Yeah, right, 99%.

    Row-Store Indexes

    Row-Store indexes are implemented a B+Trees. The technical difference between clustered and non-clustered is that for a clustered index in the leaf pages you have the actual data, while in the non-clustered you have data related only to the columns used or included in the index, and the pointer.

    Clustered Row-Store Index

    SQL Server doesn’t really enforce the physical order of the data after the index has been created. Data is ordered following the index definition during the index creation phase, but after that, if there are operation that inserts, deletes or updated data, than the physical order is not preserved. What is preserved is the logical order of the data. This means that data will be read in the correct sequence which may be different than the sequence in which rows can be found on the disk. This logical/physical difference is the index fragmentation that is not necessarily a bad thing. It usually is a bad thing, but not in 100% of the cases. For example if you always access your data by one row at time, you care a little about fragmentation. Or if you want to insert data as fast as possible, you don’t really care about fragmentation; on the opposite, you may want to have fragmented data in order to reduce the chances of having hot spots where contention will happen and thus your performance will suffer. Anyway, as usual, edge cases like the two described must not be used as the general rule.

    Non-Clustered Row-Store Index

    You can have more than one non-clustered row-store index on a table. Of course you can create such index on more than one column. This enables some nice “tricks” like the covering index that I will describe in future articles.

    Other Indexes

    There are many more index type then the one I described. They are specific to a certain data type, like geospatial indexes or to a specific feature that I will discuss in another post, like the Hekaton, the In-Memory, Lock-Free, engine. The indexes I described in this articles are the most commonly used.

    For those who know more

    Caching

    Caching is usually a nice thing to have and I also use caching solution quite a lot. I just don’t want to use it as a surrogate of performance I can have right from my database. Caching is perfect, in my opinion, in highly concurrent systems to reduce the stress on your database so that you can deliver the same user experience with just a less expensive solution (be a smaller hardware on-premises or a lower performance level in the cloud).

    Keep also in mind that SQL Server, offers an extremely good option for creating customized caching solution using In-Memory Tables, as Bwin, the online betting giant, demonstrated. So if you already using a SQL Server version that supports this feature, it may be worth to give it a try. Otherwise Redis and the likes are just perfect.

    The “User” table

    Yes, the table used in the sample has several design problems (both logical and physical), for example:

    • there is a functional dependency between City, State and Country)
    • there is no Primary Key defined
    • data types are really not the best possible

    but I just want to show something that is as close as reality as possible so it will be easier to relate the example to your own experience. In addition to that I’ll talk about Keys and Indexes in another dedicated articles, so here I went for something that just plain, simple and easy to understand.

    You’re really come that far?

    Well, thanks for reading. :)

  • Speaking at the SQL Server Seattle User Group in April

    I’m happy to share that I’ll be speaking at the next SQL Server Seattle User Group meeting on April 13th. The topic is again Temporal Tables a subject I love a lot and that is 100% useful in possible roles: dev, dba or bi guy.

    Temporal Tables with SQL Server 2016, vNext and Azure SQL

    Keeping track of how data changed over time in a table has always been a difficult task. Both data insertion or modification and even querying is just more complex when you want to have the result that was returned at a specific point of time in the past. And even more complex when you’re not looking for a specific point in time, but a period of time. Temporal database theory and temporal operators surely can come to the rescue, but they are not a matter for the faint of heart! Luckily one of the biggest - and most requested – feature that has been added to SQL Server 2016 solves exactly this problem, allowing the creation of change audit trails, data history and point-in-time queries in such a simple what that anyone, even on *current* applications, can benefit from it, simplifying solution architecture and saving time (and money) on maintenance an reporting.
    In this session we’ll see how the feature work on SQL Server 2016 and Azure SQL v12 and also what will be available in the vNext version of SQL Server.

    Additional details, location and free registration are here:

    https://www.eventbrite.com/e/temporal-tables-with-sql-server-2016-and-azure-sql-w-davide-mauri-tickets-33015877378

  • Speaking at SQL Saturday #613 in Redmond, WA

    On April 15, 2017 I’ll be speaking at SQL Saturday in Redmond, WA. I’ve settled down after the recent changes, so it’s time to start to be an active part of the Microsoft Data Platform (former SQL Server) community again.

    I’ll be talking about monitoring SQL Server Integration Services and the amazing support for JSON in SQL Server 2016:

    SSIS Monitoring Deep Dive

    In this session we’ll see everything interesting is hidden in the SSISDB database, where you can gain a lot of insight on the outcome, the performance and the status of your SSIS Packages. I’ll share everything I’ve learned building the SSIS Dashboard we’re actually using in production and that you can test here http://ssis-dashboard.azurewebsites.net/. We’ll see the internals of SSISDB database, how we can add custom logging information and how we can use all these data in order to know exactly what happened on a specific point in time.

    JSON & SQL Server: using it effectively

    JSON is a de-facto standard for machine-to-machine communication but also to store “liquid” data. Once a killer feature of NoSQL database, with SQL Server 2016 is now available also in a (post) relational environment.
    But how it fits in the relational database space? How the usage of JSON can simplify DBAs and DEVs everyday work? What about performances? And what about flexibility?
    In this session we’ll see how powerful JSON can be, when used correctly and not abused, providing solution to common problems
    (like in the well known “dynamic schema” case), simplifying development (for example to pass to Stored Procedure complex objects or variable numbers of parameters) with also very good performances, even if they maybe hidden in not-so-obvious places.

    The event is going to be an amazing one, as you can see from the agenda:

    SQLSaturday #613 - Redmond 2017

    So, if you live near there, don’t miss it.

    See there in a bit more than a month!

  • Apache Zeppelin 0.7 for SQL Server

    During the weekend I’ve updated my SQL Server interpreter for the amazing Apache Zeppelin to be compatible with the latest version — 0.7— that has been released just couple of days ago:

    Apache Zeppelin Release 0.7.0

    This new release has a lot of nice features (customizable charts, better multi-user support, improved tables and many others) and in order to start to use the new features also with SQL Server and SQL Azure, the first step was to sync the existing interpreter with version 0.7. Well, that’s done.

    1-Rj4Y2607mkDxIA9PfH6ldg

    You can download the source code and build it yourself from here:

    GitHub - yorek/zeppelin: Apache Zeppelin with support for SQL Server

    or get the Docker image with everything already built, for maximum comfort:

    https://hub.docker.com/r/yorek/zeppelin-sqlserver/

    If you’re new to Apache Zeppelin, you can take a look at the articles I’ve already written to help people to get started:

    Breaking Changes

    In order to take advantage of automatic interpreter discovery and native syntax highlight introduced in version 0.7, I had to rename the interpreter group and the interpreter name to sqlserver from tsql. This means that any binding to the tsql interpreter will be invalid which, in turn, means that all notebook using it won’t work. Luckily fixing this situation is easy: just create a new interpreter binding, using sqlserver as interpreter group and give it the same name you used before. Make sure the notebooks affected use this new binding (as the default one if in your notebooks you didn’t use the %interpreter notation) and you’re done.

    Updating Docker Container to version 0.7

    If you were already running a docker container with Apache Zeppelin for SQL Server, you may have notebook and interpreter configurations that you want to keep.

    The process to update everything without losing any existing work is the following (just be sure also to read the “Breaking Changes” section above!):

    Get the updated image

    docker pull yorek/zeppelin-sqlserver:v0.7

    Once it’s downloaded, check if you have any running container

    docker ps

    If you see an Apache Zeppelin container is running (it is named zeppelin if you followed my tutorials), stop it

    docker stop zeppelin

    Now create a new container pointing to the volumes of the previous version. The new container will automatically use the updated image

    docker run -name zeppelin2 -p 8080:8080 -volumes-from zeppelin -d yorek/zeppelin-sqlserver:v0.7

    List again all the existing containers (running and not running)

    docker ps -a

    Note that the old zeppelin container the is still there and if not needed can be removed

    docker rm zeppelin

    The new container can now be renamed to the usual name

    docker rename zeppelin2 zeppelin

    The old image can also be removed to free disk space

    docker rmi yorek/zeppelin-sqlserver:v0.6.2

    Now check that only one zeppelin container, with the version 0.7, is available:

    docker ps

    Upgrade done! Now you can start Apache Zeppelin for SQL Server with the usual command:

    docker start zeppelin

    and you’ll have access to all your existing notebooks and interpreter bindings.

  • Azure Functions, Dapper and SQL Server JSON to store form data

    A common problem

    Recently we had to setup a couple of contact form pages to allow users to register to the App 2.0 Beta program (now RTM) or to tell us which of the products we launched at CES 2017 they are interested into.

    Such kind of request are quite frequent and usually, from a developer perspective, they are the worst thing someone can ask. They are boring to architect, boring to write, boring to test. Still, business and marketing needs and love them.

    So how to satisfy this requirement in the fastest and easiest way, but still delivering high-quality and easy to maintain code, while also keeping an eye on integrating the data with the existing environment?

    Evaluating Azure DocumentDB

    Given that we are on Azure, one option was Azure DocumentDB. No need to define a schema upfront and it is usable just using REST so the UX developer could do anything on its own, just using JQuery. Azure DocumentDB can also be queried using a SQL language, so extracting data from it wouldn’t have been a problem.

    But at the end, I decided to go another way. All our data, after swimming in a Data Lake are saved in a Azure SQL database where we heavily rely on its column-store features. So having an additional database to integrate was something that would have made the solution a bit more more complex than the simplest possible one. The famous quote

    everything should be made as simple as possible, but no simpler

    is what drives all my architectural decisions, so I wasn’t really happy about going with Azure DocumentDB.

    With the additional fact that there are no really good tooling around Azure DocumentDB yet, I started to look for alternatives. The obvious alternative, aside from saving data into a blob, which was not on option since that would have been too simple, because it doesn’t offer any querying capability, was to use Azure SQL.

    Moving to Azure SQL?

    With Azure SQL you have great tooling (SQL Server Management Studio and now also the online query editor), we already have all knowledge in house, but surely the fact that it doesn’t allow to use just REST to read and write data was, again, something that wasn’t making me really happy.

    Beside that, Azure SQL seemed to be the perfect option. JSON is now natively supported, so there is no problem to store data without a strictly enforced schema.

    Since we’re already using SQL Azure, we wound’t even have to spend any additional cent for it. The only problem to solve was that you can’t use Azure SQL directly via JQuery.

    Serverless is the key

    The missing link — the REST interface — can easily be created using Azure Functions and a microORM like Dapper. Thanks to the serverless nature of Azure Functions all it’s need are the few lines of code to get the HTTP Request Body that contains the contact form “jsonifyied” data and store into the SQL Azure database.

    The created Azure Function gets called each time an HTTP Request is done, using an HTTP Trigger. Here the function.json file that defines the function bindings:

    {
    "bindings": [
    {
    "authLevel": "function",
    "name": "req",
    "type": "httpTrigger",
    "direction": "in"
    },
    {
    "name": "res",
    "type": "http",
    "direction": "out"
    }
    ],
    "disabled": false
    }

    and the function full code is here:

    https://gist.github.com/yorek/89fb1afecc34db3f9419ceb3ae9664d9#file-azure-functions-save-form-data-to-sql-azure-csx

    Such additional layer is also a welcome addition since it allows to inject some additional validation checks and business if needed.

    I’ve used a Stored Procedure here for better decoupling, and it does just a simple insert, with some additional checks.

    https://gist.github.com/yorek/2877b555452c5d5a155a8f50dbfd9bf7#file-azure-functions-save-form-data-to-sql-azure-sql

    It also turns some JSON data into columns, to make querying even easier.

    Dapper helps to make interacting with SQL Azure a breeze (two, yes two, lines of code to open the connection and write to SQL azure), so all simplicity requirements are finally met. No more impedance mismatch. No more.

    Given the existing external factors (resources, knowledge, integration) the simplest solution but not simpler has been achieved.

    Without any existing factors I would probably have chosen Azure DocumentDB. But as an afterthought, I still have the gut feeling that SQL Azure would have been better in the long term (for example I can connect Marketing Manager’s Excel directly to the SQL Azure, something that I cannot do yet with DocumentDB)…so maybe SQL Azure would be my choice onve again. After all software is made to serve people, and this should drive at least 50% (if not more) of the architectural choices, IMHO.

    In conclusion

    Since I joined Sensoria I’ve moved to work on the cloud 100% of my time. And the more I work on the cloud, the more it is clear to me that serverless and simplicity (which means, implementation-wise: microservices) is the way to go. Efficiency is increased so much in this way that it’s really worth the effort, and embracing PaaS clearly becomes an obvious choice.

  • Creating your first Dashboard with Apache Zeppelin and SQL Server/Azure SQL

    Now that Apache Zeppelin has been downloaded and configured, it’s time to create a dashboard. It will be very easy: all you have to do is figure out which data you want to show, write the corresponding T-SQL query and then add some charts and information to make it perfect.

    To create the first Apache Zeppelin dashboard, let’s use the new Wide World Imports sample database from Microsoft:

    Microsoft/sql-server-samples

    Take the “Standard” version so that you can use it also on a non-premium Azure SQL instance, if you want to try it on Azure.

    Once you have restored the WideWorldImportersStandard database, run Apache Zeppelin 0.6.2 for SQL Server docker container:

    docker run --name zeppelin -p 8080:8080 -v /zeppelin-sqlserver/notebook -v /zeppelin-sqlserver/conf -d yorek/zeppelin-sqlserver:v0.6.2

    make sure you have configured an interpreter (call it “Azure” for example) with the connection information to point to WideWorldImportersStandard and than create a new note:

    And now it’s just a matter of writing the queries and turning the result into charts. As usual I’ve recorded a quick video (it’s just six-something minutes) to easily show how you can do that. (And I’ve also started from configuring the interpreter so that you can also review that part).

    Enjoy:

    image

    https://vimeo.com/198582184

    In the video I also show how the markdown interpreter can be used to add information to the dashboard.

    The sample dashboard, that also includes the queries, can be downloaded here:

    My First Dashboard.json

    If you prefer to download only the queries and then DIY, here’s a SQL file with all the used queries:

    My First Dashboard.sql

    I really recommend you to start using Apache Zeppelin if you haven’t done it yet. It’s incredibly useful even for DBAs just to monitor SQL Server status. I’ll talk about this in a forthcoming post. Stay tuned!

More Posts Next page »

This Blog

Syndication

Privacy Statement