THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Davide Mauri

A place for my thoughts and experiences the Microsoft Data Platform

  • What's new in Entity Framework Core 2.0?

    In just a couple of week the PASS Application Development Virtual Group will host a new webinar, this time with Chris Woodruff who will talk about new stuff in EF Core 2.0:

    What's new in Entity Framework Core 2.0?

    Entity Framework (EF) Core is the lightweight, extensible, and cross-platform version of Entity Framework, the popular Object/Relational Mapping (O/RM) framework for .NET. This talk will give a quick look at the new features for .NET Core 2.0. We will look at the new release, how to install it and show sample projects covering the following topics: Improved LINQ Translation, Owned Entities and Table Splitting, Global Query Filters, DbContext Pooling, String interpolation in raw SQL methods, Explicitly compiled queries, Self-contained entity configurations in code first

    as usual attending is free, just register here:

    http://appdev.pass.org/

    and the session will be recorded and made available on the group YouTube channel http://bit.ly/appdevyoutube) within a few days after the live broadcast.

  • Working with SQL Server for Linux Cross-Platform

    Next week, on August 24th a new webinar from the PASS Application Development Virtual Group will be aired. This time the topic is the Cross-Platform support introduced with the forthcoming SQL Server 2017. Maximo Trinidad will delivery the following session:

    Working with SQL Server for Linux Cross-Platform

    You're welcome to explore the latest build of SQL Server for Linux. Connect to SQL Server looking into the tools available for both Linux and Windows. Maximo will provide samples on querying for database information using Python and PowerShell between two environments. This will be a demo intensive session you will not want to miss!

    as usual registration is free, via the following link:

    http://bit.ly/appdevreg

    and the session will be recorded and made available on the group YouTube channel http://bit.ly/appdevyoutube) within a few days after the live broadcast.

    Don’t miss the webinar if you’re thinking, planning or even already using SQL Server in a Multi-Platform environment!

  • Azure SQL: work with JSON files where they are

    Dealing with CSV or JSON data today is more and more common. I do it on daily basis, since the our application send data to our microservice gateway backend is in a (compressed) JSON format.

    Sometimes, especially when debugging or developing a new feature, I need to access that JSON data, before is sent to any further microservices for processing or, after that, being stored in the database.

    So far I usually used CloudBerry Explorer to locate and download the JSON I was interested into and that a tool like Notepad++ with JSON-Viewer plugin or Visual Studio Code to load and analyze the content.

    Being Azure SQL or main database, I spend a lot of time working with T-SQL, so I would really love to be able to query JSON directly from T-SQL, without even have the need to download the file from the Azure Blob Stored where it is stored. This will make my work more efficient and easier.

    I would love to access JSON where it is, just like Hadoop or Azure Data Lake allows you to do

    Well, you can. I just find out that with the latest additions (added since SQL Server 2017 CTP 1.1 and already available on Azure SQL v 12.0.2000.8) it is incredibly easy.

    First of all the Shared Access Signature needs to be created to allow Azure SQL to access the Azure Blob Store where you have your JSON files. This can be done using the Azure Portal, from the Azure Storage Blade

    SAS Signature Creation Window

    or you can also do it via the Azure CLI 2.0 as described here:

    Azure CLI 2.0: Generate SAS Token for Blob in Azure Storage

    Once you have the signature a Database Scoped Credential that points to the created Shared Access Signature needs to be created too:

    If you haven’t done it before you will be warned that you need to create a Database Master Key before being able to run the above command.

    After that credentials are created, it’s time to point to the Azure Blob Account where your JSON files are stored by creating a External Data Source:

    Once this is done, you can just start to play with JSON files using the OPENROWSET along with OPENJSON:

    and voilà, JSON content are here at your fingertips. For example, I can access to all activity data contained in our “running session” json:

    This is just amazing: now my work is much simpler, especially when I’m traveling and, maybe, I don’t have a good internet access. I can process and work on my JSON file without even have them leaving the cloud.

    What about CSV?

    If you have a CSV file the technique is very similar, and it is already documented in the official Microsoft documentation:

    Examples of Bulk Access to Data in Azure Blob Storage

    What about on-premises?

    The same approach is doable also via SQL Server 2017 (now in CTP 2.1). You can also access file not stored in the cloud, but on your on-premises storage. In such case, of course, you don’t specify the Shared Access Signature as an authentication methods, since SQL Server will just rely on Windows Authentication. Here Jovan showed a sample usage:

    Is the code available?

    Sure, there is a Gist for that:

    https://gist.github.com/yorek/59074a4c4176191687d6a17dabb426ed

    • Speaking @ PASS Summit 2017

      After years and years talking about Business Intelligence and Business Analytics, this year at PASS I’ll be speaking about Application Development and how to make the best use of the data platform that Microsoft created so far.

      I’ve started talking about the impedance mismatch at the beginning of my career and now, after 15 year of Business Intelligence, Big Data, Analytics, Data Warehousing, I’m back talking about it again, but this time there is a new amazing approach to the problem: microORM.

      MicroORM are ORM that only take care of one thing: mapping database result-set to classes and nothing more. I love the idea because it allows developers and DBAs to leverage all the power of RDBMS whilst, at the same time, remove all the boring and error-prone plumbing code. As Van Halen said, it is just the “Best of Both Worlds”.

      As a result, MicroORM are, in my opinion, just perfect in a Micro-Services scenario. They make things simple, but not simpler, while preserving performance and making a developer just more productive (give that he/she must know how to query a database, something that is mandatory IMHO).

      More specifically I’ll talking about Dapper.NET a microORM I’ve learned to love in the last years, and that I use everyday:

      Dapper: the microORM that will change your life

      ORM or Stored Procedures? Code First or Database First? Ad-Hoc Queries? Impedance Mismatch? If you're a developer or you are a DBA working with developers you have heard all this terms at least once in your life…and usually in the middle of a strong discussion, debating about one or the other. Well, thanks to StackOverflow's Dapper, all these fights are finished. Dapper is a blazing fast microORM that allows developers to map SQL queries to classes automatically, leaving (and encouraging) the usage of stored procedures, parameterized statements and all the good stuff that SQL Server offers (JSON and TVP are supported too!) In this session I'll show how to use Dapper in your projects from the very basis to some more complex usages that will help you to create *really fast* applications without the burden of huge and complex ORMs. The days of Impedance Mismatch are finally over!

      See you at PASS 2017!

    • T-SQL Window Functions

      I’ll soon post something on this, but let me tell you right here. If you don’t want AI to take you DEV/DBA/BI job you *need* to be smarter than AI (which is not AI at all right now, but let’s keep on playing the marketing game): one way to become smarter is to - guess! - use your brain and train it to think out of the box, practice lateral thinking and more in general do all the things that brute force and Machine Learning cannot do.

      Of course when you do that, you have to be supported by a language that allows you to exploit all such potential. SQL is one of those nice languages. So don’t miss Itzik Ben-Gan session on T-SQL Window Functions on July 13th for the Application Development Virtual Group:

      T-SQL Window Functions

      Itzik Ben-Gan

      T-SQL window functions allow you to perform data analysis calculations like aggregates, ranking, offset and more. When compared with alternative tools to achieve similar tasks like grouping, joining, using subqueries, window functions have several interesting advantages that allow you to solve your tasks often more elegantly and more efficiently. Furthermore, as it turns out, window functions can be used to solve a wide variety of T-SQL querying tasks well beyond their original intended use case, which is data analysis. This session introduces window functions and their evolution from SQL Server 2005 to SQL Server 2017, explains how they get optimized, and shows practical use cases.

      Registration via the usual link:

      http://appdev.pass.org/

      Don’t miss it, no matter if you’re into SQL Server or not. This will be useful to everyone who works with databases that supports the standard ANSI-SQL 2003 and after. This means, SQL Server, Azure SQL, of course, but also Hive, SparkQL, Postgres and MySQL (nope,sorry), MariaDB

    • Introduction to Azure Cosmos DB: Microsoft’s Globally Distributed, Multi-Model Database Service

      It’s time for something NoSQL at the PASS Application Development Virtual Group. Next week Denny Lee will present on Cosmos DB, the latest evolution of DocumentDB:

      Introduction to Azure Cosmos DB: Microsoft’s Globally Distributed, Multi-Model Database Service

      At Build we announced Azure Cosmos DB, the industry’s first globally-distributed service that enables you to elastically scale throughput and storage across any number of geographical regions while guaranteeing low latency, high availability and consistency – all backed by industry leading, comprehensive SLAs. Azure Cosmos DB allows you to use key-value, graph, and document data in one service, at global scale and without worrying about schema or index management. Join this upcoming webinar to learn more about Azure Cosmos DB, what it offers developers and how it can help you quickly build your modern applications.

      I’m looking forward to it: keeping update with all the database technologies available in the cloud is a mandatory task for today’s architects and developers, so don’t miss it. Registration, as usual, is free here:

      http://appdev.pass.org/

    • Graph data processing with SQL Server 2017 and Azure SQL DB

      As soon I saw that SQL Server 2017 supports graph database concepts (queries and modeling) I started to play with it and wanted to know more. And so I took the chance to ask to the program managers working on that feature to come and speak at the PASS Application Development Virtual Group, and they kindly accepted Smile.

      So, if you’re also interested in the topic (and you should!), point your browser here and register, meeting is set for 8th June, online:

      Graph data processing with SQL Server 2017 and Azure SQL DB

      Shreya Verma & Arvind Shyamsundar

      Microsoft has introduced Graph Database Processing to SQL Server 2017, which will facilitate users track many-to-many relationships between entities, to gather powerful insights from different pieces of connected data. This is well suited for applications where users need to extract information from different pieces of connected data, for example, IoT, fraud detection, recommendation engine, predictive analysis etc. In this session, we will learn the new features and demonstrate them by going over some use cases.

    • From ElasticSearch back to SQL Server

      Sharing with the community why we decided to move back from ElasticSearch to SQL Server

      Sometimes it happens: you move to something that is considered cool and bleeding edge back to something that has dust over it.

      Well, at least this is how SQL Server has been perceived lately by the majority of developers and software architects. Microsoft must have heard this sort-of disappointment against his RDBMS and decided to recover all the lost time, and try to make it cool again.

      If you still think that SQL Server (and his twin, SQL Azure) is still the “old” RDBMS with all its “rigid” limitations, my advice, as Software and Data Architect, is to go back and check it again.

      This is what we did in Sensoria in the last months, and now have moved everything from ElasticSearch back to SQL Server.

      I think that what we discovered and the experience we did is worth sharing: I believe that a lot of companies will find themselves (or are already) in the same situation. Maybe it won’t be ElasticSearch but another product, but the story will be the same. Here it goes.

      The Story So Far

      Something like five years ago, if you were looking for a highly scalable database solution that could also provide a lot of flexibility in terms of object stored and could also offer a good analytical solution, one of the option was ElasticSearch. It can ingest and index JSON documents, using a schema-on-read approach (instead of a schema-on-write, typical of RDMBS), it allows for really easy scale-out, it offers a very nice monitoring tool (based on Kibana), it uses REST as communication protocol so it is compatible with everything (which make it perfect for the web), and, last but not least, it is based on the well-known Lucene technology that is solid and mature.

      So ElasticSearch it was. It did great for several year, millions and millions of documents has been stored into it, both coming from end users using our products or from internal machine learning algorithm that crunches numbers and tries to turn all the raw data into something more useful for the athlete, making sense of the data we gather from garment sensors.

      Requirements and possible options

      We I joined Sensoria, we were in the middle of an architectural revision. Of course one of the system being reviewed was the database. We had to decide if to stay with ElasticSearch or move to something different. I was in charge for this decision, given my strong data architect background, and I could have chosen any database. We could have stayed on ElasticSearch or switched.

      ElasticSearch has lately clearly stated that its focus is on search and analyticsand that ruled it out immediately, since we need a database where to store all our data in first place.

      We started to evaluate the options. One of our main objective is to go 100% PaaS so, using Azure, one of the first options was DocumentDB (now CosmosDB).

      Beside storing and querying JSON we also looked for something that would allow us to quickly aggregate data using any possible combination of filters and aggregations logic. For example someone may want to see the heart rate aggregated on weekly basis for the last two years, while someone else may want to get the pace aggregated by day in the last month. This must be done in real-time, while, at the same time, new sessions are loaded and processed.

      We need to ingest data coming from our IoT devices, Apps and SDKs, but we also need to store “classic” data like user account, preferences, device info and so on.

      Architectural considerations

      Of course there is no a one-size-fit-all solution. Or, better, there is no such solution at a decent costs.

      We’re also implementing a Lambda architecture and the final data store that is used the serve data to the end user it only contain “squeezed” data, where the schema is actually well-known, otherwise developers wouldn’t know how to interact with stored data in order to efficiently generates statistics and charts. In the other stages data is stored in other more suitable places. For example we’ll use a streaming query engine to query data flowing into the system in real-time, and we’ll be using a data lake to store IoT raw data.

      Such architecture allows us to focus only, at least for now, on the master database (I like to call it a online transactional data warehouse, since what we want to do is basically real time analytics), so the search scope is now narrowed down and well defined.

      Technical decision?

      Truth is that almost all enterprise databases we evaluated, relational or not, provides, albeit with different capability levels, the features we were looking for our master database.

      So, from the decision was not only technical one, but more a strategic one. What is the database that, in our context, will give us the highest natively supported features set, the lowest amount of maintenance and the simplest platform so that our developers could use it efficiently? All at competitive costs, of course.

      SQL Azure

      As obvious from the title of this article, the final decision was to use Azure SQL. You may think my choice are biased since I’m a Microsoft Data Platform MVP, and my background is almost SQL Server. “If all you know is a hammer, everything will look like a nail”, right? Sure, but this could have been applied to SQL Server several years ago, the engine has improved a lot in the last version, both in term of raw performances and features offered. Here’s the feature that motivated my choice:

      • Columnar storage support: perfect for real-time aggregations
      • Columnar and Row storage at the same time: optimum to perform aggregate and singleton queries
      • In-Memory (Lock-Free structures): great to support high-concurrency workloads
      • Partitions: perfect for distributing workload
      • Temporal Tables: great to simplify development
      • JSON support with Non-Clustered & Generalized INverted Indexes: perfect to support migration from ElasticSearch and also to make life easier to developers
      • Row-Level Security: optimum for securing data at the source

      And these are the feature that are available now. In the near future SQL Azure will also provide native support for Graph Queries, all within the same engine.

      So technically it’s great but that alone was not sufficient to justify it. Costs played a huge role in the final decision. Initial tests showed that we could handle the load managed by several ElasticSearch servers with just a entry level Azure SQL Premium database, thanks the columnar indexes. The costs showed a reduction ration of 3:1, which was immediately loved by the business guys.

      Integration is a key factor

      For me, having all the aforementioned features in just one product is really a big thing. Of course SQL Server / Azure SQL may not excel in all of them, so you don’t have the best of the best, but, except for rare edge cases, I rather prefer to have just one tool that provides you with hundred of feature at near perfect level that hundred of tools each one just doing just one thing but perfectly.

      The fact that everything is integrated means that we don’t have to move data around, manually figuring out how to integrate different data stores while keeping logical consistency and integrity of data.

      Simplicity is a big plus. By making our solution simpler, much simpler, means that we have less development and maintenance costs.

      Having to learn only one query language is also a big plus. One of the downside of using different databases, is that NoSQL database have the tendency to have their own query language, which make everything a little more challenging: developers need to master n languages, which make them way less productive, and bugs and performance problems are just more frequent.

      Mastering well a database is really important because it avoid developers to go on a caching spree. Caching is a fantastic technology if used correctly (we use the well know Redis database for that). If used just to hide performance problems, it will quickly turn your solution into something worse that the worst spaghetti-code you have ever seen. With the addition of concurrency management and stale data issues to the problem. Hell is a nice place in comparison.

      So far, for us, the feature set, the integration of everything, the SQL language and the existence of very good tools (this is another really important point to keep in mind: good tools allows productivity to increase a lot) was pointing us to Azure SQL. But what about scalability? It will scale enough for us?

      Scalability is in the fabric

      One of the argument that was true when SQL Server was only offering an on-premises options is the scalability factor. What happens if we need to scale out to handle a lot more data and concurrent access we have right now?

      Beside the fact the our architecture uses a queue in order to handle spikes of workload, especially for data ingestion, Azure offers the solution. Azure SQL allows the database to scale up to 4 TB of data and if this should be not enough, Elastic Pools will help even more. We comfortably fit in that space for now (we’re not Twitter or Facebook just yet…), so we’re fine within that limit (even because all the raw data stays in a data lake, and there is were things get huge, usually).

      Being a PaaS solution also means that we don’t have to manage anymore clusters, nodes, updates, reboots and all the well known things that you normally have to deal with when you have to maintain your servers.

      Python and R are also in the mix

      A lot of the machine learning and advanced analytics code is based on R and Python. With the latest release of SQL Server, beside native support to R, also Python has been added as a language that can be used in-database to crunch data. This means that we can reuse all our assets based on scikit-learn. While this is not yet available on Azure SQL, I expect it to come soon, as also mentioned at //build 2017 event.

      Refactoring Code

      One key factor that allowed use to use SQL Azure is the native support to JSON that it now finally has. While is still a v1 feature, it is enough, for us, to implement all the desired features we needed to replace ElasticSearch. I’ll write a more technical post on the subject in future, so stay tuned if you’re interested in the implementation details but, as you can imagine, the trick was to fool the code to make sure it felt like being still talking with ElasticSearch, so overall changes to our code have been minimal.

      Conclusions

      ElasticSearch does offer a lot of nice feature, but, on Azure, is not offered as PaaS. Since it’s focused on analytics and search mostly, if we would had chosen it, would would have still needed a database to store OLTP data. CosmosDB is PaaS and also offers a lot of flexibility and also support SQL but….it’s another database that would have required us to create an additional integration solution to keep the existing Azure SQL and the newly created CosmosDB data in sync. We also have a quite well defined schema, so Azure SQL was the perfect choice for us in terms of ratio between costs, performances and features. It provides a very good range of features that ranges from OLTP to (real time) OLAP which is great for us, until we reach to 4 TB limit. Just like my beloved Python language, I like to think to Azure SQL as a solution “batteries included”.

      What I’d like to underline here is that we’re now living in an era where, technology wise, technology is usually not the limiting factor. Humans and our limited knowledge is. So, what is really important now, is the balance between costs, features and simplicity of the overall solution. Technology should help us create solution that are as simple (both for developers and users) as possible. But no simpler.

      And this clarifies the role of Solution/Data/Software Architectes especially well: find the best technical solution with the best benefits/costs ratio, now and in the near future. And make the solution modular so that in the distant future we can only change what needs to be changed in order to keep up with the new challenges.

      Luckily we live in an era where the cloud is making it possible. We just have to evaluate and re-evaluate technologies without any preconceptions.

    • PASS AppDev next webinar: “Introducing R”

      On May 25th the PASS Application Development Virtual Group will host a new webinar delivered by the well know Dejan Sarka and the topic will be R (https://www.r-project.org/) that is more and more used also by developers:

      Introducing R

      R is a free software programming language and software environment for statistical computing, data mining, and graphics. R is extensively supported in SQL Server 2016 suite, in Power BI, and in Azure ML. In order to use it, you have to learn the language. After attending this session, you should be able to write simple R code and use it in the products and services mentioned.

      As usual webinar is free, and it will be accessible here:

      https://attendee.gotowebinar.com/register/8547344291727590146

    • PASS AppDev Recording: “Building REST API with SQL Server using JSON functions”

      Last week Jovan Popovic (SQL Server Program Manager for JSON support and other nice things) delivered a very interesting webinar on JSON and SQL Server / Azure SQL for the PASS Application Development Virtual Group. If you’re interested in the subject, here’s a couple of links that will make you happy:

      Full recording is on PASS Development Virtual Group YouTube channel:

      https://www.youtube.com/watch?v=0m6GXF3-5WI&feature=youtu.be

      Slides are available on Slideshare:

      https://www.slideshare.net/JovanPopovic/building-rest-api-using-sql-server-and-json

      while demos are on GitHub:

      https://github.com/Microsoft/sql-server-samples/tree/master/samples/features/json/product-catalog

      Jovan showed not only how this feature can help in .NET development, but also how it make life so easy for using SQL Server from any other language be it Javascript or Node.js

      I’ve using JSON with SQL Server for a while now, and I can really say it’s a game changer for developers. If you haven’t played with it yet, you definitely have to give it a try: managing dynamic schema has never been so simple.

    • Data Juice

      The data ecosystem explained in 10 simple images

      Couple of weeks ago I was explaining how IoT and BigData can be a game changer in sports and human activities. Yeah, it’s cool to put a sensor in your own grill so that it can automatically tell you the exact time you should change side of your filet, but I think it’s even more cooler to instrument the human body and objectively quantify human activities, especially in sports and health care, so that we can even enjoy more and more the sports we love and the life we live. After all knowledge is power, right? And the first step to knowledge is the ability to measure: IoT is really a big deal here.

      But then? From a business perspective, how and where all the gathered data will be stored? What all terms like Data Lake, Data Science and the likes really mean? Surprisingly enough, the answer is as simple as describing the life cycle of something that everyone know very well: orange juice.

      I created a slide deck, named Data Juice, to explain the whole data ecosystem to C-level guys that had to understand the business value of every piece in such ecosystem, but without having to deal with technical details. It had to be clear, concise and easy to remember. The juice sample was just perfect, as you we’ll see.

      I created a slide deck to explain the whole data ecosystem: it had to be clear, concise and easy to remember.

      Now, the same concept applies to IoT: the sensor working right now in the wearable you’re wearing, almost surely collect raw data (acceleration, magnetic field, etc) at something around 50 times per second (up to 100 sometimes). But you care about is not that raw value, but the trend of your preferred exercise performance.

      What’s the line that connect those two points? Happily enough for me, answering to this question allows me also to describe what kind of work I do, and how it is related to sports, which was one of the question I’ve been asked recently by a well-known tech & lifestyle magazine.

      Here’s the answer:

      Data Sources

      New data is born everyday.

      New data is generated everyday, everywhere. Accounting systems, IoT devices, machine logs, clickstreams…anything generate some kind of data. At this stage data is generated and consumed, usually, locally and in very small sizes. For example an accounting system generates an invoice, and allows users to interact with it. One a time. Just like oranges that are born on the tree. You can just grab one and consume it in place.

      Big Data

      Oranges are harvested and sent to factory for processing. The factory needs to be prepared to get oranges of any size, kind and color and different ripe status. Sometimes it will get only one load a day, sometimes several per hour. What happen with data is really similar. The famous 3V, Volume, Variety and Velocity says that your data system must be able to accept almost any kind of data. Again, just like with oranges: if you want to be in the business, you better be prepared to handle them when they arrive.

      Data Lake

      All the ingested amount of data needs to be stored somewhere. And it better be a place where space can grow as need and it should also be pretty cheap if you don’t want to spend all your money in storing everything, even if you don-t know if and how you will use that data in future. Such place also need to support querying and processing capabilities in order to allow you to dive in such lake. You don’t really care too much about shape, format or correctness here. You just store data that as it was generated, just to have it in case you need it. Even if it may be not good for consumption.

      You just store data that as it was generated, just to have it in case you need it. Even if it may be not good for consumption.

      Data Curation

      Once named also ETL, Extract-Transform-Load, this is the process where data is selected, cleansed, standardized and integrated with additional informations, if available. More or less how oranges are cleaned, divided in different groups for different usage, and if not in good condition, discarded.

      Data Warehouse

      Once data has been processed by the previous system, it needs to be stored in safe place. Safe not only in the meaning that no-one can steal from it, but also in the meaning that everyone who will get data from it, can assume it is safe to consume. Again, just like oranges, if you get one from a market warehouse you can assume safely enough that you won’t be poisoned. The same goes with data. If you get data from the Data Warehouse, you can safely assume that it is safe to consume, meaning that you can use it to take informed decisions. The same does not apply to a Data Lake, where usually you don’t really know what you’ll get and how to deal with it.

      If you get data from the Data Warehouse, you can safely assume that it is safe to consume, meaning that you can use it to take informed decisions.

      Data Mart

      Once you get your oranges from the store, you can decide how to consume them. If you have hundreds of boxes you won’t consume it one by one. You’ll surely try to transform it in a more easy to consume product. Depending on the target, you may want to process it in different ways, in order to make sure that everyone gets exactly what they need, no more and no less. A Data Mart is exactly this: data ready to be consumed with very little additional processing, if needed at all.

      Of course, different people, or different situation may require different aspects of that data, so having more than one Data Mart is quite common. If something is not clear enough in the Data Mart, one can always go in the Data Warehouse that lies behind and check how and why such unexpected data has been produced.

      Self-Service Analysis

      Also known as Self-Service BI, this is where you’re not happy with the Data Mart, since none of them provides what you exactly need, you have to do it yourself. You need to squeeze the juice of three different oranges, of three different kind, in order to create the perfect mixture you want. You go in the Data Warehouse (usually, but sometime you can also grab data from different Data Marts) and create your mix. Maybe adding a hint of cinnamon taken from the outside world.

      Business Intelligence

      Using all the things described so far, you apply the Business Intelligence process to figure out how to drive your business, check how it is performing and analyze the outcome of defined strategies to stay in business and, possibly, be successful.

      Data Science

      What to do with all the vast amount of unused data in the Data Lake? Maybe among the oranges that does not fit the current sales plan and target, there is the next “big thing”. Something you can leverage to create a new product, sell a new service or something that can help you to understand if there is anything in the collect-store-process-enrich life cycle that can be improved. Usually this means that you are looking for question, and not for answers. This is the main difference between the process of doing Business Intelligence and the process of Data Science: in the first you look for answers to questions you know, in the second you look for questions you don’t know yet.

      This is the main difference between the process of doing Business Intelligence and the process of Data Science: in the first you look for answers to questions you know, in the second you look for questions you don’t know yet.

      Data Juice

      In conclusion this is what the information technology, from a data standpoint, is all about: squeezing every drop of information out of all the data we have, so that we can gain knowledge that ultimately brings to wisdom.

      We squeeze every drop of information out of all the data we have, so that we can gain knowledge that ultimately brings to wisdom.

      So now, when someone ask me what kind of work I do, I say that “I squeeze information out of raw data”. Easy, understandable and, actually, really precise.

    • PASS Application Development Virtual Group

      I've been involved with PASS, once named Professional Association of SQL Server users, now just...PASS, since 2006 when I stepped up as president of SQL Server Italian User Group. Now that I moved to Redmond, WA, I'm still involved with it, but of course, I'm not leading the Italian User Group anymore.

      Since I do really think that community activity is vital for an healthy and successful career, as soon as I noticed that the Application Development Virtual Group was not very active anymore I decided to offer my help and so here I am, I've just stepped up as new leader of that group, AppDev for friends:

      http://appdev.pass.org/

      The focus will be on the entire Microsoft Data Platform, so not only SQL Server, and the languages we'll try to cover are, of course C# and .NET in general and also, more broadly, any language that can interact with the Microsoft Data Platform. So Python (my passion!), Java, Ruby, F#, Rust, R....anything you can think of. Of course everything approached from a developer perspective. No administration here, no sir. Security? Yes. Performances? Sure, give me more! Best Practices? Absolutely! Everything that can be of interests to developers and architects will be in our scope (given that is somehow related to the Microsoft Data Platform, of course).

      I've just started to contact some great speakers to start with the right foot, and the first just confirmed his support. Jovan Popovic, SQL Server / SQL Azure PM will present a session on JSON, one of the most requested and powerful feature in SQL Server 2016 and next:

      Building REST API with SQL Server using JSON functions

      Registration is free both to the seminar and also the the Virtual User group. You can find link to registration and additional info on UG homepage:

      http://appdev.pass.org/

      I recommend to register to both, so you won't miss a very interesting webcast (that will be recorded and made available on YouTube for future reference) and you'll also stay informed on the future activities that will be, I can assure you, just amazing!

      And, last but not least, if you have a topic you want to present, just let me know.

    • For the Better Developer: Notes

      I want to be clear about the target and the scope of the “For the Better Developer” series.

      The articles doesn’t want to be 100% precise, but the aim is to give you, the developer, who understand how important in today work is the ability to manipulate data, a good knowledge on how to use some features and not how they really work behind the scenes. I hope that you will be interested in learning how things works and the section “for those who want to know more” you can find at the end is there with that specific purpose in mind. But again it doesn’t really make sense to dive really deeply in how things works since there are a lot of extremely good book and articles and documentation and white papers that already do this. My objective is to give you enough information to

      • help you in your work
      • spark in you the will to know more

      so sometimes I have to simplify things in order to make them consumable

      • by everyone, especially those who doesn’t have a deep database background yet
      • in just a bunch of minutes, since you’re probably reading this while at work, and you surely have some work to deliver within a deadline that is always too near

      The article in the series assumes that you already have a basic practical knowledge of a database. For example you know how to create a table and how to query it.

      The section “for those who know more” is created for those who already something more deep about database and that found simplifications in the article too…simple. Here I explain and clarify the topics I had to simplify a lot, sometimes even maybe telling a lighter version of truth, in order to make it easily accessible to everyone. But truth needs to be told, and in this section is where I tell it.

    • Know your data(base), or Introduction to the “For the Better Developer” series

      After 15 years in Database Performance Tuning, Database Optimization, Data Modeling, Business Intelligence, Data Warehousing, Data Analytics, Data Science, Data Whatever, I’m back in the development space, where I lived also for the first 10 years of my career.

      Huston, we have a problem

      After 15 year I still see the same big problem: the huge majority of developers doesn’t really know how to properly deal with data. It’s not a critic or a rant. Is just a matter of fact: especially several years ago, if you spoke to a developer, it was clear that dealing with data and especially a database was not something he felt was part of his job.

      This is not true anymore, any developer today knows that a bit of knowledge of database is needed. No, better: is mandatory. Still, that way of thinking that was common years ago created a huge hole in developer knowledge bag that is showing its bad effects now. If you didn’t learn how to properly deal with a database, it’s now time to fill that gap. Unfortunately 15 years is a lot of time and database evolved immensely in that time frame, especially relational databases, so trying to recover the lost time it may be really hard. Still, it’s mandatory. for you, your team and the company you own or you work for.

      I do really believe in community and sharing so I decided to try to help to fix this problem. If you want to read the full story, go on, otherwise you can just skip to the last paragraph. As always, its your choice.

      Changing times

      The market changed, asking for more integrated and interdisciplinary knowledge in order to be able to react more quickly to changes. Developers needed a way out: database were just too old and boring, and they really didn’t want to learn SQL, that freakin’ old language.

      The NoSQL movement for a while seemed to come to the rescue, promising developers to give them something that will take care of data for them. Auto-Indexing, Auto-Sharding, Auto-Scaling, Auto-Replication. JSON queries. JSON data. JSON everything. No Schema. Just freedom, beers and peace for everyone.

      The world today

      Of course that was (obviously) too good to be true. Even NoSQL databases, now that they have matured, needs to have data properly modeled, indexes correctly defined, used and maintained, not to mention security, transactions, and techniques to optimize performances.

      There is a big difference between now and 15 year ago: now we live in a world where data is the center of everything and so developers needs to know how to handle it. Look at any work offer for developers and you will find that some degree of database knowledge is always required.

      You probably already know how to query several database, be SQL or NoSQL: you had to learn it to survive.

      Become a Better Developer

      Frankly speaking, surviving is just not that nice. Evolving, thriving, accomplishing something bigger is nice. And to do that, you really be able to make the difference. To stand out from the crowd, there is only one thing you have to do. Become good, extremely good, in dealing with data.

      You don’t have to become a DBA or a Data Scientist if you don’t want to. But dealing with data in a position of control and not just trying avoid dying under the data weight is something any good developer should learn.

      Do you want to use MongoDB? Go and become the champion of MongoDB. You are excited about the new Google Spanner? Go and learn it like there is no tomorrow. Choose one or more database of your choice and master it. It will make a whole difference, since you will have the power to turn the data in what you want, without having to reinvent the wheel every time.

      More and more computing power is moved within the database. It is just much more easier to move compute logic in the database instead of moving the huge amount of data we have to deal with everyday out to the application. This is already happening and will continue in the foreseeable future: knowledge of a database is now more important than ever. Performances, and thus costs, depends on how good you are in manipulating data.

      The cloud make all these reasons is even more important, since the relationship between performance and costs is stronger than if you are on-premises.

      A good data model, the correct index strategy and a cleverly written code can improve your performance by orders or magnitude. And reduce costs at the same time.

      My choice

      I have chosen SQL Server for myself. I think that nowadays is the most complete platform that supports all the workload one can ask and offers NoSQL, InMemory and Columnar capabilities all in one product. All with exceptional performances and with a great TCO. Plus, it lives in the cloud.

      Just to make it clear: I’m not a fanboy. I’ve also studied and used MongoDB, CouchDB, MemCached, Hadoop, Redis, Event Store, SQL Lite, MySQL, ElasticSearch, PostgreSQL. But since the Microsoft platform is where I work most of the time, it make sense for me to use SQL Server.

      To be 100% honest and transparent: I’m also a Microsoft Data Platform MVP. But that’s not why I like SQL Server and SQL Azure so much. It is because I wanted to be a developer who can make the difference that I have learned SQL Server so well which in turn drove me to be recognized as an MVP. Since I’m a data lover, I always take a deep look at what the market has to offer: I always go for an objective and rational approach.

      “For the Better Developer” series

      I really love development, and I love it even more when I am among talented developers who love their job and know the technologies they work with extremely well, because when that happen, great things happen as a result. When that happen my work become pure joy. It becomes poetry. I’ve been lucky to experience such situation several times in the past. And I want more.

      So I want to increase my chances to be in that situation and that’s why I’m starting a series of articles that will try to help everyone who want to become a better developer with SQL Server and SQL Azure.

      If you’re a developer that needs to use SQL Server or SQL Azure, you will find in articles of the series a quick way to learn the basics and also learn where to look for more deeper knowledge.

      Here I’ll keep the updated list of the articles I have written so far:

      SQL Server Indexes

    • For The Better Developer: SQL Server Indexes

      Indexes?

      Indexes are a key factor for a database, relational or not. Without indexes the only option for a database would be to read all the data and discard what is not needed. Very inefficient.

      If a query has a where or a join or a group by clause and you feel performances are not good, before trying to figure out how to cache results in order to improve overall application performances — which will surely help but it will put on you the burden to maintain the cache, making the solution more expensive and complex to develop, maintain and evolve — keep in mind that you will surely benefit from using an index. How much is hard to say, but I would expect performance improvements in the order of 10x to 1000x. Yes: index can make such difference.

      More on caching and other stuff is discussed at the bottom of the article, in the “For those who want to know more” and “For those who know more” sections.

      Notes

      Wondering who I am and why I’m writing these posts? Read the introductory article here. Also be sure to read the notes about the “For the Better Developer” article series, so that you can know what to expect.

      The indexes you have to know

      There are two principal types of indexes in SQL Server: clustered and non-clustered. The first main difference between the two is that the clustered works directly on table data, while the non-clustered works on a copy of such data.

      Both indexes can supports two different storage models: using a row-store or using a column-store.

      There are therefore four possible combination that we have to choose from when we need to add an index to our tables.

      Clustered Row-Store Indexes

      An easy example to explain this index is the encyclopedia or the phone number directory. Data is stored accordingly the order specified by the index. More specifically, by the columns declared when the index is created.

      create clustered index IXC on dbo.Users(Surname, Name)

      In the example above, rows will be stored ordered by Surname and then Name. Just like the phone number directory, duplicates are allowed.

      Each row will also be stored along with all the remaining columns values. If the table dbo.Users has the following definition:

      create table dbo.Users
      (
      Id int not null,
      Surname nvarchar(50) not null,
      Name nvarchar(50) not null,
      DateOfBirth date not null,
      PhoneNumber varchar(50) not null,
      Address nvarchar(100) not null,
      City nvarchar(50) not null,
      State nvarchar(50) null,
      Country nvarchar(50) not null
      )

      the index row will contain not only the indexed columns but also the Id, DateOfBirth, PhoneNumber, Address, City, State and Country. This happens because, remember, a clustered index works directly on table data.

      Of course, since a clustered row-store index works directly on table’s data, there can be only one clustered index per table.

      When you should use this type of index? It is perfect if:

      • you are looking for zero, one or more rows in a table
      • you are looking for a range of rows: from “A” to “B” for example.
      • you need to have the rows returned in a specific order

      As you can see it fits all usage scenarios and, in fact, the clustered row-store index is very flexible and can be useful in almost every scenario but unfortunately, as already said, only one can exist per table. This limitation is imposed by the fact that such index physically orders data in the table and, of course, you can order data only in one way at time.

      If you need to order it in more than one way at the same time, you need to duplicate some or all data, and that’s why Non-Clustered Row-Store Indexes exists.

      Non-Clustered Row-Store Index

      A Non-Clustered Row-Store index is somehow like the index you can find at the beginning of a magazine, where the topics are listed along with a pointer to the page in which they can be found.

      In the case of SQL Server, when you create a non-clustered row-store index on a column — and following the phone number directory example let’s say the PhoneNumber column — all values in that column will be indexed and each one will have a pointer that allows SQL Server to relate that value with the row in the table that contains it.

      create nonclustered index IXNC on dbo.Users(PhoneNumber)

      Again, please note that values in the chosen columns needs not to be unique.

      The non-clustered index requires additional space to be stored, since it creates a copy of all the data in the columns you specified in the index definition, plus the space needed for the pointer.

      When the index is used to locate some data, the result of that search is a pointer. This means that in order to get the real data a lookup operation needs to be done. For example, given a PhoneNumber you want to know who will be answering if you call it. SQL Server will used the index of the PhoneNumber and after having found the number you are looking for, will take the pointer and use it to find all the related values like Name, Surname, Address and so on. A lookup is nothing more that a jump to the location indicated by the pointer in order to access the real data.

      Of course this additional operation has a little overhead. Overhead that is directly proportional to the number of rows for which the lookup is needed. Are you looking for a thousand rows? A thousand lookup needs to be done. The consequence of this behavior is that it doesn’t make sense to always use an index, even if it exists. If the effort to do the all the lookup (which can be an expensive operation) is too high, compared to reading the full table and just discarding the rows that doesn’t fall in the scope of the requested search, the index will not be used.

      As a general rule, than, it can be said that this index is perfect when you are looking

      • for zero or one row in a table
      • for *very few* rows in a table

      as a rule of thumb, “very few” mean 1% or less of the rows in your table.

      Clustered Column-Store Index

      This index effectively turns SQL Server into a columnar database. Really. Once created on a table, since it is a clustered index, it will change the way SQL Server stores that table: instead of saving data one row at time, data will be saved one column a time, following the principles of columnar databases. For that reason you don’t really need to specify any columns when creating such index, since it will be applied to the whole table and thus all columns will be indexed:

      create clustered columnstore index IXCCS on dbo.TrainingSessions

      Column-Store index are perfect for analytical queries. All queries in which data needs to be grouped, optionally filtered and aggregated fall into this scope. Performance can be easily increased by a factor of 10x and it’s quite common to see improvements of 30x and more. This is also a result of the fact that, to work properly, columnar solutions needs to compress data and SQL Server makes no exception: the clustered column-store index will compress data in your table using several techniques, and you may see you table space shrinking a lot, like up to 60%-70% or even more.

      Column-Store index are especially good in read-intensive scenarios, while they do not excel in being written very often. There is an exception for bulk-load scenario. A bulk load is activated when when you use the T-SQL command BULK INSERT or load data into SQL Server using the SqlBulkCopy object or if you use any solution, like SQL Server Integration Services, that supports the Bulk Load option.

      If you have a table in which the majority if the queries have this pattern:

      SELECT 
      <aggregation_function>(Column1),
      <aggregation_function>(Column2),
      ...
      FROM
      <table>
      WHERE
      ...
      GROUP BY
      Column3,
      Column4,
      ...

      and write operations, compared to read operations, are a small number, then give this index a try. You will probably be amazed by how quickly data can be aggregated and manipulated, even on huge tables. (Huge: hundreds millions rows or more)

      The downside of this index is that while it is great for working with aggregations, is not that good retrieving one or few rows with all their columns. So you may want to mix the row-store index and the column-store index together, and that’s why the next, and last, index on the list may be very useful too.

      Non-Clustered Column-Store Index

      This index allows you to create a colum-store index only on a subset of columns of your table.

      create nonclustered columnstore index IXNCCS on dbo.Users(City, State, Country)

      This is helpful when you need to have your table both supporting a row-store index, that is quick for retrieving whole rows, but you also need to quickly aggregated on some columns of your data.

      Using again our phone directory example, the above index will make any aggregation on City or State or Country or a combination of them, really really really fast.

      Let’s make a more real-life scenario: if you have a database that stores training sessions, for example, and you what to be able to present to your user a nice dashboard with aggregated information of the last six month of the most useful indicators like calories burnt, number of training done and so on and you also need to be able to quickly retrieve a single or a bunch of rows to show all the data you have for a specific training session, the non-clustered column-store index, especially when used together with a clustered row-store index, is just perfect for this job.

      Be Warned

      Creating, altering or dropping and index may have huge impact on the server. The bigger the table the bigger the impact. Depending on the size of your data, such operation can take seconds or hours and in the meantime all users trying to use the database will be affected in a way or another. So before playing with indexes, test the script on a test server to evaluate the impact.

      Conclusions

      You now have the basic foundation to understand how to make application go faster. Much, much, much faster.

      Take the sample done above, remove training and users and put your own objects like invoices, bills, orders, whatever you want, and you can easily understand how SQL Server can be perfect to manage a real-time analytic workload and at the same a time also a transactional workload, which is a quite typical requirement.

      This is a really unique feature in the DBMS space today.

      My opinion is that, if you have less than 1 TB of “hot” data (data updated quite frequently during that day) you don’t really need to look for Big Data solution since the overhead they will put on you and your team can be quite big that I can hardly justify for one 1 TB of data. If you have 1 PB or near, than, let’s talk.

      External References

      Indexes are really a huge topic, here we just scratched the surface and even just a bit. I’ll surely talk more about indexes in next posts, but if you don’t want to wait and you’re interested in the topic, which is my main objective with this post series, that you can go for these external references:

      For those who want to know more

      Caching

      SQL Server has its own internal cache. Actually, it has several different caches. The most important is the buffer cache. Here is where data live. In an healthy situation 99% of your data comes from the cache. Yeah, right, 99%.

      Row-Store Indexes

      Row-Store indexes are implemented a B+Trees. The technical difference between clustered and non-clustered is that for a clustered index in the leaf pages you have the actual data, while in the non-clustered you have data related only to the columns used or included in the index, and the pointer.

      Clustered Row-Store Index

      SQL Server doesn’t really enforce the physical order of the data after the index has been created. Data is ordered following the index definition during the index creation phase, but after that, if there are operation that inserts, deletes or updated data, than the physical order is not preserved. What is preserved is the logical order of the data. This means that data will be read in the correct sequence which may be different than the sequence in which rows can be found on the disk. This logical/physical difference is the index fragmentation that is not necessarily a bad thing. It usually is a bad thing, but not in 100% of the cases. For example if you always access your data by one row at time, you care a little about fragmentation. Or if you want to insert data as fast as possible, you don’t really care about fragmentation; on the opposite, you may want to have fragmented data in order to reduce the chances of having hot spots where contention will happen and thus your performance will suffer. Anyway, as usual, edge cases like the two described must not be used as the general rule.

      Non-Clustered Row-Store Index

      You can have more than one non-clustered row-store index on a table. Of course you can create such index on more than one column. This enables some nice “tricks” like the covering index that I will describe in future articles.

      Other Indexes

      There are many more index type then the one I described. They are specific to a certain data type, like geospatial indexes or to a specific feature that I will discuss in another post, like the Hekaton, the In-Memory, Lock-Free, engine. The indexes I described in this articles are the most commonly used.

      For those who know more

      Caching

      Caching is usually a nice thing to have and I also use caching solution quite a lot. I just don’t want to use it as a surrogate of performance I can have right from my database. Caching is perfect, in my opinion, in highly concurrent systems to reduce the stress on your database so that you can deliver the same user experience with just a less expensive solution (be a smaller hardware on-premises or a lower performance level in the cloud).

      Keep also in mind that SQL Server, offers an extremely good option for creating customized caching solution using In-Memory Tables, as Bwin, the online betting giant, demonstrated. So if you already using a SQL Server version that supports this feature, it may be worth to give it a try. Otherwise Redis and the likes are just perfect.

      The “User” table

      Yes, the table used in the sample has several design problems (both logical and physical), for example:

      • there is a functional dependency between City, State and Country)
      • there is no Primary Key defined
      • data types are really not the best possible

      but I just want to show something that is as close as reality as possible so it will be easier to relate the example to your own experience. In addition to that I’ll talk about Keys and Indexes in another dedicated articles, so here I went for something that just plain, simple and easy to understand.

      You’re really come that far?

      Well, thanks for reading. :)

    More Posts Next page »

    This Blog

    Syndication

    Privacy Statement