THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Davide Mauri

A place for my thoughts and experiences the Microsoft Data Platform

  • Lateral Thinking: Transitive Closure Clustering with SQL Server, UDA and JSON

    For a feature we’re developing in Sensoria, we had to solve one of the well-know, yet hard to solve, problem with data and relationships between elements in a data set. While there are several ways to solve the same problem (just search for “transitive closure” with your own favorite search engine), I’d like to describe here a very interesting approach that not only shows how to leverage to the maximum SQL Server/Azure SQL, .NET and its newly added JSON support, but also to highlight that one key assets of an architect / developer, in a world where (so-called) AI is going to be very strong, is the human ability to find creative solutions. Thinking out-of-the-box or, in other words, practice some lateral thinking, is going to be key factor in future: let me show you one case that explains why.

    The Problem

    Let’s make an example to clarify the problem. Let’s say you are at a party where a lot of people have been invited, and you wonder how and if people are connected to each other via common friends. Let’s use letters to easily identify people, and let’s say we have this situation:

    As you can see, people can be divided in two groups, so that each group, or cluster, will be made only of those people who are connected to each other via a common friend, friend of a friend, and so on.

    This means that you have to find the transitive closure for the elements and then create groups so that the elements with the same transitive closure (said more easily: that are directly or indirectly related to each other) will be in the same group.

    The obvious solution

    This is a typical graph problem, and so, since we’re using SQL Azure, we tried to use the new Graph features. Unfortunately calculating the transitive closure is a feature that is not yet there, so another solution was needed.

    It seemed that the only option to solve our problem was to use a Graph database (Azure Cosmos DB would have been the choice), but that would have required us to move data in/out of our database, which is Azure SQL, that in turn would have made our architecture a little bit more complex — and thus more expensive to manage and maintain—and, in addition, we would have needed to figure out how to keep the two databases in sync.

    Nothing really too complex or hard, but before going that road I decided to spend some time to figure out if such solution would have been viable with the good old SQL. If yes, it would have helped us to save time and money while keeping the overall architectural complexity low (which helps to have maintenance costs low and performance high). Modeling graph using relational database it is possible and it is also quite easy, as you can write the data shown above like a set of pairs:

    but performance are usually less than good when compared to Graph databases. In our case we had a very specific use case, we just need to group all the elements that are connected together, and thus we could just focus to solve this specific problem.

    The creative solution

    I proposed the problem to my good friend Itzik Ben-Gan that helped me to find a very nice SQL only solution that will soon be published on SQL Server Magazine, but I also decided to try a different creative approach, just to experiment a bit and keep my lateral thinking abilities trained.

    At a first glance this sounds to be the perfect job for an UDA, but there is the additional problem that a user defined aggregate must return just a scalar value, which is the result of the aggregation function applied to all values that belongs to the same group. If the aggregation value is a sum, the returned scalar would be the value obtained by summing all the group values, if the aggregation function is concatenation, the resulting scalar would be a string containing all the string values in the same group concatenated one after the other.

    Now what if, due to how the custom aggregation function works, the data may generate subgroups? And what if the number of such subgroups cannot be known in advance, but only after that data has been processed?

    Let’s say, for example, you want to take all the orders of a customer, and create an aggregation function that split them in two groups: those who are above the customer’s order average amount and those who are below. This can be easily done in SQL, I know, so I would never create an UDA for this, but it easily and clearly explains the problem. This kind of problems cannot be solved using a UDA, it seems, since the return value must be a scalar and nothing else: no sub-grouping, since how would you fit them into a scalar value?

    Unfortunately for us, this constraint is blocking problem, since we need to read all the data, and only after calculating the transitive closure of each element, we know how many groups we really have. It may be one, but it may be more than one, like in the example I described at the beginning.

    Now, let’s try to think in a very creative way: what is a scalar, but can also be seen as a complex object…like an array of object? Yes: JSON is an answer. Let’s say we don’t use letters but numbers, and thus the original data can be rewritten as the following:

    Now, if an aggregation function could return a “scalar” value like:

    {
    "1": [1, 2, 3, 4, 5, 6],
    "3": [7, 8, 9, 10, 11]
    }

    then the problem would be elegantly solved. As you can see the two groups are correctly identified and each group has a unique number assigned that identifies it.

    Now, say that the UDA is called TCC:

    Once you have such JSON, transforming it into a table is a really simple:

    WITH cte AS
    (
    SELECT
    dbo.TCC([Person], [IsFriendOf]) AS Result
    FROM
    [dbo].[Friends]
    ),
    cte2 AS (
    SELECT
    CAST(J1.[key] AS INT) AS groupid,
    CAST(J2.[value] AS INT) AS id
    FROM
    cte
    CROSS APPLY
    OPENJSON(cte.[result]) J1
    CROSS APPLY
    OPENJSON(j1.[value]) J2
    )
    SELECT
    *
    FROM
    cte2

    And the result will be

    Problem solved!

    Now, this is exactly what I have built in the last days. You can find the fully working code and example data here:

    yorek/non-scalar-uda-transitive-closure

    What I love of the solution, beside performances that will be discussed later, is how it helps to make the solution easy, elegant and simple. All the complexity is hidden and encapsulated into the UDA, data doesn’t need to move around different systems, which help to reduce friction and thus costs, and there is no need to learn a new language, like Gremlin, to solve our small and very specific problem.

    Performances

    What about performances? Well, this is one of the very few occasion where you can actually beat the database optimizer. Thanks to the fact that the UDA allows you to scan data only once, you can load each number into a list and add all the numbers into that list only if they are connected. If they are not connected, just create a new list. If you discover at some point that the two list have a common number (or friend to follow the original example) you merge them into one.

    This is just the perfect use case of the HashSet that has a constant search time — O(1) — but that unfortunately cannot be used in a SQLCLR object since it is marked as MayLeakOnAbort.

    The other object that offers the same constant search time is a Dictionary and the ContainsKey method which can be used in SQLCLR. So I’ve built the entire algorithm around a dictionary — a key-value pair — whose value is always set to true (could also have been anything else, since I don’t use such value at all), and the number that belong to the group represented by the dictionary is stored as a key.

    Performance are great as you can see in the comparison chart here:

    The values on the horizontal axis describes how the random test data was built. 2000x100 means that data was generated so that the resulting groups would have been 2000, each one with 100 elements in it. The SQL solution used is the best one we’ve been able to find (big thanks to Itzik, of course, that came up with a very elegant and clever solution). Of course, if you can came up with a better one, let me know.

    Conclusions

    Lateral thinking is our secret weapon. If you are afraid of AI coming to steal your job, don’t be, and try to solve problems in the most creative way, using your technical knowledge, intuition, gut feeling and your ability to invent a solution where it doesn’t even seem to exists.

    The very first solution I tried was so slow that after minutes, even on small data sets, it was still running.

    But then, in couple of days I’ve been able to find such solution, that is just what we need right now: it allows us to keep overall architecture complexity low and give us amazing performance, all of that also allowing us to spare money (we’re 100% on the cloud, so we pay for each bit we use…in this case means several thousands per year).

    If in future we need some more complex graph support, extending our platform to use Cosmos DB (or any other Graph Database) will be inevitable and we’ll gladly embrace it, be assured: the message here is not about which is the best technology to do what, but that one should always try to look for a solution different than the obvious one. It just may be the best one for the target use case.

  • Historical CRUD and Application Business Events

    PASS took almost all my “free” time and I quite forgot to write a quick post to let you know that tomorrow November 8th (so I’m just right on time!) Dino Esposito – that I’m sure you all know – will deliver an interesting session for the PASS AppDev VG:

    Historical CRUD and Application Business Events

    Every time you update or delete a record in classic relational database you lose information. In particular, you lose the state of the entities which was current until you changed it. For many years this automatic loss of information didn’t represent a problem for companies. But now, in the days of big data and business intelligence, any data you lose is potentially money you lose. Until the release of SQL Server 2016, for updating the state of a system you had two options: lose old data or shift to another persistence paradigm like Event Sourcing. SQL Server 2016 provides an intriguing middle ground as it introduces concepts like temporal tables and native JSON serialization and indexing that combined with existing columnar indexing make it suitable to be the storage of choice even for event-driven systems. However, the real question is do you need plain CRUD or domain-specific CRUD? If it’s the latter, you need to move towards Event Sourcing.

    registration is free via the usual link:

    http://appdev.pass.org/

    and, again as usual, session will be recorded and made available on group YouTube channel couple of days after the live event.

  • For the Better Developer: When indexes are not enough

    Another month, another session for the PASS Application Development VG. This time I will be both the host and the guest :), talking about something is very close to my interests and, I think, also a key point for any developer to have success in future of AI. Here’s the session title and abstract:

    For the Better Developer: When indexes are not enough

    If you want optimum performance in SQL Server, you have to use indexes. But what if you already implemented indexes and your solution is still slow or it doesn’t scale like you want? Or, if you're on Azure, it is just requiring too much resources, which in turns means just more money to be spent on it? You may have to rethink the way you write your queries. How you write your queries is directly related to how you approach and solve your business problems, and more often than not thinking outside the box is the way to unlock incredible performances. But what this exactly means in a database? And how a developer can do that? In this session, you’ll see how using some lateral thinking and a set-based approach will open up a whole world of possibilities. Thanks to this demo intensive session, you'll never be the same after switching on this new mindset!

    registration is free via the usual link:

    http://appdev.pass.org/

    Live event will take place on October 18th, and, again as usual, session will be recorded and made available on group YouTube channel couple of days after the live event.

  • What's new in Entity Framework Core 2.0?

    In just a couple of week the PASS Application Development Virtual Group will host a new webinar, this time with Chris Woodruff who will talk about new stuff in EF Core 2.0:

    What's new in Entity Framework Core 2.0?

    Entity Framework (EF) Core is the lightweight, extensible, and cross-platform version of Entity Framework, the popular Object/Relational Mapping (O/RM) framework for .NET. This talk will give a quick look at the new features for .NET Core 2.0. We will look at the new release, how to install it and show sample projects covering the following topics: Improved LINQ Translation, Owned Entities and Table Splitting, Global Query Filters, DbContext Pooling, String interpolation in raw SQL methods, Explicitly compiled queries, Self-contained entity configurations in code first

    as usual attending is free, just register here:

    http://appdev.pass.org/

    and the session will be recorded and made available on the group YouTube channel http://bit.ly/appdevyoutube) within a few days after the live broadcast.

  • Working with SQL Server for Linux Cross-Platform

    Next week, on August 24th a new webinar from the PASS Application Development Virtual Group will be aired. This time the topic is the Cross-Platform support introduced with the forthcoming SQL Server 2017. Maximo Trinidad will delivery the following session:

    Working with SQL Server for Linux Cross-Platform

    You're welcome to explore the latest build of SQL Server for Linux. Connect to SQL Server looking into the tools available for both Linux and Windows. Maximo will provide samples on querying for database information using Python and PowerShell between two environments. This will be a demo intensive session you will not want to miss!

    as usual registration is free, via the following link:

    http://bit.ly/appdevreg

    and the session will be recorded and made available on the group YouTube channel http://bit.ly/appdevyoutube) within a few days after the live broadcast.

    Don’t miss the webinar if you’re thinking, planning or even already using SQL Server in a Multi-Platform environment!

  • Azure SQL: work with JSON files where they are

    Dealing with CSV or JSON data today is more and more common. I do it on daily basis, since the our application send data to our microservice gateway backend is in a (compressed) JSON format.

    Sometimes, especially when debugging or developing a new feature, I need to access that JSON data, before is sent to any further microservices for processing or, after that, being stored in the database.

    So far I usually used CloudBerry Explorer to locate and download the JSON I was interested into and that a tool like Notepad++ with JSON-Viewer plugin or Visual Studio Code to load and analyze the content.

    Being Azure SQL or main database, I spend a lot of time working with T-SQL, so I would really love to be able to query JSON directly from T-SQL, without even have the need to download the file from the Azure Blob Stored where it is stored. This will make my work more efficient and easier.

    I would love to access JSON where it is, just like Hadoop or Azure Data Lake allows you to do

    Well, you can. I just find out that with the latest additions (added since SQL Server 2017 CTP 1.1 and already available on Azure SQL v 12.0.2000.8) it is incredibly easy.

    First of all the Shared Access Signature needs to be created to allow Azure SQL to access the Azure Blob Store where you have your JSON files. This can be done using the Azure Portal, from the Azure Storage Blade

    SAS Signature Creation Window

    or you can also do it via the Azure CLI 2.0 as described here:

    Azure CLI 2.0: Generate SAS Token for Blob in Azure Storage

    Once you have the signature a Database Scoped Credential that points to the created Shared Access Signature needs to be created too:

    If you haven’t done it before you will be warned that you need to create a Database Master Key before being able to run the above command.

    After that credentials are created, it’s time to point to the Azure Blob Account where your JSON files are stored by creating a External Data Source:

    Once this is done, you can just start to play with JSON files using the OPENROWSET along with OPENJSON:

    and voilà, JSON content are here at your fingertips. For example, I can access to all activity data contained in our “running session” json:

    This is just amazing: now my work is much simpler, especially when I’m traveling and, maybe, I don’t have a good internet access. I can process and work on my JSON file without even have them leaving the cloud.

    What about CSV?

    If you have a CSV file the technique is very similar, and it is already documented in the official Microsoft documentation:

    Examples of Bulk Access to Data in Azure Blob Storage

    What about on-premises?

    The same approach is doable also via SQL Server 2017 (now in CTP 2.1). You can also access file not stored in the cloud, but on your on-premises storage. In such case, of course, you don’t specify the Shared Access Signature as an authentication methods, since SQL Server will just rely on Windows Authentication. Here Jovan showed a sample usage:

    Is the code available?

    Sure, there is a Gist for that:

    https://gist.github.com/yorek/59074a4c4176191687d6a17dabb426ed

    • Speaking @ PASS Summit 2017

      After years and years talking about Business Intelligence and Business Analytics, this year at PASS I’ll be speaking about Application Development and how to make the best use of the data platform that Microsoft created so far.

      I’ve started talking about the impedance mismatch at the beginning of my career and now, after 15 year of Business Intelligence, Big Data, Analytics, Data Warehousing, I’m back talking about it again, but this time there is a new amazing approach to the problem: microORM.

      MicroORM are ORM that only take care of one thing: mapping database result-set to classes and nothing more. I love the idea because it allows developers and DBAs to leverage all the power of RDBMS whilst, at the same time, remove all the boring and error-prone plumbing code. As Van Halen said, it is just the “Best of Both Worlds”.

      As a result, MicroORM are, in my opinion, just perfect in a Micro-Services scenario. They make things simple, but not simpler, while preserving performance and making a developer just more productive (give that he/she must know how to query a database, something that is mandatory IMHO).

      More specifically I’ll talking about Dapper.NET a microORM I’ve learned to love in the last years, and that I use everyday:

      Dapper: the microORM that will change your life

      ORM or Stored Procedures? Code First or Database First? Ad-Hoc Queries? Impedance Mismatch? If you're a developer or you are a DBA working with developers you have heard all this terms at least once in your life…and usually in the middle of a strong discussion, debating about one or the other. Well, thanks to StackOverflow's Dapper, all these fights are finished. Dapper is a blazing fast microORM that allows developers to map SQL queries to classes automatically, leaving (and encouraging) the usage of stored procedures, parameterized statements and all the good stuff that SQL Server offers (JSON and TVP are supported too!) In this session I'll show how to use Dapper in your projects from the very basis to some more complex usages that will help you to create *really fast* applications without the burden of huge and complex ORMs. The days of Impedance Mismatch are finally over!

      See you at PASS 2017!

    • T-SQL Window Functions

      I’ll soon post something on this, but let me tell you right here. If you don’t want AI to take you DEV/DBA/BI job you *need* to be smarter than AI (which is not AI at all right now, but let’s keep on playing the marketing game): one way to become smarter is to - guess! - use your brain and train it to think out of the box, practice lateral thinking and more in general do all the things that brute force and Machine Learning cannot do.

      Of course when you do that, you have to be supported by a language that allows you to exploit all such potential. SQL is one of those nice languages. So don’t miss Itzik Ben-Gan session on T-SQL Window Functions on July 13th for the Application Development Virtual Group:

      T-SQL Window Functions

      Itzik Ben-Gan

      T-SQL window functions allow you to perform data analysis calculations like aggregates, ranking, offset and more. When compared with alternative tools to achieve similar tasks like grouping, joining, using subqueries, window functions have several interesting advantages that allow you to solve your tasks often more elegantly and more efficiently. Furthermore, as it turns out, window functions can be used to solve a wide variety of T-SQL querying tasks well beyond their original intended use case, which is data analysis. This session introduces window functions and their evolution from SQL Server 2005 to SQL Server 2017, explains how they get optimized, and shows practical use cases.

      Registration via the usual link:

      http://appdev.pass.org/

      Don’t miss it, no matter if you’re into SQL Server or not. This will be useful to everyone who works with databases that supports the standard ANSI-SQL 2003 and after. This means, SQL Server, Azure SQL, of course, but also Hive, SparkQL, Postgres and MySQL (nope,sorry), MariaDB

    • Introduction to Azure Cosmos DB: Microsoft’s Globally Distributed, Multi-Model Database Service

      It’s time for something NoSQL at the PASS Application Development Virtual Group. Next week Denny Lee will present on Cosmos DB, the latest evolution of DocumentDB:

      Introduction to Azure Cosmos DB: Microsoft’s Globally Distributed, Multi-Model Database Service

      At Build we announced Azure Cosmos DB, the industry’s first globally-distributed service that enables you to elastically scale throughput and storage across any number of geographical regions while guaranteeing low latency, high availability and consistency – all backed by industry leading, comprehensive SLAs. Azure Cosmos DB allows you to use key-value, graph, and document data in one service, at global scale and without worrying about schema or index management. Join this upcoming webinar to learn more about Azure Cosmos DB, what it offers developers and how it can help you quickly build your modern applications.

      I’m looking forward to it: keeping update with all the database technologies available in the cloud is a mandatory task for today’s architects and developers, so don’t miss it. Registration, as usual, is free here:

      http://appdev.pass.org/

    • Graph data processing with SQL Server 2017 and Azure SQL DB

      As soon I saw that SQL Server 2017 supports graph database concepts (queries and modeling) I started to play with it and wanted to know more. And so I took the chance to ask to the program managers working on that feature to come and speak at the PASS Application Development Virtual Group, and they kindly accepted Smile.

      So, if you’re also interested in the topic (and you should!), point your browser here and register, meeting is set for 8th June, online:

      Graph data processing with SQL Server 2017 and Azure SQL DB

      Shreya Verma & Arvind Shyamsundar

      Microsoft has introduced Graph Database Processing to SQL Server 2017, which will facilitate users track many-to-many relationships between entities, to gather powerful insights from different pieces of connected data. This is well suited for applications where users need to extract information from different pieces of connected data, for example, IoT, fraud detection, recommendation engine, predictive analysis etc. In this session, we will learn the new features and demonstrate them by going over some use cases.

    • From ElasticSearch back to SQL Server

      Sharing with the community why we decided to move back from ElasticSearch to SQL Server

      Sometimes it happens: you move to something that is considered cool and bleeding edge back to something that has dust over it.

      Well, at least this is how SQL Server has been perceived lately by the majority of developers and software architects. Microsoft must have heard this sort-of disappointment against his RDBMS and decided to recover all the lost time, and try to make it cool again.

      If you still think that SQL Server (and his twin, SQL Azure) is still the “old” RDBMS with all its “rigid” limitations, my advice, as Software and Data Architect, is to go back and check it again.

      This is what we did in Sensoria in the last months, and now have moved everything from ElasticSearch back to SQL Server.

      I think that what we discovered and the experience we did is worth sharing: I believe that a lot of companies will find themselves (or are already) in the same situation. Maybe it won’t be ElasticSearch but another product, but the story will be the same. Here it goes.

      The Story So Far

      Something like five years ago, if you were looking for a highly scalable database solution that could also provide a lot of flexibility in terms of object stored and could also offer a good analytical solution, one of the option was ElasticSearch. It can ingest and index JSON documents, using a schema-on-read approach (instead of a schema-on-write, typical of RDMBS), it allows for really easy scale-out, it offers a very nice monitoring tool (based on Kibana), it uses REST as communication protocol so it is compatible with everything (which make it perfect for the web), and, last but not least, it is based on the well-known Lucene technology that is solid and mature.

      So ElasticSearch it was. It did great for several year, millions and millions of documents has been stored into it, both coming from end users using our products or from internal machine learning algorithm that crunches numbers and tries to turn all the raw data into something more useful for the athlete, making sense of the data we gather from garment sensors.

      Requirements and possible options

      We I joined Sensoria, we were in the middle of an architectural revision. Of course one of the system being reviewed was the database. We had to decide if to stay with ElasticSearch or move to something different. I was in charge for this decision, given my strong data architect background, and I could have chosen any database. We could have stayed on ElasticSearch or switched.

      ElasticSearch has lately clearly stated that its focus is on search and analyticsand that ruled it out immediately, since we need a database where to store all our data in first place.

      We started to evaluate the options. One of our main objective is to go 100% PaaS so, using Azure, one of the first options was DocumentDB (now CosmosDB).

      Beside storing and querying JSON we also looked for something that would allow us to quickly aggregate data using any possible combination of filters and aggregations logic. For example someone may want to see the heart rate aggregated on weekly basis for the last two years, while someone else may want to get the pace aggregated by day in the last month. This must be done in real-time, while, at the same time, new sessions are loaded and processed.

      We need to ingest data coming from our IoT devices, Apps and SDKs, but we also need to store “classic” data like user account, preferences, device info and so on.

      Architectural considerations

      Of course there is no a one-size-fit-all solution. Or, better, there is no such solution at a decent costs.

      We’re also implementing a Lambda architecture and the final data store that is used the serve data to the end user it only contain “squeezed” data, where the schema is actually well-known, otherwise developers wouldn’t know how to interact with stored data in order to efficiently generates statistics and charts. In the other stages data is stored in other more suitable places. For example we’ll use a streaming query engine to query data flowing into the system in real-time, and we’ll be using a data lake to store IoT raw data.

      Such architecture allows us to focus only, at least for now, on the master database (I like to call it a online transactional data warehouse, since what we want to do is basically real time analytics), so the search scope is now narrowed down and well defined.

      Technical decision?

      Truth is that almost all enterprise databases we evaluated, relational or not, provides, albeit with different capability levels, the features we were looking for our master database.

      So, from the decision was not only technical one, but more a strategic one. What is the database that, in our context, will give us the highest natively supported features set, the lowest amount of maintenance and the simplest platform so that our developers could use it efficiently? All at competitive costs, of course.

      SQL Azure

      As obvious from the title of this article, the final decision was to use Azure SQL. You may think my choice are biased since I’m a Microsoft Data Platform MVP, and my background is almost SQL Server. “If all you know is a hammer, everything will look like a nail”, right? Sure, but this could have been applied to SQL Server several years ago, the engine has improved a lot in the last version, both in term of raw performances and features offered. Here’s the feature that motivated my choice:

      • Columnar storage support: perfect for real-time aggregations
      • Columnar and Row storage at the same time: optimum to perform aggregate and singleton queries
      • In-Memory (Lock-Free structures): great to support high-concurrency workloads
      • Partitions: perfect for distributing workload
      • Temporal Tables: great to simplify development
      • JSON support with Non-Clustered & Generalized INverted Indexes: perfect to support migration from ElasticSearch and also to make life easier to developers
      • Row-Level Security: optimum for securing data at the source

      And these are the feature that are available now. In the near future SQL Azure will also provide native support for Graph Queries, all within the same engine.

      So technically it’s great but that alone was not sufficient to justify it. Costs played a huge role in the final decision. Initial tests showed that we could handle the load managed by several ElasticSearch servers with just a entry level Azure SQL Premium database, thanks the columnar indexes. The costs showed a reduction ration of 3:1, which was immediately loved by the business guys.

      Integration is a key factor

      For me, having all the aforementioned features in just one product is really a big thing. Of course SQL Server / Azure SQL may not excel in all of them, so you don’t have the best of the best, but, except for rare edge cases, I rather prefer to have just one tool that provides you with hundred of feature at near perfect level that hundred of tools each one just doing just one thing but perfectly.

      The fact that everything is integrated means that we don’t have to move data around, manually figuring out how to integrate different data stores while keeping logical consistency and integrity of data.

      Simplicity is a big plus. By making our solution simpler, much simpler, means that we have less development and maintenance costs.

      Having to learn only one query language is also a big plus. One of the downside of using different databases, is that NoSQL database have the tendency to have their own query language, which make everything a little more challenging: developers need to master n languages, which make them way less productive, and bugs and performance problems are just more frequent.

      Mastering well a database is really important because it avoid developers to go on a caching spree. Caching is a fantastic technology if used correctly (we use the well know Redis database for that). If used just to hide performance problems, it will quickly turn your solution into something worse that the worst spaghetti-code you have ever seen. With the addition of concurrency management and stale data issues to the problem. Hell is a nice place in comparison.

      So far, for us, the feature set, the integration of everything, the SQL language and the existence of very good tools (this is another really important point to keep in mind: good tools allows productivity to increase a lot) was pointing us to Azure SQL. But what about scalability? It will scale enough for us?

      Scalability is in the fabric

      One of the argument that was true when SQL Server was only offering an on-premises options is the scalability factor. What happens if we need to scale out to handle a lot more data and concurrent access we have right now?

      Beside the fact the our architecture uses a queue in order to handle spikes of workload, especially for data ingestion, Azure offers the solution. Azure SQL allows the database to scale up to 4 TB of data and if this should be not enough, Elastic Pools will help even more. We comfortably fit in that space for now (we’re not Twitter or Facebook just yet…), so we’re fine within that limit (even because all the raw data stays in a data lake, and there is were things get huge, usually).

      Being a PaaS solution also means that we don’t have to manage anymore clusters, nodes, updates, reboots and all the well known things that you normally have to deal with when you have to maintain your servers.

      Python and R are also in the mix

      A lot of the machine learning and advanced analytics code is based on R and Python. With the latest release of SQL Server, beside native support to R, also Python has been added as a language that can be used in-database to crunch data. This means that we can reuse all our assets based on scikit-learn. While this is not yet available on Azure SQL, I expect it to come soon, as also mentioned at //build 2017 event.

      Refactoring Code

      One key factor that allowed use to use SQL Azure is the native support to JSON that it now finally has. While is still a v1 feature, it is enough, for us, to implement all the desired features we needed to replace ElasticSearch. I’ll write a more technical post on the subject in future, so stay tuned if you’re interested in the implementation details but, as you can imagine, the trick was to fool the code to make sure it felt like being still talking with ElasticSearch, so overall changes to our code have been minimal.

      Conclusions

      ElasticSearch does offer a lot of nice feature, but, on Azure, is not offered as PaaS. Since it’s focused on analytics and search mostly, if we would had chosen it, would would have still needed a database to store OLTP data. CosmosDB is PaaS and also offers a lot of flexibility and also support SQL but….it’s another database that would have required us to create an additional integration solution to keep the existing Azure SQL and the newly created CosmosDB data in sync. We also have a quite well defined schema, so Azure SQL was the perfect choice for us in terms of ratio between costs, performances and features. It provides a very good range of features that ranges from OLTP to (real time) OLAP which is great for us, until we reach to 4 TB limit. Just like my beloved Python language, I like to think to Azure SQL as a solution “batteries included”.

      What I’d like to underline here is that we’re now living in an era where, technology wise, technology is usually not the limiting factor. Humans and our limited knowledge is. So, what is really important now, is the balance between costs, features and simplicity of the overall solution. Technology should help us create solution that are as simple (both for developers and users) as possible. But no simpler.

      And this clarifies the role of Solution/Data/Software Architectes especially well: find the best technical solution with the best benefits/costs ratio, now and in the near future. And make the solution modular so that in the distant future we can only change what needs to be changed in order to keep up with the new challenges.

      Luckily we live in an era where the cloud is making it possible. We just have to evaluate and re-evaluate technologies without any preconceptions.

    • PASS AppDev next webinar: “Introducing R”

      On May 25th the PASS Application Development Virtual Group will host a new webinar delivered by the well know Dejan Sarka and the topic will be R (https://www.r-project.org/) that is more and more used also by developers:

      Introducing R

      R is a free software programming language and software environment for statistical computing, data mining, and graphics. R is extensively supported in SQL Server 2016 suite, in Power BI, and in Azure ML. In order to use it, you have to learn the language. After attending this session, you should be able to write simple R code and use it in the products and services mentioned.

      As usual webinar is free, and it will be accessible here:

      https://attendee.gotowebinar.com/register/8547344291727590146

    • PASS AppDev Recording: “Building REST API with SQL Server using JSON functions”

      Last week Jovan Popovic (SQL Server Program Manager for JSON support and other nice things) delivered a very interesting webinar on JSON and SQL Server / Azure SQL for the PASS Application Development Virtual Group. If you’re interested in the subject, here’s a couple of links that will make you happy:

      Full recording is on PASS Development Virtual Group YouTube channel:

      https://www.youtube.com/watch?v=0m6GXF3-5WI&feature=youtu.be

      Slides are available on Slideshare:

      https://www.slideshare.net/JovanPopovic/building-rest-api-using-sql-server-and-json

      while demos are on GitHub:

      https://github.com/Microsoft/sql-server-samples/tree/master/samples/features/json/product-catalog

      Jovan showed not only how this feature can help in .NET development, but also how it make life so easy for using SQL Server from any other language be it Javascript or Node.js

      I’ve using JSON with SQL Server for a while now, and I can really say it’s a game changer for developers. If you haven’t played with it yet, you definitely have to give it a try: managing dynamic schema has never been so simple.

    • Data Juice

      The data ecosystem explained in 10 simple images

      Couple of weeks ago I was explaining how IoT and BigData can be a game changer in sports and human activities. Yeah, it’s cool to put a sensor in your own grill so that it can automatically tell you the exact time you should change side of your filet, but I think it’s even more cooler to instrument the human body and objectively quantify human activities, especially in sports and health care, so that we can even enjoy more and more the sports we love and the life we live. After all knowledge is power, right? And the first step to knowledge is the ability to measure: IoT is really a big deal here.

      But then? From a business perspective, how and where all the gathered data will be stored? What all terms like Data Lake, Data Science and the likes really mean? Surprisingly enough, the answer is as simple as describing the life cycle of something that everyone know very well: orange juice.

      I created a slide deck, named Data Juice, to explain the whole data ecosystem to C-level guys that had to understand the business value of every piece in such ecosystem, but without having to deal with technical details. It had to be clear, concise and easy to remember. The juice sample was just perfect, as you we’ll see.

      I created a slide deck to explain the whole data ecosystem: it had to be clear, concise and easy to remember.

      Now, the same concept applies to IoT: the sensor working right now in the wearable you’re wearing, almost surely collect raw data (acceleration, magnetic field, etc) at something around 50 times per second (up to 100 sometimes). But you care about is not that raw value, but the trend of your preferred exercise performance.

      What’s the line that connect those two points? Happily enough for me, answering to this question allows me also to describe what kind of work I do, and how it is related to sports, which was one of the question I’ve been asked recently by a well-known tech & lifestyle magazine.

      Here’s the answer:

      Data Sources

      New data is born everyday.

      New data is generated everyday, everywhere. Accounting systems, IoT devices, machine logs, clickstreams…anything generate some kind of data. At this stage data is generated and consumed, usually, locally and in very small sizes. For example an accounting system generates an invoice, and allows users to interact with it. One a time. Just like oranges that are born on the tree. You can just grab one and consume it in place.

      Big Data

      Oranges are harvested and sent to factory for processing. The factory needs to be prepared to get oranges of any size, kind and color and different ripe status. Sometimes it will get only one load a day, sometimes several per hour. What happen with data is really similar. The famous 3V, Volume, Variety and Velocity says that your data system must be able to accept almost any kind of data. Again, just like with oranges: if you want to be in the business, you better be prepared to handle them when they arrive.

      Data Lake

      All the ingested amount of data needs to be stored somewhere. And it better be a place where space can grow as need and it should also be pretty cheap if you don’t want to spend all your money in storing everything, even if you don-t know if and how you will use that data in future. Such place also need to support querying and processing capabilities in order to allow you to dive in such lake. You don’t really care too much about shape, format or correctness here. You just store data that as it was generated, just to have it in case you need it. Even if it may be not good for consumption.

      You just store data that as it was generated, just to have it in case you need it. Even if it may be not good for consumption.

      Data Curation

      Once named also ETL, Extract-Transform-Load, this is the process where data is selected, cleansed, standardized and integrated with additional informations, if available. More or less how oranges are cleaned, divided in different groups for different usage, and if not in good condition, discarded.

      Data Warehouse

      Once data has been processed by the previous system, it needs to be stored in safe place. Safe not only in the meaning that no-one can steal from it, but also in the meaning that everyone who will get data from it, can assume it is safe to consume. Again, just like oranges, if you get one from a market warehouse you can assume safely enough that you won’t be poisoned. The same goes with data. If you get data from the Data Warehouse, you can safely assume that it is safe to consume, meaning that you can use it to take informed decisions. The same does not apply to a Data Lake, where usually you don’t really know what you’ll get and how to deal with it.

      If you get data from the Data Warehouse, you can safely assume that it is safe to consume, meaning that you can use it to take informed decisions.

      Data Mart

      Once you get your oranges from the store, you can decide how to consume them. If you have hundreds of boxes you won’t consume it one by one. You’ll surely try to transform it in a more easy to consume product. Depending on the target, you may want to process it in different ways, in order to make sure that everyone gets exactly what they need, no more and no less. A Data Mart is exactly this: data ready to be consumed with very little additional processing, if needed at all.

      Of course, different people, or different situation may require different aspects of that data, so having more than one Data Mart is quite common. If something is not clear enough in the Data Mart, one can always go in the Data Warehouse that lies behind and check how and why such unexpected data has been produced.

      Self-Service Analysis

      Also known as Self-Service BI, this is where you’re not happy with the Data Mart, since none of them provides what you exactly need, you have to do it yourself. You need to squeeze the juice of three different oranges, of three different kind, in order to create the perfect mixture you want. You go in the Data Warehouse (usually, but sometime you can also grab data from different Data Marts) and create your mix. Maybe adding a hint of cinnamon taken from the outside world.

      Business Intelligence

      Using all the things described so far, you apply the Business Intelligence process to figure out how to drive your business, check how it is performing and analyze the outcome of defined strategies to stay in business and, possibly, be successful.

      Data Science

      What to do with all the vast amount of unused data in the Data Lake? Maybe among the oranges that does not fit the current sales plan and target, there is the next “big thing”. Something you can leverage to create a new product, sell a new service or something that can help you to understand if there is anything in the collect-store-process-enrich life cycle that can be improved. Usually this means that you are looking for question, and not for answers. This is the main difference between the process of doing Business Intelligence and the process of Data Science: in the first you look for answers to questions you know, in the second you look for questions you don’t know yet.

      This is the main difference between the process of doing Business Intelligence and the process of Data Science: in the first you look for answers to questions you know, in the second you look for questions you don’t know yet.

      Data Juice

      In conclusion this is what the information technology, from a data standpoint, is all about: squeezing every drop of information out of all the data we have, so that we can gain knowledge that ultimately brings to wisdom.

      We squeeze every drop of information out of all the data we have, so that we can gain knowledge that ultimately brings to wisdom.

      So now, when someone ask me what kind of work I do, I say that “I squeeze information out of raw data”. Easy, understandable and, actually, really precise.

    • PASS Application Development Virtual Group

      I've been involved with PASS, once named Professional Association of SQL Server users, now just...PASS, since 2006 when I stepped up as president of SQL Server Italian User Group. Now that I moved to Redmond, WA, I'm still involved with it, but of course, I'm not leading the Italian User Group anymore.

      Since I do really think that community activity is vital for an healthy and successful career, as soon as I noticed that the Application Development Virtual Group was not very active anymore I decided to offer my help and so here I am, I've just stepped up as new leader of that group, AppDev for friends:

      http://appdev.pass.org/

      The focus will be on the entire Microsoft Data Platform, so not only SQL Server, and the languages we'll try to cover are, of course C# and .NET in general and also, more broadly, any language that can interact with the Microsoft Data Platform. So Python (my passion!), Java, Ruby, F#, Rust, R....anything you can think of. Of course everything approached from a developer perspective. No administration here, no sir. Security? Yes. Performances? Sure, give me more! Best Practices? Absolutely! Everything that can be of interests to developers and architects will be in our scope (given that is somehow related to the Microsoft Data Platform, of course).

      I've just started to contact some great speakers to start with the right foot, and the first just confirmed his support. Jovan Popovic, SQL Server / SQL Azure PM will present a session on JSON, one of the most requested and powerful feature in SQL Server 2016 and next:

      Building REST API with SQL Server using JSON functions

      Registration is free both to the seminar and also the the Virtual User group. You can find link to registration and additional info on UG homepage:

      http://appdev.pass.org/

      I recommend to register to both, so you won't miss a very interesting webcast (that will be recorded and made available on YouTube for future reference) and you'll also stay informed on the future activities that will be, I can assure you, just amazing!

      And, last but not least, if you have a topic you want to present, just let me know.

    More Posts Next page »

    This Blog

    Syndication

    Privacy Statement