

Browse by Tags
All Tags » statistics (RSS)
Showing page 1 of 5 (41 total posts)

I am continuing with my data mining and machine learning algorithms series. Naive Bayes is a nice algorithm for classification and prediction.
It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute, which can later be used to predict an outcome of the predicted attribute based on ...

Support vector machines are both, unsupervised and supervised learning models for classification and regression analysis (supervised) and for anomaly detection (unsupervised). Given a set of training examples, each marked as belonging to one of categories, an SVM training algorithm builds a model that assigns new examples into one category. An SVM ...

Hierarchical clustering could be very useful because it is easy to see the optimal number of clusters in a dendrogram and because the dendrogram visualizes the clusters and the process of building of that clusters. However, hierarchical methods don’t scale well. Just imagine how cluttered a dendrogram would be if 10,000 cases would be shown on ...

Clustering is the process of grouping the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Dissimilarities are assessed based on the attribute values describing the objects.
There are a large number of clustering algorithms. The ...

We are close to the publishing day of the TSQL Querying book. Of course, like always in this series, the main author of the book is Itzik BenGan. This time, besides me, Adam Machanic and Kevin Farlee are the coauthors. The information I want to share now is that you can get a substantial discount if you preorder the book today, Monday, February ...

I was just starting to work on a post on column statistics, using one of my favorite metadata functions: sys.dm_db_stats_properties(), when I realized something was missing.
The function requires a stats_id value, which is available from sys.stats. However, sys.stats does not show the column names that each statistics object is attached ...

I spent the last few days in Zagreb, Croatie, at the third edition of the SQL TuneIn conference, and I had a very good time here. Nice company, good sessions, and awesome audiences.
I presented my “Understanding Execution Plans” precon to a small but interested audience on Monday. Participants have received a download link for the slide deck.
On ...

There are two ways to test how your queries behave on huge amounts of data. The simple option is to actually use them on huge amounts of data – but where do you get that if you have no access to the production database, and how do you store it if you happen not to have a multiterabyte storage array sitting in your basement? So here’s the second ...

The session Skewed Data, Poor Cardinality Estimates, and Plans Gone Bad by Kimberly Tripp (@KimberlyLTripp) has been published on channel SQLPASS TV. Abstract When data distribution is heavily skewed, cardinality estimation (how many rows the query optimizer expects each operator to process) can be wildly incorrect, resulting in ...

This is the fifth, the final part of the fraud detection whitepaper. You can find the first part, the second part, the third part, and the fourth part in my previous blog posts about this topic. The Results In my original fraud detection whitepaper I wrote for SolidQ, I was advised by my friends to include some concrete and simple numbers to ...
1



