This topic is titled to specifically consider only row estimation after joins, precluding discussion of row estimates at the source table, which has already been addressed in papers covering the new Cardinality Estimator in SQL Server 2014 and other statistics papers for previous versions of SQL Server.
There are certain situations in which the query compile time can be excessively long even for queries of moderate complexity. This could be true when the plan cost is very high, so that the query optimizer expends more effort to find a better plan before reaching the time out, that is, the time out setting appear to be a function of the plan cost. Even then, the query optimizer could still make row estimation errors in the operations after the source table (for which data distribution statistics are kept on columns and indexes) of sufficient consequence that renders the remainder of the plan un-executable in a practical time frame.
The new cardinality estimator in SQL Server 2014 is helpful in resolving known issues, but has little to improve row estimates after the initial access at the data source beyond fixed rules which may be more generally true than the rule used before. That said, the query optimizer only attempts to estimate rows (and other cost factors) using a combination of the information it has, and rules for situations where there are no grounds for making a knowledge based estimate.
So why be bound by this rule of estimating only? The other company (only sometimes in the news concerning databases) has (recently introduced?) adaptive query optimization that can make run-time adjustments to the execution plan. Sure that’s nice, but I am thinking something more sophisticated could done in forming the plan but less complex than the ability to change course in the midst of execution.
Obviously, if a query has an obviously correct execution plan that can be determined by existing techniques, and the plan cost is fairly low, then no change is necessary. Further special technique should only be considered for expensive queries, particularly in which the row estimates at the intermediate steps are difficult to assess.
Consider query below
SELECT bunch of columns
FROM Nation n
JOIN Customers c ON c.NationId = n.NationId
JOIN Orders o ON o.CustumerId = c.CustomerId
WHERE n.Country LIKE ‘Z%’
From statistics, we know approximately how many countries have a name beginning with Z. SQL Server also has a histogram for the NationId column in the Customers table. If we had specified the list of NationId values (with equality on to both tables), SQL Server could use the more detailed information from the histogram to make a row estimate. But because we specified the Country name column that is only in the Nation table, we must use the average number of customers per county (in the density vector) multiplied by the number of countries to estimate rows after the join to customers.
And next of course, all customers are not alike. There are customers who place a small, medium or large number of orders. So why expend a great deal of effort to find a plan from all the possible join orders and index combination based on only estimates of rows when it is known that data distribution in each succeeding table is heavily skewed. Why not pre-execute the tables for which a SARG was specified to get the column values used in the join to next table so that the more detailed histogram distribution information. This technique could be pushed to multiple levels depending on the initially assessed plan cost, and perhaps controlled by a query hint.
For example, in the above query, suppose NationId’s 19, 37, 42, and 59 meet the criteria Country beginning with Z. The optimizer would next look at the Customers table histogram on NationId for these values to estimate rows after the join. If the situation warrants, the next level could be examined as well.
It could be argued that the query optimizer should not execute the query to determine the plan, but why follow that principle if the cost of query optimization is excessively high (several seconds) in relation to relatively minor effort to make a more extensive reconnaissance (of tens or hundreds of milli-seconds)? Especially considering that the reliability of row estimates becomes progressively worse after each join or other operation beyond the original source?
This technique should probably be used when there are tables with search arguments joining to tables on columns with highly skewed distribution. The first implementation might be activated only be a query hint until some maturity is achieved, followed by greater use.
Presumably there might be a cost threshold as well. I would prefer not to tie it with parallelism. Of course, given the nature of modern systems, it really is time for the cost threshold for parallelism and max degree of parallelism to have graduated controls, instead of the single setting on-off.
Side note 1
OK, now forget what I said at the beginning and I will gripe about SQL Server default statistics. It has been discussed else where that SQL Server uses random page samples and not random row samples, as this is a much less expensive way to collect data. It does use an index for which the column is not a lead key if available, to improve randomness. Still, I have notice a severe sensitivity to sampling percentage in cases where the column value is correlated with (page) storage location.
So I suggest that as the page sampling is in progress, a count of new values found in each successive page sampled versus existing values be kept. If the number of new values found falls off sharply, then most distinct values have probably been found, and the interpretation of the existing sample is that its distribution should be scaled by ratio of total rows to the rows sampled. If almost of the values in the last few pages sampled are new (not previously found), then the interpretation should be that the number of distinct values should be scaled by the total to sample ratio. And some blend when there is an intermediate number of new values versus previously found values in each successive page.
The query optimization/plan compile process is single threaded.
The modern microprocessor might be 3GHz, so a 10 sec compile is 30 Billion cpu-cycles.
And I have seen compiles run more than 100 sec? One query even broke SQL Server, of course that was a set of deeply nested, repeating CTE's that should have been PIVOT/UNPIVOT.
So why the principle of optimizing based on unreliable estimates when an index seek is a mere 10 micro-sec? Presumably the key column statistics have already been decoded.
It would be nice to have a more powerful query optimizer,
but there is a method for writing SQL to get a specific execution plan,
Of course the other element in this is known what the correct execution plan is.
This involves not what the query optimizer uses for cost model,
but what the true cost model is.