Entity Framework 6 - Dynamically creating temporary tables from IQueryable (Part 1 of 2)

I'm a huge fan of entity based and result set based relational mapping (classic ORMs). I'm also huge fan of DML based relational mapping (micro ORMs). In general I'm a huge fan of every technology that allows me to get the job done in the best possible way. I believe that one should never limit himself to single approach within a project - it should always be about choosing the best tool for the job. But sometimes there are real life constraints (licensing, business, politics etc.) which are limiting the choices. This was one of those situations.

Entity Framework is not the perfect tool for complex reporting but still can be used to do it. Recently I was tasked with optimizing such a scenario. The code was building a quite complex report across several tables all filtered by the same sub query. In general it looked similar to the snippet below.

public IQueryable<StudentItem> GetStudentsItemsQueryable(Filters filters, IEnumerable<StudentIdentity> students)
{
    IQueryable<StudentItem> itemsQueryable = null;

    IQueryable<int> studentsIds = students.Select(s => s.Id)
    foreach (IStudentItemManager itemManager in _itemsManagers)
    {
        IQueryable<StudentItem> itemQueryable = itemManager.GetStudentsItemQueryable(filters, studentsIds);
        if (itemQueryable != null)
        {
            itemsQueryable = (itemsQueryable != null) ? itemsQueryable.Concat(itemQueryable) : itemQueryable;
        }
    }

    ...

    return itemsQueryable;
}

The top level method generates union from number of independent item queries. Each item query receives identifiers of students for whom the report is being generated.

public IQueryable<StudentItem> GetStudentsItemQueryable(Filters filters, IEnumerable<int> studentsIds)
{
    IQueryable<StudentItem> itemQueryable = null;

    ...

    itemQueryable = itemQueryable.Where(i => studentsIds.Contains(i.StudentId));

    ...

    return itemQueryable;
}

In the end Entity Framework was generating SQL query with pattern visible below.

SELECT ...
FROM (SELECT ...
        FROM ... AS [Extent1]
        WHERE ... AND EXISTS (SELECT 1 AS [C1] FROM (<sql representation="" of="" studentidentity="" enumerable="">) AS [Limit1] WHERE [Limit1].[Id] = [Extenten1].[StudentId])
      UNION ALL
      SELECT ...
        FROM ... AS [Extent2]
        WHERE ... AND EXISTS (SELECT 1 AS [C1] FROM (<sql representation="" of="" studentidentity="" enumerable="">) AS [Limit2] WHERE [Limit2].[Id] = [Extenten2].[StudentId])
      UNION ALL
      ...
      UNION ALL
      SELECT ...
        FROM ... AS [ExtentN]
        WHERE ... AND EXISTS (SELECT 1 AS [C1] FROM (<sql representation="" of="" studentidentity="" enumerable="">) AS [LimitN] WHERE [LimitN].[Id] = [ExtentenN].[StudentId])) AS UnionAll1

The redundancy in the query is immediately visible. Both the students and items queries were not a trivial ones, which resulted in pressure on the database. Quick look at SQL Server profiler provided CPU time ~750ms and overall duration ~1700ms. Taking into consideration that those data had to be rendered and returned to user this is "bad experience" territory. How to optimize this? All the sub queries by itself looked quite good (despite not being trivial) so the initial idea became removing the redundancy.

Changing the shared sub query to in-memory collection

First thing which came into mind was grabbing the students identifiers once and passing them as in-memory collection to all the sub queries. The code change was in fact easy and cheap.

public IQueryable<StudentItem> GetStudentsItemsQueryable(Filters filters, IEnumerable<StudentIdentity> students)
{
    IQueryable<StudentItem> itemsQueryable = null;

    IQueryable<int> studentsIds = students.Select(s => s.Id).ToList();
    ...

    return itemsQueryable;
}

Generated SQL query also looked promising.

SELECT ...
FROM (SELECT ...
        FROM ... AS [Extent1]
        WHERE ... AND [Extenten1].[StudentId] IN (...)
      UNION ALL
      SELECT ...
        FROM ... AS [Extent2]
        WHERE ... AND [Extenten2].[StudentId] IN (...)
      UNION ALL
      ...
      UNION ALL
      SELECT ...
        FROM ... AS [ExtentN]
        WHERE ... AND [ExtentenN].[StudentId] IN (...)) AS UnionAll1

Verification with SQL Server profiler showed that times have been nicely reduced (CPU time ~345ms and overall duration ~565ms) so the problem should be solved. Unfortunately the end-to-end stress performance test results didn't improve. The overall latency remained at similar level and application CPU usage went up. Why?

The only change that has been done was introduction of IN clause representing Enumerable.Contains. This scenario was indeed problematic with Entity Framework 4 as usage of Enumerable.Contains was translated into a tree of OR expressions, but Entity Framework 6 has introduced native support for DbInExpression so it should be ok. Still the Visual Studio profiler was pointing at query generation as the reason. Quick look at Entity Framework performance considerations regarding Autocompiled Queries brought the answer - the IN clause prevents query caching. Every time the code was being executed Entity Framework had to regenerate all the items queries (and the top query as well) which resulted in high CPU usage. This meant that different solution was needed.

Moving the optimization to SQL level

The repetition had to be removed, but it couldn't be done by IN clause with static identifiers list. The identifiers sub query results had to be gathered once and reused at SQL level. A potential solution I could think of was temporary table. The creation and access in case of temporary table shouldn't be expensive, it will also be automatically removed when the session ends. This seemed to be a reasonable approach so a POC was born.

IF OBJECT_ID('tempdb..#TemporaryStudentIdentity') IS NOT NULL
BEGIN
    DROP TABLE #TemporaryStudentIdentity
END

SELECT * INTO #TemporaryStudentIdentity
FROM (<sql representation="" of="" studentidentity="" enumerable="">) AS [TemporarySnapshotQueryable]

SELECT ...
FROM (SELECT ...
        FROM ... AS [Extent1]
        WHERE ... AND EXISTS (SELECT 1 AS [C1] FROM [tempdb].[#TemporaryStudentIdentity] AS [Limit1] WHERE [Limit1].[Id] = [Extenten1].[StudentId])
      UNION ALL
      SELECT ...
        FROM ... AS [Extent2]
        WHERE ... AND EXISTS (SELECT 1 AS [C1] FROM [tempdb].[#TemporaryStudentIdentity] AS [Limit2] WHERE [Limit2].[Id] = [Extenten2].[StudentId])
      UNION ALL
      ...
      UNION ALL
      SELECT ...
        FROM ... AS [ExtentN]
        WHERE ... AND EXISTS (SELECT 1 AS [C1] FROM [tempdb].[#TemporaryStudentIdentity] AS [LimitN] WHERE [LimitN].[Id] = [ExtentenN].[StudentId])) AS UnionAll1

Testing of POC has shown that this approach is more expensive than IN clause (CPU time ~550ms and overall duration ~660ms) but way better then original one (and can be considered acceptable). Now the same thing needed to be done using Entity Framework.

Implementing the temporary table creation

First of all I needed an entity to represent the temporary table. As the project was using Code First approach I've created a POCO with same properties as the projection DTO (having a separated class allowed me to protect constructor and properties setters).

public class TemporaryStudentIdentity
{
    protected TemporaryStudentIdentity()
    { }

    public virtual int Id { get; protected set; }

    ...
}

In order to map this entity to a temporary table TableAttribute can be used, but I prefer to keep mappings separated from entities so I have created a configuration class.

public sealed class TemporaryStudentIdentityConfiguration : EntityTypeConfiguration<TemporaryStudentIdentity>
{
    public TeamMonitorTemporarySuperviseesInfoMap()
    {
        ToTable("#" + typeof(TemporaryStudentIdentity).Name, "tempdb");

        HasKey(x => x.Id);

        Property(x => x.Id);
        ...
    }
}

Now I needed to get the actual parametrized query out of IQueryable (I didn't want to hardcode everything so the solution would be more generic in future), this post got me going. First I've created a small extension method.

internal static class ObjectQueryExtensions
{
    internal static ObjectQuery<TQueryProjection> GetObjectQuery<TQueryProjection>(this IQueryable<TQueryProjection> query)
    {
        object internalQuery = GetField(query, "_internalQuery");

        ObjectQuery<TQueryProjection> objectQuery = GetField(internalQuery, "_objectQuery") as ObjectQuery<TQueryProjection>;

        return objectQuery;
    }

    private static object GetField(object objectInstance, string fieldName)
    {
        object fieldValue = null;

        if (objectInstance != null)
        {
            FieldInfo field = objectInstance.GetType().GetField(fieldName, BindingFlags.NonPublic | BindingFlags.Instance);

            if (field != null)
            {
                fieldValue = field.GetValue(objectInstance);
            }
        }

        return fieldValue;
    }
}

After separating those nasty internals I could write nice method which would get the job done.

public IQueryable<TTemporaryEntity> AsTemporarySnapshotQueryable<TQueryProjection, TTemporaryEntity>(IQueryable<TQueryProjection> query)
{
    IQueryable<TTemporaryEntity> snapshotQueryable = null;

    ObjectQuery<TQueryProjection> temporarySnapshotObjectQuery = query.GetObjectQuery();

    if (temporarySnapshotObjectQuery != null)
    {
        string temporarySnapshotTableName = "#" + typeof(TTemporaryEntity).Name;
        string temporarySnapshotSqlCommand = "IF OBJECT_ID('tempdb..{0}') IS NOT NULL BEGIN DROP TABLE {0} END SELECT * INTO {0} FROM ({1}) AS [TemporarySnapshotQueryable]", temporarySnapshotTableName, temporarySnapshotObjectQuery.ToTraceString());
        object[] temporarySnapshotSqlCommandParameters = temporarySnapshotObjectQuery.Parameters.Select(parameter => new SqlParameter(parameter.Name, parameter.Value)).ToArray();

        // Since Entity Framework 6 manually opened connection will not be automatically closed until context disposal - this way the temporary table will be visible for other queries.
        _dbContext.Database.Connection.Open();
        _dbContext.Database.ExecuteSqlCommand(temporarySnapshotSqlCommand, temporarySnapshotSqlCommandParameters);

        snapshotQueryable = _dbContext.Set<TTemporaryEntity>().AsNoTracking().AsQueryable();
    }

    return snapshotQueryable;
}

That wasn't that hard. Proud of myself I've quickly changed the method I wanted to optimize.

public IQueryable<StudentItem> GetStudentsItemsQueryable(Filters filters, IEnumerable<StudentiIdentity> students)
{
    IQueryable<StudentItem> itemsQueryable = null;

    IQueryable<int> studentsIds = AsTemporarySnapshotQueryable<StudentIdentity, TemporaryStudentIdentity>(students).Select(s => s.Id);
    ...

    return itemsQueryable;
}

First test... and it fails. The error log contained entry saying that #TemporaryStudentIdentity was not present. Did I miss something? I've made sure that connection stayed open so the session doesn't change, what else could be causing my temporary table not being visible to the query? After couple minutes of stupid staring into SQL Server profiler log it hit me. I've simplified my POC by omitting sp_executesql.

When a parametrized query is being executed by ADO.NET provider for SQL Server it is being wrapped by call to sp_executesql - this is how parameters are being passed. The tricky part of sp_executesql is the fact that the statement is being executed in "inner scope". Temporary table created by parent scope would be visible inside of sp_executesql but one created within sp_executesql is not visible outside. In my case this meant that I had to separate creation of the table from filling it with data (I didn't want to go away from parametrized query for obvious reasons).

Adjusting POC to sp_executesql usage

I've read all I could find regarding temporary tables but I wasn't able to find a way to create one ahead without specifying full definition, so the POC had to include it.

IF OBJECT_ID('tempdb..#TemporaryStudentIdentity') IS NOT NULL
BEGIN
    DROP TABLE #TemporaryStudentIdentity
END

CREATE TABLE #TemporaryStudentIdentity
(
    [Id] INT NOT NULL,
    ...
)

exec sp_executesql N'INSERT INTO #TemporaryStudentIdentity([Id], ...)
    (SELECT [Id], ...
    FROM (<sql representation="" of="" studentidentity="" enumerable="">) AS [TemporarySnapshotQueryable]'

exec sp_executesql N'SELECT ...
    FROM (SELECT ...
            FROM ... AS [Extent1]
            WHERE ... AND EXISTS (SELECT 1 AS [C1] FROM [tempdb].[#TemporaryStudentIdentity] AS [Limit1] WHERE [Limit1].[Id] = [Extenten1].[StudentId])
          UNION ALL
          SELECT ...
            FROM ... AS [Extent2]
            WHERE ... AND EXISTS (SELECT 1 AS [C1] FROM [tempdb].[#TemporaryStudentIdentity] AS [Limit2] WHERE [Limit2].[Id] = [Extenten2].[StudentId])
          UNION ALL
          ...
          UNION ALL
          SELECT ...
            FROM ... AS [ExtentN]
            WHERE ... AND EXISTS (SELECT 1 AS [C1] FROM [tempdb].[#TemporaryStudentIdentity] AS [LimitN] WHERE [LimitN].[Id] = [ExtentenN].[StudentId])) AS UnionAll1'

This worked and was even faster than previous one (probably because SQL Server didn't had to invest processing into figuring out the definition) - CPU time ~420ms, overall duration ~520ms. The challenge was how to do that from code in as generic as possible way.

Adjusting the implementation

Putting this in simple words all I had to do was generating a "create table" script based on entity. This should be doable, Entity Framework should be storing the information I needed somewhere. After some research I've found what I was looking for. I've adopted the information provided into a from of extension method.

internal static class DbContextExtensions
{
    internal IEnumerable<EdmProperty> GetEntityPropertyColumns<TEntity>(this DbContext dbContext)
    {
        // Get the metadata
        MetadataWorkspace metadata = ((IObjectContextAdapter)_dbContext).ObjectContext.MetadataWorkspace;

        // Get the space within the metadata which contains information about CLR types
        ObjectItemCollection clrSpace = ((ObjectItemCollection)metadata .GetItemCollection(DataSpace.OSpace));

        // Get the entity type from the metadata that maps to the CLR type
        EntityType entityEntityType = metadata.GetItems<entitytype>(DataSpace.OSpace).Single(e => clrSpace.GetClrType(e) == typeof(TEntity));

        // Get the entity set that uses this entity type
        EntitySet entityEntitySet = metadata.GetItems<entitycontainer>(DataSpace.CSpace).Single().EntitySets.Single(s => s.ElementType.Name == entityEntityType.Name);

        // Get the mapping between conceptual and storage model for this entity set
        EntitySetMapping entityEntitySetMapping = metadata.GetItems<EntityContainerMapping>(DataSpace.CSSpace).Single().EntitySetMappings.Single(m => m.EntitySet == entityEntitySet);

        // Get the entity columns
        return entityEntitySetMapping.EntityTypeMappings.Single().Fragments.Single().PropertyMappings.OfType<ScalarPropertyMapping>().Select(p => p.Column);
    }
}

Armed with this extension method I've started refactoring my temporary table generation code.

public IQueryable<TTemporaryEntity> AsTemporarySnapshotQueryable<TQueryProjection, TTemporaryEntity>(IQueryable<TQueryProjection> query)
{
    IQueryable<TTemporaryEntity> snapshotQueryable = null;

    ObjectQuery<TQueryProjection> temporarySnapshotObjectQuery = query.GetObjectQuery();
    IEnumerable<EdmProperty> temporarySnapshotColumns = _dbContext.GetEntityPropertyColumns<TTemporaryEntity>().ToArray();

    if ((temporarySnapshotObjectQuery != null) && temporarySnapshotColumns.Any())
    {
        string temporarySnapshotTableName = "#" + typeof(TTemporaryEntity).Name;

        StringBuilder temporarySnapshotCreateColumnsListBuilder = new StringBuilder();
        StringBuilder temporarySnapshotFillColumnsListBuilder = new StringBuilder();
        foreach (EdmProperty temporarySnapshotColumn in temporarySnapshotColumns)
        {
            temporarySnapshotCreateColumnsListBuilder.Append(GetTemporarySnapshotColumnCreateSql(temporarySnapshotColumn));
            temporarySnapshotFillColumnsListBuilder.AppendFormat("[{0}],", temporarySnapshotColumn.Name);
        }
        temporarySnapshotCreateColumnsListBuilder.Length -= 1;
        temporarySnapshotFillColumnsListBuilder.Length -= 1;

        string temporarySnapshotCreateSqlCommand = String.Format("IF OBJECT_ID('tempdb..{0}') IS NOT NULL BEGIN DROP TABLE {0} END{1}CREATE TABLE {0} ({2})", temporarySnapshotTableName, Environment.NewLine, temporarySnapshotCreateColumnsListBuilder);
        string temporarySnapshotFillSqlCommand = String.Format("INSERT INTO {0}({1}) (SELECT {1} FROM ({2}) AS [TemporarySnapshotQueryable])", temporarySnapshotTableName, temporarySnapshotFillColumnsListBuilder, temporarySnapshotObjectQuery.ToTraceString());
        object[] temporarySnapshotFillSqlCommandParameters = temporarySnapshotObjectQuery.Parameters.Select(parameter => new SqlParameter(parameter.Name, parameter.Value)).ToArray();

        // Since Entity Framework 6 manually opened connection will not be automatically closed until context disposal - this way the temporary table will be visible for other queries.
        _dbContext.Database.Connection.Open();
        _dbContext.Database.ExecuteSqlCommand(temporarySnapshotCreateSqlCommand.ToString());
        _dbContext.Database.ExecuteSqlCommand(temporarySnapshotFillSqlCommand, temporarySnapshotFillSqlCommandParameters);

        snapshotQueryable = _dbContext.Set<TTemporaryEntity>().AsNoTracking().AsQueryable();
    }

    return snapshotQueryable;
}

private static string GetTemporarySnapshotColumnCreateSql(EdmProperty temporarySnapshotColumn)
{
    string typeNameUpperCase = temporarySnapshotColumn.TypeName.ToUpperInvariant();
    string temporarySnapshotColumnCreateSqlSuffix = temporarySnapshotColumn.Nullable ? "," : " NOT NULL,";
    switch (typeNameUpperCase)
    {
        case "NUMERIC":
            return String.Format("[{0}] NUMERIC({1},{2}){3}", temporarySnapshotColumn.Name, temporarySnapshotColumn.Precision, temporarySnapshotColumn.Scale, temporarySnapshotColumnCreateSqlSuffix);
        case "NVARCHAR":
        case "VARCHAR":
            return String.Format("[{0}] {1}({2}){3}", temporarySnapshotColumn.Name, typeNameUpperCase, temporarySnapshotColumn.MaxLength, temporarySnapshotColumnCreateSqlSuffix);
        default:
            return String.Format("[{0}] {1}{2}", temporarySnapshotColumn.Name, typeNameUpperCase, temporarySnapshotColumnCreateSqlSuffix);
    }
}

I've added all the HasColumnType, HasMaxLength etc. to the configuration class and fired away... It worked!!!

Limitations

This solution has one serious limitation - the names of columns in temporary table must match the names of corresponding columns in query generated by Entity Framework. This won't always be true. If the query contains only columns coming directly from tables this should be safe assumption, but when calculated columns appears they will have names chosen by Entity Framework. In above approach this would require nasty hard coding like HasColumnName("C2") which should rather be avoided.

There is a potential way of working around this limitation - I will explore it in part two.