Google+

LINQ to SQL & Entity Framework Pitfalls

In my last post describing the differences between LINQ to objects and LINQ to SQL, I mentioned how LINQ to SQL and Entity Framework "interpret" your LINQ code, and create the corresponding SQL. Forgetting this fact is extremely dangerous, because LINQ to SQL and other object relational mappers are extremely leaky abstractions. LINQ is obviously a wonderful technology, but this post will be talking about some potential pitfalls you may run into.

SQL Query Complexity Disproportional to LINQ Complexity

Recall the example from my last post:

//Query Syntax:
from device in Devices
where device.Type != null
select device.DeviceId

//SQL Syntax:
SELECT [t0].[DeviceId]
FROM [Devices] AS [t0]
WHERE [t0].[Type] IS NOT NULL

In this case, LINQ to SQL has done something wonderful. It's saved us from having to understand or worry about the translation of syntax between C# and SQL. Now, what happens when we write something a little more advanced, such as a nested group by?

from d in Devices
group d by d.CZone into czoneGroup
select new { Key = czoneGroup.Key, val = from d2 in czoneGroup
    group d2 by d2.LZone into lzoneGroup
    select lzoneGroup.Key }

And the corresponding SQL:

SELECT [t0].[CZone] AS [Key]
FROM [Devices] AS [t0]
GROUP BY [t0].[CZone]
GO

DECLARE @x1 Int = 3
SELECT [t0].[LZone]
FROM [Devices] AS [t0]
WHERE ((@x1 IS NULL) AND ([t0].[CZone] IS NULL)) OR ((@x1 IS NOT NULL) AND ([t0].[CZone] IS NOT NULL) AND (@x1 = [t0].[CZone]))
GROUP BY [t0].[LZone]
GO

DECLARE @x1 Int = 1
SELECT [t0].[LZone]
FROM [Devices] AS [t0]
WHERE ((@x1 IS NULL) AND ([t0].[CZone] IS NULL)) OR ((@x1 IS NOT NULL) AND ([t0].[CZone] IS NOT NULL) AND (@x1 = [t0].[CZone]))
GROUP BY [t0].[LZone]
GO

//Reminaing SQL removed....

What just happened? Our innocent nested join has turned into a monster! This is an example of a query that is simple to do in LINQ, but has no translation to a simple SQL statement. Instead of just bombing, the LINQ to SQL engine comes up with a solution that a user may not have written themselves. A typical SQL developer may have looked for a different approach.

Side note: In the nested group-by, notice that LINQ to SQL uses multiple queries. This differs from the Entity Framework approach, which uses outer joins to achieve the same effect.

Does it matter? The answer isn't so simple. In this simplified example, the performance impact is minimal. Unfortunately, with a large amount of data in this type of query, you could start to experience terrible performance. I personally saw a nested query that was only a few lines of code turn into a 27 page SQL statement. The SQL statement was technically correct, but took seconds to execute, when it should have taken a fraction of a second.

One simple solution that I have found to be very effective, yet not intuitive, is breaking apart the initial query and forcing it to execute using the ToList() method. You'll have to have a decent "where" clause to avoid excessive amounts of data being returned. Once we have the raw data, LINQ to objects will provide us the same set of tools to further manipulate our data. For instance, here is a modified version of the example presented earlier:

//Simple & fast initial query from the database
var rawData = (from d in Devices
where d.Location = 'B3').ToList();

//This operation happens "disconnected"
var results = from d in rawData
group d by d.CZone into czoneGroup
select new { Key = czoneGroup.Key, val = from d2 in czoneGroup
    group d2 by d2.LZone into lzoneGroup
    select lzoneGroup.Key };

The reason this works well is that it's taking advantage of the strength of SQL Server, which is to query data, and the strength of .NET, which is to process and manipulate data.

LINQ Abstracting Away Problems it Can't Solve

Here is a simplified version of a query I saw recently:

int sum = (from d in Devices
where 1 == 2 && d.CZone != null
select d.CZone.Value).Sum()

To make it extremely clear what I'm trying to accomplish, I put "1 == 2" in the "where" clause, so that no rows match the condition. The "Sum()" method returns the type that it's acting on. For example, if you're summing integers, the result is an integer. If you're summing nullable integers, the result is a nullable integer. This is perfectly valid LINQ. This is effectively the SQL that is generated (I simplified it for clarity):

Select SUM(CZone)
From Devices
Where 1 = 2

Since the result of this SQL statement is NULL, it can't be converted back to an integer. The exception is "InvalidOperationException: The null value cannot be assigned to a member with type System.Int32 which is a non-nullable value type."

When the LINQ is translated to SQL, there is no such operation as converting a nullable value to a non-nullable, so the ".Value" operation is ignored. This would be fine if the sum function still expected a nullable return type, but it's now expecting an integer. When it can't find any rows to return, it tries to return NULL. Since it's trying to package up a NULL value into a standard integer type, it has no choice but to throw an exception.

Conclusion

Getting started with LINQ is fairly straightforward, but you can't forget the fact that whatever query you're writing must be converted into a SQL statement, and the results must be converted back to data that is understandable to .NET. Every LINQ query you write should be checked with a tool such as LINQPad to ensure that the SQL is efficient, and matches what you expect.

Also keep in mind that when you upgrade your data provider, your queries can change. For example, converting a statement from LINQ to SQL to Entity Framework can generate different SQL queries, just as updating to a newer version of the same ORM can.

Like this post? Please share it!

See a mistake? Edit this post!

Understanding LINQ and LINQ to SQL (and EF)

Back to basics for this post. Developers often throw around the word LINQ when talking about a number of different technologies. Now that I have been comfortably using a wide variety of LINQ technologies for a fair amount of time, I'm now able to convey some of the key differences that are critical to using LINQ technologies efficiently. I'm also using this as a foundation and reference for some exciting upcoming posts.

The first key point is to know what the heck LINQ is. LINQ itself is a number of separate features. One of these key features is being able to write SQL-like syntax (query syntax) in your code. At a basic level, that's all you need to know for now.

LINQ (to objects)

First, we're going to talk about LINQ to objects, which I typically just refer to as LINQ (possibly making the matter more confusing). It has absolutely nothing to do with SQL Server, Oracle, or any other kind of relational database. I'm talking about LINQ to objects, because I think that understanding it and contrasting it with LINQ to SQL is critical to understanding both.

For a moment, forget that LINQ exists. Let's say that you wanted to filter a list of names, to only get names that start with the letter "J". You could write the following "utility" function: (if you don't understand "yield return", see this post on that topic).

public static IEnumerable<string> GetNamesStartingWithJ(IEnumerable<string> names)
{
    foreach(var name in names)
        if(name.StartsWith("J"))
            yield return name;
}

A new feature in C# introduced in .NET 3.0, is a concept known as an extension method. This lets us turn my handy dandy static utility method into a method that can be called on a list of names. By changing the signature to this:

public static IEnumerable<string> GetNamesStartingWithJ(this IEnumerable<string> names)

I can then call it like this (Sweet!):

var myListOfNames = new List<T> {"Abe", "Jack", "Jason"};
var jNames = myListOfNames.GetNamesStartingWithJ();

We haven't even talked about LINQ yet, but we've basically reinvented a portion of it. As an exercise for the reader, think about how you could use a Lambda parameter to pass in a filter criteria to create a ".Where" method. All the pieces are in place to re-create this form of LINQ yourself.

One actual new feature for LINQ is known as query syntax. Basically, it gives us an alternative way to write our query. It makes the code look more like SQL, and less like a long chain of extension methods.

Lambda Syntax:

var uppercaseJNames = names.Where(name => name.StartsWith("J")).Select(name => name.ToUpper());

Query Syntax (same query):

var uppercaseJNames = from name in names
where name.StartsWith("J")
select name.ToUpper();

In both of those examples, the exact same operations are occurring, and you get the result. The one you choose will most likely come down to personal preference. It's also worth noting that some of the extension methods provided out of the box are not available in query syntax. You can either avoid the query syntax in those cases, or use a hybrid approach.

How is LINQ to SQL (and Entity Framework, etc) Different?

Now, I hope you understand that there isn't really any magic going on in LINQ. Microsoft has simply given us a new set of easy to use tools that make working with sets a breeze.

LINQ to SQL is a different matter. Instead of executing code, you're building an expression. An expression is simply a "picture" of what you're trying to accomplish. It can interpreted in many different ways. To understand the underlying technology, you'll have to read up on expression trees, which I'm intentionally keeping outside the scope of this post.

If we have a "picture" of a query, what happens to it when we want to "run" it? LINQ to SQL, Entity Framework, and other LINQ implementations look at your query, and basically translate it into something else. How about an example?:

//Query Syntax
var deviceIds = from device in Devices
where device.Type == "I"
select device.DeviceId

//Lambda Sytax (extension methods)
var deviceIds = Devices
   .Where (device => (device.Type == "I"))
   .Select (device => device.DeviceId)

//SQL
SELECT [t0].[DeviceId]
FROM [Devices] AS [t0]
WHERE [t0].[Type] = "I"

I've provided the query syntax and the lambda syntax. At the bottom is the resulting translation into a SQL statement.

In this last example, I'll try to make it clear that your code is simply interpreted and translated:

//Query Syntax:
from device in Devices
where device.Type != null
select device.DeviceId

//SQL Syntax:
SELECT [t0].[DeviceId]
FROM [Devices] AS [t0]
WHERE [t0].[Type] IS NOT NULL

Notice that the C# operator "!=" translates in SQL to "IS NOT NULL". This was handled automatically for us. Our expression did NOT get back all the rows and apply a conditional to it.

Why is this important? To use either technology effectively, you have to understand that when you're working with objects, it's simply a chain of methods, and often behaves as you would expect. When working with LINQ to SQL (or a related technology), the expression is evaluated, and might not execute like you expected.

Understanding the internal workings of these technologies will let us fully take advantage of all the wonderful features it has to offer. In upcoming posts, I'll be warning you of some potential pitfalls related to how your queries are interpreted and translated. I'll also be showing you how to get significant performance gains by using LINQ to SQL or Entity Framework efficiently (over traditional SQL based solutions). I'll also be showing you how I write LINQ queries to query an AutoCAD document!

Related Posts:

Like this post? Please share it!

See a mistake? Edit this post!

Determine if a point is contained within a polygon

One of my recent projects had a requirement to take a list of points and a list of polygons (of any order), and determine which points were in which polygons. I find this problem interesting, because the solution is not apparent, but it is easy to implement.

One common algorithm is called the ray casting algorithm. You can read more about the ray casting algorithm on Wikipedia. My buddy Google was able to find another great resource with some sample Java code.

After an initial performance test, I found this algorithm to be extremely fast. I was able to process over 200,000 checks in under a second.

I converted the code to something a little more object oriented. I wanted a class that would represent a Polygon, and also have a method that would tell me if a point was contained within it. I'm including the code in the hopes that it may help someone else one day:

/// <summary>
///      Represents a geometric polygon made up of any number of sides, defined by <see cref="PointF"/> structures
///      between those points.
/// </summary>
public class Polygon
{
    private readonly PointF[] _vertices;

    /// <summary>
    ///     Creates a new instance of the <see cref="Polygon"/> class with the specified vertices.
    /// </summary>
    /// <param name="vertices">
    ///     An array of <see cref="PointF"/> structures representing the points between the sides of the polygon.
    /// </param>
    public Polygon(PointF[] vertices)
    {
        _vertices = vertices;
    }

    /// <summary>
    ///     Determines if the specified <see cref="PointF"/> if within this polygon.
    /// </summary>
    /// <remarks>
    ///     This algorithm is extremely fast, which makes it appropriate for use in brute force algorithms.
    /// </remarks>
    /// <param name="point">
    ///     The point containing the x,y coordinates to check.
    /// </param>
    /// <returns>
    ///     <c>true</c> if the point is within the polygon, otherwise <c>false</c>
    /// </returns>
    public bool PointInPolygon(PointF point)
    {
        var j = _vertices.Length - 1;
        var oddNodes = false;

        for (var i = 0; i < _vertices.Length; i++)
        {
            if (_vertices[i].Y < point.Y && _vertices[j].Y >= point.Y ''
                _vertices[j].Y < point.Y && _vertices[i].Y >= point.Y)
            {
                if (_vertices[i].X +
                    (point.Y - _vertices[i].Y)/(_vertices[j].Y - _vertices[i].Y)*(_vertices[j].X - _vertices[i].X) < point.X)
                {
                    oddNodes = !oddNodes;
                }
            }
            j = i;
        }

        return oddNodes;
    }
}

Of course I can't write a class without the appropriate unit tests:

[TestClass]
public class PolygonTests
{
    [TestMethod]
    public void PointInPolygon_InnerPoint_ContainedWithinPolygon()
    {
        var vertices = new PointF[4]
                            {
                                new PointF(1, 3),
                                new PointF(1, 1),
                                new PointF(4, 1),
                                new PointF(4, 3)
                            };

        var p = new Polygon(vertices);

        Assert.AreEqual(true, p.PointInPolygon(new PointF(2,2)));
    }

    [TestMethod]
    public void PointInPolygon_OuterPoint_NotContainedWithinPolygon()
    {
        var vertices = new PointF[4]
                            {
                                new PointF(1, 3),
                                new PointF(1, 1),
                                new PointF(4, 1),
                                new PointF(4, 3)
                            };

        var p = new Polygon(vertices);

        Assert.AreEqual(false, p.PointInPolygon(new PointF(5,3)));
    }

    [TestMethod]
    public void PointInPolygon_DiagonalPointWithin()
    {
        var vertices = new PointF[3]
                            {
                                new PointF(1, 3),
                                new PointF(1, 1),
                                new PointF(4, 1)
                            };

        var p = new Polygon(vertices);

        Assert.AreEqual(true, p.PointInPolygon(new PointF(2, 2)));
    }

    [TestMethod]
    public void PointInPolygon_DiagonalPointOut()
    {
        var vertices = new PointF[3]
                            {
                                new PointF(1, 3),
                                new PointF(1, 1),
                                new PointF(4, 1)
                            };

        var p = new Polygon(vertices);

        Assert.AreEqual(false, p.PointInPolygon(new PointF(3, 3)));
    }

    [TestMethod]
    public void PointInPolygon_PerformanceTest()
    {
        var vertices = new PointF[4]
                            {
                                new PointF(1, 3),
                                new PointF(1, 1),
                                new PointF(4, 1),
                                new PointF(4, 3)
                            };

        var p = new Polygon(vertices);

        var sw = new Stopwatch();
        sw.Start();

        for(var i = 0; i < 200000; i++)
            p.PointInPolygon(new PointF(2, 2));

        sw.Stop();

        Assert.IsTrue(sw.Elapsed.TotalSeconds < 1);
    }
}

The last unit test was only to determine if this method was going to be performant enough for the scenario I wanted to use it in. You may want to remove it or mark it as explicit if you can to avoid timing issues affecting your test outcomes.

If anyone see's a bug, let me know!

Like this post? Please share it!

See a mistake? Edit this post!

Jason Young I'm Jason Young, software engineer. This blog contains my opinions, of which my employer - Microsoft - may not share.

@ytechieGitHubLinkedInStack OverflowPersonal VLOG