Upgrading CRM 4 Attachments to 2011 Using a Linked Server

We ran into a situation where our CRM 4 database would take days to upgrade to 2011. It turns out we had accumulated over 150GB of attachments from CRM tracked emails. In order to drastically speed up the migration process, we decided to truncate ActivityMimeAttachment the table prior to the upgrade process. I then wrote SQL scripts to pull the data over after the conversion using the linked server functionality in SQL Server. The advantage is that data can be pulled even while one or both of the systems are in use.

If you’re not familiar with linked servers, it’s a useful feature that allows you to set up one SQL Server database to connect to another and access both within the same query or script. To link a server, you’ll want to go to the destination server, and log in. You can actually link either direction, I just chose to write the script for the destination server. The source server you’ll be linking must have the proper permissions set up. In this case, since the migration process was temporary, so I created a temporary sysadmin account. I was then able to link the source server with this SQL (if it’s not obvious, there are some tokens you need to substitute):

Exec sp_addlinkedserver '{source_server}', 'SQL Server';
Exec sp_addlinkedsrvlogin '{source_server}', 'false', null, '{source_server_user}', '{source_server_password}';

Once the servers are linked, you can start pulling over the data. This is where things get really hairy. In CRM 4, the ActivityMimeAttachment table stored all of the attachments and the associated metadata. In CRM 2011, this information is split apart, and there are extra columns that we need to supply data for. First, we’ll need to populate the Attachments table:

Insert Into Attachment
Select Body, [Subject], FileSize, MimeType, [FileName], Null as 'VersionNumber',
	ActivityMimeAttachmentId as 'AttachmentId'
From {source_server}.{source_crm_database_name}.dbo.ActivityMimeAttachment

After that, we can pull over the metadata. One piece of information that we’re missing is the “solutionId”. I sent a test email from 2011 and inspected the row it inserted to get the solutionId.

Insert into ActivityMimeAttachment
Select AttachmentNumber, ActivityMimeAttachmentId, Null as [VersionNumber],
	{solution_id} as SolutionId, ama.ActivityMimeAttachmentId as 'AttachmentId',
	Null as 'SupportingSolutionId', ActivityTypecode as 'ObjectTypeCode',
	0 as 'IsManaged', 0 as 'ComponentState', '1900-01-01 00:00:00.000' as 'OverwriteTime',
	(newid()) as 'ActivityMimeAttachmentIdUnique', ama.ActivityId as 'ObjectId'
From {source_server}.{source_crm_database_name}.dbo.ActivityPointerBase apb
Join {source_server}.{source_crm_database_name}.dbo.ActivityMimeAttachment ama
	on ama.ActivityId = apb.ActivityId

When you’re ready to unlink the source server, use the drop commands:

Exec sp_droplinkedsrvlogin '{source_server}', null
Exec sp_dropserver '{source_server}'

At this point, I was concerned that the script may crash and/or run out of memory. To avoid this possibility, I decided to pull the data over in batches. I won’t get into the specifics, but you can reference my complete script below. From a high-level, I’m adding a column to the source table to mark rows that I’m pulling over, and what the migration state is. I made it nullable so that it doesn’t affect normal CRM operations.

Declare @LinkedServerUser Varchar(Max)
Declare @LinkedServerPassword Varchar(Max)

Set @LinkedServerUser = {linked_server_user};
Set @LinkedServerPassword = '{linked_server_password}';

Print 'Linking Remote Server...'
Exec sp_addlinkedserver '{linked_server}', 'SQL Server';
Exec sp_addlinkedsrvlogin '{linked_server}', 'false', null, @LinkedServerUser, @LinkedServerPassword;
Go

/*
Create a column in the source database to track which rows have been converted.

Values:
--1=Queued to move to attachments
--2=Moved to attachments
--3=Queued to move to mime attachments
--4=Fully migrated
*/
EXECUTE {linked_server}.{crm_database_name}.dbo.sp_executesql
	N'Alter Table ActivityMimeAttachment Add T_Processed Int'
Go

Declare @SolutionId UniqueIdentifier
Set @SolutionId = '{solution_id}';

Declare @Rows Int

Print 'Copying attachments...'
Set @Rows = 1
While(@Rows > 0)
Begin
	--Queue up records we haven't moved
	Update Top (1000)
	{linked_server}.{crm_database_name}.dbo.ActivityMimeAttachment
	Set T_Processed = 1
	Where T_Processed Is Null

	Insert Into Attachment
	Select Body, [Subject], FileSize, MimeType, [FileName], Null as 'VersionNumber',
		ActivityMimeAttachmentId as 'AttachmentId'
	From {linked_server}.{crm_database_name}.dbo.ActivityMimeAttachment
	Where T_Processed = 1

	Set @Rows = @@RowCount
	RAISERROR ('Attachment Rows Updated...', 0, 1) WITH NOWAIT

	--Mark the records as moved
	Update {linked_server}.{crm_database_name}.dbo.ActivityMimeAttachment
	Set T_Processed = 2
	Where T_Processed = 1
End

Print 'Copying attachment metadata...'
Set @Rows = 1
While(@Rows > 0)
Begin
	--Queue up records we haven't moved
	Update Top (1000)
	{linked_server}.{crm_database_name}.dbo.ActivityMimeAttachment
	Set T_Processed = 3
	Where T_Processed = 2

	Insert into ActivityMimeAttachment
	Select AttachmentNumber, ActivityMimeAttachmentId, Null as [VersionNumber],
		@SolutionId as SolutionId, ama.ActivityMimeAttachmentId as 'AttachmentId',
		null as 'SupportingSolutionId', ActivityTypecode as 'ObjectTypeCode',
		0 as 'IsManaged', 0 as 'ComponentState', '1900-01-01 00:00:00.000' as 'OverwriteTime',
		(newid()) as 'ActivityMimeAttachmentIdUnique', ama.ActivityId as 'ObjectId'
	From {linked_server}.{crm_database_name}.dbo.ActivityPointerBase apb
	Join {linked_server}.{crm_database_name}.dbo.ActivityMimeAttachment ama
		on ama.ActivityId = apb.ActivityId
	Where ama.T_Processed = 3

	Set @Rows = @@RowCount
	RAISERROR ('ActivityMimeAttachment Rows Updated...', 0, 1) WITH NOWAIT

	--Mark the records as moved
	Update {linked_server}.{crm_database_name}.dbo.ActivityMimeAttachment
	Set T_Processed = 4
	Where T_Processed = 3
End

Print 'Dropping temp column'
EXECUTE {linked_server}.{crm_database_name}.dbo.sp_executesql  N'Alter Table ActivityMimeAttachment Drop Column T_Processed'

Print 'Unlinking remote server...'
Exec sp_droplinkedsrvlogin '{linked_server}', 'sa'
Exec sp_droplinkedsrvlogin '{linked_server}', null
Exec sp_dropserver '{linked_server}'

Print 'Migration complete.'

Converting Subversion to Mercurial/Git

A few years ago, I set up a privately hosted Subversion repository to hold the source code for all of my various non-work related projects. These days, it is showing it’s age, as the world and my professional life has moved on to Distributed Source Control (DVCS). I decided it was time to convert to something more modern, maintainable, and upgradable in the future.

In this post, I’ll show you how I converted my repositories over to Mercurial to archive them in Kiln, complete with history. There are a number of ways to accomplish this goal, and I thought this method was fairly simple and straightforward.

Instructions

First, install TortoiseHg if you haven’t already done so. You actually won’t need Subversion installed to do the conversion.

Next, right click anywhere in an Explorer window, select TortoiseHg->Global Settings:TortoiseHg-Menu

Go to the "Extensions” section, and check the “convert” checkbox, which will enable the conversion extension functionality:

Enable-Convert-Extension

One Repository vs Many

In Subversion, there were two common repository structure camps. One that used a single repository for multiple projects, and those that used a separate repository for each project. Due to the complexity of setting up projects on the Subversion server, I always stuck with the “one giant repo” model.

In Git, repositories will be distributed and are ideally as lightweight as possible. The typical pattern is to create a new repository for each project. This obviously means that we have to do some work to separate out a single Subversion repository into multiple Git repos.

SNAGHTML141b8252

Creating a Filemap (optional)

If you have separate repositories in Subversion, you can skip this step.

Create a “filemap”, which is just a text file with the name of your choosing that will look like this:

include "Utilities/ProjRefToDll"
rename "Utilities/ProjRefToDll" .

The first line is telling the converter to only include the path specified. The second line is telling the converter to move the project to be in the root (note the period) of the new repository. If we leave this out, we’ll only have the project we’re looking for, but our source code will be nested under folders from our old structure.

You’ll need to update this file for each project you convert.

Convert!

Run the following command line with the correct paths (omit the filemap parameter if converting the whole repo):

"<path-to-hg>\hg.exe" convert -s svn "<path-to-svn-repo>" --filemap "<path-to-optional-filemap>"

Here is the command I ran:

image

It will then iterate through every revision in the source repository, even if it’s not relevant to the filter you’re applying. It will create a new folder next to the folder you’re converting, but with a “-hg” suffix, like this:

image

Open it up, and it will be a full-fledged Mercurial repository. Do an “update” to confirm that everything looks like you expect:

SNAGHTML1392a450

At this point, you have successfully converted a project from your Subversion repository into a Mercurial repository, revisions and all.

Personally, I’m using Kiln as my online source archive. Even if you aren’t interested in Kiln, and instead want to use something like GitHub, no problem, push it up to Kiln, and then pull it via Git for an automatic conversion at any time! I’m using Kiln for the time being, considering that it’s free for up to 2 users. It works great for my personal projects that I don’t want public on GitHub. My plan is to eventually scrub those projects, and put the latest revision of each out on GitHub to make available to the community.

I’m not sure if I’m switching the full repositories to Git yet, so I like the flexibility that Kiln offers. If you are interested in converting to Git directly, there is probably an easier way.

The Pomodoro Technique & Scrum/Agile

I recommend checking out the Pomodoro Technique. It’s a super simple productivity technique that is designed to improve your focus and your own productivity by eliminating distractions, and maintain concentration by taking breaks at optimal times.

The best part is that there is no book to buy, and you can get started in minutes. The Pomodoro website can get you up and running quickly. Basically, you’re focusing on a single task for 25 minutes, and then you take a 5 minute break before starting the next Pomodoro. After 4 Pomodoros, take a longer break.

Lifehacker users voted it the #1 productivity method, and I find that it’s compatible with the Getting Things Done methodology (a book I personally recommend). Will it cure world hunger? No. In fact, you can’t even eat the timer even though it looks like a tomato.

It works a lot like scrum. You create your list of tasks, and then estimate them based on the number of Pomodoros they will take to complete. Each day is like a sprint, and each Pomodoro is like a story. As you get better at the technique, you perform daily retrospectives to improve the process and the estimation. If you don’t like their pencil and paper approach, you can use Trello.

I’m definitely not an expert, but I’ve already realized some of the benefits. It’s amazing how many little distractions occur in a 25 minute period, killing productivity. Unlike a computer, we humans are absolutely terrible at switching tasks.

With an entire team, the challenge would be coordinating everyone to minimize potential interruptions. I did find some teams using it with limited success. For example, here is a detailed analysis of Pomodoro combined with Scrum:

http://www.devoteddeveloper.com/2012/02/pomodoro-scrum-development-objective-i.html

Some interesting quotes from that article:

“I think expanding the Pomodoro Technique® to a whole team can prove very difficult. Probably it’s not even desirable.”

“How you work most efficient is very individual, but to be productive I think you have to feel comfortable about, and like, the way you work.”

Give it a try. Couldn’t hurt.

My Ivy Bridge i7 PC Build

I recently had an opportunity to sell my original i7 920 PC, and build a 3rd generation Ivy Bridge.  My goal was to minimize power usage so that I could run it as a 24/7 media server, and maximize its encoding horsepower. After hooking this build up to a Kill-A-Watt, it measured at just over 30 watts at idle, and less than 100 watts maxed out.

Component Price Name
PSU $89.99 SeaSonic M12II 620 Bronze 620W ATX12V V2.3 / EPS 12V V2.91 SLI Ready 80 PLUS BRONZE Certified Modular Active PFC Power Supply
Motherboard $104.99 ASRock Z77 Extreme4 LGA 1155 Intel Z77 HDMI SATA 6Gb/s USB 3.0 ATX Intel Motherboard
CPU Fan $49.99 ZALMAN CNPS9500A-LED 92mm 2 Ball CPU Cooler
Memory (32GB) $109.98 G.SKILL Ares Series 16GB (2 x 8GB) 240-Pin DDR3 SDRAM DDR3 1866 (PC3 14900) Desktop Memory Model F3-1866C10D-16GAB
Intel 3.5Ghz i7 Ivy Bridge $299.99 Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K
Antec 300 Case $44.99 Antec Three Hundred Black Steel ATX Mid Tower Computer Case
128GB SSD HD $102.44 Crucial 128 GB m4 2.5-Inch Solid State Drive SATA 6Gb/s CT128M4SSD2
LG Blu-ray Drive $94.99 LG BH14NS40 14X SATA Blu-ray BDXL Internal Rewriter with Software – Retail Box

Total Price: $897.36 + shipping

Processor: Intel Core i7-3770K 3.5GHz

i7-3770K

This was the first component I picked out, as a suspected all of the other components would be dependent on the processor. We have built some AMD machines at work, but they are severely lagging behind their Intel-based counterparts for performance and power usage. Intel is a little pricier, but definitely worth it. I went on the high-end as far as speed to minimize video encoding time, and hopefully prolong my next upgrade as long as possible.

Since I don’t have any intentions of using this system for high-end gaming, my goal is to actually use the Intel 4000 graphics that are integrated into the chip. Integrated graphics are still significantly behind their discrete counterparts, but this saves me over 100 watts of power usage.

Motherboard: ASRock Z77 Extreme 4

Z77 Extreme4(m)

I always find the motherboard the hardest component to spec for a new build. In the past I’ve chosen Gigabyte boards for their low price and innovative features such as long-life solid state capacitors. However, I ran into some buggy firmware with my last build, and hoped to avoid them this round. Solid state capacitors are now the norm, so features and stability were my primary goals. We’ve used a few ASRock boards for builds at work, and they have worked flawlessly. They are also the current leader for fast booting when using UEFI. At $105 for this board, it seemed like a great price/value ratio. It has dual-monitor outputs so that I can use the integrated graphics while still keeping 2 monitors.

Below is a thermal image from the board. Notice that the ASRock chipset is at 143F. I can’t explain why it runs so hot. It’s not hot enough to be a huge concern, but I would prefer to keep every component cool for long life.

IR_1146

Memory: 32GB G.SKILL

GSkill-Ares

Memory speed used to have a significant impact on the real-world performance of your system. These days, buy memory that is well reviewed and falls within the memory speed requirements of your motherboard. At the time of my purchase, I was able to get the Ares memory for only $55 for 2x8GB strips. At this price, it was worth maxing out the board at 32GB (I’m still in awe) just to avoid a future upgrade. One interesting feature of the motherboard is the ability to use part of the RAM as a RAMDISK, which is essentially a virtual hard drive that would outperform even the best SSD.

PSU: SeaSonic M12II 620 Bronze 620W

Seasonic-620

Bad power supplies create bad power, which kills components. A good power supply will provide clean power, which will make your system more stable, and will help ensure a long lifespan for all of the other components.

Why did I choose a 620W power supply? My experience has been that people often oversize their power supplies. If you use a 1000W power supply on a 50W load, it’s going to be terribly inefficient. Power supply efficiencies, which measure the amount of power lost in the conversion from AC to DC, are only accurate when the power supply is under heavy load. As you get out of the sweet spot, efficiency drops. When sizing a power supply, the #1 power consumer is your video card. An inefficient video card will pull more power than the rest of your system. Add multiple cards and the power requirements go through the roof. In my case, I’m using integrated graphics and shooting for low power usage. My goal was to stay under 100W at load on the DC side of the power supply. With an 85% efficiency, that would be about 118 watts on the AC side. Even at 620W, this power supply is WAY oversized. However, I have a quality power supply that gives me plenty of headroom for any upgrades down the road. Besides, there are not many well-built power supplies at lower wattages these days.

Hard Drive: Crucial m4 128GB SSD

Crucial-m4

SSD’s are undoubtedly the biggest technological breakthrough in computing as it relates to the perceived speed of the computer. People usually quote 7 second boot times, but the value is truly in the fact that applications open instantly.

I’ve had my fair share of bad luck with SSD’s. Since they’re a newer technology than their spinning disk counterparts, they have firmware that can be buggy in certain versions. I was a victim of the OCZ firmware bug, as well as the 7 month bug in the Crucial drives. That being said, I trust this drive thanks to its fairly consistent track record, and favorable performance.

Why only 128GB? I have a 3TB storage array that I use for my bulk data. Unless you install a number of large games on your SSD, or store other bulk data, 128GB is fine for just about everyone. If you feel more comfortable with the 256GB version, the price gap has closed significantly.

Case: Antec 300

Antec300

I hate choosing a case. Without seeing them in person, it’s difficult to gauge their build quality and features. I’ve used this case in the past, and it’s worked well enough. It’s usually inexpensive, and easy to work with. It has excellent ventilation, a cleanable air filter, and a place to mount your SSD drive.

CPU Fan: ZALMAN 92mm

Zalman-CPU-Cooler

I ran some temperature tests on the i7 Ivy Bridge with a stock cooler, and found that it can run pretty hot under load. An upgraded fan was probably not absolutely necessary, but it’s a $50 investment that guarantees that your CPU will run 20-30c cooler. This one is a simple design that’s easy to install, and the copper fins dissipate heat very easily.

Here is a photo using a Flir thermal camera. Notice that there is almost no measurable heat coming from the heat sink on the right side of the image.

Blu-ray / DVD: LG Blu-ray Drive

LG-Blu-Ray

This was probably overkill, but it was the only Blu-Ray drive that I could find that got exceptional reviews, and is 16x.

Why I Would no Longer Choose Silverlight

I recently spoke to some of the wonderful guys on the ASP.NET team over at Microsoft (@shanselman @haacked and others). Somebody asked if anyone had any complaints they wanted to share. I brought up the topic of Silverlight since it’s been on my mind recently.

I would like to make it clear that the product we built with it is still amazing, and Silverlight has a lot to do with that. Silverlight is a solid product that has reflected the amount of effort that the developers put into it.

When I was originally confronted with the decision of building our user interface in Silverlight or HTML, I chose Silverlight for the following reasons:

  • It’s the cool thing to do
  • Perceived fast development
  • User interface performance without having to touch “icky” JavaScript

Now that the first commercially available version of our software is available, I can’t help but admit some regret with the choice to use Silverlight.

The bottom line for me comes down to the fact that HTML runs everywhere, and Silverlight runs some places.

It’s becoming abundantly clear that the future of computing is in a wide variety of form factors, and they’re getting smaller. iPads, tablets/slates, iPhones, Android Phones, Microsoft phones, etc. are all widely available. Even Microsoft’s own Windows CE devices don’t support the Silverlight we know and love. These new types of mobile devices are showing us that the future it not just staring at a 19” screen. Our own Silverlight front-end is not only un-optimized for these devices, but worse yet, it doesn’t work, period.

So is the HTML story better? I believe it is. Let’s examine my original reasons for going to Silverlight:

  • It’s the cool thing to do – I believe this gap is closing. jQuery and other animation and scripting tools are making cross platform development reliable and easy to develop. The proliferation of web standards has been helpful as well.
  • Perceived fast development – My mistake here was thinking that since we didn’t have to deal with HTML, CSS, and JavaScript issues, we would be able to develop much quicker. The reality is that the time we would have spent on those issues is now spent dealing with another layer of misdirection. We now have to get the data from our database to the web server, but then we have to get it to the client. This is all while managing the subtleties of RIA services and LINQ to SQL objects. ASP.NET MVC removes one layer of mapping and complexity (we can optionally add layers obviously).

    On notable exception to this is our scheduling interface. We’re using the wonderful Telerik scheduler, which really shows when Silverlight can shine compared to an equivalent HTML page.

  • User interface performance – Silverlight wins this one. We’re doing some amazing things with layout transforms that really wows our customers. The easy animations will be about the same amount of work in JavaScript, but the more complex animations will likely be easier in Silverlight.

Most of my reasons make common scenarios a wash (or possibly in favor of Silverlight). However, there are reasons that I’m avoiding Silverlight going forward:

  • It doesn’t run well on Linux, CE, iPad, etc – This is huge. Unless you have a captive audience, you’ll want to make sure that your application runs on the newest devices. HTML is the only sure bet.
  • Memory leaks/issues – This one really irks me. I’m sure that there are issues in our code that cause some leaks, and I’m sure some can be attributed to this seemly unsolvable issue. We have a screen that stays open on a touchscreen, and it consumes all the memory on the machine within half a day. The bottom line is that these simply don’t happen in HTML.
  • Initial load hit – I know there are ways to mitigate this in Silverlight by spreading your application over multiple XAP files. However, HTML is the better choice for slow connections. The slow speed is spread throughout the screens, instead of switching between fast and slow.

In conclusion, all of the amazing reasons that I went with Silverlight have faded, and now I’m left with the reality of a platform that will never achieve 100% accessibility. There are plenty of situations where Silverlight may be the perfect fit, but next time I wont be as quick to lean toward Silverlight without serious consideration as to the potential audience and goals.

Introduction to Distributed Source Control

Version control systems manage the changes of documents. In software development, their primary purpose is to store the source code for an application, as well as every revision created during its development.

Currently, many developers use a centralized version control system such as Visual Studio Team System (VSTS) or Subversion. With such systems there is a central repository (i.e., Team Foundation Server (TFS)), usually located remotely, that houses the different versions of source code.

image

Unfortunately, a number of issues accompany typical source control systems that are based on a centralized repository, including, but not limited to the following:

  • Many operations such as checking code in or out can perform poorly over slow connections.
  • Working offline results in a reduced set of functionality, such as branching or committing multiple features or bug fixes.
  • Moving the repository can be difficult due to the fact that there is front-end and back-end management.
  • Working between networks that may never become bridged is impossible or difficult, since a connection must be made to the central repository.
  • Private work is typically not under source control.
  • There is often a single point of failure.
  • Security must be managed, and may become complex due to multiple permission sets and projects.

Distributed source control systems (or distributed version control systems, "DVCS" for short) are starting to gain popularity because they offer many advantages over the traditional, centralized repository.

image

They allow users to work independently in either a connected or disconnected environment. There is a tremendous amount of flexibility in regards to merging, managing different branches of development, and managing product features.

Adopters of this technology include Google Code and Sourceforge. Moreover, many major projects such as GNOME, Perl, MySQL, Python, and Ubuntu are also using a distributed source control system.

You may have already heard about some of the popular implementations. Git, Mercurial, and Bazaar are a few choices that have started to become mainstream. If you’ve worked with Subversion, you’ll find that migrating to this new generation of source control systems doesn’t mean giving up the features that you’re used to.

There are many problems with centralized repositories that simply disappear when you’re working with a distributed system:

  • Merging is a core feature and works how and when you want.
  • Security is trivial since everyone works in their own sandbox. You simply choose who you allow pushing and pulling changes to and from. In open source projects, this typically means allowing certain trusted individuals to push changes to the project repositories. When needed, additional security models such as authentication can be imposed.
  • Working disconnected doesn’t require any preparation. You are working offline by default. The only online operation is synchronizing with other repositories.
  • All operations are near-instantaneous. Synchronizing is the only operation that is dependent on the speed of your connection.

 

What is a Distributed Source Control System?

Distributed source control systems have the same purpose, but work much differently than systems like Team Foundation Server (TFS), SourceSafe, Subversion, and CVS. Instead of having a single repository that contains the source code and history, there are many repositories that have the source code, and some or all of its revision history. One or more peers have repositories for a project, and synchronize what they want, when they want to. There are really no requirements or restrictions. The focus is on synchronizing and working independently.

Workflow

Examples & screenshots included here are from Mercurial using the TortoiseHg explorer extension. Git has similar functionality using TortoiseGit. You can also use the command line for all/some operations if you prefer.

1. Clone a repository – To create your own local repository, you have to clone (copy) all or part of an existing repository. Since each developer has a copy of the repository, you can clone it from anyone.

image

2. Create a "working copy" of the code – Even though you have the full repository from another developer, you still need to "check-out" or get the latest version of the code. Since the repository is local, this operation is quick and can be done offline.

3. Make changes – Simply make any changes you like, without concerning yourself about how your source control works. This is similar to Subversion, and contrasts sharply with TFS which needs to track any changes you make by interfacing with Visual Studio.

4. Check-in changes – When it’s time to check in your local changes, they are simply committed to your local repository. They do not affect any other repository. Changes are detected by comparing the newest committed revision with the current version on disk.

image

5. Push or pull changesets – To actually send your changesets to another repository, you need to "push" or "pull" them. In the Mercurial dialog below, there are options labeled "Incoming" and "Outgoing" which simply compare the local changes with the remote changes and determine what will get pushed or pulled. The "Push" and "Pull" operations send your changesets to another repository, or pull changesets from another repository respectively.

image

When changesets are transferred between repositories, they do not affect any working copies. This flexibility allows changes to be synchronized without affecting work in progress.

Online/Offline Operations

Operation

TFS

Subversion

Mercurial/GIT/Bazaar

Get/Update

Online

Online

Offline

Check-out

Online

Online

Offline

Check-in

Online

Online

Offline

View History

Online

Online

Offline

Revert

N/A

Offline

Offline

Compare working changes

Online

Offline

Offline

Change tracking

Limited*

Offline

Offline

* Changes can be made in a special "offline" mode, and edited files will be checked-out when returning to "online" mode.

Merging Divergent Development Branches

In traditional, centralized source control systems, the only way for a divergence in code paths was to explicitly create a branch. While this is still possible in a distributed source control system, it is also possible for multiple developers to make independent changes that may or may not conflict.

The beauty of the system is that divergent code paths can be merged at any time. It is possible for the developers to make multiple changes, perform multiple synchronizations (pulls), yet not have to merge until they want to, or until they need to push their changes to another repository.

image

image

Where Is My Repository?

If you’re working with a team of 20 developers, and each one has a full copy of the repository, you don’t need a central repository. However, there are a few reasons why it is recommended:

  1. Central backup location – Even though you have numerous copies of the repository, it is still useful to have a single location that can be used as a place where an automated backup process is able to find it.
  2. Central communication hub – The logistics of pushing and pulling code between a number of developers can get complicated. Distributing your repository simplifies many of these problems, but is not perfect by itself. Having a central "authoritative" repository can make it quick and easy for developers to collaborate.
  3. Central location for builds – Automated builds and continuous integration servers need a location to pull source code from, which an authoritative repository provides.
  4. Central merge location – If multiple developers are pulling changes from one another, implicit branches can be created. A central repository serves as location where all of these branches are merged into one development line.

Repositories can typically be easily hosted internally using Apache, a built in daemon, CGI script, or simply just a file share. For simplicity, there are many services that provide repository hosting. For Git, there is GitHub and SourceForge. For Mercurial, there is BitBucket, Google Code, and SourceForge.

The beauty of distributed source control is apparent when you take into account the administrative overhead of a central server. Since the central server is no different than any other peer, it can be easily moved or modified. For example, you can start out with no central server, then you can use BitBucket to store your revisions, then you can move to another service within minutes. Changing providers simply means pushing your changes to another server.

Common Operations

Importing Existing Code

Importing existing code is an extremely simple operation. If it is new code that is not yet under source control, you can simply create a new repository within the folder that contains your code. You can then check in your code as desired.

In Subversion, the import process involved importing the code into the repository, and then checking out a working copy. Mercurial does not have this complexity.

image

Mercurial also comes with built in support for converting existing Subversion repositories to Mercurial repositories, including the entire revision history. More information is available here.

To convert from an older source control system such as Visual Source Safe, you can first convert the repository to Subversion, and then to Mercurial.

  Checking-in Code

It is worth mentioning a typical philosophical difference between how some source control systems promote the check-in process for changes. Systems like TFS and Visual Source Safe only provide limited functionality for reverting and re-applying specific changesets. For this reason, developers tend to check-in groups of unrelated changes. This tends to lead to less useful generic or incomplete comments such as "done for the day".

Flexible source control systems such as Subversion, Mercurial, and Git provide a lot more value when changesets are fine-grained, and represent a single change to the system. For example, renaming a page and changing the tab order are two changes that should be checked-in separately. If needed, either feature can be pulled in or out, moved, synchronized, or used to patch other versions. It also reduces the likelihood of conflicts, and typically makes conflict resolution easier. Other developers can quickly scan through the changelog and get a clear list of the features that were added, or bugs that were fixed. In an ideal world, all commits should be tied to a bug or feature to increase traceability.

Managing Branches & Releases

It is simple to create explicit branches that allow you to maintain parallel development of different features or versions. Branching simply involves entering a branch name when you commit your code. Switching between branches is as easy as performing an update to the latest revision of a branch. In contrast, TFS requires a branch to be created before you can commit changes to it. TFS also keeps a copy of each branch on the developers machine, which is optional with Subversion, Mercurial, and others.

Since changes can be made independently, there is also a concept of implicit branching. If we have two users, Ann and Bob, they are free to make changes independently of each other. If Ann checks in her changes, and then Bob pulls down those changes while having changes of his own, there are now two implicit branches of development. In this case pulling changes will automatically create multiple parallel lines of development. Changes cannot be pushed unless the code has been merged. The system is designed this way so that merging is only necessary when pushing, typically to a central repository or build server. The effect is that repositories that are only "pushed to" can be easily and cleanly maintained remotely.

Most distributed source control systems include tools that allow a visual display of code branches. This functionality is also likely to be included in Team Foundation Server 2010.

Tagging Revisions

In order to mark the significance of certain revisions, they can be tagged with a specific label. For example, when you release a specific version of your project, you can tag that revision with the label "v1.2" as seen below. Additional flexibility is provided by the "local tag" functionality, which lets you tag code on your computer without sharing the tag with others.

image

Terminology

Distributed Version Control System (DVCS) – Version control systems manage the changes of documents. In software development, their primary purpose is to store the source code for an application, as well as every revision created during its development.

Repository – A container for a set of changes that represent the history of the source code for a project. A repository may have the ability to store a partial history of the project, or the entire history. The repository is typically optimized by using compression and by only storing deltas or changes of files.

Changeset/Revision – A particular "delta" or change in the codebase. This can include any type of change, in any number of files. Visual Source Safe stored revision numbers for each file. Team Foundation Server and Subversion have global revision numbers for the entire repository. Distributed source control systems often use GUID’s or hash codes to represent specific revisions.

Working copy – A particular revision of the code that has been extracted or checked out from the repository. This revision includes the full version of all the files involved so that the developer can load and make changes to the code.

Bundle – A bundle is a file that contains a set of changes that is intended to be sent to another user to update their repository. This technology allows users to be physically disconnected yet pass code changes to each other. This file typically employs some form of compression to minimize file size.

Patch/diff – A patch is a file that shows the changes between two versions of a file or multiple files. It contains enough information to transform the old version into the new version, or vice-versa. It’s a quick way of sending someone a changeset. Patches are usually in the "unified diff" format, which looks like the following:

--- /path/to/original timestamp 
+++ /path/to/new      timestamp 
@@ -1,3 +1,9 @@ 
+This is an important 
+notice! It should 
+therefore be located at 
+the beginning of this 
+document! 
+ 
 This part of the 
 document has stayed the 
 same from version to 
@@ -5,16 +11,10 @@ 
 be shown if it doesn't 
 change.  Otherwise, that 
 would not be helping to 
-compress the size of the 
-changes. 
- 
-This paragraph contains 
-text that is outdated. 
-It will be deleted in the -near future. 
+compress anything. 
  
 It is important to spell 
-check this dokument. On 
+check this document. On 
 the other hand, a 
 misspelled word isn't 
 the end of the world. 
@@ -22,3 +22,7 @@ 
 this paragraph needs to 
 be changed. Things can 
 be added after it. 
+ 
+This paragraph contains 
+important new additions 
+to this document. 

 

References

Recommended Reading

Speeding up data access by using Linq to SQL or EF

Recall that LINQ based object relational mappers (ORM) use expression trees to effectively translate your C# (or other language) LINQ code into SQL. Many DBA’s and developers that don’t fully understand this technology are often quick to discredit it. I’m going to show how significant performance, simplicity, and clarity can be gained by using Linq to SQL.

A recent DBA asked me the question “I thought inline SQL was bad, so why are we using it again?” LINQ may *smell* like inline SQL, but it is not. Let’s first take a look at some simple LINQ that is easy to read:

from d in Devices
where d.CZone == 4
&& d.Type == "X"
select d.Id

So how is this different than inline SQL? To be technical, you’re writing a query against a data model, with full intellisense. You’re also writing a provider agnostic query. This same query can be performed against SQL Server, Oracle, or even the Facebook API if there was a supporting framework in place. We now have a truly unified query architecture.

Let’s keep taking about this simple LINQ query, and see how you would write it if you didn’t want to use LINQ. Most developers before the days of LINQ would probably use a stored procedure. Stored procedures are great. They’re efficient, reusable, and easily updatable. Here is what it may look like:

Create Procedure GetMyStuff
As
Select Id
From Devices
Where CZone = 4
And Type = 'X'
Select Id
GO

A nice, simple SQL query. There are a few disadvantages that may not be immediately apparent:

  • If you need a second, similar query, you have to either have to create and maintain two stored procedures. As an alternative, you could modify the stored procedure to operate differently based on a parameter. Both of these options are not idea, but LINQ does give us an alternative that I’ll discuss in a bit.
  • You don’t get intellisense when you’re writing your code.
  • You have to be concerned with two different “programming” paradigms, and also have to manually manage the translation in both directions.

Now, let’s take our example to the other end of the spectrum, which will help show where LINQ can really shine where straight SQL does not. This example is a query for a search page. I set up a simple ASPX page to demonstrate. Here is a sample of the user interface:

image

The user enters a number of search criteria, and the results are displayed. I literally coded this in under 5 minutes. If you’re used to using stored procedures to retrieve this type of data, think about how you would go about creating this. You have a couple of options that I’m aware of:

  • Write a separate stored procedure for every combination of parameters. In this case that would be 7 stored procedures. This would certainly not be ideal.
  • Write a single stored procedure that can handle each parameter as nullable parameters, and use “Where @Param Is Null Or Param = @Param”. This option is easy, but has some potential performance implications.
  • Write a single stored procedure that can handle each parameter as nullable parameter, and “If” statements to handle each scenario. This would be time consuming an error prone.

In LINQ, we’re able to dynamically build up a query. For the search example, my LINQ looks like this:

var dataContext = new DataClassesDataContext();

var query = (IQueryable<Device>) dataContext.Devices;

if (txtCZone.Text.Length > 0)
    query = query.Where(device => device.CZone == int.Parse(txtCZone.Text));
if (txtUCZone.Text.Length > 0)
    query = query.Where(device => device.UCZone == int.Parse(txtUCZone.Text));
if (txtLZone.Text.Length > 0)
    query = query.Where(device => device.LZone == int.Parse(txtLZone.Text));

dgResults.DataSource = query.ToList();
dgResults.DataBind();

And of course we can use the query syntax instead (replacing lines 5-10 above):

if (txtCZone.Text.Length > 0)
	query = from device in query where device.CZone == int.Parse(txtCZone.Text) select device;
if (txtUCZone.Text.Length > 0)
	query = from device in query where device.UCZone == int.Parse(txtUCZone.Text) select device;
if (txtLZone.Text.Length > 0)
	query = from device in query where device.LZone == int.Parse(txtLZone.Text) select device;

The result is that the SQL code is specifically written to support only the parameters that the user has entered. No extra SQL, and no specific SQL to maintain. Remember that LINQ can be chained together without querying the underlying data. The actual querying of the data only occurs when enumerating the results, using an operation like “ToList()”.

To support paging we need to run 3 different types of underlying queries. Here is where LINQ is really going to shine. We can use the same base query for all of these operations, and not have to worry about the drastically different underlying SQL statements.

  1. Result count – Simply by calling the “.Count()” method, we can retrieve the number of rows the query will return in total. The underlying SQL will be a simple and efficient Count operation.
  2. Page n query – By utilizing Skip and Take, any page within the results can be queried. The work of generating a common table expression is handled for you.
  3. First page query – If the underlying provider has an optimization for using the SQL TOP command, the first page of data you query will be able to avoid a common table expression. This has the advantage of being more efficient when the first (and often most common) page of results is displayed.

Real-world Results

I initially ran into this in a real application that was primarily used to search through a large table of records. It had originally used the stored procedure approach, and was causing the entire system to slow down to the point of being unusable. Thanks to LINQ, we were able to make the search usable. In fact, the results were drastic:

  Stored Procedure LINQ to SQL
Reads Over 4,000,000 8948
Duration 3249ms 189ms

In addition to the improved performance, the code was easier to maintain. The stored procedure was extremely cluttered, had large where clauses, and even contained two nearly identical copies of the query. One for calculating the count, and one for paging support.

Conclusion

LINQ gives us much more than “inline SQL”. It gives us a unified query syntax, delayed execution, query expression building, and dynamically created SQL output. Additionally, the generated queries are optimized based on the exact query being performed instead of making generic SQL that is optimized for multiple scenarios.

LINQ to SQL & Entity Framework Pitfalls

In my last post describing the differences between LINQ to objects and LINQ to SQL, I mentioned how LINQ to SQL and Entity Framework “interpret” your LINQ code, and create the corresponding SQL. Forgetting this fact is extremely dangerous, because LINQ to SQL and other object relational mappers are extremely leaky abstractions. LINQ is obviously a wonderful technology, but this post will be talking about some potential pitfalls you may run into.

SQL Query Complexity Disproportional to LINQ Complexity

Recall the example from my last post:

//Query Syntax:
from device in Devices
where device.Type != null
select device.DeviceId

//SQL Syntax:
SELECT [t0].[DeviceId]
FROM [Devices] AS [t0]
WHERE [t0].[Type] IS NOT NULL

In this case, LINQ to SQL has done something wonderful. It’s saved us from having to understand or worry about the translation of syntax between C# and SQL. Now, what happens when we write something a little more advanced, such as a nested group by?

from d in Devices
group d by d.CZone into czoneGroup
select new { Key = czoneGroup.Key, val = from d2 in czoneGroup
	group d2 by d2.LZone into lzoneGroup
	select lzoneGroup.Key }

And the corresponding SQL:

SELECT [t0].[CZone] AS [Key]
FROM [Devices] AS [t0]
GROUP BY [t0].[CZone]
GO

DECLARE @x1 Int = 3
SELECT [t0].[LZone]
FROM [Devices] AS [t0]
WHERE ((@x1 IS NULL) AND ([t0].[CZone] IS NULL)) OR ((@x1 IS NOT NULL) AND ([t0].[CZone] IS NOT NULL) AND (@x1 = [t0].[CZone]))
GROUP BY [t0].[LZone]
GO

DECLARE @x1 Int = 1
SELECT [t0].[LZone]
FROM [Devices] AS [t0]
WHERE ((@x1 IS NULL) AND ([t0].[CZone] IS NULL)) OR ((@x1 IS NOT NULL) AND ([t0].[CZone] IS NOT NULL) AND (@x1 = [t0].[CZone]))
GROUP BY [t0].[LZone]
GO

//Reminaing SQL removed....

What just happened? Our innocent nested join has turned into a monster! This is an example of a query that is simple to do in LINQ, but has no translation to a simple SQL statement. Instead of just bombing, the LINQ to SQL engine comes up with a solution that a user may not have written themselves. A typical SQL developer may have looked for a different approach.

Side note: In the nested group-by, notice that LINQ to SQL uses multiple queries. This differs from the Entity Framework approach, which uses outer joins to achieve the same effect.

Does it matter? The answer isn’t so simple. In this simplified example, the performance impact is minimal. Unfortunately, with a large amount of data in this type of query, you could start to experience terrible performance. I personally saw a nested query that was only a few lines of code turn into a 27 page SQL statement. The SQL statement was technically correct, but took seconds to execute, when it should have taken a fraction of a second.

One simple solution that I have found to be very effective, yet not intuitive, is breaking apart the initial query and forcing it to execute using the ToList() method. You’ll have to have a decent “where” clause to avoid excessive amounts of data being returned. Once we have the raw data, LINQ to objects will provide us the same set of tools to further manipulate our data. For instance, here is a modified version of the example presented earlier:

//Simple & fast initial query from the database
var rawData = (from d in Devices
where d.Location = 'B3').ToList();

//This operation happens "disconnected"
var results = from d in rawData
group d by d.CZone into czoneGroup
select new { Key = czoneGroup.Key, val = from d2 in czoneGroup
	group d2 by d2.LZone into lzoneGroup
	select lzoneGroup.Key };

The reason this works well is that it’s taking advantage of the strength of SQL Server, which is to query data, and the strength of .NET, which is to process and manipulate data.

LINQ Abstracting Away Problems it Can’t Solve

Here is a simplified version of a query I saw recently:

int sum = (from d in Devices
where 1 == 2 && d.CZone != null
select d.CZone.Value).Sum()

To make it extremely clear what I’m trying to accomplish, I put “1 == 2” in the “where” clause, so that no rows match the condition. The “Sum()” method returns the type that it’s acting on. For example, if you’re summing integers, the result is an integer. If you’re summing nullable integers, the result is a nullable integer. This is perfectly valid LINQ. This is effectively the SQL that is generated (I simplified it for clarity):

Select SUM(CZone)
From Devices
Where 1 = 2

Since the result of this SQL statement is NULL, it can’t be converted back to an integer. The exception is “InvalidOperationException: The null value cannot be assigned to a member with type System.Int32 which is a non-nullable value type.

When the LINQ is translated to SQL, there is no such operation as converting a nullable value to a non-nullable, so the “.Value” operation is ignored. This would be fine if the sum function still expected a nullable return type, but it’s now expecting an integer. When it can’t find any rows to return, it tries to return NULL. Since it’s trying to package up a NULL value into a standard integer type, it has no choice but to throw an exception.

Conclusion

Getting started with LINQ is fairly straightforward, but you can’t forget the fact that whatever query you’re writing must be converted into a SQL statement, and the results must be converted back to data that is understandable to .NET. Every LINQ query you write should be checked with a tool such as LINQPad to ensure that the SQL is efficient, and matches what you expect.

Also keep in mind that when you upgrade your data provider, your queries can change. For example, converting a statement from LINQ to SQL to Entity Framework can generate different SQL queries, just as updating to a newer version of the same ORM can.

Understanding LINQ and LINQ to SQL (and EF)

Back to basics for this post. Developers often throw around the word LINQ when talking about a number of different technologies. Now that I have been comfortably using a wide variety of LINQ technologies for a fair amount of time, I’m now able to convey some of the key differences that are critical to using LINQ technologies efficiently. I’m also using this as a foundation and reference for some exciting upcoming posts.

The first key point is to know what the heck LINQ is. LINQ itself is a number of separate features. One of these key features is being able to write SQL-like syntax (query syntax) in your code. At a basic level, that’s all you need to know for now.

LINQ (to objects)

First, we’re going to talk about LINQ to objects, which I typically just refer to as LINQ (possibly making the matter more confusing). It has absolutely nothing to do with SQL Server, Oracle, or any other kind of relational database. I’m talking about LINQ to objects, because I think that understanding it and contrasting it with LINQ to SQL is critical to understanding both.

For a moment, forget that LINQ exists. Let’s say that you wanted to filter a list of names, to only get names that start with the letter “J”. You could write the following “utility” function: (if you don’t understand “yield return”, see this post on that topic).

public static IEnumerable<string> GetNamesStartingWithJ(IEnumerable<string> names)
{
    foreach(var name in names)
        if(name.StartsWith("J"))
            yield return name;
}

A new feature in C# introduced in .NET 3.0, is a concept known as an extension method. This lets us turn my handy dandy static utility method into a method that can be called on a list of names. By changing the signature to this:

public static IEnumerable<string> GetNamesStartingWithJ(this IEnumerable<string> names)

I can then call it like this (Sweet!):

var myListOfNames = new List<T> {"Abe", "Jack", "Jason"};
var jNames = myListOfNames.GetNamesStartingWithJ();

We haven’t even talked about LINQ yet, but we’ve basically reinvented a portion of it. As an exercise for the reader, think about how you could use a Lambda parameter to pass in a filter criteria to create a ".Where” method. All the pieces are in place to re-create this form of LINQ yourself.

One actual new feature for LINQ is known as query syntax. Basically, it gives us an alternative way to write our query. It makes the code look more like SQL, and less like a long chain of extension methods.

Lambda Syntax:

var uppercaseJNames = names.Where(name => name.StartsWith("J")).Select(name => name.ToUpper());

Query Syntax (same query):

var uppercaseJNames = from name in names
	where name.StartsWith("J")
	select name.ToUpper();

In both of those examples, the exact same operations are occurring, and you get the result. The one you choose will most likely come down to personal preference. It’s also worth noting that some of the extension methods provided out of the box are not available in query syntax. You can either avoid the query syntax in those cases, or use a hybrid approach.

How is LINQ to SQL (and Entity Framework, etc) Different?

Now, I hope you understand that there isn’t really any magic going on in LINQ. Microsoft has simply given us a new set of easy to use tools that make working with sets a breeze.

LINQ to SQL is a different matter. Instead of executing code, you’re building an expression. An expression is simply a “picture” of what you’re trying to accomplish. It can interpreted in many different ways. To understand the underlying technology, you’ll have to read up on expression trees, which I’m intentionally keeping outside the scope of this post.

If we have a “picture” of a query, what happens to it when we want to “run” it? LINQ to SQL, Entity Framework, and other LINQ implementations look at your query, and basically translate it into something else. How about an example?:

//Query Syntax
var deviceIds = from device in Devices
where device.Type == "I"
select device.DeviceId

//Lambda Sytax (extension methods)
var deviceIds = Devices
   .Where (device => (device.Type == "I"))
   .Select (device => device.DeviceId)

//SQL
SELECT [t0].[DeviceId]
FROM [Devices] AS [t0]
WHERE [t0].[Type] = "I"

I’ve provided the query syntax and the lambda syntax. At the bottom is the resulting translation into a SQL statement.

In this last example, I’ll try to make it clear that your code is simply interpreted and translated:

//Query Syntax:
from device in Devices
where device.Type != null
select device.DeviceId

//SQL Syntax:
SELECT [t0].[DeviceId]
FROM [Devices] AS [t0]
WHERE [t0].[Type] IS NOT NULL

Notice that the C# operator “!=” translates in SQL to “IS NOT NULL”. This was handled automatically for us. Our expression did NOT get back all the rows and apply a conditional to it.

Why is this important? To use either technology effectively, you have to understand that when you’re working with objects, it’s simply a chain of methods, and often behaves as you would expect. When working with LINQ to SQL (or a related technology), the expression is evaluated, and might not execute like you expected.

Understanding the internal workings of these technologies will let us fully take advantage of all the wonderful features it has to offer. In upcoming posts, I’ll be warning you of some potential pitfalls related to how your queries are interpreted and translated. I’ll also be showing you how to get significant performance gains by using LINQ to SQL or Entity Framework efficiently (over traditional SQL based solutions). I’ll also be showing you how I write LINQ queries to query an AutoCAD document!

Related Posts:

Determine if a point is contained within a polygon

One of my recent projects had a requirement to take a list of points and a list of polygons (of any order), and determine which points were in which polygons. I find this problem interesting, because the solution is not apparent, but it is easy to implement.

One common algorithm is called the ray casting algorithm. You can read more about the ray casting algorithm on Wikipedia. My buddy Google was able to find another great resource with some sample Java code.

After an initial performance test, I found this algorithm to be extremely fast. I was able to process over 200,000 checks in under a second.

I converted the code to something a little more object oriented. I wanted a class that would represent a Polygon, and also have a method that would tell me if a point was contained within it. I’m including the code in the hopes that it may help someone else one day:

/// <summary>
///		Represents a geometric polygon made up of any number of sides, defined by <see cref="PointF"/> structures
///		between those points.
/// </summary>
public class Polygon
{
    private readonly PointF[] _vertices;

    /// <summary>
    ///		Creates a new instance of the <see cref="Polygon"/> class with the specified vertices.
    /// </summary>
    /// <param name="vertices">
    ///		An array of <see cref="PointF"/> structures representing the points between the sides of the polygon.
    /// </param>
    public Polygon(PointF[] vertices)
    {
        _vertices = vertices;
    }

    /// <summary>
    ///		Determines if the specified <see cref="PointF"/> if within this polygon.
    /// </summary>
    /// <remarks>
    ///		This algorithm is extremely fast, which makes it appropriate for use in brute force algorithms.
    /// </remarks>
    /// <param name="point">
    ///		The point containing the x,y coordinates to check.
    /// </param>
    /// <returns>
    ///		<c>true</c> if the point is within the polygon, otherwise <c>false</c>
    /// </returns>
    public bool PointInPolygon(PointF point)
    {
        var j = _vertices.Length - 1;
        var oddNodes = false;

        for (var i = 0; i < _vertices.Length; i++)
        {
            if (_vertices[i].Y < point.Y && _vertices[j].Y >= point.Y ||
                _vertices[j].Y < point.Y && _vertices[i].Y >= point.Y)
            {
                if (_vertices[i].X +
                    (point.Y - _vertices[i].Y)/(_vertices[j].Y - _vertices[i].Y)*(_vertices[j].X - _vertices[i].X) < point.X)
                {
                    oddNodes = !oddNodes;
                }
            }
            j = i;
        }

        return oddNodes;
    }
}

Of course I can’t write a class without the appropriate unit tests:

[TestClass]
public class PolygonTests
{
    [TestMethod]
    public void PointInPolygon_InnerPoint_ContainedWithinPolygon()
    {
        var vertices = new PointF[4]
                            {
                                new PointF(1, 3),
                                new PointF(1, 1),
                                new PointF(4, 1),
                                new PointF(4, 3)
                            };

        var p = new Polygon(vertices);

        Assert.AreEqual(true, p.PointInPolygon(new PointF(2,2)));
    }

    [TestMethod]
    public void PointInPolygon_OuterPoint_NotContainedWithinPolygon()
    {
        var vertices = new PointF[4]
                            {
                                new PointF(1, 3),
                                new PointF(1, 1),
                                new PointF(4, 1),
                                new PointF(4, 3)
                            };

        var p = new Polygon(vertices);

        Assert.AreEqual(false, p.PointInPolygon(new PointF(5,3)));
    }

    [TestMethod]
    public void PointInPolygon_DiagonalPointWithin()
    {
        var vertices = new PointF[3]
                            {
                                new PointF(1, 3),
                                new PointF(1, 1),
                                new PointF(4, 1)
                            };

        var p = new Polygon(vertices);

        Assert.AreEqual(true, p.PointInPolygon(new PointF(2, 2)));
    }

    [TestMethod]
    public void PointInPolygon_DiagonalPointOut()
    {
        var vertices = new PointF[3]
                            {
                                new PointF(1, 3),
                                new PointF(1, 1),
                                new PointF(4, 1)
                            };

        var p = new Polygon(vertices);

        Assert.AreEqual(false, p.PointInPolygon(new PointF(3, 3)));
    }

    [TestMethod]
    public void PointInPolygon_PerformanceTest()
    {
        var vertices = new PointF[4]
                            {
                                new PointF(1, 3),
                                new PointF(1, 1),
                                new PointF(4, 1),
                                new PointF(4, 3)
                            };

        var p = new Polygon(vertices);

        var sw = new Stopwatch();
        sw.Start();

        for(var i = 0; i < 200000; i++)
            p.PointInPolygon(new PointF(2, 2));

        sw.Stop();

        Assert.IsTrue(sw.Elapsed.TotalSeconds < 1);
    }
}

The last unit test was only to determine if this method was going to be performant enough for the scenario I wanted to use it in. You may want to remove it or mark it as explicit if you can to avoid timing issues affecting your test outcomes.

If anyone see’s a bug, let me know!