Diff tooling

2016-07-29

An important part in our line of work (eg. coding) is source control management (SCM) tooling. These days, for the web, that translates pretty much into Github, and so, git. As you know, git isn't the only type of SCM. There's a plethora of them, actually. Git is just the jQ^H^HAng^H^H^HReact of the moment. This post is not about git or Github in particular, though. It's more about the actual diff and the blatantly missing tools to work with them.

One of the key points of a diff is showing you what changed. To you, the one who coded it. Or you, the one who's reviewing it. Or you, the one who's auditing it. Or you, the one who's trying to untangle the shit of your predecessor who may have messed it up but at least used some kind of SCM in the process. You. Are trying to figure out what changed.

And what changed reaaally depends on the person committing changes. Some people will literally commit every line changed. Some people will do monthly commits. Some people will make semantically split commits, others pile it all together. Who are you again?

Unless you are the one who made the commit, and even sometimes if you are, your task at hand is likely going to be as much fun as one of those fancy rings of hell. The commit you need to review is going to be a bunch of out-of-context changes and unless you are very acquainted with the code affected you're going to have a tough time with it. But why? Because more often than not, the semantics of a commit are going to be entangled. Refactors are the worst offenders here, of course, the more drastic a change the more trouble the commit.

What surprises me, getting down to the core of this post, is why the tooling is so bad on this front. For example, why can't you check a diff and suppress certain semantic changes? Was it perhaps the job of the committer to have split the commit in the first place? Yes. Maybe. Who knows. Maybe it wasn't relevant. Maybe it was laziness. Maybe it just wasn't feasible. No? Regardless, that doesn't really help you when you're on the reviewing end of things. Especially when you can't/won't throw it back at the committer.

Here's a fact; a commit can consist of various atomic semantical actions. For example, a change can be a rename, multiple renames, movement of code, re-ordering of code without actually changing the logic, business logic changes, etc. A commit really is like a Mikado game and taking away each stick can be a very tricky business. But it's not impossible. And the results can be very pretty.

Let's look at a diff. Conceptually a diff is really the result of some algorithm that compares two strings (usually files) and, under the assumption that these two strings have various similarities, try to match lines together while jumping over arbitrary changes. Bare diffs are often represented by prefixing removed lines with a - and added lines with a +. The removed lines often (but not necessarily) have an added counter-line which would reflect the change, like this:

Code:
- console.warn('hello world');
+// console.warn('hello world');

The diff format will also have some telemetry regarding the file and line position of the diff and the diff usually, but not necessarily, has some unchanged lines preceding and succeeding it. For this post, I'm only interested in the actual change, so only in the actual lines where something changed.

What a diff will not do is tell you where the changes actually happened. It's really a vertical oriented protocol. That makes sense, of course, in the context of source code that has a tendency to consist of relatively short condense lines of instructions. Unless you're writing Java. Nah I'm kidding. Or am I. Yes.

The diff won't tell you how many changes it found on a line. It won't tell you the column where the difference was spotted. Won't tell you the length of the diff(s) found. In short, if you changed some minified JS and you want to diff it you'll just get the whole file and good luck to you. (Minified JS tends to put everything on one line)

What if you take this diff and start analyzing it? What if you start untangling the diff? What if you took a diff and created multiple small diffs? Trim the fat to get to the meat of the diff. That'd be great, right? Right. Right now this is a tedious manual task while committing and not something that you can do while reviewing. I'm not sure why that is, though, because it'd be a tremendously helpful tool.

Imagine you have to review this terror commit. Github makes your life slightly easier by having a side-by-side diff mode where actual changes per line are highlighted, if their algorithm could pinpoint it anyways. Note that you can actually have this in the CLI as well. If you changed one line and replaced it with a completely different line there's nothing to highlight so ymmv there. Regardless, it's a vast improvement over the regular raw, or even colored, diff.

I would like to be able to, after the fact or while reviewing, split a commit up into two individual succeeding commits. I want to say "show me a diff as if the renaming of foo into bar had already happened in an earlier commit". Or better yet, make it so. Rewrite the commit tree history such that there is first a commit where the only thing that happened is renaming foo into bar and then a second commit containing whatever is left of the original commit. That shouldn't be too hard, assuming you can do horizontally what diff can do vertically. That is, find the actual columns of changes. You'll need that because if a line did two things, like rename a var and change its assigned value, then this action should only "move" the renaming but leave the remainder of the changed line in tact.

Hey you could take a changed line, split it with newlines between each character, and run a diff over it ;) That's probably how all the big companies do it, anyways. (Nah, I'm kidding)

This renaming of a var is actually two steps further than what I want, initially. At first I'd just want to say "move all these from-to changes into their own commit and show me the current commit as if it appeared after that injected commit". The actual change can be arbitrary, var names or simply repeating patterns of changes, regular expressions, it matters me naught.

Similarly I'd like to be able to select a bunch of lines, or even just a part (or multiple parts) of the diff and have it split that up into two commits. Rewrite history and allow me to proceed to edit the two commits as if that's how they were committed originally.

You can't tell me this is new. There's no way somebody went through all the trouble of refining diffing to what it currently is but then not go the extra mile and do anything like that. So why can't we? Or if we can, where is that magic hiding? Why doesn't git have a cli editor that works similar to git add -i, but better? Why doesn't Github, with all its funding and smart people, offer better tools for doing code reviews? Come on, this isn't even rocket science. It's probably harder to work it out in the UI than to actually do it.

The rocket science would be two steps further. An algorithm that would analyze a code diff and be able to suggest various Mikado sticks to "normalize" into separate commits. Like "hey, I noticed this variable was renamed" or "ooh, you moved that function down a few lines, I think it looks nicer if you put that in its own commit". Or "hey, this is a bug, shall I drop the commi" NO wait!

Anyways. I'm just frustrated that this kind of tooling isn't available, knowing it's not _that_ hard to build them.

So why am I not doing that myself? Good question. Simple truth is I have plenty of pet projects being neglected already, the last thing I need is another one. Let that 2B$ valuated company spend some money on its core business. I'd start paying for an account for that shit. Just. Give. Me. Or hey, hire me to do it for you. That's probably the only way I can make time for it right now.

Honestly, if you're an SCM-as-a-service and can't see the supreme value in these kinds of tooling ...