Friday, November 26, 2010

Good practise in use of source code control

Generated files

Never check-in generated files, period. I mean it. I hope by the end of the article you will be convinced too. Really there should never be a need for revisioning of a generated file. You are doing it all wrong if you think you need to. It means you don't understand what SOURCE CODE control is, I'll spell it out, it is for SOURCE material. I'm not saying you can't save those generated files somewhere for later use (ie a file server that has backups), but they shouldn't need revision control because it is not logical to have different versions of something that was generated from source material unless there is a change or different version of the corresponding source material or scripts used to generate it. A source code control or revisioning system is not a backup system, don't abuse it for this. More on this later.

The source material needs versioning. Sure, if you recompile the same code 20 times you could get 20 different binaries if you do a binary diff, things like the date and time can work their way in to the binary, as too can the path from which the code was compiled depending on the compiler and flags. Which OS, compiler, version and flags was used will all have an effect. But really the scripts and makefiles used to control the compiling of the source code should be under source code control. The generated file that you were going to checkin really should be replaced with a script or equivalent that will check the pre-conditions and then run the precise commands with the precise flags that will result in that generated file being created. It should be reproducible, so there should be no need to check that generated file in. If it is not then you are doing something wrong. It might mean there is too much magic involved to reproduce compiling that code properly. That is a problem, it's a symptom that you might be doing something wrong. It might be someone wants to exercise some control over other developers or prevent others from compiling their code such that they must use their generated binaries that are checked in instead of everyone compiling from source. This means something is wrong.

Either the code is fragile, brittle and breaks easily or someone is trying to hide something or trying to protect their job or some combination of these.

It's not the way it should be. The code will only get better with more people compiling it and trying it with different compilers and different flags. Any code that needs a specific compiler on a particular persons computer compiled in some specific way with the magic scripts only on their computer or that only work on one persons computer setup or with some tool with restrictive licensing sounds like some funky shit that smells like something is off and you want to stay away from it because it will be a liability to you and your company. If the code you work on gets a dependency to such code, you'll be making yourself reliant on someone else, you will be giving them more power, feeding the beast. Work around the dependency, find alternatives, or try to work out the magic, ask to watch it get built. One does need to be pragmatic too, sometimes there might be no way around using critical 3rd party libraries. Obviously some stuff like Win32 API DLL dependencies can't be avoided on the windows platform, but there is a reasonable assumption MS and the DLLs won't disappear suddenly and there are alternatives like WINE that if MS went bust or decided to go 64bit only, there are fallbacks. Another strategy is design the code to be cross-platform in the beginning so no dependency on any particular OS or OS vendor. That's part of the problem of checking in generated binary files without a means to re-generate them, it limits the OS, toolchain, flags etc to just the set for which binaries were made available. Going cross-platform removes dependencies and gets developers in the mind-set of thinking about their dependencies. It's too bad that too often dependencies are only given thought after the fact when some code that works on one platform suddenly has a need to also work on other platforms. All those dependencies become painfully obvious. Digging around for a developer's old computer that worked at the company 5 years ago for his special tool he wrote in .NET that did some funky code generation for which the code to the tool wasn't checked in, only the generated code was, sounds like no one thought about the right source code control disciplines back then but now it is being paid for.

Merge vs exclusive check-out

Exclusive check-out model doesn't work. one user is working on something and needs other person to finish with their lock on the file. If in the office, they can just ask the person to check it in, but if that person is away, they might be stuck until they check-in their work, or they continue working in some compromised way. It just doesn't scale. Logically thinking about it and imagining if the locking was more granular down to function level or line level, one may be able to see some of the problems with the model, some concurrent changes between versions at the line or function level between users can create a non-working build. One really needs to check that the combination of changes, including the latest changes they have made and others have made, will work in combination. The exclusive check-out model doesn't magically solve that, it doesn't prevent incompatible changes happening concurrently to different files. I might remove all calls to a functions in all the code and then prepare to do the final step of removing the implementation of that function from the .h and .c files while simultaneously somewhere in the code (could be any file in the project) someone checks-in a change to call that function. The only way to prevent this is to check-out every file in the project and also lock other users from adding any files so they can't add a call to the function I am about to remove. Obviously one still needs to test the combination of the changes and correct for such types of problems when files are checked-in simultaneously. This really is a merge of sorts, but the granularity is different and the conflict is harder to detect, it's not really detected by the versioning software at all, its something that gets found by testing the latest code from the repository by compiling it and running it. Nothing substitutes testing, the versioning software can not solve incompatible changes. No, the real solution is to mark versions as working, merge in sets of changes, test the merged results, version it etc. And really the best is when the testing can be automated, but that is not always possible. There really is no merge free solution when you understand that every change needs to be checked it is compatible with all previous changes, irrespective of if it is in the same file or from a different file to one that has recently changed.

Exclusive check-out is really annoying too when also checking in generated files. That has to be the worst possible thing to do. When recompiling or doing anything that causes the generated files to change, the compile can fail because it can't overwrite the read-only generated file, or it automatically checks it out which blocks others compiling. It just causes all kinds of blocking or further bad practices with kludging the file permissions. It is just horrible.

Working with checked in generated files instead of generating them when the source is available is bad. It's worse when the source was available and someone takes it out of the repository to prevent others compiling it and just leaving the binaries and headers there to link with. How do people seriously think that is progress?

Particularly if they are compiled with one version of the toolchain, like VS9, it then forces everyone to switch to VC9 and can't do VC6 builds anymore even if there is some need to do so. Code would be better if it is compiled and tested with more toolchains. It stops compiling the code in other configurations, eg with flags for MBCS or UNICODE etc particularly if the API is not well written, it then can force the calling code that links with the checked in binary that was compiled in a particular way to also have to be compiled in the same way to be compatible with how the API was defined and the mangled types in the function symbols of the binary to get correct linkage with it. It then sucks more when other binaries are checked in that were compiled in incompatible ways also with similar braindead APIs. It may prevent the 2 libraries being used together with one application.

While the root of the problem is not just the use of the source code control system, it would be have been possible to generate the binaries as required in the correct form if the generated binaries weren't checked-in and things would be much nicer, one could make different versions of the binaries that one requires rather than being told how the binary should be generated with it in some fixed forms.

The term "source code control" is bad. It uses the word "control", which is I guess is how some people use it. But i think that's part of the problem.

People use it as a tool of power and control. It shouldn't be a restrictive system to inhibit changes or access. The real problem that needs solving is the one of versioning, what is needed is being able to mark versions of the software and to manage concurrent versions, merge them, branch etc.

Not a centralized control system. Of course with decentralized systems, there is the question of  backups. And source code control/versioning systems are that, not backup systems.

Backup-proxy

Please stop using the repository as a backup system. Perhaps that is why there are all the generated files checked in? Maybe that is why the server is grinding to a halt and can't handle the load of the repository? Maintaining file histories of big binary files that change nearly completely each version is not really what source code control or versioning systems are designed to handle. Smart ones detect binary files and store them differently, but can seriously bloat the store. Normally version control systems store a series of diffs, either forward or backward or both. So there might be the earliest and latest version of the file stored, and a set of differences that you can apply to get to any other version of that file. That works great for most text based files, but falls apart completely for binary files. In the case of binary files which are usually lots bigger to begin with, the repository generally needs to store a complete copy of every version of that file. Rarely it makes much sense to do a diff between 2 big binary files. Most of the time when checking out the generated files, we would be interested in the latest version of the generated file if we were at all interested in checking out generated files.

VirtualMachines

Using a VirtualMachine is not an excuse to avoid using source code control. Don't just make backups of the virtual machine to a multi-spanning DVD collection.
Try to make the steps to setup the virtual machine repeatable or document it. VMWare and VirtualBox are scriptable, please check out the commandline commands.
When setting up the virtual machine, try to do that with scripts that can be version controlled.
Any files created and being edited inside the virtual machine are likely candidates to be version controlled too.
Preferably use a version control system that is cross-platfom supporting the host and client OSes of the VirtualMachine, but failing that, most

VirtualMachine applications and client OSes have means to share folders with the host OS which would made it possible to use the versioning control software from the host to version control the files being edited in the VM. Don't be lazy, using a VirtualMachine is not an excuse or easy way out, far from it, doing it right is hard work.

Best practice
1) never check-in anything generated. Make the steps to generate the output automated and as reproducible as possible
2) never modify generated files unless done via an automated process
3) do check-in the scripts and files that are part of the automated processes for generating the output
4) do check-in and automate anything related to setting up the assumed environment, other than what might be the companies standard developer machine setup, under which development happens
5) before check-in/commit changes, check-out the latest and merge changes, test, then repeat, then check-in when ready
6) don't use the repository as a backup location (find alternative and more direct means of doing backups, eg if save binary builds of past version, send them to a network drive in dated folders that will get backed up, not use the repo as a backup proxy, it's not designed for that)
7) do ensure there are backups of the source code control system, preferably there are backups of the repo stored off-site

Tuesday, November 23, 2010

Artifacts of Good Code

An artifact is defined by one meaning as "any feature that is not naturally present but is a product of an extrinsic agent, method, or the like...". I'm interested in the method of coding. When coding, the obvious output or product is the code itself. But I want to talk about some of the other products of good coding.

If the code compiles, the binary output. The compiler messages during compilation. But these are products of compiling, not so much coding.


When you see good code you often know it. It looks inviting. It draws you in and lets you play with it.

The kind of characteristics or products of good coding I want to talk about are what you often see when you find good code, they are correlated but I am not claiming or trying to prove causality. They are indicators. Think of it like trying to judge the appearance of the inside of someone's house by looking at their yard. Well trimmed plants, neat lawn, no weeds.


So what kind of things might be in the yard of a piece of code. Some obvious things are documentation, test cases, a specification and/or requirements and/or user stories, bug tracking, use of a source code versioning system etc. Not all code has any or all of these things, but my observation is that more mature code tends to have most of these things.


Lets look at why. A number of these artifacts are to facilitate more than one person working on the code. They are a sign that the code wasn't developed by a single developer. More heads are better than one. It's immediately a good sign that the code hasn't been just looked at by one person, several people working on it means some review from different perspectives. It's like a house with a single occupant vs a share house.


Anyway we were talking about code. If the code is occupied by a few coders, some things probably will happen. Firstly some kind of source code versioning system if it isn't already being used will be introduced. It's quite hard without such systems for several people to work on the same code at the same time and expect anything to work without some kind of versioning system to mark a working version and way to combine the contributions of the coders to make and mark newer versions. Even if one person was writing the code alone, marking working versions and being able to go backwards and forwards between versions of the code is very useful, but it becomes much more necessary when several people start working on it.

Often the next thing that happens when more and more people start working on the code and using it is the need for documentation. The original author knew what the function or module was meant to do and what its limitations might be by design, but if the code doesn't do what was meant or the original limitations are not sufficient or assumptions about the limitations were wrong, then without documentation its hard for others to work with or use that code. One can't change or fix a function without knowing properly what the function is intended to correctly do. That needs documenting so one can verify the implementation of the class/module/function does what is intended and the uses of it are using it as intended.

Functions shouldn't be so complicated that the intention of what they do is hard to describe, so good code tends to have nicely named functions/classes/modules too. Good code uses sensible descriptive but concise naming of functions and variables. It shows thought went in to the slicing of the program in to functions in a logical way. Hair-brained code is easy to spot, just take a sample of some of the function/class names and if you can't really guess from the name what it does, it's probably because it's a function or class that is ill defined and conceived and exists to factorize the code a certain way that isn't well aligned with any semantic meaning. Another sign of this is many functions with very similar names doing almost the same thing but with very minor variations (I mean besides operator overloading convenience functions to pass in different types). Having 10 different functions to query the same value but in different ways is a bad sign. It is probably the result of insufficient documentation because as more coders worked on something, the lack of documentation resulted in more functions added that did similar things because the original functions weren't documented about their intentions and resultantly became hard to modify and the uses of those functions were too fragile to changes in the function that new coders created new similar functions to avoid the fragility breaking everything. You can more confidently fix a broken implementation of a given API if you know what the intent is from the documentation, and then it is the callers that are broken if they depended on the broken implementation. Without good API documentation you can not make this judgement as easily.


Someone on a project like this (perhaps one of the original coders if they are still around) will eventually notice how brain-dead it is all becoming. Perhaps something happens that brings to a front that the code was not documented so there was no way to know how it should have behaved compared to how the function actually behaved. Then begins the task of documenting. It's always interesting watching coders forced to write documentation for old crufty code they didn't originally write. Okay, next someone introduces a policy about swear words in the documentation, and then down a road of policies. But the point is, documentation is a sign of good code because it signifies that someone at sometime has had to stop and think for a moment about, what does this function do, what might be the limitations, what are the inputs, what are the outputs. And they'll look at the function name and hopefully stop to consider the design of it.

It's okay to document what a function is intending to do, it's another thing altogether to actually check it does what it intends to. That's what testing does. Having test cases is how one does that. Without testing, there is no way to really know that something does as intended. Tests don't need to be automated, but automated is better for lots of cases. Unit tests can test at a fine grained level, but any testing and test case is a start and a good sign that someone is trying to check that something does what it's meant to do. If given code with no test cases, no sample input files, no examples, it's a sign that the code might not be tested to do what is intended for it to do. Having unit tests or various test cases is going to make it heaps easier for the coders to fix bugs and get the code doing what it's documented it should do, so it's a good sign that the code is or on it's way to doing what it is documented to do.


Requirements or user stories are a step up from checking the code is doing what was intended, they check what the intentions of the code should be. That might evolve (or devolve) over time, but so too should the user stories or requirements. They are a statement about what is the need for the software, what purpose will it fulfil, it gives the context and meaning to the software. It doesn't turn the code in to good code, but if the code doesn't have a context and meaning then the coders might be producing something that will have little value to anyone but maybe themselves.

So these are some of the obvious ways to identify that potentially some good coding might be going on making some good code. Obviously in a company setting it is possible for managers to require that "software engineers" must use/produce the above mentioned artifacts as standard (but then they are no longer really by-products of good coding, they have been made in to the products of the output of the workers). This is a kind of cargo cult way of doing things (google cargo cult is you haven't heard of this before). This is not the way to "inspired code". In later blogs I want to write about other signs that code is good, or even inspired, not just some cargo cult coding dressing it up to look like good code.