Generated files
Never check-in generated files, period. I mean it. I hope by the end of the article you will be convinced too. Really there should never be a need for revisioning of a generated file. You are doing it all wrong if you think you need to. It means you don't understand what SOURCE CODE control is, I'll spell it out, it is for SOURCE material. I'm not saying you can't save those generated files somewhere for later use (ie a file server that has backups), but they shouldn't need revision control because it is not logical to have different versions of something that was generated from source material unless there is a change or different version of the corresponding source material or scripts used to generate it. A source code control or revisioning system is not a backup system, don't abuse it for this. More on this later.
The source material needs versioning. Sure, if you recompile the same code 20 times you could get 20 different binaries if you do a binary diff, things like the date and time can work their way in to the binary, as too can the path from which the code was compiled depending on the compiler and flags. Which OS, compiler, version and flags was used will all have an effect. But really the scripts and makefiles used to control the compiling of the source code should be under source code control. The generated file that you were going to checkin really should be replaced with a script or equivalent that will check the pre-conditions and then run the precise commands with the precise flags that will result in that generated file being created. It should be reproducible, so there should be no need to check that generated file in. If it is not then you are doing something wrong. It might mean there is too much magic involved to reproduce compiling that code properly. That is a problem, it's a symptom that you might be doing something wrong. It might be someone wants to exercise some control over other developers or prevent others from compiling their code such that they must use their generated binaries that are checked in instead of everyone compiling from source. This means something is wrong.
Either the code is fragile, brittle and breaks easily or someone is trying to hide something or trying to protect their job or some combination of these.
It's not the way it should be. The code will only get better with more people compiling it and trying it with different compilers and different flags. Any code that needs a specific compiler on a particular persons computer compiled in some specific way with the magic scripts only on their computer or that only work on one persons computer setup or with some tool with restrictive licensing sounds like some funky shit that smells like something is off and you want to stay away from it because it will be a liability to you and your company. If the code you work on gets a dependency to such code, you'll be making yourself reliant on someone else, you will be giving them more power, feeding the beast. Work around the dependency, find alternatives, or try to work out the magic, ask to watch it get built. One does need to be pragmatic too, sometimes there might be no way around using critical 3rd party libraries. Obviously some stuff like Win32 API DLL dependencies can't be avoided on the windows platform, but there is a reasonable assumption MS and the DLLs won't disappear suddenly and there are alternatives like WINE that if MS went bust or decided to go 64bit only, there are fallbacks. Another strategy is design the code to be cross-platform in the beginning so no dependency on any particular OS or OS vendor. That's part of the problem of checking in generated binary files without a means to re-generate them, it limits the OS, toolchain, flags etc to just the set for which binaries were made available. Going cross-platform removes dependencies and gets developers in the mind-set of thinking about their dependencies. It's too bad that too often dependencies are only given thought after the fact when some code that works on one platform suddenly has a need to also work on other platforms. All those dependencies become painfully obvious. Digging around for a developer's old computer that worked at the company 5 years ago for his special tool he wrote in .NET that did some funky code generation for which the code to the tool wasn't checked in, only the generated code was, sounds like no one thought about the right source code control disciplines back then but now it is being paid for.
Merge vs exclusive check-out
Exclusive check-out model doesn't work. one user is working on something and needs other person to finish with their lock on the file. If in the office, they can just ask the person to check it in, but if that person is away, they might be stuck until they check-in their work, or they continue working in some compromised way. It just doesn't scale. Logically thinking about it and imagining if the locking was more granular down to function level or line level, one may be able to see some of the problems with the model, some concurrent changes between versions at the line or function level between users can create a non-working build. One really needs to check that the combination of changes, including the latest changes they have made and others have made, will work in combination. The exclusive check-out model doesn't magically solve that, it doesn't prevent incompatible changes happening concurrently to different files. I might remove all calls to a functions in all the code and then prepare to do the final step of removing the implementation of that function from the .h and .c files while simultaneously somewhere in the code (could be any file in the project) someone checks-in a change to call that function. The only way to prevent this is to check-out every file in the project and also lock other users from adding any files so they can't add a call to the function I am about to remove. Obviously one still needs to test the combination of the changes and correct for such types of problems when files are checked-in simultaneously. This really is a merge of sorts, but the granularity is different and the conflict is harder to detect, it's not really detected by the versioning software at all, its something that gets found by testing the latest code from the repository by compiling it and running it. Nothing substitutes testing, the versioning software can not solve incompatible changes. No, the real solution is to mark versions as working, merge in sets of changes, test the merged results, version it etc. And really the best is when the testing can be automated, but that is not always possible. There really is no merge free solution when you understand that every change needs to be checked it is compatible with all previous changes, irrespective of if it is in the same file or from a different file to one that has recently changed.
Exclusive check-out is really annoying too when also checking in generated files. That has to be the worst possible thing to do. When recompiling or doing anything that causes the generated files to change, the compile can fail because it can't overwrite the read-only generated file, or it automatically checks it out which blocks others compiling. It just causes all kinds of blocking or further bad practices with kludging the file permissions. It is just horrible.
Working with checked in generated files instead of generating them when the source is available is bad. It's worse when the source was available and someone takes it out of the repository to prevent others compiling it and just leaving the binaries and headers there to link with. How do people seriously think that is progress?
Particularly if they are compiled with one version of the toolchain, like VS9, it then forces everyone to switch to VC9 and can't do VC6 builds anymore even if there is some need to do so. Code would be better if it is compiled and tested with more toolchains. It stops compiling the code in other configurations, eg with flags for MBCS or UNICODE etc particularly if the API is not well written, it then can force the calling code that links with the checked in binary that was compiled in a particular way to also have to be compiled in the same way to be compatible with how the API was defined and the mangled types in the function symbols of the binary to get correct linkage with it. It then sucks more when other binaries are checked in that were compiled in incompatible ways also with similar braindead APIs. It may prevent the 2 libraries being used together with one application.
While the root of the problem is not just the use of the source code control system, it would be have been possible to generate the binaries as required in the correct form if the generated binaries weren't checked-in and things would be much nicer, one could make different versions of the binaries that one requires rather than being told how the binary should be generated with it in some fixed forms.
The term "source code control" is bad. It uses the word "control", which is I guess is how some people use it. But i think that's part of the problem.
People use it as a tool of power and control. It shouldn't be a restrictive system to inhibit changes or access. The real problem that needs solving is the one of versioning, what is needed is being able to mark versions of the software and to manage concurrent versions, merge them, branch etc.
Not a centralized control system. Of course with decentralized systems, there is the question of backups. And source code control/versioning systems are that, not backup systems.
Backup-proxy
Please stop using the repository as a backup system. Perhaps that is why there are all the generated files checked in? Maybe that is why the server is grinding to a halt and can't handle the load of the repository? Maintaining file histories of big binary files that change nearly completely each version is not really what source code control or versioning systems are designed to handle. Smart ones detect binary files and store them differently, but can seriously bloat the store. Normally version control systems store a series of diffs, either forward or backward or both. So there might be the earliest and latest version of the file stored, and a set of differences that you can apply to get to any other version of that file. That works great for most text based files, but falls apart completely for binary files. In the case of binary files which are usually lots bigger to begin with, the repository generally needs to store a complete copy of every version of that file. Rarely it makes much sense to do a diff between 2 big binary files. Most of the time when checking out the generated files, we would be interested in the latest version of the generated file if we were at all interested in checking out generated files.
VirtualMachines
Using a VirtualMachine is not an excuse to avoid using source code control. Don't just make backups of the virtual machine to a multi-spanning DVD collection.
Try to make the steps to setup the virtual machine repeatable or document it. VMWare and VirtualBox are scriptable, please check out the commandline commands.
When setting up the virtual machine, try to do that with scripts that can be version controlled.
Any files created and being edited inside the virtual machine are likely candidates to be version controlled too.
Preferably use a version control system that is cross-platfom supporting the host and client OSes of the VirtualMachine, but failing that, most
VirtualMachine applications and client OSes have means to share folders with the host OS which would made it possible to use the versioning control software from the host to version control the files being edited in the VM. Don't be lazy, using a VirtualMachine is not an excuse or easy way out, far from it, doing it right is hard work.
Best practice
1) never check-in anything generated. Make the steps to generate the output automated and as reproducible as possible
2) never modify generated files unless done via an automated process
3) do check-in the scripts and files that are part of the automated processes for generating the output
4) do check-in and automate anything related to setting up the assumed environment, other than what might be the companies standard developer machine setup, under which development happens
5) before check-in/commit changes, check-out the latest and merge changes, test, then repeat, then check-in when ready
6) don't use the repository as a backup location (find alternative and more direct means of doing backups, eg if save binary builds of past version, send them to a network drive in dated folders that will get backed up, not use the repo as a backup proxy, it's not designed for that)
7) do ensure there are backups of the source code control system, preferably there are backups of the repo stored off-site
No comments:
Post a Comment