N900 tweaking
Posted by Jilles in Blog Posts on June 12, 2010
I’ve been tweaking my N900 quite a bit (just because I can).
Power management. Sadly there are some issues with some wifi routers related to power management. If you find yourself with connections timing out, the solution is going to settings, internet connections. Then edit the problematic connection and go to the last page which features an advanced button. Then under ‘other’ set power management to intermediate or off.
With that sorted out, you’ll want to be offline most of the day. So don’t turn on sip/im/facebook unless you need it and switch it off right after you’re done. Push email is nice but with 15/30 minute polling your battery will last longer.
To gain insight, of course install battery-eye. This plots a graph of your batteries power reserves. Finally, you may want to install a few applets to dim the screen, turn on/off wifi, and switch between 2G and 3G. You can find these in the extras repository that is enabled by default in the application manager.
Apt-get. The application manager is nice but a bit sluggish and it insists on refreshing catalogs after just about each tap. Use it to install openssh and make sure to pick a good password (or set up key authentication). Then ssh into your n900 and use apt-get update and apt-get install just like you would on any decent Debian box. This is why you got this device.
Finding stuff to install. Instead of listing all the crap I installed, I’ll provide something more useful: ways of finding crap to install.
- Ovi store. Small selection of goodies. Check it out but don’t count on finding too much there. Included for completeness
- Misc sites with the latest cool stuff:
- nokian900applications.com. Loads of cool things to try here.
- garage.maemo.com. Source forge for maemo, this is where many of the cool apps live.
- maemo.org. Maemo community page. Loads of stuff to find there.
- web applications
- Nokia blog, Maemo 5 category.
- Advanced (i.e. don’t come crying when you mess up and have to reflash): enable the extras, extras-testing, extras-devel repositories from here. Many useful things are provided here. Some of them have the potential to seriously mess up your device. Extras-devel is where all the good stuff comes from but it’s very much like Debian unstable.
Browser extensions. The N900 browser supports extensions. Install the adblock and maemo geolocation extensions through the application manager.
Use the browser. Instead of applications, you can use the browser and rely on web applications instead:
- Cotchin. A web based foursquare client. Relies on the geolocation API for positioning.
- Google Reader for touch screen phones.
- Google maps mobile. Includes latitude, routing and other cool features. Relies on the geolocation API for positioning.
- Maemaps. Pretty cool N900 optimized unofficial frontend for Google Maps.
- Hahlo. A nice twitter client in the browser.
One week with the N900
Posted by Jilles in Blog Posts on June 5, 2010
This is me pimping a Nokia product. I work for Nokia and I rarely do this, and never without believing 100% what I write.
With some delays, I managed to get my hands on a N900. Our internal ordering system took nearly five months to deliver this thing (something involving bureaucracy and the popularity of this device. I guess the external paying customers lining up for it had some priority too). But it was well worth the wait.
For those who don’t know what the N900 is: it is the first phone in a series of linux based tablet devices from Nokia that started with the N770 in 2006, the N800 (which I still have), and the N810. As such, this series of devices was the start of something beautiful a few years ago. Not hindered by any operator limitations, these were essentially pocketable linux pcs. So naturally the engineers working on this, selected Debian Linux and named it Maemo Linux. Then they built a tool chain and ecosystem around the platform and tapped into all the readily available OSS goodies. It was great. Any research lab in need of some hackable devices jumped on this. As I recall when I was still doing pervasive computing research, most of the researchers in this field were using these devices to study all sorts of stuff. Because no matter how obscure your OSS project is, barring screen and cpu limitations you can probably get it going on Maemo Linux. You can, and people did. Most of Ubuntu cross compiles to Maemo without much effort. For example, I was running a tomcat, equinox, and lucene on a port of Sun’s CDC J2ME environment (roughly equivalent to java 1.4) on a N800 three years ago. It actually ran well too. In short, these babies are the ultimate hackers devices. There really is no alternative in terms of openness or scope in the industry. Android may be linux deep down inside, and Palm OS may be linux deep down inside, but Maemo is Debian Linux without buts or ifs.
And now there is the N900. The N900 is about as thick as a N97, about as long and about 3mm wider and slightly heavier (I actually did the comparison). Unlike its predecessors, it is a phone as well as a Debian linux running internet tablet. So all the goodness from the past version with a 2X performance and memory boost, a good quality phone stack (hey, it’s still a Nokia), and lots of UI work. While it has some rough edges (the software, not the hardware), it is surprisingly useful as a smart phone despite its current status as an early adopter’s device. It has one of the best browsers around (some would say the best); the UI is responsive and very touch friendly, it multitasks without effort, and it comes with tons of goodies like SIP, skype, google talk, Facebook, twitter support. And that’s just the out of the box stuff. You can do most of what the N900 does on an iphone. But not all at once. You can on the N900, plus some.
So, best phone ever as far as I’m concerned. Meego, the consumer friendly version of Maemo that was born out of our recent deal with Intel and MobLin, is coming soon in the form of new Nokia phones (you can already get it for net books). I can’t wait for world+dog to start porting over their favorite software to that. Meanwhile, I just use it as is, which is plenty good. It’s a great smart phone that plays back my music, browses the web (including Google Maps, Youtube, Facebook, and other web 2.0 heavy AJAX & flash sites) without much effort. Most of the iphone optimized web apps work great on the N900 as well. For example, I use the iphone optimized mobile Google Reader (http://www.google.com/reader/i). Mail support is excellent on this device, I use mail for exchange push email and gmail. I can do regular calls, VOIP, Skype (with video), IM, upload photos/videos to facebook, flickr, and other networks. Functionally there is little left to desire. Though somebody getting a foursquare client beyond the early Alpha stage would be nice (there’s two of those).
Re: bear shaving
Posted by Jilles in Blog Posts on May 10, 2010
I was going to submit the stuff below in a shortened form as a comment to this fun little blog post on “bear shaving” but it sort of grew into a full blown article, again. To summarize the original article, there’s this nice analogy of shaving bears to help them cope with global warming and how that is not really addressing the core issues (not to mention dangerous). The analogy is applied to integration builds and people patching things up. Then the author sort of goes off and comes up with a few arguments against git and decentralization.
While some of the criticism is valid, this of course ticked me off
I see Git as a solution to increase the amount of change and dealing more effectively with people working in parallel. Yes, this puts a strain on integrating the resulting changes. But less change is the equivalent of bear shaving here. Change is good. Change is productivity. You want more productivity not less. You want to move forward as fast as you possibly can. Integration builds breaking are a symptom of a larger problem. Bear shaving would be doing everything you can to make the integration builds work again, including forcing people to sit on their hands. The typical reflex to a crisis like this in the software industry is less change, complete with the process to ensure that people do less. This is how waterfall was born. Iterative or spiral development is about the same thing but doing it more frequently and less longer. This was generally seen as an improvement. But you are still going to sit on your hands for pro longed periods of time. The real deal these days is continuous deployment and you can’t do this if you are sitting on your hands.
Breaking integration builds have a cause: the people making the changes are piling mistake on mistake and keep bear shaving (I love the metaphor) the problem because they are under a pressure to release and deliver functionality. All a faster pace of development does is make this more obvious. Along with the increased amount of change per time-unit comes also an increased amount of mistakes per time unit. Every quick fix and every misguided commit makes the system as a whole a little less stable. That’s why the waterfall model includes a feature freeze (aka. integration) where no changes are allowed because the system would never get finished otherwise.
A long time ago I wrote an article about design erosion. It was one of the corner stones of my phd thesis (check my publication page if you are interested). In a nutshell: changes are cumulative and we take design decisions in the context of our expectations of the future. Only problem: nobody can predict the future accurately and as a consequence, there will be mistakes from time to time. It is inevitable that you will get it wrong sometimes and won’t realize right away. You can’t just rip out a single change you made months/years ago without the depending subsequent changes being affected. In other words, change is cumulative: rip one piece out and the whole sand castle collapses. Some of the decisions will be wrong or will have to be reconsidered at some point and because changes are inter dependent, fixing design erosion can be painful and expensive. Consequently, it is inevitable that all software designs erode over time as inevitably such change is delayed until the last possible moment. Design erosion is a serious problem. You can’t just fix a badly eroded system that you had for years over-night. Failing to address design erosion in time can actually kill your company, product or project. But you can delay the inevitable by dealing with the problems closer to where they originate instead of dealing with it later. Dealing with the problem close to where it originates means less subsequent changes are affected, meaning that you minimize the cost of fixing the problem. Breaking integration builds are a symptom of an eroding design. Delaying the fix makes it worse.
So, the solution is to refactor and rethink the broken parts of the system to be more robust, easier to test, more flexible to meet the requirements, etc. Easier said then done of course. However, Git is a revolutionary enabler here: you can do the more disruptive stuff on a git branch and merge it back in when it is ready instead of when you go home and break the nightly build. This way you can do big changes without destabilizing your source tree. Of course you want continuous integration on your branches too. That way, you will push less mistakes between branches, thus solving problems closer to their origin and without affecting each other. You will still have breaking builds, but they will be cheaper to fix. Decentralization is the solution here and not the problem as is suggested in the blog post I linked above!
Here’s why decentralization works: testing effort grows exponentially to the amount of change. Double the amount of change, and you quadruple the testing effort. So don’t do that and keep the testing effort low. In a centralized world you do this through feature freeze. By stopping all change, you can actually find all the problems you introduced. In a decentralized world you do this by not pushing your changes until the changes you pull are no longer breaking your local branch. Then you push your working code. Why is this better? 1) you integrate incoming changes with your changes instead of the other way around. 2) you do this continuously (every time you pull changes), so you fix problems when they happen. 3) your changes only get pushed when they are stable which means that other people have less work with #1 and #2 on their side. 4) By keeping changes isolated from each other, you make it easier to test them. Once tested, the changes are a lot easier to integrate.
Continuous integration can help here but not if you only do it on the production branch: you need to do it all over the place. Serializing all the change through 1 integration environment turns it into a bottleneck: your version system may be decentralized but if your integration process is not you are still going to be in trouble. A centralized build system works ok with a centralized version system because centralized version system serializes the changes anyway (which is a problem and not something to keep bear shaving). The whole point of decentralizing version management is decentralizing change. You need to decentralize the integration process as well.
In a nutshell, this is how the linux kernel handles thousands of kloc of changes per day with hundreds of developers. And, yes, it is no coincidence that those guys came up with git. The linux kernel deals with design erosion by a continuous re development. The change is not additive, people are literally making changes all over the linux source tree, all the time. There is no way in hell they could deal with this in a centralized version management type environment. As far as I know, the linux kernel has no automated continuous integration. But they do have thousands of developers running all sorts of developer builds and reporting bugs against them, which is really the next best thing. Nothing gets in the mainline kernel without this taking place.
Git: so far, so good
Posted by Jilles in Blog Posts on April 18, 2010
I started using git two months ago. Basically, colleagues around me fall into three categories:
- Those that already use git or mercurial (a small minority).
- Those that are considering to start using it like me a few months ago (a few).
- Those that don’t get it (the majority).
To those that don’t get it: time to update your skill sets. Not getting it is never good in IT and keeping your skill set current is vital to survival long term. Git is still new enough that you can get away with not getting it but I don’t think that will last long.
The truth of the matter is that git mostly works as advertised and there are a few real benefits to using it and a few real problems with not using it. To start with the problems:
- Not using git limits you to one branch: trunk. Don’t fool yourself into thinking otherwise. I’ve seen branching in svn a couple of times and it was not pretty.
- Not using git forces you to either work in small, non invasive increments or accept pro-longed instability on trunk with lots of disruptive change coming in. Most teams tend to have a release heart beat where trunk is close to useless except when a release is coming.
- Not using git limits size of the group of people that can work effectively on the same code base. Having too many people commit on the same code will increase the number conflicting changes.
- Not using git exposes you regularly to merge problems and conflicts when you upgrade your work copy from trunk.
- Not using git forces a style of working that avoids the above problems: you don’t branch; people get angry when trunk breaks (which it does, often); you avoid making disruptive changes and when you do, you work for prolonged periods of time without committing; when you finally commit, you find that some a**hole introduced conflicting changes on trunk while you weren’t committing; once you have committed other people find that their uncommitted work now conflicts with trunk etc.
- Given the above problems, people avoid the type of changes that causes them to run into these problems. This is the real problem. Not refactoring because of potential conflicts is an anti-pattern. Not doing a change because it would take too long to stabilize means that necessary changes get delayed.
All of those problems are real and the worst part is that people think they are normal. Git is hardly a silver bullet but it does take away these specific problems. And that’s a real benefit. Because it is a real benefit, more and more people are starting to use git, which puts all those people not using it at a disadvantage. So, not getting it is causing you real problems now (which you may not even be aware off). Just because you don’t get it doesn’t stop people who do get it from competing with you.
In the past few weeks, I’ve been gradually expanding my use of git. I started with the basics but I now find that my work flow is changing:
I’m no longer paranoid about updating from svn regularly because the incoming changes tend to not conflict with my local work if I “git svn rebase”. Rebasing is git specific process where you pull in changes from remote and “reapply” your own local commits on top of them. Basically before you push changes to remote, you rebase them on top of the latest and greatest available remote. This way your commit to remote is guaranteed to not conflict. So “git svn rebase” pulls changes from trunk and applies my local commits on top of them. Occasionally there are conflicts of course but git tends to be pretty smart about resolving most of those. E.g. file renames tend to be no problem. In a few weeks of using git, I’ve only had to edit conflicts a couple of times and in all of these cases, this was straightforward. The frequency with which you rebase doesn’t really matter since the process works on a per commit basis and not on a merge basis like in svn.
I tend to isolate big chunks of work on their own git branch so I can switch between tasks. I have a few experimental things going on that change our production business logic in a pretty big way. Those changes live in their own git branch. Once in a while, I rebase those branches against master where I rebase against svn trunk regularly to get the latest changes from svn trunk on the branch and make sure that I can still push them back to trunk when the time comes. Simply being able to work on such changes without those changes disrupting trunk or trunk changes disrupting my changes is a great benefit. You tend to not experiment on svn trunk because this pisses people off. I can experiment all I want on a local branch though. However, most of my branches are actually short lived: just because I can sit on changes forever doesn’t mean I make a habit of doing that needlessly. The main thing for me is being able to isolate unrelated changes from each other and from trunk and switching between those changes effortlessly.
Branching and rebasing allows me to work on a group of related changes without committing back to trunk right away. I find that my svn commits tend to be bigger but less frequent now. I’ve heard people who don’t get it argue that this is a bad thing. And I agree: for svn users this would be a bad thing because of the above problems. However, I don’t have those problems anymore because I use git. So, why would I want to destabilize trunk with my incomplete work?
Whenever I get interrupted to fix some bug or address some issue, I just do the change on whatever branch I’m working on. I commit the changes in that branch. Then I do a git stash save to quickly store any uncommitted work in progress. I do a git checkout master followed by a git cherrypick
So those are big improvements in my workflow that have been enabled by using git svn to interface with svn. I’d love to be able to collaborate with colleagues on my experimental branches. Git would enable them to do that. This why them not getting git is a problem.
Btw. you can replace git with mercurial in the text above. They are very similar in capabilities.
The Gimp
Posted by Jilles in Blog Posts on April 7, 2010
Since getting an iMac in the summer and not spending the many hundreds of dollars needed for a Photoshop license, I’ve been a pretty happy user of Google’s Picasa. However, it is a bit underpowered and lacks the type of features that are useful for fixing contrast, color, and sharpness issues in poorly lit, partly blown out, noisy, and otherwise problematic photos that you end up with if you are shooting with a nice pocketable compact camera, like I do. My Canon S80 is actually not that bad (great lens, easy to stuff in a pocket, fast to unpocket and aim and shoot, nice controls) but it has three major limitations:
- When shooting automatic it tends to blow out on the highlights, meaning the sky and other bright areas in the photo are white. This means you have to manually set aperture, shutter time, and ISO to get more difficult shots. Most compacts suffer from this problem BTW.
- The screen and histogram on it are not that useful. Basically you will end up with photos that are too dark and that do not use the full available dynamic range if you try to optimize for what’s on the screen. Instead, I’ve been relying on spot metering and measuring different spots and compensating for that using Ansel Adams style zoning and wet finger approach (okay a lot of this). Basically this works but it is tedious.
- Like most compacts, it is useless at higher ISOs due to the noise. Basically I avoid shooting at 200 or above and usually shoot at 50 unless I can’t get the shot. This means in low light conditions, I need a really steady hand to get workable shots.
So as a result, my photos tend to need a bit of post processing to look presentable. Picasa handles the easy cases ok-ish but I know software can do better. So, after exploring the various (free) options on mac and deciding against buying Adobe Lightroom or Photoshop Elements, I ended up taking a fresh look at the Gimp.
The Gimp is as you no doubt know an open source photo/bitmap editor (as well as a really funny character in Pulp Fiction). It comes with a lot of features, a UI that is quite ‘challenging’ (some would say unusable), and some technical limitations. To start with the technical limitations: it doesn’t do anything but 8 bit color depth, which means lossy operations like editing contrast or running filters tend to lose a lot more information due to rounding errors that add up the more you edit. It doesn’t do adjustment layers and other forms of non destructive editing, which adds to the previous problems. It’s slow. Slow as in it can take minutes to do stuff like gaussian blur or sharpening on a large image that should be near real time in e.g. Photoshop . It doesn’t support popular non RGB color spaces (like LAB or CMYK, though it can be made to work with them if you need to). And it doesn’t come with a whole lot of filters and user friendly tools that are common in commercial packages. Finally the UI is the typical result of engineers putting together a UI for features they want to show off and of course not agreeing on such things as an overall UI design philosophy or any kind of conventions. It’s nasty, it’s weird in plenty of places, it’s counter intuitive and it looks quite ugly next to my pretty mac apps. But it sort of works, and you can actually configure some of its more annoying defaults to be more reasonable.
So there is a lot lacking and missing in the Gimp and plenty to whine about if you are used to commercial grade tooling.
But, the good news is (beyond it being free) that you can still get the job done in the Gimp. It does require a creative use of the features it has. Basically, the Gimp provides all the basic building blocks to do complex image manipulation but they are not integrated particularly well. There are only a handful of other applications that provide the same type of features and implementation quality. Most of those are expensive.
In isolation the building blocks that the Gimp provides are not that useful. You have to put them together to accomplish tasks, often in not so obvious ways (although for anyone with a solid background in advanced photo editing it is not that much of a challenge). Doing things in the Gimp mainly involves understanding what you want to do and how the Gimp does things. It’s really not very forgiving when you don’t understand this.
Here are some things that are generally in my work flow (not necessarily in this order) that work quite well in the Gimp. I just summarize the essentials here since you can find lengthy tutorials on each of these topics if you start Googling, also there is lots of potential for variation here and perfecting skills in particular areas:
Contrast: duplicate layer, set blend mode to value (just light, not color), use levels or curves tool on the layer to adjust the contrast. Fine tune the effect with layer transparency. This basically leaves the colors unmodified but modifies brightness and contrast.
Improve black and white contrast with color balance: basically in black and white photography you can use a color filter in front of the lens to change the way light and dark effect the negative. E.g. a red filter is great for getting some nice detail in e.g. water or sky. You can achieve a similar effect with the color balance tool and a layer that has its mode set to value. This is nice for creating black and white photos but also nice for dealing with things like smog (a mostly red haze -> deemphasize the red) in color photos or getting some extra crisp skies. You can examine the individual color channels to find out which have more details and then boost the overal detail by mixing the channels in a different way. This will of course screw up the colors but you are only interested in light dark here, not color. So duplicate layer, set mode to value, and edit the layer with the color balance tool. Some basic knowledge of color theory will help you make sense of what you are doing but random fiddling with the sliders also works fine.
Local contrast: duplicate layer, use the unsharp filter to edit local contrast by setting radius to something like 50 pixels and amount to something like 0.20. Basically this will change local color contrast and change the perceived contrast in different areas by locally changing colors and lightness. If needed, restrict the layer to either value or color mode.
Contrast map: duplicate layer, blur at about 40 pixels, invert, destaturate, set layer to overlay. This is a great way to fix images but a high dynamic range (lots of shadow and highlight detail, histogram is sort of a V shape). Basically it pushes some of the detail to the center of the histogram, thus compressing the dynamic range. Basically it brightens dark spots and darkens bright spots. The blurring is the tricky bit since you can easily end up with some ghosting around high contrast areas. Fiddling with the pixel amount can fix things here. Also using the opacity on the layer is essential since you rarely want to go with 100% here.
Overlay to make a bland image pop: duplicate layer, set mode to overlay. This works well for photos with a low dynamic range. It basically stretches detail towards the shadow and highlights and enhances both contrast and saturation at the same time. Skies pop, grass is really green, etc. Cheap success but easy to overdo. Sort of the opposite effect of contrast map.
Multiply the sky. Duplicate layer, set mode to multiply, mask everything but the sky (try using a gradient for this or some feathered selection). This has the effect of darkening and intensifying the sky and is great for photos that were overexposed. Also works great for water (though you might want to use overlay).
Color noise: duplicate layer, set mode to color, switch to the red channel and use a combination of blur, and noise reduction filters to smooth out the noise. Selective gaussian blur works pretty well here. Repeat for the green and blue channels. Generally, most of the noise will be in the blue and red channels (because for every cluster of 4 pixels in the sensor, two are green, i.e. most of the detail is in the green channel). Basically, you are only editing the colors here, not the detail or the light so you can push it quite far without losing a lot of detail. Apply a light blur to the whole layer to smooth things out some more.
Luminosity noise: duplicate layer, set mode to value, like with color noise, work on the individual channels to get rid of noise. You will want to go easy on the blurring since this time you are actually erasing detail. Target channels in this order, red, blue, green (in order of noisiness and reverse order of amount of detail). Stop when enough of the luminosity noise is gone.
Color: duplicate layer, set blend mode to color, adjust color balance with curves, levels or color balance tool.
Saturation: duplicate layer, set mode to saturation, use the curves tool to edit saturation (try pulling the curve down). This is vastly superior to the saturation tool. You may want to work on the individual color channels, though this can have some side effects.
Dodge/burn: create a new empty layer, set mode to overlay, paint with black and white on it using 10-20% transparency. This will darken or brighten parts of the image without modifying the image. You can undo with the eraser. Smooth things with gaussian blur, etc. This is great for highlighting people’s eyes, pretty reflections, darkening shadow areas, etc.
Crop: select rectangle, copy, paste as new image, save. Kind of sucks that there is no crop tool but this works just fine.
Sharpening: A neat trick I re-discovered in the Gimp is high-pass sharpening. High pass filtering is about combining a layer with just the outline of the bits that need sharpening with the original photo. This is great for noisy photos since you can edit the layer with the outline independent from the photo, which means that you end up only sharpening the bits that need sharpening. How this works: copy visible, paste as new image, duplicate the layer in the new image, blur (10-20px should do it) the top layer, invert, blend at 50% opacity with the layer below. You should now see a gray image with some lines in there that represent the outlines of whatever is to be sharpened. This is called a high pass. Copy visible, paste as new layer in original image, set the high pass layer’s blend mode to overlay. Observe this sharpens your image, tweak the effect with opacity. If needed manually delete portions from the high pass that you don’t want sharpened. Tweak further with gaussian blur, curves, levels, unsharp mask on the high pass layer. Basically this is a very powerful way of sharpening that gives you a lot more control than normal sharpening filters. But it involves using a lot of Gimp features together. It works especially well on noisy images since you can avoid noise artifacts from being sharpened.
A lot of these effects you can further enhance by playing with the opacity and applying masks. A key decision is the order in which you do things and what to use as the base for a new layer (either visible, or just the original layer). Of course some of these effects can work against each other or may depend on each other and some effects are more lossy than others. In general, paste as new image and paste as a new layer together with layer blending modes like color, value, or overlay are useful to achieve the semi non destructive editing that you would achieve with adjustment layers in Photoshop. You can save layers in independent files and edit them separately. And of course you don’t want to lose any originals you have.
Also nice to be aware of is that most of the effects above you can accomplish in other software packages as well. In Photoshop, most of the tricks above give you quite a bit more control than the default user friendly tools (at the price of having to fiddle more). Some other tools tend to be a bit underpowered. I’ve tried to do several of these things in paint.net under windows and was always underwhelmed with the performance and quality.
Finally, there exist Gimp plugins and scripts that can do most of the effects listed above. I have very little experience with third party plugins and I am aware of the fact that there are a huge number of plugins for e.g. sharpening and noise. However, most of these plugins just do what you could be doing yourself manually, with much more control and precision. Understanding how to do this can help you use such plugins more effectively.
To be honest, my current workflow is to do as much as possible in Picasa and I only switch to the Gimp when I am really not satisfied with the results in Picasa. Picasa does an OK but not great job. But with hundreds of photos to edit, it is a quick and dirty way to get things done. Once I have a photo in the Gimp, I tend to need quite a bit of time before I am happy with the result. But the point is that quite good results can be achieved with it, if you know what to do. The above listed effects should enable you to address a wide range of issues with photos in the Gimp (or similar tools).
Git and agile
Posted by Jilles in Blog Posts on March 7, 2010
I’ve been working with Subversion since 2004 (we used a pre 1.0 version at GX). I started hearing about git around the 2006-2007 time frame when Linus Torvalds’ replacement for Bitkeeper started maturing enough for other people to use it. I met people working on Maemo (the Debian based OS for the N770, N800, N810, and recently the N900) in Nokia who were really enthusiastic about it in 2008. They had to use it to work with all the upstream projects Maemo depends on and they loved it. When I moved to Berlin everybody there was using subversion so I just conformed and ignored git/mercurial and all those other cool versioning systems out there for an entire year. It turns out that was lost time, I should have switched around 2007/2008. I’m especially annoyed by this because I’ve been aware of decentralized versioning being superior to centralized versioning since 2006. If you don’t believe me, I had a workshop paper at SPLC 2006 on version management and variability management that pointed out the emerging of DVCSes in that context. I’ve wasted at least three years. Ages for the early adopter type guy I still consider myself to be.
Anyway, after weighing the pros and cons for way too long, I switched from subversion to git last week. What triggered me to do this was, oddly, an excellent tutorial on Mercurial by Joel Spolsky. Nothing against Mercurial, but Git has the momentum in my view and it definitely appears to be the band wagon to be jumping right now. I don’t see any big technical argument for using Mercurial instead of Git. There’s github and no mercurial hub as far as I know. So, I took Joel’s good advice on Mercurial as a hint that it was time to get off my ass and get more serious about switching to anything else than Subversion. I had already decided in favor of git based on stuff I’ve been reading on both versioning systems.
My colleagues of course haven’t switched (yet, mostly) but that is not an issue with git-svn, which allows me to interface with svn repositories. I’d like to say making the switch was an easy ride, except it wasn’t. The reason is not git but me. Git is a powerful tool that has quite a bit more features than Subversion. Martin Fowler has a nice diagram on “recommendability” and “required skill”. Git is in the top right corner (highly recommended but you’ll need to learn some new skills) and Subversion is lower right (recommended, not much skill needed). The good news is that you will need only a small subset of commands to cover the feature set provided by svn and you can gradually expand what you use from there. Even with this small subset git is worth the trouble IMHO, if only because world + dog are switching. The bad news is that you will just have to sit down and spend a few hours learning the basics. I spent a bit more than I planned to on this but in the end I got there.
I should have switched around 2007/2008
The mistake I made that caused me to delay the switch for years was not realizing that git adds loads of value even when your colleagues are not using it: you will be able to collaborate more effectively if you are the only one using git! There are two parts to my mistake.
The first part is that the whole point of git is branching. You don’t have a working copy, you have a branch. It’s exactly the same with git-svn: you don’t have a svn working copy but a branch forked of svn trunk. So what, you might think. Git excels at merging between branches. With svn branching and merging is painful, so instead of having branches and merging between them, you avoid conflicts by updating often and committing often. With git-svn, you don’t update from svn trunk, you merge its changes in your local branch. You are working on a branch by default and creating more than 1 is really not something to be scared of. It’s is painless, even if you have a large amount of uncommitted work (which would get you in trouble with svn). Even if that work includes renaming the top level directories in your project (I did this). Even if other people are doing big changes in svn trunk. That’s a really valuable feature to have around. It means I can work on big changes to the code without having to worry about upstream svn commits. The type of changes nobody dares to take on because it would be too disruptive to deal with branching and merging and because there are “more important things” to do and we don’t want to “destabilize” trunk. Well, not any more. I can work on changes locally on a git branch for weeks if needed and push it back to trunk when it is ready while at the same time me and my colleagues keep committing big changes on trunk. The reason I’m so annoyed right now is the time I spent on resolving svn conflicts in the past four years was essentially unnecessary. Not switching four years ago was a big mistake.
The second part of my mistake was assuming I needed IDE support for git to be able to deal with refactoring and particularly class renames (which I do all the time in Eclipse). While there is egit now, it is still pretty immature. It turns out that assuming I needed Eclipse support was a false assumption. If you rename a file in a git repository and commit the file, Git will automatically figure out that the file was renamed, you don’t need to tell git that the file was renamed. A simple “mv foo.java bar.java” will work. On directories too. This is a really cool feature. So I can develop in eclipse without it even being aware of any git specifics, refactor and rename as much as I like, and git will keep tracking the changes for me. Even better, certain types of refactorings that are quite tricky with subclipse and subversive just work in git. I’ve corrupted svn work directories on several occasions when trying to rename packages and moving stuff around. Git will handle this effortlessly. Merges work so well because git can handle the situation where a locally renamed file needs changes from upstream merged into it. It’s a core feature, not an argument against using it. My mistake. I probably spent even more time on corrupted svn directories than conflict resolution in the last three years.
Git is an Agile enabler
We have plenty of pending big changes and refactorings that we have been delaying because they are disruptive. Git allows me to work on these changes whenever I feel like it without having to finish them before somebody else starts introducing conflicting changes.
This is not just a technical advantage. It is a process advantage as well. Subversion forces you to serialize change so that you minimize the interactions between the changes. That’s another way of saying that subversion is all about waterfall. Git allows you to decouple change instead and parallelize the work more effectively. Think multiple teams working on the same code base on unrelated changes. Don’t believe me? The linux kernel community has thousands of developers from hundreds of companies working on the same code base touching large portions of the entire source tree. Git is why that works at all and why they push out stable releases every 6 weeks. Linux kernel development speed is measured in thousands of lines of code modified or added per day. Evaluating the incoming changes every day is a full time job for several people.
Subversion is causing us to delay necessary changes, i.e. changes that we would prefer to do if only it wouldn’t be so disruptive. Delayed changes pile up to become technical debt. Think of git as a tool to manage your technical debt. You can work on business value adding changes (and keep the managers happy) and disruptive changes at the same time without the two interfering. In other words you can be more agile. Agile has always been about technical enablers (refactoring tooling, unit testing frameworks, continuous integration infrastructure, version control, etc) as much as it was about process. Having the infrastructure to do rapid iterations and release frequently is critical to the ability to release every sprint. You can’t do one without the other. Of course, tools don’t fix process problems. But then, process tends to be about workarounds for lacking tools as well. Decentralized version management is another essential tool in this context. You can compensate not using it with process. IMHO life is to short to play bureaucrat.
Not an easy ride
But as I said, switching from svn to git wasn’t a smooth ride. Getting familiar with the various git commands and how they are different from what I am used to in svn has been taking some time despite the fact that I understand how it works and how I am supposed to use it. I’m a git newby and I’ve been making lots of beginners mistakes (mainly using the wrong git commands for the things I was trying to do). The good news is that I managed to get some pretty big changes committed back to the central svn repository without losing any work (which is the point of version management). The bad news is that I got stuck several times trying to figure out how to rebase properly, how to undo certain changes, how to recover a messed up checkout on top of my local work directory from the local git repository. In short, I learned a lot on this and I have still some more things to learn. On the other hand, I can track changes from svn trunk, have local topic branches, merge from those to the local git master, and dcommit back to trunk. That about covers all my basic needs.
log4j, maven, surefire, jetty and how to make it work
Posted by Jilles in Blog Posts on February 13, 2010
I spend some time yesterday on making log4j behave. Not for the first time (gave up on several occasions) and I was getting thoroughly frustrated with how my logs refuse to conform to my log4j configuration, or rather any type of configuration. This time, I believe I succeeded and since I know plenty of others must be facing the exact same misery and since most of the information out there is downright misleading in the sense of presenting all types of snake oil solutions that actually don’t change a thing, here’s a post that offers a proper analysis of the problem and a way out. That, and it’s a nice note to self as well. I just know that I’ll need to set this up again some day.
In a nutshell, the problem is that there are multiple ways of doing logging in Java and one in particular, Apache common-logging, is misbehaving. This trusty little library has not evolved significantly since about 2006 and is depended on by just about any dependency in the maven repository that does logging, mostly for historical reasons. Some others depend on log4j directly and yet some others depend on slf4j (Simple Logging Facade for Java). Basically, you are extremely likely to have a transitive dependency on all of these and even a few dependencies on JDK logging (introduced in Java 1.4).
The main goal of commons.logging is not having to choose log4j or JDK logging. It acts as a facade and picks one of them using some funky reflection. Nice but most sysadmins and developers I have worked with seem to favor log4j anyway and hate commons-logging with a passion. In our case, all our projects depend on log4j directly and that’s just the way it is.
One of the nasty things with commons-logging is that it behaves weirdly in complex class loading situations. Like in maven or a typical application server. As a result, it takes over orchestration of the logs for basically the whole application and wrongly assumes that you want to use jdk logging or some log4j configuration buried deep in one of your dependencies. WTF is up with that btw, don’t ship logging configuration with a library. Just don’t.
Symptoms: you configure logger foo in log4j.xml to STFU below ERROR level and when running your maven tests (or even from eclipse) or in your application server it barfs all over the place at INFO level, which is the unconfigured default. To top it off, it does this using an appender you sure as hell did not configure. Double check, yes log4j is on the classpath, it finds your configuration, it even creates the log file you defined an appender for but nothing shows up there. You can find out what log4j is up to with -Dlog4j.debug=true. So, log4j is there, configured and all but commons-logging is trying to be smart and is redirecting all logged stuff, including the stuff actually logged with log4j directly, to the wrong place. To add to your misery, you might have partly succeeded with your attempts to get log4j working so now you have stuff from different log libraries showing up in the console.
A decent workaround in that case is to define a file appender, which will be free from non log4j stuff. This more or less is the advice for production deployments: don’t use a console logger because dependencies are prone to hijacking the console for all sorts of purposes.
So, good advice, but less than satisfactory. To fix the problem properly, make sure you don’t have commons-logging on the classpath. At all. This will break all the stuff that depends on it being there. Fix that by using slf4j instead. Slf4j comes in several maven modules. I used the following ones:
- jcl-over-slf4j is a drop-in, API compatible replacement for commons logging. It writes messages logged through commons-logging using slf4j, which is similar to commons-logging but behaves much nicer (i.e. it actually works). It’s designed to fix the problem we are dealing with here. The only reason it exists is because commons-logging is hopelessly broken.
- slf4j-api is used by dependencies already depending on slf4j
- slf4j-log4j12 the backend for log4j. If this is on the classpath slf4j will use log4j for its output. You want this.
That’s it. Here’s what I had to do to get a properly working configuration:
- Use mvn dependency:tree to find out which dependencies are transitively/directly depending on commons-logging.
- fix all of these dependencies with a
<exclusions> <exclusion> <groupId>commons-logging</groupId> <artifactId>commons-logging</artifactId> </exclusion> </exclusions>
- You might have to iterate fixing the dependencies and rerunning mvn dependency:tree since only the first instance of commons-logging found will used transitively.
- Now add these dependencies to your pom.xml:
<dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-api</artifactId> <version>1.5.10</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>jcl-over-slf4j</artifactId> <version>1.5.10</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> <version>1.5.10</version> </dependency>
- Maven plugins have their own dependencies, separately from your normal dependencies. Make that you add the three slf4j dependencies to surefire, jetty, and other relevant plugins. At least jetty seems to already depend on slf4j.
- Finally make sure that your plugins have system properties defining log4j.configuration=file:[log4j config location]. Most of the googled advice on this topic covers this (and not much else). Some plugins can be a bit hard to configure due to the fact that they fork off separate processes.
That should do the trick, assuming you have log4j on the classpath of course.
Missing the point
Posted by Jilles in Blog Posts on February 10, 2010
Like most of you (probably), I’ve been reading the news around Google Buzz with interest. At this point, the regular as clockwork announcements from Google are treated somewhat routinely by the various technology blogs. Google announced foo, competitor bar says this and expert John Doe says that. Bla bla bla, revolutionary, bla bla similar to bla, bla. Etc. You might be tempted to dismiss Buzz as yet another Google service doomed to be ignored by most users. And you’d be right. Except it’s easy to forget that most of those announcements actually do have some substance. Sure, there have been a few less than exciting ones lately and not everything Google touches turns into gold but there is some genuinely cool stuff being pushed out into the world from Mountain View on a monthly, if not more frequent, basis.
So this week it’s Google Buzz. Personally, I think Buzz won’t last. At least not in its current gmail centric form. Focusing on Buzz is missing the point however. It will have a lasting effect similar to what happened with RSS a few years back. The reason is very simple, Google is big enough to cause everybody else to implement their APIs, even if buzz is not going to be a huge success. They showed this with open social, which world + dog now implements, despite it being very unsuccessful in user space. Google wave, same thing so far. The net effect of Buzz and the APIs that come with it will be internet wide endorsement of a new real time notification protocol, pubsubhubbub. In effect this will take twitter (already an implementer) to the next level. Think pubsubhubbub sinks and sources all over the internet and absolutely massive traffic between those sources and sinks. Every little internet site will be able to notify the world of whatever updates it has, every person on the internet will be able to subscribe to such notifications directly, or more importantly, indirectly to whichever other websites choose to consume, funnel and filter those notifications on their behalf. It’s so easy to implement that few will resist the temptation to do so.
Buzz is merely the first large scale consumer of pubsubhub notifications. Friendfeed tried something similar with RSS, was bought by Facebook and successfully eliminated as a Facebook competitor. However, Pubsubhubbub is the one protocol that Facebook won’t be able to ignore. For now they seem to stick with their closed everything model. This means there is Facebook and the rest of the world and well guarded boundaries between those. As the rest of the world becomes more interesting in terms of notifications, keeping Facebook isolated as it is today will become harder. Technically, there are no obstacles. The only reason Facebook is isolated is because it chooses to be isolated. Anybody who is not Facebook has a stake in committing to pubsubhubbub to be able to compete with Facebook. So Facebook becoming a consumer of pubsubhubbub type notifications is a matter of time, if only because it will simply be the easiest way for them to syndicate third party notifications (which is their core business). I’d be very surprised if they hadn’t got something implemented already. Facebook becoming a source of notifications is a different matter though. The beauty of the whole thing is that the more notifications originate outside of Facebook, the less this will matter. Already some of their status updates are simply syndicated from elsewhere (e.g. mine go through Twitter). Facebook is merely a place people go to see an aggregated view on what their friends do. It is not a major source of information, and ironically the limitations imposed by Facebook make it less competitive as such.
So, those dismissing Buzz for whatever reason are missing the point: it’s the APIs stupid! Open APIs, unrestricted syndication and aggregation of notifications, events, status updates, etc. It’s been talked about for ages, it’s about to happen in the next few months. First thing to catch up will be those little social network sites that almost nobody uses but collectively are used by everybody. Hook them up to buzz, twitter, etc. Result, more detailed event streams popping up outside of Facebook. Eventually people will start hooking up Facebook as well, with or without the help of Facebook. By this time endorsement will seem like a good survival strategy for Facebook.
CouchDB
Posted by Jilles in Blog Posts on January 15, 2010
We did a little exercise at work to come up with a plan to scale to absolutely massive levels. Not an entirely academic problem where I work. One of the options I am (strongly) in favor of is using something like couchdb to scale out. I was aware of couchdb before this but over the past few days I have learned quite a bit more that and am now even more convinced that couchdb is a perfect match for our needs. For obvious reasons I can’t dive in what we want to do with it exactly. But of course itemizing what I like in couchdb should give you a strong indication that it involves shitloads (think hundreds of millions) of data items served up to shitloads of users (likewise). Not unheard of in this day and age (e.g. Facebook, Google). But also not something any off the shelf solution is going to handle just like that.
Or so I thought …
The couchdb wiki has a nice two line description:
Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Among other features, it provides robust, incremental replication with bi-directional conflict detection and resolution, and is queryable and indexable using a table-oriented view engine with JavaScript acting as the default view definition language.
This is not the whole story but it gives a strong indication that quite a lot is being claimed here. So, lets dig into the details a bit.
Document oriented and schema less storage. CouchDB stores json documents. So, a document is nothing more than a JSON data structure. Fair enough. No schemas to worry about, just data. A tree with nodes, attributes and values. Up to you to determine what goes in the tree.
Conflict resolution. It has special attributes for the identity and revision of a document and some other couchdb stuff. Both id and revision are globally unique uuids. UPDATE revision is not a uuid (thanks Matt).That means that any document stored in any instance of couchdb anywhere on this planet is uniquely identifiable and that any revision of such a document in any instance of couchdb is also uniquely identifiable. Any conflicts are easily identified by simply examining the id and revision attributes. A (simple) conflict resolution mechanism is part of couchdb. Simple but effective for simple day to day replication.
Robust incremental replication. Two couchdb nodes can replicate to each other. Since documents are globally unique, it is easy to figure out which document is on which node. Additionally, the revision id allows couchdb to figure out what the correct revision is. Should you be so unlucky to have conflicting changes on both nodes, there are ways of dealing with conflict resolution as well. What this means is that any node can replicate to any other node. All it takes is bandwidth and time. It’s bidirectional so you can have a master-master setup where both nodes consume writes and propagate changes to each other. The couchdb use the concept of “eventual consistency” to emphasize the fact that a network of couchdb nodes replicating to each other will eventually have the same data and be consistent with each other, regardless of the size of the network or how out of sync the nodes are at the beginning.
Fault tolerant.Couchdb uses a file as its datastore. Any write to a couchdb instance appends stuff to this file. Data in the file already is never overwritten. That’s why it is fault tolerant. The only part of the file that can possibly get corrupted is at the end of the file, which is easily detected (on startup). Aside from that, couchdb is rock solid and guaranteed to never touch your data once it has been committed to disk. New revisions don’t overwrite old ones, they are simply appended to the file (in full) to the end of the file with a new revision id. You. Never. Overwrite. Existing. Data. Ever. Fair enough, it doesn’t get more robust than that. Allegedly, kill -9 is a supported shutdown mechanism.
Cleanup by replicating. Because it is append only, a lot of cruft can accumulate in the bit of the file that is never touched again. Solution: add an empty node, tell the others to replicate to it. Once they are done replicating, you have a clean node and you can start cleaning up the old ones. Easy to automate. Data store cleanup is not an issue. Update. As Jan and Matt point out in the comments, you can use a compact function, which would be a bit more efficient.
Restful. CouchDBs native protocol is REST operations over HTTP. This means several things. First of all, there are no dedicated binary protocols, couchdb clients, drivers, etc. Instead you use normal REST and service related tooling to access couchdb. This is good because this is exactly what has made the internet work for all these years. Need caching? Pick your favorite caching proxy. Need load balancing? Same thing. Need access from language x on platform y? If it came with http support you are ready to roll.
Incremental map reduce. Map reduce is easy to explain if you understand functional programming. If you’re not familiar with that, it’s a divide and conquer type strategy to calculate stuff concurrently from lists of items. Very long lists with millions/billions of items. How it works is as follows: the list is chopped into chunks. The chunks are processed concurrently in a (large) cluster to calculate something. This is called the map phase. Then the results are combined by collecting the results from processing each of the chunks. This is called the reduce phase. Basically, this is what Google uses to calculate e.g. pagerank and many thousands of other things on their local copy of the web (which they populate by crawling) the web regularly. CouchDB uses the same strategy as a generic querying mechanism. You define map and reduce functions in Javascript and couchdb takes care of applying them to the documents in its store. Moreover, it is incremental. So if you have n documents and those have been map reduced and you add another document, it basically incrementally calculates the map reduce stuff. I.e. it catches up real quick. Using this feature you can define views and query simply by accessing the views. The views are calculated on write (Update. actually it’s on read), so accessing a view is cheap whereas writing involves the cost of storing and the background task of updating all the relevant views, which you control yourself by writing good map reduce functions. It’s concurrent, so you can simply add nodes to scale. You can use views to index specific attributes, run clustering algorithms, implement join like query views, etc. Anything goes here. MS at one point had an experimental query optimizer backend for ms sql that was implemented using map reduce. Think expensive datamining SQL queries running as map reduce jobs on a generic map reduce cluster.
It’s fast. It is implemented in erlang which is a language that is designed from the ground up to scale on massively parallel systems. It’s a bit of a weird language but one with a long and very solid track record in high performance, high throughput type systems. Additionally, couchdb’s append only and lock free files are wickedly fast. Basically, the primary bottleneck is the available IO to disk. Couchdb developers are actually claiming sustained write throughput that is above 80% of the IO bandwidth to disk. Add nodes to scale out.
So couchdb is an extremely scalable & fast storage system for documents that provides incremental map reduce for querying and mining the data; http based access and replication; and a robust append only, overwrite never, and lock free storage.
Is that all?
No.
Meebo decided that this was all nice and dandy but they needed to partition and shard their data instead of having all their data in every couchdb node. So they came up with CouchDB Lounge. Basically what couchdb lounge does is enabled by the REST like nature of couchdb. It’s a simple set of scripts on top of nginx (a popular http proxy) and the python twisted framework (a popular IO oriented framework for python) that dynamically routes HTTP messages to the right couchdb node. Each node hosts not one but several (configurable) couchdb shards. As the shards fill up, new nodes can be added and the existing shards are redistributed among them. Each shard calculates its map reduce views, the scripts in front of the loadbalancer take care of reducing these views across all nodes to a coherent ‘global’ view. I.e. from the outside world, a couchdb lounge cluster looks just like any other couchdb node. It’s sharded nature is completely transparent. Except it is effectively infinitely scalable both in the number of documents it can store as well in the read/write throughput. Couchdb looks just like any other couchdb instance in the sense that you can run the full test suite that comes with couchdb against and it will basically pass all tests. There’s no difference from a functional perspective.
So, couchdb with couchdb lounge provides an off the shelf solution for storing, accessing and querying shitloads of documents. Precisely what we need. If shitloads of users come that need access, we can give them all the throughput they could possibly need by throwing more hardware in the mix. If shitloads is redefined to mean billions instead of millions, same solution. I’m sold. I want to get my hands dirty now. I’m totally sick and tired of having to deal with retarded ORM solutions that are neither easy, scalable, fast, robust, or even remotely convenient. I have some smart colleagues who are specialized in this stuff and way more who are not. The net result is a data layer that requires constant fire fighting to stay operational. The non experts routinely do stuff they shouldn’t be doing that then requires lots of magic from our DB & ORM gurus. And to be fair, I’m not an expert. CouchDB is so close to being a silver bullet here that you’d have to be a fool to ignore the voices telling you that it is all too good to be true. But then again, I’ve been looking for flaws and so far have not come up with something substantial.
Sure, I have lots of unanswered questions and I’m hardly a couchdb expert since technically, any newby with more than an hour experience coding stuff for the thing outranks me here. But if you put it all together you have an easy to understand storage solution that is used successfully by others in rather large deployments that seem to be doing quite well. If there are any limits in terms of the number of nodes, the number of documents, or indeed the read/write throughput, I’ve yet to identify it. All the available documentation seems to suggest that there are no such limits, by design.
Some good links:
- The couchdb main site at apache
- Cool presentations at the recent nosql conference in Berlin, including one by a Berlin local,
- Jan Lehnardt, who is a director at
- CouchIO, a consulting company around CouchDB
- Another talk at Google (with slides) by
- Chris Anderson who is “an Apache CouchDB committer and co-author of the
forthcoming - O’Reilly book “CouchDB: The Definitive Guide“
- Planet couchdb is where you get all your couchdb news.
- CouchDB Lounge main site.
- CouchDB-python for if you prefer using python over javascript.
- Couchdb4j if you prefer Java.
- Or jrelax.
- Or jcouchdb.
OVI Prime Place and Happy Holidays!
Posted by Jilles in Blog Posts on December 22, 2009
Wow another busy month without me posting. So, here’s what will probably be the last post of this year.
In the past few weeks we rolled out OVI Prime Places, http:/www.primeplace.ovi.com, which people may use to register their businesses on OVI maps. Check out the demo video here (click the arrows thingy for full screen).
Ovi Prime Place Introduction from beatbandit on Vimeo.
Anyway, busy few weeks fixing bugs, getting the release out and using the ops induced bureaucracy & downtime for some major refactoring for the next release. All more or less done now and quite ready for a vacation.
Happy Holidays & 2010.

Recent Comments