Tech blog of Anton Keks: 2009

Thursday, December 17, 2009

Deleting thumbnails for inexisting photos

Freedesktop for some years already has a spec on how applications should manage image thumbnails (use Next link there). The spec is now followed by majority of Gnome and KDE applications, including F-Spot, which is one of the very few applications that uses large 256x256 thumbnails under ~/.thumbails/large.

The spec specifies to store thumbnails in PNG format, naming the files after the MD5 sum of the original URLs of the original files, eg 81347ce6c37f75513c5e517e5b1895b8.png.

The problem with the spec is that if you delete or move image files, thumbnails stay there and take space (for my 20000+ photos I have 1.4Gb of large thumbails).

Fortunately, you can from time to time clean them by using simple command-line tricks, as the original URLs are stored inside of thumbnail files as Thumb:URI attributes. I don't recommend erasing all of your thumbnails, because regeneration will take time.

In order to create a list of matching thumbnail-original URL pairs, you can run the following in a terminal inside of either .thumbnails/large or .thumbnails/normal directories (it will take some time):

for i in `ls *.png`; do
identify -verbose "$i" | \
fgrep Thumb::URI | sed "s@.*Thumb::URI:@$i@" >> uris.txt;
done

This will get you a uris.txt file, where each line looks like the following:

f78c63184b17981fddce24741c7ebd06.png file:///home/user/Photos/2009/IMG_5887.CR2

Note that the provided thumbnail filenames (first tokens) can also be generated the following way from the URLs (second tokens) using MD5 hashes:

echo -n file:///home/user/Photos/2009/IMG_5887.CR2 | md5sum

After you have your uris.txt file, it can be easily processed with any familiar command-line tools, like grep, sed, awk, etc.

For example, in order to delete all thumbnails matching 'Africa', use the following:

for i in `cat uris.txt | fgrep Africa`; do rm $i >/dev/null; done

So, as you can see, it is pretty simple to free a few hundred megabytes (depending on the number of thumbnails you are deleting).

With this kind of trick you can even rename the thumbnails of moved files if you use md5sum to generate the new filenames from the URLs, as shown above. This will save you regeneration time.

Wednesday, August 26, 2009

Announcing F-Spot Live Web Gallery extension

I am happy to announce a new extension for F-Spot, the popular Linux photo management application - LiveWebGallery. Once installed, invoke it from the Tools menu in F-Spot's main window.

The extension contains a minimal web server implementation that serves the user's gallery over HTTP and can be viewed with any web browser, even on Mac and Windows. So now you are able to easily share your photos with family, friends, colleagues no matter what operating system and software they use by doing just a few mouse clicks in F-Spot. The only requirement is that they have to be on the same network, or be able to access your machine's IP address in some other way.

As you can see in the screenshot, you can choose whether to share photos with a particular tag, current view in F-Spot (allows you to create an arbitrary query) or the currently selected photos.

To activate the gallery (start the embedded lightweight web server), just click activate button in the top-right corner. On activation, the URL of the web gallery will appear, allowing you either to open it yourself or copy the link and provide to other viewers.

After that all the options can still be changed in the dialog and will affect all new viewers or those pressing the browser's reload button.

Most of us already know that many pictures are rarely viewed after they are made (Point, Shoot, Kiss It Goodbye). F-Spot tries to fix this with its very powerful tagging features - tags make it much easier to find photos made long ago. This, however, is no magic - the possibilities of finding the right photos when needed depends on how well you tag. Now, this extension allows to make tagging even more useful, because other people can help you with the most difficult part - properly tagging can sometimes be a lot of work. With this extension, you can delegate some it to other people! The gallery is not read-only - if you choose so, an editable tag can be selected and viewers can add/remove this tag from photos (currently only in the full photo view). This is especially useful to let other people to tag themselves in your library. For security reasons, editing is disabled by default.

As time goes by, a lot more features can be added to Live Web Gallery extension, especially related to editing photo metadata (tagging, editing descriptions, flagging for deletion).

As far as I know, being able to share your photos on the local network without any software or OS requirements is a unique feature of F-Spot now. No other photo management application can do this to date.

Downloading

The source code is on Gitorious, live_web_gallery branch (until is has been merged to the mainline).

To install, use the Edit->Manage Extensions menu in the F-Spot, click on Install Add-ins and then Refresh. After that LiveWebGallery should be available under the Tools category.

Or, alternatively, you can download the precompiled binary and put it to:
~/.config/f-spot/addins or /usr/lib/f-spot/extensions

Note: F-Spot 0.6 is required for it to work. You can already find deb/rpm packages for F-Spot 0.6 or 0.6.1 for most distributions and it will be included in the upcoming distro releases this autumn.

Hopefully, the extension will later be distributed with newer versions of F-Spot by default.

Enjoy! Comments are welcome!

Monday, June 1, 2009

Database Refactoring

A couple of months ago I have made a short keynote titled Dinosaur Strategies: How Can Data Professionals Still Prosper in Modern Organisations, inspired by Scott Ambler's joke on the fictional Waterfall 2006 conference website.

(see the slides)

I primarily deal with 'application' aspects of software development using Agile practices, so I have a hard time understanding how some Data Professionals can be so behind in their evolution, and not doing some basic things like iterative development, unit tests, continuous integration, etc.

Last week I was asked to give a talk on Database Refactoring. The topic seemed challenging enough and as no Database Professionals cared to lead the topic, I decided to give it a try. The result is a motivational speech for both database developers as well as others in the software development process.

I have discussed the cultural conflict of database and OOP developers, the problem of refactoring tools available to relational database developers lagging behind, and some solutions to these problems that can help before these tools become available:

(1) Development Sandboxes
(2) Regression Testing
(3) Automatic Changelog, Delta scripts
(4) Proper Versioning
(5) Continuous integration
(6) Teamwork & Cultural Changes

Other discussed topics include Refactoring of Stored Code vs Database Schema, Agile Reality, Overspecialization (016n), Database not being under control, Database Smells, Fear of Change, Scenarios, Dealing with Coupling, Dealing with unknown applications, Proper versioning, Continuous Integration using sandboxes, and Delta Scripts (Migrations), which make evolutionary database schema possible.

The dinosaurs below are the reminder of my previous keynote available above. They come from the very nice Dinosaurs Song, available on YouTube, which I have actually played after the keynote itself.

Below are full slides of the Database Refactoring talk.

(click for PDF slides)

Sunday, May 17, 2009

Versioning your home directory or documents with Git

Git is a relatively new Version Control System, initially started by Linus Torvalds in order to manage the source code of Linux Kernel.

Although Randal Schwartz has stated that Git was not designed to version your home directory, it seems that many people are now trying to do so :-)

Some people have used CVS or Subversion for this purpose in the past, but to my mind, Git is suited better for this task for several reasons:

Git is grep-friendly (only stores it's metadata in a single .git directory at the root of working copy)
It is very easy to work with a local repository (just do git init and you're ready)
Git stores changes very efficiently (even binary files), so not much disk space is wasted, but don't forget to call git gc from time to time
Git repository is always available on your computer, even when you are offline, but on the other hand it is very easy to push your changes to a remote repository as well

All these things are much worse with CVS, which spams all versioned directories with CVS subdirs and stores each version of binary files fully. Subversion also requires more effort to setup, is less storage-efficient, and puts .svn subdirs everywhere.

Having said that, my setup is ultra-simple compared to others on the net!

To start versioning your home directory, just run this in the root of your home:

git init

This will initialize an empty local Git repository in ~/.git/ - this is the location that you can use when doing backups, but otherwise you shouldn't care about it anymore.

Then you need to tell Git to track your important files:

git add Documents
git add bin
git add whatever else you want to version
git commit -m "Adding initial files"

Then you can work normally with your tracked files and occasionally commit your changes to the repository with

git commit -a "description of changes you have done"

Note the "-a" above, that means to commit any changes made to any previously tracked files, so you don't have to use git add again. But don't forget to git add any new files you create before committing.

Use git status to show what files were changed since your last commit. Unfortunately, it will also list all untracked files in your home directory, so you may need to create a .gitignore file. You can get the initial version of this file using this command:

git status | awk '/#/ {sub("/$", ""); print $2}' > .gitignore

then, edit it and possible replace some full names partly with '*'. Don't forget to git add and git commit this file as well!

That's, basically, it! You may also try some GUI tools provided by git, eg gitk or git gui to browse your changes and do some changes if you can't remember the commands.

Moreover, I have some more ideas how to make all this more automatic that I am going to try laster:

Put git commit -a to user's crontab in order to commit changes automatically, eg daily
Create a couple of nautlus scripts (located in ~/.gnome2/nautilus-scripts) to make adding, comitting and other actions available directly from Nautlilus file manager in Gnome.

Happy versioning! And read the Git tutorial with either man gittutorial or on the official site.

Sunday, April 26, 2009

Excessive memory usage by Oracle driver solved

On my day job I deal with Internet banking. The Internet bank is a relatively large and high-load Java/Spring/Hibernate web application, which uses Oracle databases.

During our recent transition from a centralized data accessor (VJDBC) to local JDBC connection pools to reduce data routrip times, we have started having issues with memory usage in our application servers: some requests started to allocate tens to hundreds of megabytes of memory. While Garbage Collector was successfully reclaiming all this memory afterwards (no memory leaks), it still posed a problem of high peak memory usage as well as too frequent collections, also affecting the overall performance.

While profiling memory allocations with JProfiler, I have discovered that OracleStatement.prepareAccessors() is responsible for these monstrous allocations (up to 600 Mb at once, most in either char or byte giant arrays). Google has pointed to this nice article on reducing the default prefetch size, describing a very similar situation, however these guys have had problems with queries returning LOBs. We haven't used any LOBs in our problematic queries and haven't modified the defaultRowPrefetch connection property knowingly.

Further investigation led to the way we were using Hibernate: for some quesries that are expected to return large result sets, we were using the Query.setFetchSize() or Criteria.setFetchSize() methods with rather high values (eg 5000). This seemed reasonable, because we were also using the setMaxResults() method with the same value to reduce the maximum length of the returned ResultSet. However, after doing some upgrades of Java, Hibernate, and Oracle driver, this had started having these memory allocation side-effects. It seems that now Hibernate translates this fetchSize parameter directly to OracleStatement's rowPrefetch value, forcing it instantly allocate a rowPrefetch*expectedRowSize sized array even before it runs the actual query. This array can be ridicuosly large, even if the actual query returns only a few rows afterwards. Later investigation showed that also having the batch-size attribute in the Hibernate mapping files (hbm.xml) has exactly the same effect and also results in giant pre-allocations.

As a result, we had to review all batch-size and setFetchSize() values that we were using with our Hibernate queries and mappings, in most cases reducing them significantly. This would reduce the worst-case performance of some long queries (they would require more roundtrips to the database), but would also reduce the overall amount of garbage accumulating in the heap and thus reduce the frequency of garbage collections, having a positive impact on CPU load. Shorter results would run equally fast, so it makes sense actually to rely on average statictics of the actual responses when chosing optimal rowPrefetch values. The default value is 10, which is hardcoded in the Oracle driver.

For longer queries, the abovementioned article has proposed an idea of geometrically increasing the rowPrefetch (setting it twice as big for each subsequent fetch manually). This is a nice idea, but I wonder why Oracle driver can't do this automatically? This is how Java collections behave when they resize themselves. I haven't tried doing this with Hibernate yet, but I think it should be possible, especially if you use the Query.scroll() instead of Query.list().

Sunday, March 8, 2009

Exchange calendar sync to iPhone through Evolution

Finally, I have found an easy and transparent way of syncing my corporate calendar (on the evil MS Exchange server, of course) to my iPhone over-the-air, without involving any manual work. The same recipe can actually work for other mobile phones as well, read on!

Current syncing path is as follows:
Exchange → Evolution → (gcaldaemon) → Google Calendar → (Google Sync) → iPhone

If it looks long, don't worry - it is not so difficult in reality.

Exchange - this is where the calendar is stored, which very often is located behind a firewall, where it cannot be talked to directly from a mobile phone.
Exchange is accessed using Evolution Exchange connector (poor Outlook users may find this helpful).
gcaldaemon is an open-source tool for doing various interesting tricks with Google Calendar, including syncing it with Evolution, see below for details.
Google Calendar is a full-featured web-based calendar, where you can store several calendars. It is especially convenient together with GMail, but doesn't require it.
Google Sync is a new service from Google that can sync your Google Calendar (and contacts from GMail) to various programs and mobile devices.
iPhone is where you get both your personal and corporate calendars, always with you :-)

So, actually the keyword here is Google Calendar. As it is quite popular, it is already supported by a lot of software, so you can use it as a middle man in various synching situations, not only the one described here.

I have started testing the Google Sync from Google Calendar to my iPhone. Google had a nice idea to implement Exchange ActiveSync protocol, which is already supported by iPhone (probably there are lots of Google employees using iPhones). Now you just need to setup an Exchange account in your phone and configure it to talk to m.google.com instead of an Exchange server. This is another brilliant implementation after GMail started talking IMAP natively. Follow the instructions on the Google website linked above. As a bonus, you will get 2-way syncing of GMail contacts as well if you want. And everything will work via Push, so you will get almost instant updates when you change something on the either the phone or on the web.

Push syncing works on the iPhone by keeping HTTP connections open for as long as possible by sending a request and waiting for a response as long as your mobile operator's infrastructure permits (can get up to several hours). During this time no traffic is moving between iPhone and Google, so unless something is changed, no need to pay for any data, which is actually better than polling of server every 10 minutes or so. When changes are available, the server stops blocking the connection and immediately pushes data to the phone, hence the name.

Tip: to select which calendars to sync with your phone, navigate to m.google.com/sync with you phone and select what is needed. I have at least 2 calendars there: personal and corporate one - you can conveniently see appointments in different colors.

As soon as this works perfectly, all you need to do is get your Exchange (or any other) events to Google Calendar. This is very easy using the gcaldaemon. See their website for lots of usage scenarios. We are currently interested in file-based synchronization with Evolution. Note that Evolution now has native Google Calendar support as well, but this allows you viewing your exising Google Calendar in Evolution, but not syncing your corporate Exchange calendar to Google.

Evolution, while talking to Exchange, caches your calendar data in a file called cache.ics. You can find it in:

~/.evolution/exchange/exchange___username;auth=Basic@server_;personal_Calendar/cache.ics

substitute your own username name server there.

All you need is to configure gcaldaemon to monitor this file and send updates to Google Calendar, totally automatically. This way you will get one way sync from Exchange to Google, but this should be enough to not miss your all-important meetings at work, because then your phone will alert you whenever you are. I run gcaldaemon right after Evolution from the same launcher, so I don't have to worry about syncing anymore. For that, I have created the ~/bin/evolution file (local bin has priority in PATH, at least on Ubuntu), and this script on execution first runs /usr/bin/evolution, sleeps several seconds and then starts gcaldaemon.

Google sync actually supports syncing with many mobile devices, including the iPhone, Android phones, Nokia Series60 phones with Symbian (contacts only for now), Blackberry, and the awkward Windows Mobile. But even if you cannot directly sync calendar to you phone, you can ask Google to alert you with SMS before each meeting starts, which is almost as good, gcaldaemon will ask Google to do this by default for each event it syncs, provided that Google knows your mobile number. Give this a try - lots of operators are supported worldwide, it's not only US anymore.

The only thing that worries now that this just is another step towards Google taking over the World :-)

Tech blog of Anton Keks