Monday, August 10, 2009

Adding binary files to CVS

I recently added a few Excel files to CVS and suddenly found that the files had become unreadable when I checked them out again. Excel would complain that the files were corrupted blah, blah, blah ....

The last time this had happened to me, I had been dealing with files that were over a decade old and that contained very important information and I had almost got a heart attack. At that time I had figured out the solution after a lot of tense web research but had omitted to document that since I assumed that I would remember it. Now I wish to correct that omission of mine.

The problem happens because CVS by default assumes that files are non-binary and does stuff like keyword substitution on non-binary files when you check them out. The solution is therefore simple. Just let CVS know that the files in question are binary. For example suppose you have a directory foo in your CVS repository that contains (say) some .xls files that you mistakenly checked in without marking them as binary. The all you need to do is:
  • cd foo
  • cvs admin -kb *.xls
  • cvs update
and you are all set.

To prevent this in future, go to the highest level in your repository and edit the file CVSROOT/cvswrappers just like you modify any other file in CVS. Modify it so it has the following lines at the end:

*.xls -k 'b'
*.xlsx -k 'b'
*.ppt -k 'b'
*.pptx -k 'b'
*.doc -k 'b'
*.pdf -k 'b'
*.docx -k 'b'

Then just commit that file and CVS will remember that all these files are binary files.

Note that this change will only apply to this particular repository. If you have multiple repositories you will need to modify this file in all your repositories, of course.

Friday, April 10, 2009

Shuffling a table in Excel

Here is a problem I faced recently - I had an Excel table that I wanted shuffled randomly (no, I was not dealing a bridge hand as people who know about my association with the game might conclude). As a matter of routine I was going to do it the way I knew how to do these things - by copying this into a text file and writing a Perl script to do the shuffling but I just decided to see how one could do it in-place in Excel itself. The answer is very simple and elegant! Just create a pseudo column in the table and set the value for each cell in that column to "=RAND()" and sort on this column! Excel is cool - I want to learn all about it.

Saturday, January 24, 2009

Sometimes the old ways are the best ...

I woke up this sleepy Saturday morning and decided to check out what was happening on Twitter. My network on Twitter consists of one friend Dhananjay Nene who twitters regularly (http://twitter.com/dnene) coupled with many others who have done nothing other than twittering "Test" or "Checking this out" and then tuning out completely. Anyway, to come back to the point, I saw this entry by Dhananjay

"Brian Kernighan "Sometimes the old Ways are Best" - http://tinyurl.com/c2ydgc - Precisely the reason why I cant stop using Linux.
  • P1 would consist of one line : getfacl d >oldacls
  • P2 would consist of one line : setfacl -f oldacls d
In Windows on the other hand I had the following problems:
  • Windows has at least three programs that manage file ACLs, not all of which work on all versions of Windows - cacls, xcacls, icacls
  • Except for icacls that is only supported on Vista and later, none of the programs provide a convenient way for storing current ACLs.
  • cacls is available on all Windows platforms post-2000 (which was o.k. for me) but not all cacls had a "/S" switch that prints current ACLs
  • In settings where I can use "cacls /S" to get the current ACLs I still need to parse the output which looks something like c:\dir1\dir2\currentdir: "a long ACL string". Unfortunately parsing this string and extracting the ACLs for d1 is not at all trivial in Windows, unless of course we were using UNIX tools such as grep or perl or awk.
Unfortunately for me, using any of the UNIX tools was not an option so I had to end up writing over 100 lines of C# code! Can you believe it?

Sunday, December 14, 2008

Predictions

In my life I have heard a number of very sweeping predictions that have or have not come true. For example, I remember that in the year 1981 someone (initials AV) had very authoritatively asserted that the U.S.S.R. and China would come close in a strategic alliance within a decade. This person (who is very knowledgeable and whom I hold in high regard) offered as evidence a number of events such as veiled statements made in "official" "party" publications in those two countries. Remember that in those days both these countries were extremely secretive and one had to very carefully parse comments made in these official news organs in order to understand what the powers-that-be wanted the world to hear.

What actually happened was that the U.S.S.R. crumbled under the costs of a costly war in Afghanistan coupled with an extremely inefficient agriculture plus a very expensive arms race with the U.S. and ultimately saw a collapse of the economic and political system that hitherto held sway there. China, on the other hand also underwent a significant transformation, albeit in a totally different way. Today both countries are solidly capitalist (but still very undemocratic) although China continues to pays lip service to communism. But the point is that the prediction turned out to be totally untrue or at least irrelevant. On the other hand, I know of nobody, literally nobody, who foresaw the collapse of the Soviet Union, neither people on the right nor those on the left!

I just thought it might be worthwhile to keep track of predictions people make and preserve them for posterity and this is what I want to achieve with this blog. I may not live long enough to know how some of these predictions turned out. As an example, I think I remember having read an Arthur C. Clarke book a long time ago wherein he had predicted that human beings would achieve immortality by 2100 A.D :-). I am pretty sure I cannot test that (unless we achieve this by 2022 when I will be 60 years old).

I had assumed that there would be a site http://www.predictions.com/that would have such a list but I was disappointed to see that as of December 15, 2008 it only had predictions about:

  • football prediction
  • college football predictions
  • pro football picks
  • football odds
  • psychic reading
  • psychic online
  • tarot reading
For what it is worth, wikipedia.org has a page that is relevant - it is http://en.wikipedia.org/wiki/Famous_predictions (as of December 15, 2008)

Anyway here is the list of predictions that I will update on an ongoing basis:

  • My friend SB, a hedge/mutual fund manager predicted that the U.S. economy will be worse off in 2018 compared to today (2008). This was in response to my observation that the quality of goods and services obtainable in the U.S. today is significantly inferior to what I saw in the late 20th. century. In general, I think the U.S. population feels much poorer compared to ten years ago. And this is something I noticed in 2007 before we had the country going into recession.
  • On April 7 2009 Michael Stonebreaker predicted that "in the next decade the database market will shift from elephant one-size fits-all databases to specialized databases". In other words people would be using specialized databases for different types of workloads such as transaction processing, data warehousing, text data oriented applications, scientific applications, streaming applications etc.


Successful Predictions


And here are some predictions that people I know had made in the past that turned out to be true:

  • In 1989 my friend Deepak Khemani had made a prediction that I had been extremely dismissive of then but has proved to be remarkably prescient - he said that by the year 2005 software would be the dominant technology. I think he also went on to say that by 2020 it would be the only technology. Incidentally, Khemani is now a Professor of Computer Science at I.I.T. Madras and his web site is http://www.cs.iitm.ernet.in/khemani/ (correct as of December 15, 2008)
  • The Bridge World predicted in 1957 that Canasta would not displace bridge in its popularity (but see below under "Failed Predictions")


Failed Predictions




  • The Bridge World predicted in 1957 that the new fad called "Television" would not displace bridge in its popularity

Saturday, July 05, 2008

Designing overflow-safe algorithms

I was thinking about an old interview type problem: "Given a set of N-1 distinct integers from the range 1 .. N, how will you detect the one missing integer?". Needless to say, the input contains no duplicates and the input is unordered. There are numerous solutions, with different problems/limitations:



  • Sort the input array, go through that array and locate the missing element. Problem is that this is O(NlogN)

  • Maintain an array (or a bit vector) of size N initialized to 0. As you read the input, mark the corresponding array element as 1. Then make one pass through the array and detect the missing element by looking for value "0" in the array. This algorithm is O(N) but requires investment in memory

  • A cute solution that interviewers often look for involves summing up the input and subtracting this from the sum of numbers from 1 to N ( which is 0.5 * N * (N + 1) as any high school student knows). The difference between the two is the integer missing from the input.

The last solution is quite neat except that if one were to attempt to write such a program in most commonly used programming languages one would run into overflow problems. One could use special libraries that allow one to use arbitrarily large numbers (e.g. Perl's Math::BigInt) but they come with overheads that are really unnecessary for the problem at hand. The solution I propose involves keeping a "running difference" between the sum of the input and sum(1 .. N) making sure that this difference is always within the range (-N - 1 .. N - 1). Essentially we are going to iteratively calculate the value of

missing number = sum(1..N) - Sum(input) = 1 + 2 + ... + (N-1) + N - sum(input).

The first cut implementation might be something like



i = 1; diff = 0;
while (not end of input) {
diff += i - input element;
++i;
}
return (N + diff);


The problem is that this could lead to an underflow if (say) the initial numbers in the input are all very high. If we instead start from the top like:


i = 1; diff = 0;
while (not end of input) {
diff += N - i + 1 - input element;
++i;
}
return (diff + 1);


we could have the opposite problem that we could overflow on the higher side.

The solution to this is to make sure that diff is always within bounds - if we find that it is negative, we compensate by using an unused number from the higher end of the range (1..N) whereas if we find that it is going too high, we use an unused number from the lower end of the range (1..N)

Here is my Perl implementation:



sub with_summing {# @input
my $high = @_ + 1; #Higher end of range = #elems in input + 1
my $low = 1;
my $diff = 0;
foreach my $next (@_){
if($diff < 0) {
$diff = $diff - $next + $high ; --$high;
}
else {
$diff = $diff - $next + $low; ++$low;
}
}
return $diff + $low; #At this point $high should equal $low
}


This technique of iteratively calculating a bounded value without creating intermediate values that might overflow is useful in a number of other situations. Such solutions have the additional advantage they can also give visibility into the value the input is tending to as more and more input is read. For example, we could use this technique to calculate the average. Rather than using the formula average = sum of input/cardinality of input that is susceptible to overflow, one could instead use the following iterative solution

average_so_far = 0;
num_input_elements = 0;
while (not end of input) {
average_so_far = average_so_far * (num_input_elements/(1 + num_input_elements) + new_element/(1 + num_input_elements);
++num_input_elements;
}


And we could wrap this into a nice class as follows:

class average {
public double average_so_far();
public void add_value(double new_value);
}


We could then use this class in (say) a chemical plant application where two threads share this object with one thread adding readings from a data feed while another thread could be displaying the summarized average to an operator. Notice the technique we used to update the average - we took care not to use the formula
average_so_far = (num_input_elements * average_so_far + new element)/(num_input_elements + 1);


We actually used this technique for keeping track of the results of a long running experiment we conducted as part of our work doing Performance testing at Aztecsoft, my previous company.

One might wonder how far one can go with this idea. How about calculating Standard Deviation anyone? My gut told me that this wasn't possible but it turns out that my gut was wrong! Here is how we can calculate SD without causing overflows (this may not be very readable but I am not good with representing mathematical symbols on a blog :-(:

sd = sqrt((sum(square(element - average))/(num_elems - 1)))

Let sq(i) = sum(square(element - average(i))/(i - 1)

Then it can be shown that (this is an interesting exercise that I am leaving for the reader, mainly because I will go crazy trying to format the derivation in this extremely hard to use blogging editor):
sq(i + 1) = (i-1)/(i))*sq(i) +
(delta)*(delta) +
(element(i+1) - average(i+1))*(element(i+1) - average(i+1))
where delta = (average(i+1) - average(i)
)

Notice that none of the elements of this formula are likely to cause overflow issues. This lets us write the following code to calculate the standard deviation


sub sd { # input array of numbers
my $average = 0;
my $new_average;
my $num_elems = 0;
my $old_square = 0;
my $square = 0;
foreach my $new_elem (@_){
my $old_average = $average;
$average = $old_average*($num_elems/($num_elems + 1)) +
$new_elem/($num_elems + 1) ;
my $delta = $average - $old_average;
$old_square = $square;
if($num_elems > 0){
$square = (($num_elems - 1)/$num_elems) * $old_square +
($delta ** 2) +
(($new_elem - $average) ** 2)/$num_elems;
}
++$num_elems;
}
return ($square ** 0.5, $average);
}


To sum up, many seemingly straightforward "correct" solutions can break down at high data volumes but there is often an elegant solution that can overcome the limitations of the straightforward solution. Be on the lookout for such opportunities.

Monday, May 22, 2006

How to send a mobile phone "business card"?

I always get confused about how to send a person's data saved in my mobile phone address book to another person who is also in my address book. For example, say my friend "seeker" wants to know the contact information about another friend "sought". On my phone which is an antiquated (pre-2003) model of Nokia (I cannot say which one it is since there is nothing on the instrument that tells me what model it is) I can follow the following steps:

- look up information about "sought"
- choose "options" and then choose "send bus. card"
- choose (say) "via text msg."
- look up information about "seeker"
- choose to "send the business card"

I always get confused which order to do my search in (do I lookup "sought" and send his info. to "seeker" or do I look up "seeker" and send him information about "sought"?) and I thought others too may have that confusion hence thought I should record this permanently.

- Rajendra Gokhale

Tuesday, May 02, 2006

Words containing letters in alphabetical order

My daughter Rama asked me this question the other day - which is the longest word that contains all letters in alphabetical order? For a dyed-in-the-wool UNIX hacker like me, the answer was quite straightforward:

grep -i "a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*" /usr/share/dict/words

Interestingly enough, I had left out the "*" at the end of the regular expression so all the words I found were ones ending in "z". That taught me a new word - "chintz" which interestingly enough originates from Hindi and is a type of cotton fabric. My daughter alerted me to my bug when she pointed out that the word "fix" did not appear in my output. I was able to correct my error and rerun the program. One of the longest words fulfilling the requirements of the problem was "biopsy".

Many of my posts may seem like nice ways to kill time but these skills really come handy in ones day-to-day work - for example, in my role as manager I was able to put my skills to good use in order to figure out the productivity of my (Research) group. But that will be another posting for some other day.

- Rajendra Gokhale

Solving the hexagon problem using Perl




Here is a problem that I see quite often in one of the afternoon papers such as the "MidDay". You are shown a hexagon, all of whose sides are labelled with one English letter (see above). There is another letter at the centre of the hexagon. The challenge is to make as many words of four or more letters from the letters of the hexagon.

In making a word, each letter must be used once only. Each word must contain the central letter. There should be atleast one seven letter word. Plurals, foreign words & proper names are not allowed.

Given such a hexagon (encoded in some format of your choice) and a file consisting of all acceptable words (e.g. /usr/share/dict/words), can you write a program to solve this problem?

See the Perl program below that does this job quite elegantly. A small point - when I used the dictionary on my Redhat Linux (Fedora Core 3), I could not find the seven letter word because it was missing from the dictionary - I had to figure it out manually. The word was "website" and we are talking of the year 2006 A.D. :-)!

Here is the Perl program:

use strict;
my $regex = '^b{0,1}e{0,2}i{0,1}s{0,1}tw{0,1}$';

foreach my $word (<>)
{
chop $word;
my $newword = join "", (sort (split //, $word));
if ($newword =~ /$regex/ && length($word) >= 4){
print "${word}\n";
}
}
__END__
As you can see above, the regular expression from the image shown was hardcoded into the script but anyone who understands this program will be able to trivially modify it so that the regular expression can be constructed on the fly. Also one small optimization that is possible involves precompiling the regular expression.

- Rajendra Gokhale

Sunday, April 16, 2006

Pune passport office information

I needed to get my passport renewed from Pune, India and was not able to find much information about this from the we so I decided to describe my experiences here.

My passport expires in October 2006 but I needed to get it renewed since I needed to get a visa in May 2006 - most countries do not issue a visa unless your passport is valid for at least six months beyond the date of issue of your visa. My impression was that the Government of India did not issue a new passport more than six months before the old passport is due to expire but that is apparently not the case - you can renew your passport upto one year before it is due to expire. Anyway the fact was that I needed to get my passport renewed in less than a month so I needed to go the "Tatkal" route that enables one to get an emergency passport. Note that typical waiting times at Pune are a few months so by Pune (Indian?) standards if you need your passport in less than three months, it constitutes an emergency :-). Here are my experiences getting a Tatkal passport from the Pune Passport Office.



The GoI is one of the biggest enemies of the environment. In order to apply for my passport I had to submit an application along with an elaborate set of supporting documentation, all in triplicate! Some of the documentation I had to submit was ridiculous beyond words - for example, I was required to attach photocopies of eight pages from my old passport and although these pages contained my name and date of birth I still had to provide proof of my date of birth. When I submitted my application I had with me all the following documents, each one in triplicate:
  • letter from bank confirming my address (I did not have a ration card, the ultimate proof in India)
  • letter from "reputed employer" confirming my address
  • bank statement
  • School Leaving Certificate confirming my date of birth
  • my degree certificate confirming that I was qualified for ECNR (Emigration Clearance not Required)
If you plan to submit the application in person, you have to also make sure that all these photocopies are signed by you - if someone else is submitting on your behalf, you must have all these attested by someone qualified to do so plus you have to submit some other statement.

You will need a large number of photographs but that was not a problem for me. I have lived in India long enough that I always carry about 50 photographs on my person at all times (my guess is that twelve should suffice). You will need to glue those photographs at various places and sign on those so that the signature spans both the application form as well as the photographs. Make sure that you have affixed your thumb impression at the right place on all the applications (left thumb for males and right for females).

Here then is the list of items you need:
  • glue stick (one no.)
  • stapler with extra pins
  • stamp pad
  • portable photo-copier
  • two ball point pens (black or blue ink)
  • scissors
  • plastic bag to carry all of the above
  • a friend to assist you
  • a novel/video game etc. to keep you entertained while waiting for your number in various queues
  • and above all, lots of patience ...


Armed with all the required documents and accessories, I reached the passport office on a Wednesday (I think). I will describe all that I learnt there in the form of factoids:

  • there are three steps in the application process.
    1. get a token number from the token number queue.
    2. wait to have your application scrutinized. Since there is a token system in place, you do not have to stand in any queue for this. In this stage, your application will be scrutinized in order to make sure that all your papers are in order. In general, you would have missed out some documents that need to be submitted and you are asked to bring them. For example, I had a valid visa on my current passport and I was asked to get photocopies of that (there are numerous shops providing photocopying service in the vicinity of the office). If your old passport contains valid visas, the passport is cancelled and immediately returned to you - the horror stories about your passport getting stolen etc. seem to be fiction.
    3. When your application has been scrutinized, you have to await your turn at the payment window (again, you do not have to stand in a line - you pay when your number is announced).
  • as mentioned above, I wanted to apply for a Tatkal passport. When I mentioned this to the "scrutiny" lady, she asked me to meet the Passport Officer after I had made my payment. At the payment window they just took the fees for a standard passport from me.
  • sign the token that was given to you at stage 1 of the process before you approach the scrutiny window - this is an undocumented requirement and will earn you brownie points with the lady at the scrutiny window.
  • after paying the fees I went to meet the Passport Officer and told him that I needed my Passport in a hurry - he asked me to come after eight days. Note that by this time it was close to 1 p.m. so there was no line to meet the Passport Officer.
  • I accordingly went to meet him after eight days. This time I had to stand in a long line (make sure you stand in the line to meet the Passport Officer, not the one leading up to the PRO :-) ). After an hour spent in the queue, I got an audience wth him. I had all the documents needed for getting an emergency application. He asked me to write a covering letter. After I did so, he asked me to meet him again after two days.
  • I again met him after two days going through the same process as before. Surprise of surprises - he asked me to come and collect the passport in the afternoon!
I went in the evening and collected my passport. I had got my new Indian passport with a 5 year validity on the tenth day after applying. Not only did I not have to bribe anybody, I didn't even have to pay any special fees! Of course, I had to spend three mornings and one evening, but I guess we are still only in the 21st. century out here :-(.

Lest I sound very matter-of-fact here, let me assure you that the actual experience was extremely annoying for me. I had even exchanged words with the Officer on the first day. However the speed with which I got my passport ultimately was a really pleasant surprise.

I think I will stop now.

- Rajendra Gokhale