Parsing XML data using bash and standard Unix tools

Parsing XML can be a tedious and unpleasant job if you insist on using just standard Unix tools like sed, awk, cut, grep and so on. One might say that it’s better to use python/perl/ruby/other language that ships with a full blown XML parser and use the standard Unix utilites for what they were meant for, plain old text files and not pesky XML. The problem with those nice programming languages is that they take away the one liners. You need to import stuff, have variables, flow control and so on.

A nice tool that makes one’s life easier when it comes to XML is XML2. It can convert a normal xml file to a more line oriented file format. The standard debian distribution has this neat tool in the repos so you are one apt-get away from using it.

 

One simple example. Take this XML file:


<xml>
<fruits>
<fruit name="apple" type="royal gala" quantity="2" price="1"/>
<fruit name="orange" type="tasty" quantity="4" price="1.5"/>
<fruit name="banana" type="green" quantity="3" price="1"/>
</fruits>
</xml>

We run xml2 against it:

cosu@roadwarrior:/tmp$ xml2 < fruits.xml
/xml/fruits/fruit/@name=apple
/xml/fruits/fruit/@type=royal gala
/xml/fruits/fruit/@quantity=2
/xml/fruits/fruit/@price=1
/xml/fruits/fruit
/xml/fruits/fruit/@name=orange
/xml/fruits/fruit/@type=tasty
/xml/fruits/fruit/@quantity=4
/xml/fruits/fruit/@price=1.5
/xml/fruits/fruit
/xml/fruits/fruit/@name=banana
/xml/fruits/fruit/@type=green
/xml/fruits/fruit/@quantity=3
/xml/fruits/fruit/@price=1

And now we extract all the fruit names:

cosu@roadwarrior:/tmp$ xml2 < fruits.xml |grep name |cut -d"=" -f2
apple
orange
banana

There you go! A fruit salad! Of course for more complicated stuff use other tools :)

 

Internet Exchange Points

The largest Romanian IXP is Interlan . Funny enough, Interlan is a response of the smaller ISPs  to the other big Romanian IXP, Ronix. Because 3 years ago joining Ronix was a complicated affair, a few small companies decided to take matters into their own hands. Currently,  Interlan has 3 times more traffic than Ronix.

Joining pdf files

Combining multiple pdfs into a single file can be handy for putting together one big final report or for submitting a single print job instead of multiple smaller ones. Joining pdfs in a Debian based Linux distribution can be easily done by using the pdfjoin utility. It is provided by the pdfjam package. One only needs to

sudo aptitude install pdfjam

Then all that needs to be done is cd-ing into the folder containing the large number of pdfs and running:

pdfjoin *.pdf –outfile out.pdf

There you go, instant pdf!

Choosing random entries from a group

In the past two weeks we had a lottery-type thing on RGC.ro (Romanian Guitarist Community). Proguitar, the official importer of Fender products in Romania, wanted to give-away a custom made Fender Stratocaster electric guitar. To register, the community users had to fill out a form and choose from a series of custom options for the guitar.

As organizers we had to pick out the lucky winner of the raffle. Usually this is done by someone who is impartial. Due to the fact that we had about 1600 entries and that we are geeks we wanted to do something that geeks would do. Therefore we ditched the “extract the name of the lucky winner from a bowl”. The geek version of this is described in RFC2777 – Publicly Verifiable Nomcom Random Selection

In short RFC2777 describes a simple publicly verifiable algorithm to pick out a set of entries from a group as random as possible. The keywords here are public – anyone can see how the entries are picked – and as random as possible. To have random values a thing called information entropy is needed. To get that initial random value full of juicy entropy we used, as suggested in the RFC, the results from three international lotteries. This initial random value was slightly modified for each “extracted” entry and then transformed into a MD5 hash. Due to the nature of a hash when slightly modifying the original the resulting hash differs heavily from the original hash.

Below you can find a naive python implementation that can be freely used for any purpose. Just make sure you fill in the entropySource with a good initial random value.

import md5                                                 

if __name__ == '__main__':

    entropySource = "9.24.30.32.36.40./18.25.35.43.46.47./1.3.4.8.23.31./"

    numberOfEntries = 1655
    numberOfWinners = 10  

    numbers = map( lambda x: x + 1, range( numberOfEntries ) )

    i = 0
    entries = numberOfEntries
    print "index \t hex value of MD5 \t div \t selected"
    while ( i < numberOfWinners ) :
        md5hash = md5.new()
        md5hash.update( chr( i ) + entropySource + chr( i ) )
        val = int( md5hash.hexdigest(), 16 )
        modulo = val % entries
        print str( i + 1 ) + "\t" + md5hash.hexdigest() + "\t" + str( entries ) + "\t" + str( numbers[modulo] )
        del numbers[modulo]
        i += 1
        entries -= 1

SNE Update 1

So over two months have passed since my last update. I was either  busy or not in the mood to update my blog. I will try to make up for lost posts somehow…

I’m now enrolled to the System and Network Engineering Master at University of Amsterdam. How I got here?

Continue reading ‘SNE Update 1’ »

Great success!

(that’s what Borat would say)

Today I’ve received wonderful news! I have been accepted to the System And Network Engineering Master at the University of Amsterdam! Starting from the end of August I’ll be relocating to Amsterdam for one year of full geek experience (hopefully!). I can not thank enough my girlfriend on being such a great support and motivator.Without her nothing could have happened.  Also my teachers (esp prof. Rughinis and prof. Tapus) at the Faculty of Automatic Control and Computer Science at University POLITEHNICA of Bucharest have been great mentors and supporters of my admission.

This could be a good time to add this blog to your RSS reader as starting from September I’ll be posting regularly on both geek related stuff and the lifestyle of an international student in the Netherlands.

Get your personal email account

Most people use free email services like yahoo, gmail or live. Unfortunately all the nice sounding email addresses are taken by now so new users have to come up with strange combinations like johndoe19__smth_smth@yahoo.com. That’s very hard to remember and it sounds very unprofessional.

Having an online presence is no longer such a big deal. With a few dollars a year you can get your own .com (or other top-level-domain) and another few dollars a month get you a hosting plan which provides you a couple megabytes for website storage and a number of email accounts. So with a small investment you can have a decent email like name.sourname@somedomain.com . That’s something that you could put on your personal business card. Few know that you can skip the email service offered by your webhost  and instead use a more reliable service.

Both Microsoft and Google offer domain email hosting as a free service. Microsoft calls this Windows Live Custom Domains ( https://domains.live.com/ ) while Google calls it’s service Google Apps ( http://www.google.com/apps/intl/en/group/index.html )

Using these services is quite simple. You just have to prove that you are indeed the owner of the domain and make some DNS modifications so that emails will be handled by Google or Microsoft. Modifying the DNS records is a process that can be made using the web interface set up by your hosting provider (the one that hosts your DNS records) or by directly edition your DNS configuration in case you manage the DNS yourself. Either way both Microsoft and Google give you directions on how and what to modify.
For the tech savvy readers there are 2 basic steps: add a CNAME record containing a random string to prove that you are the rightful owner and then modify the MX records with the one provided in the instructions. It’s not that complicated.

Why should you do this?
Well both Microsoft and Google provide a better service than a normal hosting company when it comes to reliability. Sure, you don’t sign a contract that mentions any SLA but statistically speaking both offer a kick-ass service. You don’t have to worry about backups, downtime, spam and so on. It just works. For small operations, say personal email or small companies like startups , this kind of service is ideal as it cuts costs and/or gives less headaches.
Using the administration page you can create, delete or reset any email account. If someone messes up his/hers password you can simply reset the account. 
By using either the Microsoft based service or the Google one you get access to other related services like Office Online or Google Docs because the created email accounts serve as Live IDs or Google Accounts. This opens a new world of online collaboration. I know a few startups that use these kind of services.

What are the downsides?
You don’t own your email (carefully read the EULA’s ) and some may not like this.
You are limited to 50 or 100 email accounts and when you hit that limit you have to upgrade to a paid service. Individuals and small companies will just ignore this.
The web mail interface will display ads just as gmail.com or live.com. Adblocker type software could make this a non-issue.
You get little to no tech support. This can be neglected by individuals or small companies considering the advantages.

Access to the email account is made either by browser or by email client. Google Apps email can be accessed by POP3, IMAP and webmail. Unfortunately Windows Live Custom Domains does not offer access using the IMAP or POP3 protocols. To use Outlook you need to install a small piece of software called Office Outlook connector. The advantage of this approach is that besides email you can synchronize your address book and calendars. The IMAP and POP3 protocols don’t allow that. For Thunderbird + live you need a plugin but you get only basic service : get/send emails, no calendar :( .

With 9$ a year you could get a .com domain. You just need a public DNS server to host your records and that’s it, you can sign up for free email hosting.

Regarding DNS hosting, this is really not an issue. http://freedns.afraid.org/ is a very good option. If you don’t like it you could always ask your geek friend to help you out.

It’s hard to tell which service is best. Right now I’m using both Live Custom Domains and Google Apps and I’m quite happy with either one. It all depends on what you want to achieve.

After a year or more of using Goggle Apps I’m thinking of decommissioning all of my postfix installs (yes postfix is better than qmail) and switching to one of the above options. Having a full blown email server (even if it’s just a virtual machine with just enough resources serving many domains by means of sql and virtual domains) seems more and more a waste of time and resources for small operations.

I have a gut feeling that more and more companies will outsource the email service. I’ve seen this happening on a large scale in a few Universities in Romania.  The Bucharest Academy of Economic Studies is using Google Apps to offer email accounts to all it’s students ( that’s more than 20.000 accounts!). Likewise there’s a small implementation of Live @EDU , a Microsoft programme that basically does the same thing, in the Faculty of Automatic Control and Computers at the POLITEHNICA University in Bucharest (that’s about 3000 accounts, give or take). 

Free PowerShell book

If you’re just learning PowerShell or you’re already a top scriptwizard “Mastering Powershell” might prove to be a useful resource.

Besides the usual scripting basics like variables, functions, pipes and so on the later chapters show some usage of the scripting language for some more concrete problems like XML manipulation or user account management. Just give it a try, it’s free!

Color that manpage!

Manpages are the last line of defence when it comes to unix troubleshooting. After you’ve tried everything you could have think of and it still doesn’t work you know it’s time to read the manual.

By default linuxes use the less command to display the man page requested by the user. The manpage is displayed as plain text and because of that it can be sometimes hard to find what you’re looking for. Keywords and special parameters are printed with a bold face to ease document navigation but sometimes this is not enough.

Navigation is done by using the up and down arrows , page up/ page down and the space key.
Searching through the document is done by typing the / character followed by the word or phrase to search for.

One useful hack is to color the manpage so that keywords parameters and so on are highlighted.

To do this you we have to set some environment variables:

export LESS_TERMCAP_mb=$'\E[01;31m' # begin blinking
export LESS_TERMCAP_md=$'\E[01;31m' # begin bold
export LESS_TERMCAP_me=$'\E[0m' # end mode
export LESS_TERMCAP_se=$'\E[0m' # end standout-mode
export LESS_TERMCAP_so=$'\E[01;44;33m' # begin standout-mode - info box
export LESS_TERMCAP_ue=$'\E[0m' # end underline
export LESS_TERMCAP_us=$'\E[01;32m' # begin underline

The strange ‘\E0 strings stand for color codes used by the bash shell. You can check out some info about that on the bash-prompt-howto

After you have customized your colors you can save the above commands in your .bashrc file (the one in your home folder) so that the variables are set every time you logon.

Quickie: Wrap to 80 columns

I got a complaint that my submitted text file is not wrapped to 80 columns. Rather than work my butt to mix and match the text lines until i get to the bastard’s requirement I used the neat little tool called fold

cosu@cosu-desktop:~/Desktop$ cat file | fold -s
my monitor resolution is soooooooooooooooooo small that more than 80 colums of
text give me a segfault.

-s stands for break at spaces. man fold for more options.