Using R to visualize data

R is a great tool for visualising data. Recently I’ve got a chance to work on a project which handles rather big amounts of data. I was told the data is big but I’ve never seen any visuals myself. One day I decided to collect more information about the data and plot some graphics.

Initially data was collected into csv files of the following structure:

client,type,subtype,size,name
bidswitch,aggregates,predict_price_2,114876,2013-02-15-00.bidswitch.prod.predict_price_2.tsv.gz
admaxasia,aggregates,budget_4,10398,2013-02-14_17:25:50.2013-02-15_00:15:00.admaxasia.prod.budget_4.tsv.gz
tokyo,logs,pixel,3802377,2013-02-15-19-30-00.CET.pixel_v36.userverlua-ireland22_338a1.log.gz
tokyo,logs,pixel,2695996,2013-02-15-19-30-00.CET.pixel_v36.userverlua-ireland25_a9a45.log.gz

Later the format has changed and a few fields were added: mtime, user, group, perm. However in this acticle we will only need client and size fields.

Know your enemy

So basically what I wanted to know is what amounts of data are flowing through the system? I was curious to know the number of files for each client and their total size.

R> data <- read.csv('all/raw/files-2013-02-15.csv')
R> dim(data)
[1] 167247      5
R> names(data)
[1] "client"  "type"    "subtype" "size"    "name"   
R> length(list.files('all/raw'))
[1] 74

Not many files yet, but they’re counting every day and we can see something interesting today.

First steps

My first step was to take standart plotting functions and get somewhat acceptable plot where I can see the players. Being a novice in R I can say it was not that easy. After a lot of googling I got my first working prototype and it appeared to be quite lengthy and verbose.

#
# Plot number of files and total size of files per client from csv.
#
plotNullt <- function(file) {
    par(mfrow=c(1,2))

    d <- read.csv(file, head=T, colClasses='character')
    d$size <- as.numeric(d$size)
    sizes <- aggregate(d$size, by=list(client = d$client), FUN=sum)
    sizes$x <- sizes$x %/% (1024 ^ 2) # b to Mb
    total_size = sum(d$size) %/% (1024 ^ 2)
    total_files = nrow(d)

    # plot data size info
    K <- length(sizes$x)
    ord <- order(sizes$x, decreasing=T)
    par(oma=c(3,0,0,0))
    main <- sprintf('Mb per client (%s, total size = %dMb)', file, total_size)
    a <- barplot(sizes$x[ord], axes=F, ylab='Mb', main=main)
    m <- max(sizes$x)
    axis(side=1, at=a, labels=sizes$client[ord], las=3)
    axis(side=2, at=seq(0, m, 5000))

    # files per client
    clients <- unique(d$client)
    x <- sapply(clients, function(x) nrow(d[d$client == x, ]))
    lines <- data.frame(client=clients, files=array(x))

    # plot # of files info
    ord <- order(lines$files, decreasing=T)
    main <- sprintf('# of files per client (%s, total # of files = %d)', file, total_files)
    a <- barplot(lines$files[ord], axes=F, ylab='# of files', main=main)
    m <- max(lines$files)
    axis(side=1, at=a, labels=lines$client[ord], las=3)
    axis(side=2, at=seq(0, m, 500))
}

In this function we’re plotting two plots. We use par and mfrow to specify the number of rows and columns of subplots in our main plot.

Next we read the data from the file, making sure we get size column as numeric values. It took me a while to figure out how to use aggregate the right way. We use it to find the sum of file sizes for each client. This could probably be implemented using powerful plyr package but I’m not familiar with it yet. It’s one line of code for aggregation already which is good enough. Next we divide aggregated sums by 1024 ^ 2 to represent values in Mb. Total size and total number of files is also remembered.

After that we draw our first plot. The most challenging part here was to provide correct ordering: I wanted clients to be sorted in a decreasing order of their respective values, be it file sizes or number of files. Another tricky part was to rotate the axis labels by 90 degrees and to add more ticks to the axis. The former one was solved with axis function and las parameter. The latter one was simpler and only required to use at and seq.

Next I’m using sapply to calculate the number of files per client. This is weird and I don’t know why I didn’t use aggregate again, it could be much simpler.

And pretty much the same story with the second plot. The code is basically the same and except of a few different lines. The result looks as follows:

2013/05/03

Know your enemy

First steps