Right at the start of my PhD, pretty much the first thing my supervisor sat me down to do was to learn R (see CRAN for download). I’ll be honest and admit that I didn’t really see the use of a statistical programming language for the kind of work I envisaged doing. Two years later…well, I’ve spent a lot of that time so far using R to draw the sort of graphs you could do in Excel in a few seconds. Granted, Excel won’t do histograms or boxplots as part of it’s default set of plotting features, and in that regards, R is very handy. But thus far, this is all I’ve used it for. I set out my data in Excel, then import it into R and run a script to plot something basic.
Indeed, this is exactly what I had been doing in preparation of a new blog post. I had a large set of data which I’ve been compiling since I started my PhD, and although it is really of no real use for my research, I wanted to at least do something with it. The data is production data for the UK North Sea for every operating oil and gas field, monthly, from 1975 to the present. A twelve month rolling update of the data is provided by DECC, but for anything older you would need to email DECC to ask for it. My supervisor emailed me what data he had (1975-2009, I think at the time), and I got the rest up to the end of 2011 from DECC to update the spreadsheet.
Yes, it’s a spreadsheet. As it stood – including the 2011 data – the Excel file had around 62,900 rows of data in 10 columns of variables. The data as supplied by DECC, however, came with the variables in tabs, and ran monthly data in rows, as apposed to columns in my big ‘master’ document. When I had to update the master document, it involved copying, pasting and transposing row by row. I probably added around 3-4,000 rows to the master in a tedious and long-winded exercise (it took my about a week). At the time I thought “There must be a better way to do this, using R or something“, but I never looked into it any further.
Fast forward to this week, when I started thinking about plotting up the data for this blog. I realised that I had the 2012 data, but hadn’t updated the master spreadsheet with it yet. I was all set to spend hours copying & pasting, but instead got chatting to a colleague (Mike Spencer, Scottish Snow blog) about it. He resurrected the idea of using R to join my master data to the 2012 update, and with help from Mike I was set down the right path, and actually enjoyed tickling the data into a single dataset. I’ve got to say, I was immensely satisfied to have it work out. There’s a few very minor issues with it, but that’s more to do with my OCD than anything fundamentally wrong with the data…
The solution was pretty straightforward, in the end, if a little tortuous. I found the Quick R page extremely useful, as always, as well as Stack Overflow for more complex problems. You can see the issue I had in the following pictures:
My master spreadsheet has the production data (oil, gas, condensate, etc.) as well as year and month arranged in columns, whereas the original data from DECC comes with the production data tabbed with each column being a different month. The first task was therefore to save the tabs as separate .CSV files to import into R, and then use the melt function in the reshape package to transpose the monthly data into rows, instead of columns. In preparation for merging all the production data later, I also needed to add a “Year” column, and arrange the columns in the same order that appear in the master data.
This was done for all the different production types (oil, dry gas, associated gas, condensate, water and injected water), prior to using the merge function, which takes two data frames and glues them together in some cunning way, merging duplicate values (i.e. field names) and adding unique values (i.e. oil and gas volumes) in the right place. Once all the 2012 data was merged together, I could merge it with the master data, and once I’d converted any NA values to 0, was now suitable for extracting what I was interested in for plotting. The whole thing took me a day to do, and was very satisfying. The code I used is below, if you are interested. I’m sure there’s a more elegant way of doing this, but this worked for me!
I’ll put the plots in a separate post, to follow this one shortly.
###### Libraries ############## ############################### library(reshape) ##### Data manipulation ####### ############################### DECC <- read.table("DECC.csv", header=T, sep=",") oil2012 <- read.table("oil2012.csv", header=T, sep=",") oil <- melt(oil2012, id=c("Field", "Operator")) #melt function from reshape package, converts data in rows to column oil$Year <- 2012 #adds new column "Year" with value 2012 oil <- oil[c(1,2,5,3,4)] #reorders columns to place "Year" in 3rd column names(oil) <- c("Field", "Operator", "Year", "Month", "Oil.m3") #renames columns cond2012 <- read.table("cond2012.csv", header=T, sep=",") cond <- melt(cond2012, id=c("Field", "Operator")) cond$Year <- 2012 cond <- cond[c(1,2,5,3,4)] names(cond) <- c("Field", "Operator", "Year", "Month", "Cond.m3") drygas2012 <- read.table("drygas2012.csv", header=T, sep=",") drygas <- melt(drygas2012, id=c("Field", "Operator")) drygas$Year <- 2012 drygas <- drygas[c(1,2,5,3,4)] names(drygas) <- c("Field", "Operator", "Year", "Month", "Gas.Ksm3") assgas2012 <- read.table("assgas2012.csv", header=T, sep=",") assgas <- melt(assgas2012, id=c("Field", "Operator")) assgas$Year <- 2012 assgas <- assgas[c(1,2,5,3,4)] names(assgas) <- c("Field", "Operator", "Year", "Month", "Ass.Gas.Ksm3") prodwat2012 <- read.table("prodwat2012.csv", header=T, sep=",") prodwat <- melt(prodwat2012, id=c("Field", "Operator")) prodwat$Year <- 2012 prodwat <- prodwat[c(1,2,5,3,4)] names(prodwat) <- c("Field", "Operator", "Year", "Month", "Prod.Water.m3") injwat2012 <- read.table("injwat2012.csv", header=T, sep=",") injwat <- melt(injwat2012, id=c("Field", "Operator")) injwat$Year <- 2012 injwat <- injwat[c(1,2,5,3,4)] names(injwat) <- c("Field", "Operator", "Year", "Month", "Inj.Water.m3") #Removes rows with Jan and Feb 2012 DECC2 <- DECC[!(DECC$Month == "Jan" & DECC$Year==2012),] DECC3 <- DECC2[!(DECC2$Month == "Feb" & DECC2$Year==2012),] #Merges files based on column names merge1 <- merge(oil, cond, all=T) merge2 <- merge(merge1, drygas, all=T) merge3 <- merge(merge2, assgas, all=T) merge4 <- merge(merge3, prodwat, all=T) merge5 <- merge(merge4, injwat, all=T) DECCFinal <- merge(DECC3, merge5, all=T) #Replaces all NA values with 0 DECCFinal[is.na(DECCFinal)] <- 0