Tickling North Sea data using R

Right at the start of my PhD, pretty much the first thing my supervisor sat me down to do was to learn R (see CRAN for download).  I’ll be honest and admit that I didn’t really see the use of a statistical programming language for the kind of work I envisaged doing.  Two years later…well, I’ve spent a lot of that time so far using R to draw the sort of graphs you could do in Excel in a few seconds.  Granted, Excel won’t do histograms or boxplots as part of it’s default set of plotting features, and in that regards, R is very handy.  But thus far, this is all I’ve used it for.  I set out my data in Excel, then import it into R and run a script to plot something basic.

Indeed, this is exactly what I had been doing in preparation of a new blog post.  I had a large set of data which I’ve been compiling since I started my PhD, and although it is really of no real use for my research, I wanted to at least do something with it.  The data is production data for the UK North Sea for every operating oil and gas field, monthly, from 1975 to the present.  A twelve month rolling update of the data is provided by DECC, but for anything older you would need to email DECC to ask for it.  My supervisor emailed me what data he had (1975-2009, I think at the time), and I got the rest up to the end of 2011 from DECC to update the spreadsheet.

Yes, it’s a spreadsheet.  As it stood – including the 2011 data – the Excel file had around 62,900 rows of data in 10 columns of variables.  The data as supplied by DECC, however, came with the variables in tabs, and ran monthly data in rows, as apposed to columns in my big ‘master’ document.  When I had to update the master document, it involved copying, pasting and transposing row by row.  I probably added around 3-4,000 rows to the master in a tedious and long-winded exercise (it took my about a week).  At the time I thought “There must be a better way to do this, using R or something“, but I never looked into it any further.

Fast forward to this week, when I started thinking about plotting up the data for this blog.  I realised that I had the 2012 data, but hadn’t updated the master spreadsheet with it yet.  I was all set to spend hours copying & pasting, but instead got chatting to a colleague (Mike Spencer, Scottish Snow blog) about it.  He resurrected the idea of using R to join my master data to the 2012 update, and with help from Mike I was set down the right path, and actually enjoyed tickling the data into a single dataset.  I’ve got to say, I was immensely satisfied to have it work out.  There’s a few very minor issues with it, but that’s more to do with my OCD than anything fundamentally wrong with the data…

The solution was pretty straightforward, in the end, if a little tortuous.  I found the Quick R page extremely useful, as always, as well as Stack Overflow for more complex problems.  You can see the issue I had in the following pictures:

My master spreadsheet has the production data (oil, gas, condensate, etc.) as well as year and month arranged in columns, whereas the original data from DECC comes with the production data tabbed with each column being a different month.  The first task was therefore to save the tabs as separate .CSV files to import into R, and then use the melt function in the reshape package to transpose the monthly data into rows, instead of columns.  In preparation for merging all the production data later, I also needed to add a “Year” column, and arrange the columns in the same order that appear in the master data.

This was done for all the different production types (oil, dry gas, associated gas, condensate, water and injected water), prior to using the merge function, which takes two data frames and glues them together in some cunning way, merging duplicate values (i.e. field names) and adding unique values (i.e. oil and gas volumes) in the right place.  Once all the 2012 data was merged together, I could merge it with the master data, and once I’d converted any NA values to 0, was now suitable for extracting what I was interested in for plotting.  The whole thing took me a day to do, and was very satisfying.  The code I used is below, if you are interested.  I’m sure there’s a more elegant way of doing this, but this worked for me!

I’ll put the plots in a separate post, to follow this one shortly.


###### Libraries ##############
##### Data manipulation #######

DECC <- read.table("DECC.csv", header=T, sep=",")

oil2012 <- read.table("oil2012.csv", header=T, sep=",")
oil <- melt(oil2012, id=c("Field", "Operator")) #melt function from reshape package, converts data in rows to column
oil$Year <- 2012 #adds new column "Year" with value 2012
oil <- oil[c(1,2,5,3,4)] #reorders columns to place "Year" in 3rd column
names(oil) <- c("Field", "Operator", "Year", "Month", "Oil.m3") #renames columns

cond2012 <- read.table("cond2012.csv", header=T, sep=",")
cond <- melt(cond2012, id=c("Field", "Operator"))
cond$Year <- 2012
cond <- cond[c(1,2,5,3,4)]
names(cond) <- c("Field", "Operator", "Year", "Month", "Cond.m3")

drygas2012 <- read.table("drygas2012.csv", header=T, sep=",")
drygas <- melt(drygas2012, id=c("Field", "Operator"))
drygas$Year <- 2012
drygas <- drygas[c(1,2,5,3,4)]
names(drygas) <- c("Field", "Operator", "Year", "Month", "Gas.Ksm3")

assgas2012 <- read.table("assgas2012.csv", header=T, sep=",")
assgas <- melt(assgas2012, id=c("Field", "Operator"))
assgas$Year <- 2012
assgas <- assgas[c(1,2,5,3,4)]
names(assgas) <- c("Field", "Operator", "Year", "Month", "Ass.Gas.Ksm3")

prodwat2012 <- read.table("prodwat2012.csv", header=T, sep=",")
prodwat <- melt(prodwat2012, id=c("Field", "Operator"))
prodwat$Year <- 2012
prodwat <- prodwat[c(1,2,5,3,4)]
names(prodwat) <- c("Field", "Operator", "Year", "Month", "Prod.Water.m3")

injwat2012 <- read.table("injwat2012.csv", header=T, sep=",")
injwat <- melt(injwat2012, id=c("Field", "Operator"))
injwat$Year <- 2012
injwat <- injwat[c(1,2,5,3,4)]
names(injwat) <- c("Field", "Operator", "Year", "Month", "Inj.Water.m3")

#Removes rows with Jan and Feb 2012
DECC2 <- DECC[!(DECC$Month == "Jan" & DECC$Year==2012),]
DECC3 <- DECC2[!(DECC2$Month == "Feb" & DECC2$Year==2012),]

#Merges files based on column names
merge1 <- merge(oil, cond, all=T)
merge2 <- merge(merge1, drygas, all=T)
merge3 <- merge(merge2, assgas, all=T)
merge4 <- merge(merge3, prodwat, all=T)
merge5 <- merge(merge4, injwat, all=T)
DECCFinal <- merge(DECC3, merge5, all=T)

#Replaces all NA values with 0
DECCFinal[is.na(DECCFinal)] <- 0


7 thoughts on “Tickling North Sea data using R

  1. You could try something like this to save on typing out each sheet:

    DECC.import = function(file,column.name,year){
    a = read.table(file, header=T, sep=”,”)
    a = melt(a, id=c(“Field”, “Operator”))
    a$Year = year
    a = a[c(1,2,5,3,4)]
    names(a) = c(“Field”, “Operator”, “Year”, “Month”, column.name)

    One step further would be to use lapply and list.files. Something along the lines of:

    setwd(“/yourdirectory”) # C:/ something for windows
    files = list.files(“.”) # whatever the windows equivalent is for current directory…
    filenames = unlist(strsplit(files,”[.]”)) # separate file from extension
    filenames = filenames[seq(1,length(filenames),2)] # remove extension from vector (crudely!)
    # This runs the function above and puts the output in a list
    new.data = lapply(files,function(i){

    You’d then need to figure out how to merge your list… which I could use help with!

  2. Pingback: Visualising Some Tickled North Sea Data | Vitamin CCS

  3. Pingback: An exciting new job | Promptus et Fidelis

  4. Pingback: Some satisfaction with R | Vitamin CCS

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s