An R debugging example

21 May 2013
2013/05/21

The steps taken to fix an R problem.

Task

To prepare for the Portfolio Probe blog post called “Implied alpha and minimum variance”, I tried to update a matrix of daily stock prices using a function I had written for the purpose.

Error

When I tried to do what I wanted, I got:

> univclose130518 <- pp.updateclose(jjuc[,1:5])
done with: MMM  ABT  ANF  ACE  ACN  
Error in if (beg > end) stop("Start date must be before end date.") : 
  missing value where TRUE/FALSE needed

Actually this command is just trying to do the update on the first 5 stocks.  Paring the problem down is useful because:

  • you see only a few symbols rather than 400-some
  • It is much quicker to get to the error using just a few stocks

The last point is important because usually when debugging you will end up wanting to trigger the error multiple times.  Also, reducing the problem is often a good way to figure out where the problem actually is.

Read

The first thing to do is to read the error message.  It is saying that the result of  ‘beg > end‘ is NA.  That happens when at least one of beg and end is NA.

So we know that a missing value has crept in.  From the error message we know that it has to do with the start and/or end date.

We don’t know where it has gone wrong.

Look

We can (almost always) see what functions were in operation when an error is thrown by looking at the traceback:

> traceback()
3: getYahooData(symbols[1], start = start, end = end, quiet = TRUE, 
       adjust = adjust) at backfrom_Funs.R#55
2: pp.TTR.multsymbol(outnam, as.numeric(gsub("-", "", index(univclose)[1])), 
       as.numeric(today)) at backfrom_Funs.R#29
1: pp.updateclose(jjuc[, 1:5])

We see that the error is produced in the function in the TTR package that retrieves the data, but we haven’t gained any more clues about what is wrong.

The traceback in this case shows us line numbers of the file that contains the definitions of pp.updateclose and pp.TTR.multsymbol.  The line numbers are the spots where each of the two functions are when the error occurs.

Dive in

Just looking hasn’t given us satisfaction, so let’s go inside those functions and have a look around.

Before we can do that, we need to make it possible, and then re-execute the error:

> options(error=dump.frames)
> univclose130518 <- pp.updateclose(jjuc[, 1:5])
done with: MMM  ABT  ANF  ACE  ACN  
Error in if (beg > end) stop("Start date must be before end date.") : 
  missing value where TRUE/FALSE needed

We set the option that says what to do when an error is thrown.  Then we have to recreate the error so that action is actually done.

Now we are ready to go into the debugger:

> debugger()
Message:  Error in if (beg > end) stop("Start date must be before end date.") : 
  missing value where TRUE/FALSE needed
Available environments had calls:
1: pp.updateclose(jjuc[, 1:5])
2: backfrom_Funs.R#29: pp.TTR.multsymbol(outnam, as.n
3: backfrom_Funs.R#55: getYahooData(symbols[1], start

Enter an environment number, or 0 to exit  
Selection:

We get a menu of where we can look.  We know that we want to look at start and end times.  Let’s start with the top-level function:

Selection: 1
Browsing in the environment with call:
   pp.updateclose(jjuc[, 1:5])
Called from: debugger.look(ind)
Browse[1]> ls()
 [1] "backup"    "cc"        "dif"       "ncom"     
 [5] "newclose"  "ocom"      "outnam"    "overlap"  
 [9] "start"     "today"     "univclose"
Browse[1]> start
[1] "20130401"
Browse[1]> today
[1] "20130518"

We want to investigate the dates.  Everything looks fine here.  No problem.

So let’s back out of looking in the top-level function and go into the second function:

Browse[1]> c
Available environments had calls:
1: pp.updateclose(jjuc[, 1:5])
2: backfrom_Funs.R#29: pp.TTR.multsymbol(outnam, as.n
3: backfrom_Funs.R#55: getYahooData(symbols[1], start

Enter an environment number, or 0 to exit  
Selection: 2
Browsing in the environment with call:
   backfrom_Funs.R#29: pp.TTR.multsymbol(outnam, as.n
Called from: debugger.look(ind)
Browse[1]> ls()
[1] "adjust"        "end"           "fun.copyright"
[4] "item"          "start"         "symbols"      
[7] "verbose"      
Browse[1]> start
[1] 1
Browse[1]> end
[1] 20130518

Now we see something obviously wrong.  start should be a date, instead it is 1.

That’s impossible

It isn’t possible for a date in the top-level function to turn into 1 in the second function, yet there it is.

Debugging is hard because we trap ourselves into being sure about certain things.

If we take a more careful look at the traceback, we see that — at the point where the error happens — the start time for the second function is not start in the first function but rather:

as.numeric(gsub("-", "", index(univclose)[1]))

So the difference between start and start is not because of some cosmic ray or such.

The function

The definition of the function is:

pp.updateclose <- function (univclose, overlap=5, backup=FALSE) 
{
  # Placed in the public domain 2013 by Burns Statistics

  # Testing status: seems to work

  require(BurStMisc)

  today <- gsub("-", "", Sys.Date())

  if(backup) {
    save(univclose, file=paste("univclose_", today, ".rda", sep=""))
  }
  start <- if(inherits(univclose, "zoo")) {
      tail(index(univclose), overlap)[1]
    } else {
      tail(rownames(univclose), overlap)[1]
   }
  start <- substring(gsub("-", "", start), 1, 8)
  newclose <- pp.TTR.multsymbol(colnames(univclose), as.numeric(start),
                                as.numeric(today))
  cc <- intersect(colnames(univclose), colnames(newclose))
  if(nrow(univclose) < overlap) overlap <- nrow(univclose)
  ocom <- tail(univclose[, cc], overlap)
  ncom <- head(newclose[, cc], overlap)
  if(any(abs(ocom - ncom) / ocom > 1e-6)) {
    dif <- apply(abs(ocom - ncom) / ocom, 2, max)
    outnam <- cc[dif > 1e-6]
    replace <- pp.TTR.multsymbol(outnam, as.numeric(gsub("-", "", 
                                                         index(univclose)[1])), 
                                 as.numeric(today))
  } else {
    outnam <- NULL
  }
  if(length(cc) < ncol(univclose)) {
    stop("new has fewer assets")
  }
  ans <- rbind(univclose, tail(newclose, -overlap))
  if(length(outnam)) {
    ans[, outnam] <- replace
    warning("changes to: ", paste(outnam, collapse=", "))
  }
  print(cbind(corner(ans, 'bl', n=c(7,4)), corner(ans, 'br', n=c(7,4))))
  ans
}

The testing status is stated as “seems to work”.  Now we know that “works” is not universal.

What it’s doing

One of the first things the function does is to pick a start date so that there is an overlap between the new data that is retrieved and what is already there.  This is an attempt to avoid ‘Living it up with computational errors’.  We don’t want to join two price series if they don’t agree with each other.  (It is reasonably common for the prices to change because of corporate actions.)

It then retrieves data and compares the overlapping data.  For stocks that have mismatches, it does another retrieval going back to the start of the data.

So we get a little data for all the stocks, then we get a lot of data for a few stocks.  This avoids getting a lot of data for all the stocks, yet having confidence in the data.

The final thing it does is to use the corner function to print out a little of the  most recent updated data so the user can verify that the updating seemed to go okay.  Things to look for include:

  • repeated dates
  • missing dates
  • missing data
  • prices that don’t change as expected

What goes wrong

In our example the first retrieval goes okay.  It then gets confused about what the start time is for the second retrieval.

The price data can be in two forms:

  • a matrix with the dates in the row names
  • a zoo (more specifically xts) object

The start time for the initial retrieval looks to see which type of object we have and takes appropriate action.  The second start time just assumes we have a zoo object.  But really we have a matrix.

Solution

We can write a sub-function that gets the start time.  Writing a function means that we don’t have to repeat the same operations in two places.  This makes it easier to debug if any more bugs come to light.  It also gives a natural way to allow additional data structures.

The new sub-function is:

sfun.makeStart <- function(x)
{
  start <- if(inherits(x, "zoo")) {
    index(x)[1]
  } else {
    rownames(x)[1]
  }
  substring(gsub("-", "", start), 1, 8)
}

The first use of it is:

start <- sfun.makeStart(tail(univclose, overlap))

The second use is:

start2 <- sfun.makeStart(univclose)

Why was it wrong

Whenever you find a bug, it is worthwhile asking yourself why it is there.  The most important reason for doing this is so that you will think of where else that same bug may appear.  (Go fix them now.)

In this case the bug was there because the function was repurposed from another place where xts objects were guaranteed, and there was not an abstraction for “make start time”.

More on debugging

Circle 8 of The R Inferno begins with a brief introduction to debugging in R.  The rest of Circle 8 is a bunch of ways that things can go rogue.

Tao Te Programming has a chapter on debugging.  Plus it has suggestions on how to try to avoid needing to debug in the first place.

Tags:
0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *

© Copyright - Burns Statistics