# R para principiantes # PUCE, Quito, 5-8 enero 2010 # Simon Queenborough # MIERCOLES: clase 6 # GRAFICOS AVANCADOS ############################### pie chart ######################### # We demonstrate the pie chart using the avian influenza dataset from Exercise 1 # in Section 3.7. Recall that the data represent the numbers of confirmed human # cases of Avian Influenza A/(H5N1) reported to theWorld Health Organization # (WHO). The data for several countries were taken from the WHO website at # www.who.int and are reproduced only for educational purposes. We exported # the data in the Excel file, BidFlu.xls, to a tab-separated ascii file with the name # Birdflucases.txt. The following code imports the data and presents the usual information. setwd("C:/R_Quito/data/") BFCases <- read.table(file = "Birdflucases.txt", header = TRUE) names(BFCases) str(BFCases) # We have annual data from the years 2003–2008. The first variable contains # the years. There are various things we can learn fromthis dataset.An interesting # question is whether the number of bird flu cases has increased over time.We can # address this question for individual countries or for the total number of cases. # The latter is calculated by Cases <- rowSums(BFCases[, 2:16]) names(Cases) <- BFCases[,1] Cases # Columns 2–16 of BFCases contain the information per country. The rowSums function # calculates totals per year and the names function adds the labels2003–2008 # to the variable Cases. (Note that the 34 cases in 2008 is misleading, # as this was written halfway through 2008. If this were a proper statistical # analysis, the 2008 data would be dropped.) The function for a pie chart in R # is pie. It has various options. The pie function requires as input a vector of # nonnegative numerical quantities; anything more is optional and deals with labels, colours, and the like. par(mfrow = c(2, 2), mar = c(3, 3, 2, 1)) pie(Cases, main = "Ordinary pie chart") #A pie(Cases, col = gray(seq(0.4, 1.0, length = 6)), clockwise = TRUE, main = "Grey colours") #B pie(Cases, col = rainbow(6), clockwise = TRUE, main = "Rainbow colours") #C library(plotrix) # hay que bajar plotrix por el R console > packages > install packages pie3D(Cases, labels = names(Cases), explode = 0.1, main = "3D pie chart", labelcex = 0.6) #D # The par function is discussed in the next section. The variable Cases is of # length 6 and contains totals per year. The command pie (Cases) creates the # pie chart in Fig. 7.1A. Note that the direction of the slices is anticlockwise, # which may be awkward, because our variable is time related. We reversed this # in the second pie chart (Fig. 7.1B) with the option clockwise = TRUE.We # also changed the colours, try # this yourself: type in the code and see the colours of the pie charts in panels # A–C. # Because most of your work is likely to end up in a greyscale paper or # report, we recommend using greyscale fromthe beginning. The only exception # is for a PowerPoint presentation, where it is useful to present coloured pie # charts. Note that the term ‘‘useful’’ refers to ‘‘coloured,’’ rather than to pie # charts per se. The main problem with the pie chart is illustrated in Fig. 7.1: # Although 2005 and 2006 have the largest slices, it is difficult to determine # whether you should stay at home and close the windows and doors to survive # the next pandemic, or whether ‘‘only’’ a handful of people were unfortunate # enough to contract the disease. The pie chart does not give information on # sample size. # # Finally, Fig. 7.1D shows a three-dimensional pie chart. Although it # now looks more like a real pie, it is, if anything, even less clear in its # presentation than the other three graphs. To make this graph, you need # to install the package plotrix. The function pie3D has many options, # and we suggest that you consult its help file to improve the readability of # labels. ################################ par ################################### # The par function has an extensive list of graph parameters (see ?par) that can # be changed. Some options are helpful; others you may never use. # The mfrow =c (2, 2) creates a graphic window with four panels. Chan- # ging the c (2, 2) to c (1, 4) or c (4, 1) produces a row (or column) of # four pie charts. If you have more than four graphs, for instance 12, use mfrow # =c(3, 4), although now things can become crowded. # # The mar option specifies the amount of white space around each graph (each # pie chart in this case). The white space is defined by the number of lines of # margin at the four sides; bottom, left, top, and right. The default values are, # respectively, c (5, 4, 4, 2)+0.1. Increasing the values gives more white # space. Using trial and error, we chose c (3, 3, 2, 1). # # A problem arises with the par function if you execute the code for the four # pie charts above and, subsequently, make another graph. R is still in the 2 62 # mode, and will overwrite Figure 7.1A, leaving the other three graphs as they are. # # The next graph will overwrite panel B, and so on. There are two ways to avoid # this. The first option is simply to close the four-panel graph in R before making # a new one. This is a single mouse click. The alternative is a bit more programming intensive: op <- par(mfrow = c(2, 2), mar = c(3, 3, 2, 1)) pie(Cases, main = "Ordinary pie chart") pie(Cases, col = gray(seq(0.4, 1.0, length = 6)), clockwise = TRUE, main = "Grey colours") pie(Cases, col = rainbow(6), clockwise = TRUE, main = "Rainbow colours") pie3D(Cases, labels = names(Cases), explode = 0.1, main = "3D pie chart", labelcex = 0.6) par(op) # The graph parameter settings are stored in the variable op on the first line. # The graphs are made as before, and the last line of code converts to the default # settings. Any new graph created after the par (op) command will be plotted as # if the par function had not been used. This is useful if you need to create many # graphs in sequence. It is neat programming, but takes more typing. It is often # tempting to be lazy and go for the first approach. However, for good program- # ming practice, we recommend making the extra effort. You will also see this # style of programming in the help files. ################################ bar chart & strip chart ############################### ### bar chart # In the previous section, an avian influenza dataset was used to create pie charts # showing the total number of cases per year. In addition to bird flu cases, the # number of deaths is also available and can be found in the tab-separated ascii # file, Birdfludeaths.txt. The data are loaded with the commands: BFDeaths <- read.table(file = "Birdfludeaths.txt", header = TRUE) Deaths <- rowSums(BFDeaths[, 2:16]) names(Deaths) <- BFDeaths[,1] Deaths # The data are structured in the same manner as the bird flu cases. We can # visualise the change in the number of cases over time, and then compare number # of cases to deaths. # # The bar chart shows the change in the number of cases # over time using the data from the variable Cases. Recall that Cases has six values with the # labels 2003–2008. Each year is presented as a vertical bar. This graph is # more useful than the pie chart, as we can read the absolute values from # the y-axis. However, a great deal of ink and space is consumed by only # six values. par(mfrow = c(2, 2), mar = c(3, 3, 2, 1)) barplot(Cases , main = "Bird flu cases") #A Counts <- cbind(Cases, Deaths) barplot(Counts) #B barplot(t(Counts), col = gray(c(0.5, 1))) #C barplot(t(Counts), beside = TRUE) #D # In panels B–D, we used the combined data for cases and deaths; these are # called Counts and are of dimension 6 6 2: Counts # In panel B the bars represent data for each year. The graph gives little usable # information. Also, years with small numbers (e.g., 2003) are barely visible. To # produce panel C, we took the transposed values of Counts using the function t, # making the input for the barplot function a matrix of dimension 2 6 6. t(Counts) # Although you see many such graphs in the literature, they can be misleading. # If you compare the white boxes with one another, your eyes tend to compare the # values along the y-axis, but these are affected by the length of the grey boxes. If # your aim is to show that in each year there are more cases than deaths, this # graph may be sufficient (comparing compositions). Among the bar charts, # panel D is probably the best. It compares cases and deaths within each year, # and, because there are only two classes per year, it is also possible to compare # cases and deaths among years. #### bar chart of mean values with standard deviations # Core samples were taken at 45 stations on # nine beaches along the Dutch coastline. The marine benthic species were # determined in each sample with over 75 identified. # The file RIKZ2.txt contains the richness values for the 45 stations and also a column # identifying the beach. # The following R code imports the data and calculates the mean richness and # standard deviation per beach. setwd("C:/R_Quito/data/") Benthic <- read.table(file = "RIKZ2.txt", header = TRUE) Bent.M <- tapply(Benthic$Richness, INDEX = Benthic$Beach, FUN = mean) Bent.sd <- tapply(Benthic$Richness, INDEX = Benthic$Beach, FUN = sd) MSD <- cbind(Bent.M, Bent.sd) # The variable Bent.M contains the mean richness values, and Bent.sd the # standard deviation, for each of the nine beaches.We combined them in a matrix # MSD with the cbind command. The values are as follows: MSD # To make a graph in which the mean values are plotted as a bar and the # standard deviations as vertical lines extending above the bars use the # following procedure. For the graph showing mean values, enter barplot(Bent.M) # Add labels and perhaps some colour for interest: barplot(Bent.M, xlab = "Beach", ylim = c(0, 20), ylab = "Richness", col = rainbow(9)) # The vertical lines indicating standard deviations are added using the function # arrows to draw an arrow between two points with coordinates (x1, y1)and(x2, # y2). TellingR to draw an arrow between the points (x, y1)and(x, y2), will produce a # vertical arrow, as both points have the same x-value. The y1-value is the mean, and # the y2-value is the mean plus the standard deviation. The x is the coordinate of the # midpoint of a bar. The following code obtains these values and creates Fig. 7.3A. bp <- barplot(Bent.M, xlab = "Beach", ylim = c(0,20), ylab = "Richness", col = rainbow(9)) arrows(bp, Bent.M, bp, Bent.M + Bent.sd, lwd = 1.5, angle = 90, length = 0.1) box() # It is the bp<–barplot (Bent.M, ...) that helps us out. The best way to # understand what it does is by typing: bp # They are the midpoints along the x-axis of each bar, which are used as input # for the arrows function. The angle = 90 and length = 0. 1 options # change the head of the arrow into a perpendicular line. The lwd stands for line # width with 1 as the default value. The box function draws a box around the # graph. Run the code without it and see what happens. #################################### strip chart ############################### # It is relatively easy to produce # a graph with the raw data, mean values, and either the standard deviation # or standard error around the mean. An example is given in Fig. 7.3B. # Instead of using the plot function, we used the stripchart function. # The open dots show the raw data. We have added random jittering (varia- # tion) to distinguish observations with the same value, which would other- # wise coincide. The filled dots are the mean values per beach, and were # calculated in the previous section. We illustrate the standard errors, which # are calculated by dividing the standard deviation by the square root of the # sample size (we have five observations per beach). In R, this is done as follows. Benth.le <- tapply(Benthic$Richness, INDEX = Benthic$Beach, FUN = length) Bent.se <- Bent.sd / sqrt(Benth.le) # The variable Bent.se now contains the standard errors. Adding the # lines for standard error to the graph is now a matter of using the arrow # function; an arrow is drawn from the mean to the mean plus the standard # error, and also from the mean to the mean minus the standard error. The # code is below. stripchart(Benthic$Richness ~ Benthic$Beach, vert = TRUE, pch = 1, method = "jitter", jit = 0.05, xlab = "Beach", ylab = "Richness") points(1:9, Bent.M, pch = 16, cex = 1.5) arrows(1:9, Bent.M, 1:9, Bent.M + Bent.se, lwd = 1.5, angle = 90, length = 0.1) arrows(1:9, Bent.M, 1:9, Bent.M - Bent.se, lwd = 1.5, angle = 90, length = 0.1) # The options in the stripchart function are self-explanatory. Change them # to see what happens. The points function adds the dots for the mean values. # Instead of the stripchart function, you can also use the plot function, but # it does not have a method = "jitter" option. Instead you can use jitter # (Benthic$Richness). ########################## boxplot ################################## # The boxplot should most often be your tool of choice, especially when working # with a continuous numerical response (dependent) variable and categorical # explanatory (independent) variables. Its purpose is threefold: detection of out- # liers, and displaying heterogeneity of distribution and effects of explanatory # variables. Proper use of this graphing tool, along with the Cleveland dotplot, # can provide a head start on analysis of data. setwd("C:/R_Quito/data/") Owls <- read.table(file = "Owls.txt", header = TRUE) boxplot(Owls$NegPerChick) # The thick horizontal line is the median; the box is # defined by the 25th and 75th percentiles (lower and upper quartile). The difference between the # two is called the spread. The dotted line has a length of 1.5 times the spread. (The length of the # line pointing up is shorter if the values of the points are smaller than the 75th percentile+1.56 # spread, and similar for the line pointing downwards. This explains why there is no line at the # bottom of the box.) All points outside this range are potential outliers. S par(mfrow = c(2,2), mar = c(3, 3, 2, 1)) boxplot(NegPerChick ~ SexParent, data = Owls) boxplot(NegPerChick ~ FoodTreatment, data = Owls) boxplot(NegPerChick ~ SexParent * FoodTreatment, data = Owls) boxplot(NegPerChick ~ SexParent * FoodTreatment, names = c("F/Dep", "M/Dep", "F/Sat", "M/Sat"), data = Owls) # There are 27 nests, all with long names. If we had entered boxplot(NegPerChick ~ Nest, data = Owls) # only a few of the labels would be shown. The solution was to create the boxplot # without a horizontal axis line and to put the labels in a small font, at an angle, # under the appropriate boxplot. This sounds complicated, but requires only # three lines of R code. par(mar = c(2, 2, 3, 3)) boxplot(NegPerChick ~ Nest, data = Owls, axes = FALSE, ylim = c (-3, 8.5)) axis(2, at = c(0, 2, 4, 6, 8)) text(x = 1:27, y = -2, labels = levels(Owls$Nest), cex = 0.75, srt = 65) # Because we used the option axes =FALSE, the boxplot function drew # the boxplot without axes lines. The ylim specifies the lower and upper limits of # the vertical axis. Instead of using limits from 0 to 8.5, we used –3 to 8.5. This # allowed us to put the labels in the lower part of the graph # # The axis function draws an axis. Because we entered 2 as the first argument, # the vertical axis on the left is drawn, and the at argument specifies where the # tick marks should be. The text command places all the labels at the appro- # priate coordinates. The cex argument specifies the font size (1 is default) and # srt defines the angle. You will need to experiment with these values and choose # the most appropriate settings. ############################## cleveland dotplots ################################### # Before doing any analysis, we should inspect each continuous variable in the # dataset for outliers. This can be done with a boxplot or with a Cleveland dotplot. # The graphs were created using the R function, dotchart. Function # dotchart2 in the package Hmisc (which is not part of the base installation) # can produce more sophisticated presentations. We limit our discussion to # dotchart. The data are imported with the following two lines of code. setwd("C:/RBook/") Deer <- read.table("Deer.txt", header = TRUE) dotchart(Deer$LCT, xlab = "Length (cm)", ylab = "Observation number") # The dotchart function has various options. The groups option allows # grouping the data by categorical variable: dotchart(Deer$LCT, groups = factor(Deer$Sex)) # The variable Sex has missing values (type Deer $Sex in the R console to # view them), and, as a result, the dotchart function stops and produces an # error message. The missing values can easily be removed with the following # code. Isna <- is.na(Deer$Sex) dotchart(Deer$LCT[!Isna], groups = factor(Deer$Sex[!Isna]), xlab = "Length (cm)", ylab = "Observation number grouped by sex") # The is.na function produces a vector of the same length as Sex, with the # values TRUE and FALSE. The ! symbol reverses them, and only the values for # which Sex is not a missing value are plotted. # If you want to have the two Cleveland dotplots in one graph, put the # par (mfrow = c (1, 2) ) in front of the first dotchart. # Cleveland dotplots are a good alternative to boxplots when working with small # sample sizes. Figure 7.9A shows a Cleveland dotplot of the benthic data used # earlier in this chapter. Recall that there are five observations per beach. The # right graph shows the same information with the mean value for each beach # added. This graph clearly shows at least one ‘‘suspicious’’ observation. The code # is basic; see below. The first three commands import the data, with Beach # defined as a factor. A graph window with two panels is created with the par # function. The first dotchart command follows that of the deer data. To the # second dotchart command, we have added the gdata and gpch options. # The g stands for group, and the gdata attribute is used to overlay a summary # statistic such as the median, or, as we do here with the tapply function, the # mean. Finally, the legend function is used to add a legend.We discuss the use # of the legend function in more detail later in this chapter. Benthic <- read.table(file = "RIKZ2.txt", header = TRUE) Benthic$fBeach <- factor(Benthic$Beach) par(mfrow = c(1, 2)) dotchart(Benthic$Richness, groups = Benthic$fBeach, xlab = "Richness", ylab = "Beach") Bent.M<-tapply(Benthic$Richness, Benthic$Beach, FUN = mean) dotchart(Benthic$Richness, groups = Benthic$fBeach, gdata = Bent.M, gpch = 19, xlab = "Richness", ylab = "Beach") legend("bottomright", c("values", "mean"), pch = c(1, 19), bg = "white") ####################### mas 'plot' ################################################ # abline plot(y = Benthic$Richness, x = Benthic$NAP, xlab = "Mean high tide (m)", ylab = "Species richness", main = "Benthic data") M0 <- lm(Richness ~ NAP, data = Benthic) abline(M0) # The new addition is the lm and abline functions. Without going into statis- # tical detail, the lm applies linear regression in which species richness is modelled as # a function of NAP, the results are stored in the list M0,andthe abline function # superimposes the fitted line. Note that this only works if there is a single explana- # tory variable (otherwise, plotting the results in a two-dimensional graph becomes # difficult), and if the abline function is executed following the plot function. # xlim, ylim # The xlim and ylim specify the ranges along the x- and y-axes. # Suppose that you wish to set # the range of the horizontal axis from –3 to 3 metres and the range along the # vertical axis from 0 to 20 species. Use plot(y = Benthic$Richness, x = Benthic$NAP, xlab = "Mean high tide (m)", ylab = "Species richness", xlim = c(-3, 3), ylim = c(0,20)) # The xlim argument has to be of the form c(x1, x2), with numerical values # for x1 and x2. The same holds for the ylim argument. plot(y = Benthic$Richness, x = Benthic$NAP, type = "n", axes = FALSE, xlab = "Mean high tide", ylab = "Species richness") points(y = Benthic$Richness, x = Benthic$NAP) # The type = n produces a graphwithout points, and, because we use axes = # FALSE, no axes lines are plotted. We begin with a blank window with only the # labels. The points function superimposes the points onto the graph (note that # you must execute the plot function prior to the points function or an error # message will result). # # In panel C, we basically told R to prepare a graph window, but not to plot # anything. We can then proceed, in steps, to build the graph shown in Panel D. # The axis function is the starting point in this process. It allows specifying the # position, direction, and size of the tick marks as well as the text labelling them. plot(y = Benthic$Richness, x = Benthic$NAP, type = "n", axes = FALSE, xlab = "Mean high tide", ylab = "Species richness", xlim = c(-1.75,2), ylim = c(0,20)) points(y = Benthic$Richness, x = Benthic$NAP) axis(2, at = c(0, 10, 20), tcl = 1) axis(1, at = c(-1.75, 0,2), labels = c("Sea", "Water line", "Dunes")) # The first two lines of code are identical to those for panel C. The axis (2, ...) # command draws the vertical axis line and inserts tick marks of length 1 (the # default value is –0.5) at the values 0, 10, and 20. Setting tcl to 0 eliminates tick # marks. Tick marks pointing outwards are obtained by a negative tcl value; a # positive value gives inward pointing tick marks. The axis (1, ...) command # draws the horizontal axis, and, at the values –1.75, 0, and 2, adds the character # strings Sea, Water line, and Dunes. See the axis help file for further graphing # facilities. ######################## puntos, texto, lineas ################################## # This section addresses features that can be used to increase the visual appeal of # graphs. Possible embellishmentsmight be different types of lines and points, grids, # legends, transformed axes, andmuchmore. Look at the par help file, obtained by # typing ?par, to see many of the features that can be added and altered. # The function points adds new values to a plot, such as x-values and (option- # ally) y-values. By default, the function plots points, so, just as with plot, type is # set to "p". However, all the other types can be used: "l" for lines, "o" for # overplotted points and lines, "b" for points and lines, "s" and "S" for steps, and # "h" for vertical lines. Finally, "n" produces a graph-setup with no data points or # lines (see Section 7.5.2). Symbols can be changed using pch. # # The function text is similar to points in that it uses x and (optionally) y- # coordinates but adds a vector called labels containing the label strings to be # positioned on the graph. It includes extra tools for fine-tuning the placement of # the string on the graph, for example, the attributes pos and offset. The pos # attribute indicates the positions below, to the left of, above, and to the right of # the specified coordinates (respectively, 1, 2, 3, 4) and offset gives the offset of the # label from the specified coordinate in fractions of a character width. These two # options become relevant with long character strings that are not displayed # properly in R’s default display. # Using type = "n" # With the plot function, it is possible to include the attribute type = "n" to # draw everything but the data. The graph is set up for data, including axes and # their labels. To exclude these, add axes = FALSE, xlab = "", ylab = # "". It then appears there is nothing left. However, this is not the case, because # the plot retains the data that were entered in the first part of the plot function. # The user is now in full control of constructing the plot. Do you want axes lines? # If so, where do you want themand how do you want themto look?Do you want # to display the data as points or as lines? Everything that is included in the # default plot, and much more, can be altered and added to your plot. # Identifying Points # The function identify is used to identify (and plot) points on a plot. It can be # done by giving the x, y coordinates of the plot or by simply entering the plot # object (which generally defines or includes coordinates). Here is an example: plot(y = Benthic$Richness, x = Benthic$NAP, xlab = "Mean high tide (m)", ylab = "Species richness", main = "Benthic data") identify(y = Benthic$Richness, x = Benthic$NAP) # With the attribute labels in the identify function, a character vector giving # labels for the points can be included. To specify the position and offset of the # labels relative to the points; place your mouse near a point and left-click; R will # plot the label number close to the point. Press ‘‘escape’’ to cancel the process. It # is also possible to use the identify function to obtain the sample numbers of # certain points; see its help file.Note that the identify function only works for # graphs created with the plot function, and not with boxplots, dotcharts, bar # charts, pie charts, and others. # adding special characters > setwd("C:/R_Quito/data/") Whales <- read.table(file="TeethNitrogen.txt", header = TRUE) N.Moby <- Whales$X15N[Whales$Tooth == "Moby"] Age.Moby <- Whales$Age[Whales$Tooth == "Moby"] plot(x = Age.Moby, y = N.Moby, xlab = "Age", ylab = expression(paste(delta^{15}, "N"))) ########################### pairplot ############################### Benthic <- read.table(file = "RIKZ2.txt", header = TRUE) pairs(Benthic[, 2:9]) ########################## coplot ################################### coplot(Richness ~ NAP | as.factor(Beach), pch=19, data = Benthic) coplot(Richness ~ NAP | grainsize, pch=19, data = Benthic) # IF YOU FINISH: http://www.harding.edu/fmccown/R/