# R para principiantes # PUCE, Quito, 5-8 enero 2010 # Simon Queenborough # MIERCOLES: clase 10 # RANDOM DATA # Although Einstein said that god does not play dice, R can. For example sample(1:6,10,replace=T) # or with a function RollDie = function(n) sample(1:6,n,replace=T) RollDie(5) # In fact, R can create lots of di erent types of random numbers ranging from familiar families of distributions to # specialized ones. # Random number generators in R{ the \r" functions. # As we know, random numbers are described by a distribution. That is, some function which species the probabil- # ity that a random number is in some range. For example P(a < X <= b). Often this is given by a probability density # (in the continuous case) or by a function P(X = k) = f(k) in the discrete case. R will give numbers drawn from lots # of different distributions. In order to use them, you only need familiarize yourselves with the parameters that are # given to the functions such as a mean, or a rate. Here are examples of the most common ones. For each, a histogram # is given for a random sample of size 100, and density (using the \d" functions) is superimposed as appropriate. # Uniform # # Uniform numbers are ones that are "equally likely" to be in the specified range. Often these numbers are # in [0,1] for computers, but in practice can be between [a,b] where a,b depend upon the problem. An example # might be the time you wait at a traffic light. This might be uniform on [0; 2]. runif(1,0,2) # time at light runif(5,0,2) # time at 5 lights runif(5) # 5 random numbers in [0,1] # The general form is runif(n,min=0,max=1) which allows you to decide how many uniform random numbers # you want (n), and the range they are chosen from ([min,max]) # To see the distribution with min=0 and max=1 (the default) we have x=runif(100) # get the random numbers hist(x,probability=TRUE,col=gray(.9),main="uniform on [0,1]") curve(dunif(x,0,1),add=T) ## Normal. ## # Normal numbers are the backbone of classical statistical theory due to the central limit theorem The # normal distribution has two parameters a mean  and a standard deviation . These are the location and # spread parameters. For example, IQs may be normally distributed with mean 100 and standard deviation # 16, Human gestation may be normal with mean 280 and standard deviation about 10 (approximately). The # family of normals can be standardized to normal with mean 0 (centered) and variance 1. This is achieved by # "standardizing" the numbers, i.e. Z = (X - mu) / sigma. # here are some examples rnorm(1,100,16) # an IQ score rnorm(1,mean=280,sd=10) # Here the function is called as rnorm(n,mean=0,sd=1) where one specifies the mean and the standard deviation. # Too see the shape for the defaults (mean 0, standard deviation 1). x=rnorm(100) hist(x,probability=TRUE,col=gray(.9),main="normal mu=0,sigma=1") curve(dnorm(x),add=T) ## also for IQs using rnorm(100,mean=100,sd=16) # Binomial. # # The binomial random numbers are discrete random numbers. They have the distribution of the number # of successes in n independent Bernoulli trials where a Bernoulli trial results in success or failure, success with # probability p. # A single Bernoulli trial is given with n=1 in the binomial n=1; p=.5 # set the probability rbinom(1,n,p) # different each time rbinom(10,n,p) # 10 different such numbers # A binomially distributed number is the same as the number of 1's in n such Bernoulli numbers. For the last # example, this would be 5. There are then two parameters n (the number of Bernoulli trials) and p (the success # probability). # To generate binomial numbers, we simply change the value of n from 1 to the desired number of trials. For # example, with 10 trials: n = 10; p=.5 rbinom(1,n,p) # 6 successes in 10 trials rbinom(5,n,p) # 5 binomial number # The number of successes is of course discrete, but as n gets large, the number starts to look quite normal. This # is a case of the central limit theorem which states in general that ( X - mu ) / sigma is normal in the limit # The graphs show 100 binomially distributed random numbers for 3 values of n and for p = 0.25. # Notice in the graph, as n increases the shape becomes more and more bell-shaped. These graphs were made # with the commands n=5;p=.25 # change as appropriate x=rbinom(100,n,p) # 100 random numbers hist(x,probability=TRUE,) ## use points, not curve as dbinom wants integers only for x xvals=0:n;points(xvals,dbinom(xvals,n,p),type="h",lwd=3) points(xvals,dbinom(xvals,n,p),type="p",lwd=3) # ... repeat with n=15, n=50 # Exponential. # # The exponential distribution is important for theoretical work. It is used to describe lifetimes of # electrical components (to first order). For example, if the mean life of a light bulb is 2500 hours one may # think its lifetime is random with exponential distribution having mean 2500. The one parameter is the rate = # 1/mean. We specify it as follows rexp(n,rate=1). Here is an example with the rate being 1/2500 (figure 28). x=rexp(100,1/2500) hist(x,probability=TRUE,col=gray(.9),main="exponential mean=2500") curve(dexp(x,1/2500),add=T) # There are others of interest in statistics. Common ones are the Poisson, the Student t-distribution, the F # distribution, the beta distribution and the 2 (chi squared) distribution. ### Sampling with and without replacement using sample #### # R has the ability to sample with and without replacement. That is, choose at random from a collection of things # such as the numbers 1 through 6 in the dice rolling example. The sampling can be done with replacement (like # dice rolling) or without replacement (like a lottery). By default sample samples without replacement each object # having equal chance of being picked. You need to specify replace=TRUE if you want to sample with replacement. # Furthermore, you can specify separate probabilities for each if desired. # Here are some examples ## Roll a die sample(1:6,10,replace=TRUE) ## toss a coin sample(c("H","T"),10,replace=TRUE) ## pick 6 of 54 (a lottery) sample(1:54,6) # no replacement ## pick a card. (Fancy! Uses paste, rep) cards = paste(rep(c("A",2:10,"J","Q","K"),4),c("H","D","S","C")) sample(cards,5) # a pair of jacks, no replacement ## roll 2 die. Even fancier dice = as.vector(outer(1:6,1:6,paste)) sample(dice,5,replace=TRUE) # replace when rolling dice # The last two illustrate things that can be done with a little typing and a lot of thinking using the fun commands # paste for pasting together strings, rep for repeating things and outer for generating all possible products. #### A bootstrap sample #### # Bootstrapping is a method of sampling from a data set to make statistical inference. The intuitive idea is that by # sampling, one can get an idea of the variability in the data. The process involves repeatedly selecting samples and # then forming a statistic. Here is a simple illustration on obtaining a sample. # The built in data set faithful has a variable \eruptions" that measures the time between eruptions at Old # Faithful. It has an unusual distribution. A bootstrap sample is just a sample with replacement from the given values. # It can be found as follows data(faithful) # part of R's base names(faithful) # find the names for faithful eruptions = faithful[['eruptions']] # or attach and detach faithful sample(eruptions,10,replace=TRUE) hist(eruptions,breaks=25) # the dataset ## the bootstrap sample hist(sample(eruptions,100,replace=TRUE),breaks=25) # Notice that the bootstrap sample has a similar histogram, but it is different: #### d, p and q functions #### # The d functions were used to plot the theoretical densities above. As with the \r" functions, you need to specify # the parameters, but di erently, you need to specify the x values (not the number of random numbers n). # The p and q functions are for the cumulative distribution functions and the quantiles. As mentioned, the distri- # bution of a random number is specified by the probability that the number is between a and b for arbitrary a and b, # P(a < X  b). In fact, the value F(x) = P(X  b) is enough. # The p functions answer what is the probability that a random variable is less than x. Such as for a standard # normal, what is the probability it is less than :7? pnorm(.7) # standard normal pnorm(.7,1,1) # normal mean 1, std 1 # Notationally, these answer P(Z  :7) where Z is a standard normal or normal(1,1). To answer P(Z > :7) is also # easy. You can do the work by noting this is 1 - P(Z < 7) or let R do the work, by specifying lower.tail=F as in: pnorm(.7,lower.tail=F) # The q function are inverse to this. They ask, what value corresponds to a given probability. This the quantile # or point in the data that splits it accordingly. For example, what value of z has .75 of the area to the right for a # standard normal? (This is Q3) qnorm(.75) # Notationally, this is finding z which solves 0:75 = P(Z <= z). #### Standardizing, scale and z scores #### # To standardize a random variable you subtract the mean and then divide by the standard deviation. # To do so requires knowledge of the mean and standard deviation. # You can also standardize a sample. There is a convenient function scale that will do this for you. This will # make your sample have mean 0 and standard deviation 1. This is useful for comparing random variables which live # on di erent scales. # Normal random variables are often standardized as the distribution of the standardized normal variable is again # normal with mean 0 and variance 1. (The \standard" normal.) The z-score of a normal number is the value of it # after standardizing. # If we have normal data with mean 100 and standard deviation 16 then the following will find the z-scores x = rnorm(5,100,16) x z = (x-100)/16 z # The z-score is used to look up the probability of being to the right of the value of x for the given random variable. # This way only one table of normal numbers is needed. With R, this is not necessary. We can use the pnorm function # directly pnorm(z) pnorm(x,100,16) # enter in parameters