How To Use R Statistics

The R programming linguistic communication is a powerful and complimentary statistical software tool that analysts use frequently.

The R programming language is open up source software where the R community develops and maintains it, while users tin can download it for gratis.

Existence open source provides many advantages, including the following:

New statistical methods are quickly bachelor because the R community is vast and active.
The source code for each function is freely available and everybody tin can review it.
Using the R programming language is free! That'southward a pregnant reward to relatively expensive statistical tools, such as SAS, STATA, and SPSS.

In this article, I requite y'all a brief introduction to the strengths of the R programming linguistic communication by applying basic statistical concepts to a existent dataset using R functions.

If you want to follow the examples, you can copy and paste the codes shown in this article into R or RStudio. All codes are 100% reproducible.

Let'due south dive into information technology!

Example Data for the R Programming Linguistic communication

In the first section of this article, we'll load the iris dataset into R. Ronald Fisher, biologist and statistician, introduced the iris blossom dataset in 1936. It contains bloom measurements.

Subsequently downloading the dataset, load it into R by executing the following lawmaking:

data(iris)     # Loading iris flower data as case information set

Next, we can inspect the structure of the iris bloom data using the head function. The caput function returns only the first six rows of a data set:

head(iris)     # Printing first 6 rows of iris data set

This table displays the first six rows of our example data and indicates that our data incorporate 5 variables: "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", and "Species."

The starting time four variables incorporate numeric values, and the fifth variable groups our data into dissimilar bloom species.

In the following sections, we'll clarify these data – So continue on reading!

Using R Functions to Summate Bones Descriptive Statistics for a Dataset

The following syntax illustrates how to calculate a set of descriptive statistics for all variables in a dataset.

For this task, we tin employ the summary role as shown below:

summary(iris)     # Return summary statistics

This table contains the minimum, 1st quantile, median, mean, 3rd quantile, and the maximum for the numeric columns in our data, and the count of each category for the not-numeric columns.

This information gives u.s. an overview of the data distributions for our variables. Even so, nosotros can clarify our data in much more than particular!

Related posts: Measures of Central Tendency and Interpreting Percentiles and Quartiles

Calculating Descriptive Statistics by Group using R

Equally you have seen in the previous sections, our dataset groups the observations by three flower species: setosa, versicolor, and virginica. Therefore, it might be interesting to compare the descriptive statistics of the dissimilar flower species.

The following R code uses the amass and hateful functions to calculate the mean by group (i.e. blossom species) for the variable Sepal.Length:

amass(Sepal.Length ~ Species, iris, mean)     # Return mean by group

The table indicates that the average sepal length of the bloom species setosa is the shortest and the sepal length of the species virginica is the longest.

Note that we accept calculated only the mean past grouping for the showtime variable. However, we can replace the variable proper name to calculate the mean by group for other variables. Additionally, we tin can calculate other summary statistics instead of the hateful.

To determine whether these mean differences are statistically significant, you demand to perform one-manner ANOVA.

Creating a Correlation Matrix with R Programming

Agreement the relationships betwixt variables provides boosted useful information. To gain this information, we tin create a correlation matrix (i.e., a table showing the correlation coefficients between multiple variables at the aforementioned time) by applying the cor function to the numeric variables of our data:

cor(iris[ , 1:4])     # Return correlation matrix

In the correlation matrix, you tin can run into, for instance, that the correlation between Petal.Width and Petal.Length is very high, simply the correlation betwixt Sepal.Width and Sepal.Length is relatively depression.

Related post: Agreement Correlation Coefficients

Using the R Programming Language to Guess a Linear Regression Model

The R programming language also provides functions to estimate statistical models. Ane of the most commonly used model types is linear regression. Using the lm and summary functions in R, we can estimate and evaluate these models.

The following R syntax uses the variable Sepal.Length as the dependent variable and the remaining variables in the dataset every bit independent variables:

summary(lm(Sepal.Length ~ ., iris))     # Results of linear regression

As well many other metrics, the output displays regression coefficients, standard errors, t-values, and p-values. As the stars on the right side of the output signal, all independent variables significantly impact the dependent variable.

Related post: Interpreting Regression Coefficients and their P-values

Generating Random Numbers with R Programming

So far, nosotros have used R to analyze the iris blossom dataset. However, the R programming linguistic communication also provides powerful functions to generate random data.

Whenever random processes are involved, it is useful to ready a random seed. A random seed is a number that initializes a pseudorandom number generator and allows other analysts to reproduce our "random" output. We can set a random seed in R using the set up.seed part:

ready.seed(101101)     # Gear up a random seed

Next, we tin draw random numbers from a random distribution. The rnorm function draws numbers from a normal distribution:

x_small <- rnorm(20)     # Generate minor sample

The previous R lawmaking has generated twenty random numbers following a normal distribution. We can visualize our randomly generated data in a histogram using the hist office:

hist(x_small)     # Draw histogram of small sample

The previous figure visualizes our random data in a histogram. As you lot can see, our data does not look unremarkably distributed yet.

Related posts: Using Histograms to Understand Your Data and Assessing Normality: Histograms vs. Q-Q Plots

The reason for this is that we have drawn only 20 random numbers from the normal distribution and, due to the law of large numbers, we need to draw a larger sample to approximate a normal distribution.

We can do this past simply increasing the number within the rnorm role. The post-obit R lawmaking draws 10000 random numbers from the normal distribution:

x_large <- rnorm(10000)     # Generate large sample

Let'southward draw these data in a histogram:

hist(x_large, breaks = 100)     # Draw histogram of large sample

Later on executing the code, R creates the histogram beneath.

As you can see, our data nigh perfectly follows the normal distribution.

Summary & About the Author

In this tutorial, yous take learned how to calculate basic statistics using the R programming language. In case you lot desire to larn more than about topics similar this, you may bank check out my website, Statistics Globe, also every bit the Statistics Globe YouTube aqueduct.

My name is Joachim Schork, I'thou a survey statistician and developer, and I provide many R programming and statistics tutorials on these platforms. Cheers a lot to Jim Frost for providing me this opportunity to introduce the R programming language on his wonderful website!