Preface

Figure 0.2: A Statypus Reading its Favorite Book in Nature

A platypus may enter a state of something called “torpor” during the colder months, which is similar to hibernation, for periods of up to six days at a time.³

0.1 Philosophy

This book has been written for an introductory course in computer aided statistics without the reliance on any calculus. All code has been written as to be executable on a “fresh” R environment by relying on functions and techniques that do not require the implementation of packages. The few exceptions are the original functions introduced in Chapters 5 and 10 as well as a few others in the later chapters. However, care has been given to make the code as readable as possible so that the enthusiastic reader can follow as much as is possible. Data manipulation is done with “base” R techniques which may appear cumbersome to advanced users who are familiar with packages such as dplyr or ggplot2.

Here at Statypus, we strive to offer a supportive learning environment which is anchored on the following principles:

Students should be critical consumers of statistics. This means that they are able to discuss possible issues with statistical experiments/studies and question the validity of data.
Students should be thoughtful producers of statistics. This means that they are able to make calculations and produce inferences that they can explain clearly to others.
Instruction should focus on the interpretation of statistical calculations rather than testing a student’s ability to memorize equations and formulas.
We learn by making mistakes. I have taught children that mistakes are an integral part of the learning process and not something to be seen as detrimental. Students should Dare to be Wrong and interact in a way that they do not feel encumbered by the fear of making mistakes. Too many students sit quietly unable to bring themselves to ask the many questions they do have. My classroom and this book are meant to be safe places where all ideas are entertained and where no one is ever ridiculed for making a mistake.
Computers should be used as an integral tool of instruction and concepts should be taught through the lens of using a computer.
The use of AI tools such as GitHub Copilot should be encouraged, but only after students have built a solid foundation in basic programming logic and statistical computation through hands-on practice.

0.2 AI Use in Elementary Undergraduate Statistics

Figure 0.3: A Statypus Ignoring the Elephant in the Room

The landscape of statistics is changing rapidly. Tools like GitHub Copilot can now generate code in seconds, leading many to ask: “If the AI can write the code, why do I need to learn it?” In this course, we view AI as a high-powered “autopilot” system. However, just as a pilot must log hundreds of manual flight hours before being trusted with an automated cockpit, you must understand the language of R to safely “fly” a tool like Copilot. Using AI without a foundation in Base R is like using a calculator before you understand addition—you might get an answer, but you won’t know if it’s the right answer.

Learning to code manually throughout this book is your “flight school.” This hands-on path allows you to:

Become a Critical Commander: AI often suggests overly complex solutions or uses external packages when a simple Base R function is more efficient. By learning the core logic, you gain the authority to tell the AI: “No, use a simpler base function instead.”
Debug with Confidence: AI-generated code will eventually fail. When it does, the AI often struggles to explain why. Students who understand syntax can peer “under the hood” to pinpoint and fix problems that would leave a prompt-only user stranded.
See Past the “Black Box”: We want to avoid a “black box” mentality where data goes in and magic comes out. Writing your own code for a confidence interval or a linear regression ensures you understand the statistical mechanics behind the result.
The Calculator Analogy: Think of how we learn arithmetic. We practice long division by hand not because we hate calculators, but because the process reveals the underlying patterns of mathematics. Relying on a calculator too early can stunt your ability to see those patterns in more advanced courses like Algebra or Calculus.

The “Verify then Trust” Workflow: As you incorporate GitHub Copilot into your RStudio environment, try this three-step approach:

Draft First: Try to write the code yourself using the functions covered in the book.
Use Copilot for Speed: Once you understand the syntax, use Copilot to help with repetitive tasks or to suggest the next line in a sequence.
Critique the Output: Always look at what the AI suggests. If it uses a function you haven’t learned yet, ask yourself (or even the AI) if there is a Base R way to do it instead.

AI offers incredible and ever-improving sets of tools, but it is most beneficial when used by critical consumers. Our goal is to ensure that by the end of this semester, you aren’t just a user of AI-generated code, but a thoughtful producer and critical consumer of statistical analysis.

0.3 Acknowledgements

The process of writing a book is a very humbling experience, especially someone’s first book. My own personal experience was a winding walk through periods of absolute confidence and determination as well as darker periods of self-doubt and lack of belief to accomplish this goal. There is absolutely no way that this book would not exist without the help of so many people.

I am forever indebted to Darrin Speegle and Bryan Clair. Their book Probability, Statistics, and Data was a very early inspiration to develop some of my own materials which eventually turned into this book. The first document that eventually led to this book was a modified version of Section 2.2 of Darrin and Bryan’s book. In addition, an early version of Chapter 1 of Statypus is a modified version of their Chapter 1. I have learned so many techniques and been pulled away from so many bad habits by learning from both of them. From simply allowing me to see how they coded certain parts of their book to countless discussions about how certain topics should be taught, this book would not exist without them. Darrin is definitely responsible for pushing me to rethink introductory statistics within a mindset of using the technology correctly. I am sure Darrin sometimes dreaded seeing me walk towards his office door with another question that was painfully simple for him, but I could not have done this without his help. Bryan was the person who literally told me to turn my early PDF documents into a bookdown. Without that nudge, this project would never have gotten to this point. He has also been there for numerous impromptu conversations about statistical issues and concepts as well as many technical explanations to allow me to create this book. I would have no idea what a “cascading style sheet” was without him, but Bryan offered an example of his own and helped me understand how to use it.

I would also like to thank Anneke Bart. She was the chair of the Mathematics and Statistics Department during the time this book was written as has always been a true friend. She offered countless hours within her busy schedule to allow me to discuss this project as well as using some of my early PDF documents in her own course.

While I am absolutely appreciative of the efforts and time from each of these three amazing professors, I wish to thank them more for offering me respect and treating me as a friend and colleague regardless of how lost in the weeds I would often become.

I am also extremely thankful to my family. My focus was definitely pulled away from them for the past year to a degree which I have often regretted. Even when I was able to put the laptop down and be a husband or father, I know I did not always offer them the focus that they rightfully deserve. My children each make an appearance in this book and I hope that they can accept that as my apologies for not always actively listening to them like I should have.

My wife, Alie, has been the absolute bedrock from which I have drawn strength throughout the process of writing this book. Alie is one of the kindest individuals I have ever met and absolutely brilliant (other than in her choice of spouse). There has never been a single moment where she has not shown absolute belief in me and I can say unequivocally that I could not have done this without her. Alie, I thank you for putting up with the subpar husband I have been as I have worked on this book. I don’t think any man is deserving of you, but I am so grateful to be the person you have chosen to spend your life with.

0.4 Parts of this Book

As you navigate through the numerous webpages that make up Statypus, you will encounter many different colored boxes which set aside a portion of the screen for a specific task. Knowing the purpose of these different boxes can assist you in understanding what is being talked about and expedite your ability to find something you are looking for within Statypus. The following gives examples of the different times of boxes you will encounter here.

0.4.1 Alert

As mentioned earlier, this book accepts mistakes as an important and unavoidable part of the learning process. That being said, the purpose of making mistakes is to be able to avoid making similar mistakes in the future and where possible, a good teacher tries to offer a cautionary tale about common mistakes students make. Red Alert boxes, like the one shown below from Section 9.3.2 do just this. They offer a cautionary warning to be aware of easy mistakes that can be made while covering certain topics.

It is important to not confuse the different uses of \(p\) here. We have a population proportion which we denoted \(p\) and now a measure of how strong evidence based on a sample which we call the \(p\)-value. We also use \(P\) for probability calculations. We will always write out \(p\)-value to avoid confusion as much as possible.

0.4.2 Big Idea

There are a lot of students who can memorize nearly any equation given to them and many can use them effortlessly and without err. However, a lot of students struggle to answer questions of the form, “What does that mean?”, when asked to put things into context. A good instructor should be the person who can facilitate a student’s deeper understanding of what something means and not simply recite notes from a piece of paper which a student then transcribes to their own paper (or iPad?) which they then compare to their textbook and find very little difference. This book attempts to do just that with the green Big Idea boxes like the one below from Section 9.3.2. Big Idea boxes try to offer concepts in a way that are meant as a sort of heuristic view of a complex concept.

Loosely, we can take the \(p\)-value to represent how much you can still believe the assumption made in \(H_0\) after examining the evidence against it. If we believe in the statement(s) made in \(H_0\) prior to running the test, then we can view \(p\)-value as how much belief we still have in after examining the evidence provided by our sample. We will soon discuss how little belief is acceptable before we are forced to reject \(H_o\). However, it is important to also remember that this is just a loose way to make sense of it and that the technical meaning is given in Definition 9.11.

0.4.3 Code Template

Coding is hard and computers don’t care about what you “meant.” They only care what you explicitly tell them to do. For example, the code view( mtcars ) will cause an error in R because it should be View( mtcars ), with an upper case V. Human beings (even lowly textbook authors) are not perfect and typos are just a matter of “when” and not “if.” To minimize the number of simple typographical errors students encounter, it is helpful if they can “borrow” code that they know will work and be able to adapt it to their own needs rather than asking them to write new code on their own. Green Code Template boxes such as the one below from Section 3.3.3 do just this. The user should be able to automatically copy the contents of the lighter colored boxes by placing their cursor over the upper right hand portion of the box. This allows students to easily migrate code from Statypus directly into their own work with minimal concerns about typing errors.

To make a stem and leaf plot of a vector x, you use the code:

#Will only run if x is defined
stem( x )

or if we have a column of a dataframe, we would use:

#Will only run if df is appropriately defined
stem( df$Col )

0.4.4 Data Download

Statistics can be simply thought of as the science of working with data to understand our world. For most people, there is no need to discuss the concepts in a statistics course unless it relates so some sort of data. Getting that data in a way that is easily used can sometimes be tricky and with a myriad of formats out there, modern computing has made this even trickier in some ways. However, we try to minimize this with purple Data Download boxes such as the one below from Example 4.1. These offer code that you can copy and paste which will automatically download the data from the Statypus servers and move it into their RStudio environment.

Use the following code to download BabyData1.

BabyData1 <- read.csv( "https://statypus.org/files/BabyData1.csv" )

0.4.5 Example

If mistakes are an important part of learning, then examples are even more important. Blue Example boxes like the one below, Example 4.4, offer us a way keep track when we leave abstract concepts and begin to work on a specific application.

Example 0.1 If we wanted to find the range of birth masses of babies in our sample, we can do this with the following code.

range( BabyData1$weight )

## [1]  907 4825

This shows us that the smallest baby in our sample had a mass 907 grams while the largest had a mass of 4825 grams.

The range length is thus \(4825 - 907 = 3918\) grams.

0.4.6 Let’s Explore

Most good mathematicians can “see” math happen in their heads. For example, envision two vertical poles situated a certain distance apart. Further imagine that a wire is connected from the top of each pole to the base of the other. The two wires would obviously cross and a simple (at least simple to ask) question would be: “How high is the intersection of the two wires in terms of the distances and lengths of the poles and wires?” If you had a Ph.D. in geometry, you may be able to see this entire image in your head, but most people would need to make a sketch of the figure to understand what is going on. However, this problem requires us to consider the figure where we do not know any of the distances or lengths. Light blue Let’s Explore boxes attempt to offer just such a tool. The exploration below gives an interactive visualization of exactly the problem we just setup here. The answer, however, is left for you to figure out!⁴

0.4.7 New Function

Using computer software such as R requires us to use functions that are built into its system. Pink New Function boxes like the one below from Section 5.1 give a place to begin your understanding of how these software functions work. They are meant to offer the important information about the function and how to use it before we begin to actually enter values or data into them.

The syntax of plot is

plot( x, y, type )

where the arguments are:

type: Sets what type of plot should be drawn. See ?plot for a full list of options

The first vector entered, x, is graphed on the horizontal axis and the second vector entered, y, is graphed on the vertical axis.

0.4.8 New Functions

Each chapter begins with a list of the new functions it will introduce with a pink New Functions box like the one found at the beginning of Chapter 3. This offers students (and instructors) a quick place to reference where different functions were introduced. Functions in these boxes should be in the order that they appear within the text.

We will see the following functions in Chapter 3.

table(): Uses cross-classifying factors to build a contingency table of the counts at each combination of factor levels.
proportions(): Returns conditional proportions given entries of x divided by the appropriate sum(s).
barplot(): Creates a bar plot with vertical or horizontal bars.
hist(): Computes a histogram of the given data values.
stem(): Produces a stem-and-leaf plot of the values in x.

0.4.9 Now It’s Your Turn!

If you read an entire book on the theory of how to properly shoot a basketball, would that improve your ability to make a free throw? The answer is probably not unless you actually took time to also practice the concepts you are learning. Now there are exercises at the end of every chapter (everyone loves homework), but the yellow Now It’s Your Turn! boxes, the one one below from Section 3.2.2, offer a low stakes way for students to check if they are grasping the material as they go.

Make a comparative bar chart of the number of gears a car had based on the shape of the engine. The two variables are gear which gives the number of forward gears a car had and vs which tells whether an engine was V-shaped (value of 0) or straight (value of 1). Try changing the order of the variables and playing with the beside argument. Can you see a relationship between the variables?

0.4.10 Platypus Oddity

The platypus is weird… there’s no way to get around that. However, so are most mathematicians. We celebrate the uniqueness of the platypus with a gray Platypus Oddity box at the beginning of each chapter immediately after sharing an image of a Statypus (a statistics loving platypus) entirely to bring humor and happiness to the reader! The following fact does not appear in any chapter, but tucked away here for the most invested reader.

Platypuses are five times as sexy as humans. Well, at least “chromosomally.” A platypus has ten sex chromosomes while a human has only two.

Male platypuses have the pattern

\[\text{XYXYXYXYXY}\]

while females have the pattern

\[\text{XXXXXXXXXX}.\]

0.4.11 Definition

It is “turtles all the way down” as the saying goes. To make any headway in mathematics or statistics, we must begin with defining certain things and the green Definition boxes, like the one below for Definition 4.1, do just that. While not as fun as a fun fact about a leg laying mammal, definitions cannot be left out. Some definitions here may not match ones you may have learned in the past and that is fine. It is up to the author of each book to define what terms mean within the pages (webpages, I guess) of their book.

Definition 0.1 Given a vector \({\bf x} = ( x_1, x_2, \ldots, x_n )\) having \(n\) values, we can define the arithmetic mean or simply mean of \({\bf x}\), which we denote as \(\bar{x}\) if \({\bf x}\) is a sample or \(\mu\) if \({\bf x}\) is the entire population, as follows. \[\bar{x} \text{ or }\mu= \frac{1}{n} \sum_{i = 1}^n x_i = \frac{1}{n} \left( x_1 + x_2 + \cdots + x_n \right) = \frac{ x_1 + x_2 + \cdots + x_n }{n}.\]

0.4.12 Remark

Sometimes a point needs to be made and stand out, but it’s not a potential mistake nor does it fit into any of the other categories of boxes we have here at Statypus. Orange Remark boxes like the one below from Section 9.3.2 fill this gap. They will contain a wide array of ideas and concepts that students should pay attention to.

Definition 9.11 is an interpretation of the “informal” definition of a \(p\)-value as given by the American Statistical Association. Unfortunately, a rigorous definition is not easily given, nor is its interpretation fully agreed upon.

0.4.13 Theorem

“Mathematicians turn coffee into theorems” is an old adage and isn’t necessarily untrue, although the author of this book prefers tea! The theorems of this book will appear in bright green Theorem boxes like the one below for Theorem 5.1. If it’s a theorem, it’s probably important.

Theorem 0.1 The correlation, \(r\), between two vectors, \({\bf x}\), and \({\bf y}\), satisfies the following:

\(-1 \leq_R_\leq 1\)
\(|r| =1\) means that the ordered pairs \(\big\{ (x_i, y_i)\big\}\) are collinear.
Correlation is a symmetric operation. That is, the correlation of \({\bf x}\) and \({\bf y}\) is the same as the correlation of \({\bf y}\) and \({\bf x}\), i.e. the order of the vectors does not matter.

0.5 A Note From the Author

First off, most people will interact with this as a website, but for simplicity’s sake, we may refer to this collection of webpages as a “book.” We may refer to it as “this book,” “Statypus,” or something similar, but any reference to something outside of this document will be given as explicitly as is possible.

This book is the culmination of a project to overhaul the STAT 1300 course at Saint Louis University (SLU). In 2011, the math department elected to require the course be taught using the statistical software known as R. Prior to this, they had used SPSS going back to Version 1 in the 1970s. Members of the math department, with consultation to other departments at the university, decided to make the shift as R began to attract more use among researchers as well as academics. The open source nature of R (and RStudio) removes any economical barrier that commercial software may have and this was also seen as a huge benefit for all parties.

However, after a few years the course was being taught nearly entirely by adjuncts and a question as to the quality and consistency of the course was raised, especially with how R was integrated into the course. After an internal review, it was decided that an overhaul of the course was needed and materials to allow a high quality and consistent level of instruction using R needed to be developed.

Initially, the hope was to find a traditional commercial product to accomplish this goal. There was the expectation that some sort of scaffolded materials to incorporate R may be needed with a traditional textbook, but there was no plan to “start from scratch.” Dozens of introductory statistics books were reviewed from every publisher who could be thought of. However, there was no introductory (non-Calculus based) statistics book that really aligned itself to using R.

There are many outstanding commercial introductory statistics books out there, but most commercial products are designed to reach as many people or universities as they can. A lot of books will walk through the theory and calculations software agnostic and then have examples of how to handle those concepts within different types of technology. After trying to find the right angle to structure this course through trying different ideas and methods, I began to realize that I needed to provide written instruction about how to use R to my students.

Using R had become absolutely interwoven with how I was teaching the course, and I was spending countless classroom hours showing students code in R. I had begun to share the R scripts I created during lectures with students and this worked fairly well, but something more seemed necessary. In the Spring semester of 2024, I adapted a portion of the book Probability, Statistics, and Data by Darrin Speegle and Bryan Clair into a document I called “SimulationProject.” With this PDF, I provided my students with examples of R code which they could copy and paste and use to work problems on their own. Seeing R code in a “published” format seemed to be revolutionary to my students.

That led me to the conclusion I should be providing professional looking examples of R code to my students if if I was going to expect them to be able to use it at the appropriate level. That appropriate level may be different for different people, but my vision is that students in this course should be expect to use and slightly modify code they are given. It is not expected that students will be able to create new novel code and that code templates and also worked examples were necessary for students to know how to really use R. The next document to develop was called “SRSwSample.” A lot of introductory books discuss the antiquated practice of using random digit tables to find a Simple Random Sample. However, with the course being taught through R, it seemed silly to not leverage the software to do the sampling for us. This was the first document that I created completely from scratch and student’s liked the ability to review a finished PDF rather than having to simply rely on their ability to take notes on examples of how I used R functions in class.

From here, it was “Game On” and in the Fall semester of 2024 I began to create stand alone documents to supplement how to use R to do the concepts in each chapter of the textbook we were using. In addition to seeing me work problems in class, students had typed out examples of the use of R for the concepts we were discussing in class. Approximately 10 documents were developed over the course of that semester and the total page count was at over 200 pages. What started as a simple “add-on” was turning into a full workbook or possibly even a textbook.

With the guidance of other faculty member’s I decided to combine all of these documents into a single location and it was decided that the easiest way to disseminate that collection to our students was via a website. The goal was to create a workbook to exist in tandem with an existing commercial textbook that would allow the commercial textbook to handle the heavy lifting of setting up all of the content while allowing my website to offer students a place to learn how to integrate those concepts in R. Even during the opening week of the Spring 2025 semester I told the publisher of the commercial book we were using that I had no intentions of working without a traditional textbook for this course.

However, dancing with two partners is not easy. It became clear that it was not advantageous to work with students in one book and then pivot to another book to learn “how to do that concept” and then have to go back to the original book for the next concept. In addition, no commercial book has terminology and notation which is consistent with that found in R, so some “translation” became necessary. Translation became more and more teaching and most of the materials developed later in the Fall semester of 2024 was nearly a stand-alone textbook for the material it was teaching.

While laborious to write your own stand-alone textbook, it does allow you to tailor it to the exact specifications that you wish. There was no existing commercial book that fit the course that SLU envisioned, so it started to become clear what this project needed to do: write a whole new textbook with the exact vision that the course had. I want my students to interact with statistics and data in a modern way. That is, they should always expect to be able to have computational power nearby, be it a laptop or just their smartphone. The idea of doing anything statistical without technology seems almost laughable in the modern world.

This book attempts to teach statistics using computer as a tool rather than a stumbling block. We can leverage the power of computational tools to free our shoulders the burden of memorizing intricate formulas which offer often very little to the ideas they are trying to convey. I find it much more important that a student can find a \(p\)-value using something such as t.test and interpret it correctly than to know the formulas below:

\[t = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\]

\[df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2} {\frac{\frac{s_1^2}{n_1}}{n_1 -1} + \frac{\frac{s_2^2}{n_2}}{n_2-1}} \] From here, simple black and white pages began to see the addition of color, graphics, interactive explorations, and even some platypus flavoring here and there.

For those wondering, “Why a platypus?” I will tell you a story. My daughter, Amelia, was born in April of 2023 and has been an amazing addition to the world since her birth. When she was little and would need a new diaper, Amelia would fuss unless you spoke to her. It didn’t matter what you said, she was only a few weeks old after all, but she wanted to hear you talk. One day, while changing her diaper, I didn’t know what to say, but I knew that if I talked, it would soothe her. For some reason, I asked Amelia: “Did you know that a platypus is furry and has a tail like a beaver, but that it also has a bill like a duck?” Amelia instantly calmed down and locked eyes with me. I continued to rattle on a few other platypus facts I happened to know and she stared at me transfixed. That began a minor obsession with the unique mammal from Australia and Amelia even began being called “Platypus Baby” or PB for short. In fact, to this day, Amelia responds to and refers to herself as PB.

Figure 0.4: Amelia, the Platypus Baby

I hope that this labor of love aids you in either your efforts to learn or teach statistics. Please reach out to me if you have any questions or suggestions as you venture down your own personal path of discovery.

Dr. Phil Huling

28 February, 2025

https://platypus.asn.au/platypus-body-temperature/↩︎
If you really must know the solution to this problem, you can look here.↩︎