Preface
Figure 0.2: A Statypus Reading its Favorite Book in Nature
A platypus may enter a state of something called “torpor” during the colder months, which is similar to hibernation, for periods of up to six days at a time.3
0.1 Philosophy
This book has been written for an introductory course in computer aided statistics without the reliance on any calculus. All code has been written as to be executable on a “fresh” R environment by relying on functions and techniques that do not require the implementation of packages. The few exceptions are the original functions introduced in Chapters 5 and 10 as well as a few others in the later chapters. However, care has been given to make the code as readable as possible so that the enthusiastic reader can follow as much as is possible. Data manipulation is done with “base” R techniques which may appear cumbersome to advanced users who are familiar with packages such as dplyr or ggplot2.
Here at Statypus, we strive to offer a supportive learning environment which is anchored on the following principles:
Students should be critical consumers of statistics. This means that they are able to discuss possible issues with statistical experiments/studies and question the validity of data.
Students should be thoughtful producers of statistics. This means that they are able to make calculations and produce inferences that they can explain clearly to others.
Instruction should focus on the interpretation of statistical calculations rather than testing a student’s ability to memorize equations and formulas.
We learn by making mistakes. I have taught children that mistakes are an integral part of the learning process and not something to be seen as detrimental. Students should Dare to be Wrong and interact in a way that they do not feel encumbered by the fear of making mistakes. Too many students sit quietly unable to bring themselves to ask the many questions they do have. My classroom and this book are meant to be safe places where all ideas are entertained and where no one is ever ridiculed for making a mistake.
Computers should be used as an integral tool of instruction and concepts should be taught through the lens of using a computer.
The use of AI tools such as GitHub Copilot should be encouraged, but only after students have built a solid foundation in basic programming logic and statistical computation through hands-on practice.
0.2 AI Use in Elementary Undergraduate Statistics
Figure 0.3: A Statypus Ignoring the Elephant in the Room
The landscape of statistics is changing rapidly. Tools like GitHub Copilot can now generate code in seconds, leading many to ask: “If the AI can write the code, why do I need to learn it?” In this course, we view AI as a high-powered “autopilot” system. However, just as a pilot must log hundreds of manual flight hours before being trusted with an automated cockpit, you must understand the language of R to safely “fly” a tool like Copilot. Using AI without a foundation in Base R is like using a calculator before you understand addition—you might get an answer, but you won’t know if it’s the right answer.
Learning to code manually throughout this book is your “flight school.” This hands-on path allows you to:
Become a Critical Commander: AI often suggests overly complex solutions or uses external packages when a simple Base R function is more efficient. By learning the core logic, you gain the authority to tell the AI: “No, use a simpler base function instead.”
Debug with Confidence: AI-generated code will eventually fail. When it does, the AI often struggles to explain why. Students who understand syntax can peer “under the hood” to pinpoint and fix problems that would leave a prompt-only user stranded.
See Past the “Black Box”: We want to avoid a “black box” mentality where data goes in and magic comes out. Writing your own code for a confidence interval or a linear regression ensures you understand the statistical mechanics behind the result.
The Calculator Analogy: Think of how we learn arithmetic. We practice long division by hand not because we hate calculators, but because the process reveals the underlying patterns of mathematics. Relying on a calculator too early can stunt your ability to see those patterns in more advanced courses like Algebra or Calculus.
The “Verify then Trust” Workflow: As you incorporate GitHub Copilot into your RStudio environment, try this three-step approach:
Draft First: Try to write the code yourself using the functions covered in the book.
Use Copilot for Speed: Once you understand the syntax, use Copilot to help with repetitive tasks or to suggest the next line in a sequence.
Critique the Output: Always look at what the AI suggests. If it uses a function you haven’t learned yet, ask yourself (or even the AI) if there is a Base R way to do it instead.
AI offers incredible and ever-improving sets of tools, but it is most beneficial when used by critical consumers. Our goal is to ensure that by the end of this semester, you aren’t just a user of AI-generated code, but a thoughtful producer and critical consumer of statistical analysis.
0.3 Acknowledgements
The process of writing a book is a very humbling experience, especially someone’s first book. My own personal experience was a winding walk through periods of absolute confidence and determination as well as darker periods of self-doubt and lack of belief to accomplish this goal. There is absolutely no way that this book would not exist without the help of so many people.
I am forever indebted to Darrin Speegle and Bryan Clair. Their book Probability, Statistics, and Data was a very early inspiration to develop some of my own materials which eventually turned into this book. The first document that eventually led to this book was a modified version of Section 2.2 of Darrin and Bryan’s book. In addition, an early version of Chapter 1 of Statypus is a modified version of their Chapter 1. I have learned so many techniques and been pulled away from so many bad habits by learning from both of them. From simply allowing me to see how they coded certain parts of their book to countless discussions about how certain topics should be taught, this book would not exist without them. Darrin is definitely responsible for pushing me to rethink introductory statistics within a mindset of using the technology correctly. I am sure Darrin sometimes dreaded seeing me walk towards his office door with another question that was painfully simple for him, but I could not have done this without his help. Bryan was the person who literally told me to turn my early PDF documents into a bookdown. Without that nudge, this project would never have gotten to this point. He has also been there for numerous impromptu conversations about statistical issues and concepts as well as many technical explanations to allow me to create this book. I would have no idea what a “cascading style sheet” was without him, but Bryan offered an example of his own and helped me understand how to use it.
I would also like to thank Anneke Bart. She was the chair of the Mathematics and Statistics Department during the time this book was written as has always been a true friend. She offered countless hours within her busy schedule to allow me to discuss this project as well as using some of my early PDF documents in her own course.
While I am absolutely appreciative of the efforts and time from each of these three amazing professors, I wish to thank them more for offering me respect and treating me as a friend and colleague regardless of how lost in the weeds I would often become.
I am also extremely thankful to my family. My focus was definitely pulled away from them for the past year to a degree which I have often regretted. Even when I was able to put the laptop down and be a husband or father, I know I did not always offer them the focus that they rightfully deserve. My children each make an appearance in this book and I hope that they can accept that as my apologies for not always actively listening to them like I should have.
My wife, Alie, has been the absolute bedrock from which I have drawn strength throughout the process of writing this book. Alie is one of the kindest individuals I have ever met and absolutely brilliant (other than in her choice of spouse). There has never been a single moment where she has not shown absolute belief in me and I can say unequivocally that I could not have done this without her. Alie, I thank you for putting up with the subpar husband I have been as I have worked on this book. I don’t think any man is deserving of you, but I am so grateful to be the person you have chosen to spend your life with.
0.4 Parts of this Book
As you navigate through the numerous webpages that make up Statypus, you will encounter many different colored boxes which set aside a portion of the screen for a specific task. Knowing the purpose of these different boxes can assist you in understanding what is being talked about and expedite your ability to find something you are looking for within Statypus. The following gives examples of the different times of boxes you will encounter here.
0.4.1 Alert
As mentioned earlier, this book accepts mistakes as an important and unavoidable part of the learning process. That being said, the purpose of making mistakes is to be able to avoid making similar mistakes in the future and where possible, a good teacher tries to offer a cautionary tale about common mistakes students make. Red Alert boxes, like the one shown below from Section 9.3.2 do just this. They offer a cautionary warning to be aware of easy mistakes that can be made while covering certain topics.
It is important to not confuse the different uses of \(p\) here. We have a population proportion which we denoted \(p\) and now a measure of how strong evidence based on a sample which we call the \(p\)-value. We also use \(P\) for probability calculations. We will always write out \(p\)-value to avoid confusion as much as possible.
0.4.2 Big Idea
There are a lot of students who can memorize nearly any equation given to them and many can use them effortlessly and without err. However, a lot of students struggle to answer questions of the form, “What does that mean?”, when asked to put things into context. A good instructor should be the person who can facilitate a student’s deeper understanding of what something means and not simply recite notes from a piece of paper which a student then transcribes to their own paper (or iPad?) which they then compare to their textbook and find very little difference. This book attempts to do just that with the green Big Idea boxes like the one below from Section 9.3.2. Big Idea boxes try to offer concepts in a way that are meant as a sort of heuristic view of a complex concept.
Loosely, we can take the \(p\)-value to represent how much you can still believe the assumption made in \(H_0\) after examining the evidence against it. If we believe in the statement(s) made in \(H_0\) prior to running the test, then we can view \(p\)-value as how much belief we still have in after examining the evidence provided by our sample. We will soon discuss how little belief is acceptable before we are forced to reject \(H_o\). However, it is important to also remember that this is just a loose way to make sense of it and that the technical meaning is given in Definition 9.11.
0.4.3 Code Template
Coding is hard and computers don’t care about what you “meant.” They only care what you explicitly tell them to do. For example, the code view( mtcars ) will cause an error in R because it should be View( mtcars ), with an upper case V. Human beings (even lowly textbook authors) are not perfect and typos are just a matter of “when” and not “if.” To minimize the number of simple typographical errors students encounter, it is helpful if they can “borrow” code that they know will work and be able to adapt it to their own needs rather than asking them to write new code on their own. Green Code Template boxes such as the one below from Section 3.3.3 do just this. The user should be able to automatically copy the contents of the lighter colored boxes by placing their cursor over the upper right hand portion of the box. This allows students to easily migrate code from Statypus directly into their own work with minimal concerns about typing errors.
0.4.4 Data Download
Statistics can be simply thought of as the science of working with data to understand our world. For most people, there is no need to discuss the concepts in a statistics course unless it relates so some sort of data. Getting that data in a way that is easily used can sometimes be tricky and with a myriad of formats out there, modern computing has made this even trickier in some ways. However, we try to minimize this with purple Data Download boxes such as the one below from Example 4.1. These offer code that you can copy and paste which will automatically download the data from the Statypus servers and move it into their RStudio environment.
0.4.5 Example
If mistakes are an important part of learning, then examples are even more important. Blue Example boxes like the one below, Example 4.4, offer us a way keep track when we leave abstract concepts and begin to work on a specific application.
Example 0.1 If we wanted to find the range of birth masses of babies in our sample, we can do this with the following code.
## [1] 907 4825
This shows us that the smallest baby in our sample had a mass 907 grams while the largest had a mass of 4825 grams.
The range length is thus \(4825 - 907 = 3918\) grams.
0.4.6 Let’s Explore
Most good mathematicians can “see” math happen in their heads. For example, envision two vertical poles situated a certain distance apart. Further imagine that a wire is connected from the top of each pole to the base of the other. The two wires would obviously cross and a simple (at least simple to ask) question would be: “How high is the intersection of the two wires in terms of the distances and lengths of the poles and wires?” If you had a Ph.D. in geometry, you may be able to see this entire image in your head, but most people would need to make a sketch of the figure to understand what is going on. However, this problem requires us to consider the figure where we do not know any of the distances or lengths. Light blue Let’s Explore boxes attempt to offer just such a tool. The exploration below gives an interactive visualization of exactly the problem we just setup here. The answer, however, is left for you to figure out!4
0.4.7 New Function
Using computer software such as R requires us to use functions that are built into its system. Pink New Function boxes like the one below from Section 5.1 give a place to begin your understanding of how these software functions work. They are meant to offer the important information about the function and how to use it before we begin to actually enter values or data into them.
0.4.8 New Functions
Each chapter begins with a list of the new functions it will introduce with a pink New Functions box like the one found at the beginning of Chapter 3. This offers students (and instructors) a quick place to reference where different functions were introduced. Functions in these boxes should be in the order that they appear within the text.
We will see the following functions in Chapter 3.
table(): Uses cross-classifying factors to build a contingency table of the counts at each combination of factor levels.proportions(): Returns conditional proportions given entries ofxdivided by the appropriate sum(s).barplot(): Creates a bar plot with vertical or horizontal bars.hist(): Computes a histogram of the given data values.stem(): Produces a stem-and-leaf plot of the values inx.
0.4.9 Now It’s Your Turn!
If you read an entire book on the theory of how to properly shoot a basketball, would that improve your ability to make a free throw? The answer is probably not unless you actually took time to also practice the concepts you are learning. Now there are exercises at the end of every chapter (everyone loves homework), but the yellow Now It’s Your Turn! boxes, the one one below from Section 3.2.2, offer a low stakes way for students to check if they are grasping the material as they go.
Make a comparative bar chart of the number of gears a car had based on the shape of the engine. The two variables are gear which gives the number of forward gears a car had and vs which tells whether an engine was V-shaped (value of 0) or straight (value of 1). Try changing the order of the variables and playing with the beside argument. Can you see a relationship between the variables?
0.4.10 Platypus Oddity
The platypus is weird… there’s no way to get around that. However, so are most mathematicians. We celebrate the uniqueness of the platypus with a gray Platypus Oddity box at the beginning of each chapter immediately after sharing an image of a Statypus (a statistics loving platypus) entirely to bring humor and happiness to the reader! The following fact does not appear in any chapter, but tucked away here for the most invested reader.
Platypuses are five times as sexy as humans. Well, at least “chromosomally.” A platypus has ten sex chromosomes while a human has only two.
Male platypuses have the pattern
\[\text{XYXYXYXYXY}\]
while females have the pattern
\[\text{XXXXXXXXXX}.\]
0.4.11 Definition
It is “turtles all the way down” as the saying goes. To make any headway in mathematics or statistics, we must begin with defining certain things and the green Definition boxes, like the one below for Definition 4.1, do just that. While not as fun as a fun fact about a leg laying mammal, definitions cannot be left out. Some definitions here may not match ones you may have learned in the past and that is fine. It is up to the author of each book to define what terms mean within the pages (webpages, I guess) of their book.
Definition 0.1 Given a vector \({\bf x} = ( x_1, x_2, \ldots, x_n )\) having \(n\) values, we can define the arithmetic mean or simply mean of \({\bf x}\), which we denote as \(\bar{x}\) if \({\bf x}\) is a sample or \(\mu\) if \({\bf x}\) is the entire population, as follows. \[\bar{x} \text{ or }\mu= \frac{1}{n} \sum_{i = 1}^n x_i = \frac{1}{n} \left( x_1 + x_2 + \cdots + x_n \right) = \frac{ x_1 + x_2 + \cdots + x_n }{n}.\]
0.4.12 Remark
Sometimes a point needs to be made and stand out, but it’s not a potential mistake nor does it fit into any of the other categories of boxes we have here at Statypus. Orange Remark boxes like the one below from Section 9.3.2 fill this gap. They will contain a wide array of ideas and concepts that students should pay attention to.
Definition 9.11 is an interpretation of the “informal” definition of a \(p\)-value as given by the American Statistical Association. Unfortunately, a rigorous definition is not easily given, nor is its interpretation fully agreed upon.
0.4.13 Theorem
“Mathematicians turn coffee into theorems” is an old adage and isn’t necessarily untrue, although the author of this book prefers tea! The theorems of this book will appear in bright green Theorem boxes like the one below for Theorem 5.1. If it’s a theorem, it’s probably important.
Theorem 0.1 The correlation, \(r\), between two vectors, \({\bf x}\), and \({\bf y}\), satisfies the following:
\(-1 \leq_R_\leq 1\)
\(|r| =1\) means that the ordered pairs \(\big\{ (x_i, y_i)\big\}\) are collinear.
Correlation is a symmetric operation. That is, the correlation of \({\bf x}\) and \({\bf y}\) is the same as the correlation of \({\bf y}\) and \({\bf x}\), i.e. the order of the vectors does not matter.