A complete backup of inferentialthinking.com

More Annotations

Favourite Annotations

Text

2. CAUSALITY AND EXPERIMENTS 2. Causality and Experiments¶ “These problems are, and will probably ever remain, among the inscrutable secrets of nature. They belong to a class of questions radically inaccessible to the human intelligence.” —The Times of London, September 1849, on how cholera is contracted and spread Does the death penalty have a

deterrent effect?

11.1. ASSESSING MODELS 11.1. Assessing Models¶. In data science, a “model” is a set of assumptions about data. Often, models include assumptions about chance processes used to generate data. 14.1. PROPERTIES OF THE MEAN 14.1.2. The Mean is a “Smoother”¶ You can think of taking the mean as an “equalizing” or “smoothing” operation. For example, imagine the entries in not_symmetric above as the dollars in the pockets of four different people. To get the mean, you first put all of the money into one big pot and then divide it evenly among the four

people.

8.1. APPLYING A FUNCTION TO A COLUMN 8.1.2. Functions as Values¶. We’ve seen that Python has many kinds of values. For example, 6 is a number value, "cake" is a text value, Table() is an empty table, and ages is a name for a table value (since we defined it above). In Python, every function, including cut_off_at_100, is also a value.It helps to think about recipes again. 12. COMPARING TWO SAMPLES 12. Comparing Two Samples¶. We have seen several examples of assessing whether a single sample looks like random draws from a specified chance model. 8.3. CROSS-CLASSIFYING BY MORE THAN ONE VARIABLE 8.3. Cross-Classifying by More than One Variable¶. When individuals have multiple features, there are many different ways to classify them. For example, if we have a population of college students for each of whom we have recorded a major and the number of years in college, then the students could be classified by major, or by year, or by a combination of major and year. 17.5. THE ACCURACY OF THE CLASSIFIER 17.5.1. Measuring the Accuracy of Our Wine Classifier¶. OK, so let’s apply the hold-out method to evaluate the effectiveness of the \(k\)-nearest neighbor classifier for identifying wines.The data set has 178 wines, so we’ll randomly permute the data set and put 89 of

them

2.3. ESTABLISHING CAUSALITY 2.3. Establishing Causality¶. In the language developed earlier in the section, you can think of the people in the S&V houses as the treatment group, and those in the Lambeth houses at the control group. 7.1. VISUALIZING CATEGORICAL DISTRIBUTIONS 7.1.2. Features of Categorical Distributions¶. Apart from purely visual differences, there is an important fundamental distinction between bar charts and the two graphs that we saw in the previous

sections.

8.4. JOINING TABLES BY COLUMNS 8.4. Joining Tables by Columns¶. Often, data about the same individuals is maintained in more than one table. For example, one university office might have data about each student’s time to completion of degree, while another has data about the student’s tuition and financial aid. 2. CAUSALITY AND EXPERIMENTS 2. Causality and Experiments¶ “These problems are, and will probably ever remain, among the inscrutable secrets of nature. They belong to a class of questions radically inaccessible to the human intelligence.” —The Times of London, September 1849, on how cholera is contracted and spread Does the death penalty have a

deterrent effect?

people.

them

sections.

8.4. JOINING TABLES BY COLUMNS 8.4. Joining Tables by Columns¶. Often, data about the same individuals is maintained in more than one table. For example, one university office might have data about each student’s time to completion of degree, while another has data about the student’s tuition and financial aid. 11.1. ASSESSING MODELS 11.1. Assessing Models¶. In data science, a “model” is a set of assumptions about data. Often, models include assumptions about chance processes used to generate data.

15.1. CORRELATION

The table hybrid contains data on hybrid passenger cars sold in the United States from 1997 to 2013. The data were adapted from the online data archive of Prof. Larry Winner of the University of Florida. The columns: vehicle: model of the car. year: year of manufacture. msrp: manufacturer’s suggested retail price in 2013 dollars. acceleration: acceleration rate in km per hour per second 1. WHAT IS DATA SCIENCE? 1. What is Data Science?¶ Data Science is about drawing useful conclusions from large and diverse data sets through exploration, prediction, and inference. 8.1. APPLYING A FUNCTION TO A COLUMN 8.1.2. Functions as Values¶. We’ve seen that Python has many kinds of values. For example, 6 is a number value, "cake" is a text value, Table() is an empty table, and ages is a name for a table value (since we defined it above). In Python, every function, including cut_off_at_100, is also a value.It helps to think about recipes again. 16.1. A REGRESSION MODEL In reality, of course, we will never see the true line. What the simulation shows that if the regression model looks plausible, and if we have a large sample, then the 3.2.1. EXAMPLE: GROWTH RATES 3.2.1. Example: Growth Rates¶. The relationship between two measurements of the same quantity taken at different times is often expressed as a growth rate.For example, the United States federal government employed 2,766,000 people in 2002 and 2,814,000 people in 2012. To compute a growth rate, we must first decide which value to treat as the initial amount. 7.1. VISUALIZING CATEGORICAL DISTRIBUTIONS 7.1.2. Features of Categorical Distributions¶. Apart from purely visual differences, there is an important fundamental distinction between bar charts and the two graphs that we saw in the previous

sections.

10. SAMPLING AND EMPIRICAL DISTRIBUTIONS 10. Sampling and Empirical Distributions¶. An important part of data science consists of making conclusions based on the data in random samples. In order to correctly interpret their results, data scientists have to first understand exactly what random samples are. 15.3. THE METHOD OF LEAST SQUARES Our first example is a dataset that has one row for every chapter of the novel “Little Women.” The goal is to estimate the number of characters (that is, letters, spaces punctuation marks, and so on) based on the number of periods. 8.5. BIKE SHARING IN THE BAY AREA 8.5. Bike Sharing in the Bay Area¶. We end this chapter by using all the methods we have learned to examine a new and large dataset. We will also introduce map_table, a powerful visualization tool.. The Bay Area Bike Share service published a dataset describing every bicycle rental from September 2014 to August 2015 in their system. There were 354,152 rentals in all. COMPUTATIONAL AND INFERENTIAL THINKING: THE FOUNDATIONS OF Computational and Inferential Thinking: The Foundations of Data Science¶. By Ani Adhikari and John DeNero with contributions by David Wagner and Henry Milner.. This text was originally developed for the UC Berkeley course Data 8: Foundations of Data Science.. You can view this text online or view the source.. The contents of this book are licensed for free consumption under the following 2. CAUSALITY AND EXPERIMENTS 2. Causality and Experiments¶ “These problems are, and will probably ever remain, among the inscrutable secrets of nature. They belong to a class of questions radically inaccessible to the human intelligence.” —The Times of London, September 1849, on how cholera is contracted and spread Does the death penalty have a

deterrent effect?

12.1. A/B TESTING

12.1.1. Smokers and Nonsmokers¶. The table births contains the following variables for 1,174 mother-baby pairs: the baby’s birth weight in ounces, the number of gestational days, the mother’s age in completed years, the mother’s height in inches, pregnancy weight in pounds, and whether or not the mother smoked during pregnancy. 14.1. PROPERTIES OF THE MEAN 14.1.2. The Mean is a “Smoother”¶ You can think of taking the mean as an “equalizing” or “smoothing” operation. For example, imagine the entries in not_symmetric above as the dollars in the pockets of four different people. To get the mean, you first put all of the money into one big pot and then divide it evenly among the four

people.

2.3. ESTABLISHING CAUSALITY 2.3. Establishing Causality¶. In the language developed earlier in the section, you can think of the people in the S&V houses as the treatment group, and those in the Lambeth houses at the control group. 17.5. THE ACCURACY OF THE CLASSIFIER 17.5.1. Measuring the Accuracy of Our Wine Classifier¶. OK, so let’s apply the hold-out method to evaluate the effectiveness of the \(k\)-nearest neighbor classifier for identifying wines.The data set has 178 wines, so we’ll randomly permute the data set and put 89 of

them

8.3. CROSS-CLASSIFYING BY MORE THAN ONE VARIABLE 8.3. Cross-Classifying by More than One Variable¶. When individuals have multiple features, there are many different ways to classify them. For example, if we have a population of college students for each of whom we have recorded a major and the number of years in college, then the students could be classified by major, or by year, or by a combination of major and year. 14.6. CHOOSING A SAMPLE SIZE 14.6.2. The SD of a collection of 0’s and 1’s¶. If we knew the SD of the population, we’d be done. We could calculate the square root of the sample size, and then take the square to get the sample size. 8.5. BIKE SHARING IN THE BAY AREA 8.5. Bike Sharing in the Bay Area¶. We end this chapter by using all the methods we have learned to examine a new and large dataset. We will also introduce map_table, a powerful visualization tool.. The Bay Area Bike Share service published a dataset describing every bicycle rental from September 2014 to August 2015 in their system. There were 354,152 rentals in all. 8.4. JOINING TABLES BY COLUMNS 8.4. Joining Tables by Columns¶. Often, data about the same individuals is maintained in more than one table. For example, one university office might have data about each student’s time to completion of degree, while another has data about the student’s tuition and financial aid. COMPUTATIONAL AND INFERENTIAL THINKING: THE FOUNDATIONS OF Computational and Inferential Thinking: The Foundations of Data Science¶. By Ani Adhikari and John DeNero with contributions by David Wagner and Henry Milner.. This text was originally developed for the UC Berkeley course Data 8: Foundations of Data Science.. You can view this text online or view the source.. The contents of this book are licensed for free consumption under the following 2. CAUSALITY AND EXPERIMENTS 2. Causality and Experiments¶ “These problems are, and will probably ever remain, among the inscrutable secrets of nature. They belong to a class of questions radically inaccessible to the human intelligence.” —The Times of London, September 1849, on how cholera is contracted and spread Does the death penalty have a

deterrent effect?

12.1. A/B TESTING

people.

2.3. ESTABLISHING CAUSALITY 2.3. Establishing Causality¶. In the language developed earlier in the section, you can think of the people in the S&V houses as the treatment group, and those in the Lambeth houses at the control group. 17.5. THE ACCURACY OF THE CLASSIFIER 17.5.1. Measuring the Accuracy of Our Wine Classifier¶. OK, so let’s apply the hold-out method to evaluate the effectiveness of the \(k\)-nearest neighbor classifier for identifying wines.The data set has 178 wines, so we’ll randomly permute the data set and put 89 of

them

12.1. A/B TESTING

15.1. CORRELATION

people.

6. TABLES — COMPUTATIONAL AND INFERENTIAL THINKING The with_columns method on a table constructs a new table with additional labeled columns. Each column of a table is an array. To add one new column to a table, call with_columns with a label and an array. (The with_column method can be used with the same effect.). Below, we begin each example with an empty table that has no columns.

15. PREDICTION

The prediction at a given midparent height lies roughly at the center of the vertical strip of points at the given height. This method of prediction is called regression. Later in this chapter we will see whether we can avoid our arbitrary definitions of “closeness” being “within 0.5 inches”. 17.5. THE ACCURACY OF THE CLASSIFIER 17.5.1. Measuring the Accuracy of Our Wine Classifier¶. OK, so let’s apply the hold-out method to evaluate the effectiveness of the \(k\)-nearest neighbor classifier for identifying wines.The data set has 178 wines, so we’ll randomly permute the data set and put 89 of

them

13.2. THE BOOTSTRAP

13.2.1. Employee Compensation in the City of San Francisco¶. SF OpenData is a website where the City and County of San Francisco make some of their data publicly available. One of the data sets contains compensation data for employees of the City.

5.2. RANGES

5.2. Ranges¶. A range is an array of numbers in increasing or decreasing order, each separated by a regular interval. Ranges are useful in a surprisingly large number of situations, so it’s worthwhile to learn about them. Ranges are defined using the np.arange function, which takes either one, two, or three arguments: a start, and end, and a ‘step’. 8.1. APPLYING A FUNCTION TO A COLUMN 8.1.2. Functions as Values¶. We’ve seen that Python has many kinds of values. For example, 6 is a number value, "cake" is a text value, Table() is an empty table, and ages is a name for a table value (since we defined it above). In Python, every function, including cut_off_at_100, is also a value.It helps to think about recipes again. WWW.INFERENTIALTHINKING.COM 301 Moved Permanently. nginx COMPUTATIONAL AND INFERENTIAL THINKING: THE FOUNDATIONS OF Computational and Inferential Thinking: The Foundations of Data Science¶. By Ani Adhikari and John DeNero with contributions by David Wagner and Henry Milner.. This text was originally developed for the UC Berkeley course Data 8: Foundations of Data Science.. You can view this text online or view the source.. The contents of this book are licensed for free consumption under the following 1.1. CHAPTER 1: INTRODUCTION 1.1. Chapter 1: Introduction — Computational and Inferential Thinking. 1.1. Chapter 1: Introduction. Data are descriptions of the world around us, collected through observation and stored on computers. Computers enable us to infer properties of the world from these descriptions. Data science is the discipline of drawing conclusions from data

4. DATA TYPES

4. Data Types¶. Every value has a type, and the built-in type function returns the type of the result of any expression.. One type we have encountered already is a built-in function. Python indicates that the type is a builtin_function_or_method; the distinction between

a function and a

deterrent effect?

17.2. TRAINING AND TESTING 17.2.1. Overly Optimistic “Testing”¶ The training set offers a very tempting set of patients on whom to test out our classifier, because we know the class of each patient in the training set. 6. TABLES — COMPUTATIONAL AND INFERENTIAL THINKINGINFERENTIAL THINKING GOALINFERENTIAL THINKING SKILLS The with_columns method on a table constructs a new table with additional labeled columns. Each column of a table is an array. To add one new column to a table, call with_columns with a label and an array. (The with_column method can be used with the same effect.). Below, we begin each example with an empty table that has no columns. 14.1. PROPERTIES OF THE MEAN 14.1.2. The Mean is a “Smoother”¶ You can think of taking the mean as an “equalizing” or “smoothing” operation. For example, imagine the entries in not_symmetric above as the dollars in the pockets of four different people. To get the mean, you first put all of the money into one big pot and then divide it evenly among the four

people.

9. RANDOMNESS

As before, the random choice will not always be the same, so the result of the comparison won’t always be the same either. It will depend on whether treatment or control was chosen. With any cell that involves random selection, it is a good idea to run the cell several

times to get a

14.6. CHOOSING A SAMPLE SIZE 14.6.2. The SD of a collection of 0’s and 1’s¶. If we knew the SD of the population, we’d be done. We could calculate the square root of the sample size, and then take the square to get the sample size. 18.1. A “MORE LIKELY THAN NOT” BINARY CLASSIFIER 18.1.2. Tree Diagram¶. The proportion that we have just calculated was based on a class of 100 students. But there’s no reason the class couldn’t have had 200 students, for example, as long as all the proportions in the cells were correct. COMPUTATIONAL AND INFERENTIAL THINKING: THE FOUNDATIONS OF Computational and Inferential Thinking: The Foundations of Data Science¶. By Ani Adhikari and John DeNero with contributions by David Wagner and Henry Milner.. This text was originally developed for the UC Berkeley course Data 8: Foundations of Data Science.. You can view this text online or view the source.. The contents of this book are licensed for free consumption under the following 1.1. CHAPTER 1: INTRODUCTION 1.1. Chapter 1: Introduction — Computational and Inferential Thinking. 1.1. Chapter 1: Introduction. Data are descriptions of the world around us, collected through observation and stored on computers. Computers enable us to infer properties of the world from these descriptions. Data science is the discipline of drawing conclusions from data

4. DATA TYPES

a function and a

deterrent effect?

people.

9. RANDOMNESS

times to get a

17. CLASSIFICATION

17. Classification¶. David Wagner is the primary author of this chapter.. Machine learning is a class of techniques for automatically finding patterns in data and using it to draw inferences or make predictions. You have already seen linear regression, which is one kind of machine learning. This chapter introduces a new one:

classification.

12.1. A/B TESTING

15.1. CORRELATION

9. RANDOMNESS

times to get a

3.2.1. EXAMPLE: GROWTH RATES 3.2.1. Example: Growth Rates¶. The relationship between two measurements of the same quantity taken at different times is often expressed as a growth rate.For example, the United States federal government employed 2,766,000 people in 2002 and 2,814,000 people in 2012. To compute a growth rate, we must first decide which value to treat as the initial amount. 8.2. CLASSIFYING BY ONE VARIABLE 8.2.2. Finding a Characteristic of Each Category¶. The optional second argument of group names the function that will be used to aggregate values in other columns for all of those rows. For instance, sum will sum up the prices in all rows that match each category. This result also contains one row per unique value in the grouped column, but it has the same number of columns as the original table. WWW.INFERENTIALTHINKING.COM 301 Moved Permanently. nginx COMPUTATIONAL AND INFERENTIAL THINKING: THE FOUNDATIONS OF Computational and Inferential Thinking: The Foundations of Data Science¶. By Ani Adhikari and John DeNero with contributions by David Wagner and Henry Milner.. This text was originally developed for the UC Berkeley course Data 8: Foundations of Data Science.. You can view this text online or view the source.. The contents of this book are licensed for free consumption under the following 1.1. CHAPTER 1: INTRODUCTION 1.1. Chapter 1: Introduction — Computational and Inferential Thinking. 1.1. Chapter 1: Introduction. Data are descriptions of the world around us, collected through observation and stored on computers. Computers enable us to infer properties of the world from these descriptions. Data science is the discipline of drawing conclusions from data

4. DATA TYPES

a function and a

deterrent effect?

people.

9. RANDOMNESS

times to get a

4. DATA TYPES

a function and a

deterrent effect?

people.

9. RANDOMNESS

times to get a

17. CLASSIFICATION

classification.

12.1. A/B TESTING

15.1. CORRELATION

9. RANDOMNESS

times to get a

* Introduction

*

* 1. Data Science

* 1.1 Introduction

* 1.1.1 Computational Tools * 1.1.2 Statistical Techniques * 1.2 Why Data Science? * 1.3 Plotting the Classics * 1.3.1 Literary Characters * 1.3.2 Another Kind of Character * 2. Causality and Experiments * 2.1 John Snow and the Broad Street Pump * 2.2 Snow’s “Grand Experiment” * 2.3 Establishing Causality

* 2.4 Randomization

* 2.5 Endnote

* 3. Programming in Python

* 3.1 Expressions

* 3.2 Names

* 3.2.1 Example: Growth Rates * 3.3 Call Expressions * 3.4 Introduction to Tables

* 4. Data Types

* 4.1 Numbers

* 4.2 Strings

* 4.2.1 String Methods

* 4.3 Comparisons

* 5. Sequences

* 5.1 Arrays

* 5.2 Ranges

* 5.3 More on Arrays

* 6. Tables

* 6.1 Sorting Rows

* 6.2 Selecting Rows * 6.3 Example: Population Trends * 6.4 Example: Trends in Gender

* 7. Visualization

* 7.1 Categorical Distributions * 7.2 Numerical Distributions * 7.3 Overlaid Graphs * 8. Functions and Tables * 8.1 Applying Functions to Columns * 8.2 Classifying by One Variable * 8.3 Cross-Classifying * 8.4 Joining Tables by Columns * 8.5 Bike Sharing in the Bay Area

* 9. Randomness

* 9.1 Conditional Statements

* 9.2 Iteration

* 9.3 Simulation

* 9.4 The Monty Hall Problem * 9.5 Finding Probabilities * 10. Sampling and Empirical Distributions * 10.1 Empirical Distributions * 10.2 Sampling from a Population * 10.3 Empirical Distibution of a Statistic * 11. Testing Hypotheses * 11.1 Assessing Models * 11.2 Multiple Categories * 11.3 Decisions and Uncertainty * 11.4 Error Probabilities * 12. Comparing Two Samples

* 12.1 A/B Testing

* 12.2 Deflategate

* 12.3 Causality

* 13. Estimation

* 13.1 Percentiles

* 13.2 The Bootstrap * 13.3 Confidence Intervals * 13.4 Using Confidence Intervals * 14. Why the Mean Matters * 14.1 Properties of the Mean

* 14.2 Variability

* 14.3 The SD and the Normal Curve * 14.4 The Central Limit Theorem * 14.5 The Variability of the Sample Mean * 14.6 Choosing a Sample Size

* 15. Prediction

* 15.1 Correlation

* 15.2 The Regression Line * 15.3 The Method of Least Squares * 15.4 Least Squares Regression * 15.5 Visual Diagnostics * 15.6 Numerical Diagnostics * 16. Inference for Regression * 16.1 A Regression Model * 16.2 Inference for the True Slope * 16.3 Prediction Intervals * 17. Classification * 17.1 Nearest Neighbors * 17.2 Training and Testing * 17.3 Rows of Tables * 17.4 Implementing the Classifier * 17.5 The Accuracy of the Classifier * 17.6 Multiple Regression * 18. Updating Predictions * 18.1 A "More Likely Than Not" Binary Classifier * 18.2 Making Decisions Powered by Jupyter Book

.pdf

BY ANI ADHIKARI

AND JOHN DENERO

Contributions by David Wagner and

Henry Milner

This is the textbook for the Foundations of Data Science class at UC

Berkeley .

View this textbook online on GitHub Pages. The contents of this book are licensed for free consumption under the

following license:

Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

.

〈 Data Science 〉 This page was created by The Jupyter Book Community

Details

Image Url

HTML Url

Moderation By

More Annotations

Earl Hamilton

2021-03-15 23:19:26

Earl Hamilton

2021-03-15 23:19:31

Earl Hamilton

2021-03-15 23:19:36

Earl Hamilton

2021-03-15 23:19:57

Earl Hamilton

2021-03-15 23:20:06

Earl Hamilton

2021-03-15 23:20:36

Earl Hamilton

2021-03-15 23:20:44

Earl Hamilton

2021-03-15 23:21:04

Earl Hamilton

2021-03-15 23:21:29

Earl Hamilton

2021-03-15 23:21:37

Earl Hamilton

2021-03-15 23:21:43

Earl Hamilton

2021-03-15 23:21:57

Favourite Annotations

Earl Hamilton

2020-12-02 08:33:48

Earl Hamilton

2020-12-02 08:33:57

Earl Hamilton

2020-12-02 08:34:10

Earl Hamilton

2020-12-02 08:34:21

Earl Hamilton

2020-12-02 08:34:34

Earl Hamilton

2020-12-02 08:34:48

Earl Hamilton

2020-12-02 08:35:00

Earl Hamilton

2020-12-02 08:35:30

Earl Hamilton

2020-12-02 08:35:40

Earl Hamilton

2020-12-02 08:35:49

Earl Hamilton

2020-12-02 08:35:57

Earl Hamilton

2020-12-02 08:36:25

Text

deterrent effect?

people.

them

sections.

deterrent effect?

people.

them

sections.

15.1. CORRELATION

sections.

deterrent effect?

12.1. A/B TESTING

people.

them

deterrent effect?

12.1. A/B TESTING

people.

them

12.1. A/B TESTING

15.1. CORRELATION

people.

15. PREDICTION

them

13.2. THE BOOTSTRAP

5.2. RANGES

4. DATA TYPES

a function and a

deterrent effect?

people.