Organize measures into frequency distributions, ordered arrays, and stem-and-leaf plots.
John-Francis Bourke/Corbis
Chapter Learning Objectives After reading this chapter, you should be able to do the following:
1. Organize measures into frequency distributions, ordered arrays, and stem-and-leaf plots.
2. Create pie charts, bar graphs, and frequency polygons using Excel.
3. Describe the components of data normally.
4. Judge data normality by performing manual calculations and by using Excel output.
5. Develop tools to identify outliers.
tan82773_02_ch02_029-060.indd 29 3/3/16 9:58 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 2.1 From Description to Display
Introduction People who like to organize things will especially like this chapter. What we cover here can be particularly helpful in an age where we are exposed to much more data than we can absorb. When the material is irrelevant, this data overload is not a problem, but when the information is important, we need ways to retain it. This chapter offers some solutions involving visual data displays, which an anecdote will help to illustrate.
During World War II, a British analyst was assigned to recommend to aircraft builders the points on airframes that should be reinforced with armor plating. Too much armor plating and the aircraft would lose maneuverability and range; too little and it would become too vulnerable to enemy fire. The analyst examined aircraft returning from com-
bat, noted which areas showed damage, and drew pictures of the places where they had been hit. He recommended reinforcing the areas where the return- ing planes had not been damaged. How counterintuitive was that? As illogical as his approach seems, he reasoned that if the damage had been fatal to either the pilot or the aircraft’s ability to fly, the airplanes he examined would not have returned. So damage to the other areas was apparently the most serious, and those were the areas that needed the most protection.
This story is a lesson in the value of clari- fying relationships with visual displays. Certainly, mathematical manipulation and statistical procedures are required at
times, but often a necessary first step to understanding a data set is to arrange the data so that they can be visually analyzed. The understanding researchers gain from observation can then guide the mathematical analyses that follow.
Chapter 1 emphasized the descriptors and the statistical shorthand that allow us to classify and describe groups of data. That chapter limited descriptions to the scale of the data and the measures of central tendency and variability that allow data summaries. This chapter uses visual display for some of the same purposes and expands the applications for descriptive statistics.
2.1 From Description to Display The study of statistics has an incremental nature: Each step becomes part of a more involved process later, which makes grasping the early topics important, since they are building blocks for subsequent ones. For now, we will use what we know about data scale and descriptive
Edward Koren/The New Yorker Collection/The Cartoon Bank
tan82773_02_ch02_029-060.indd 30 3/3/16 9:58 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 2.1 From Description to Display
statistics to arrange measures into the tables and figures that reveal the multiple dimensions of numerical data. Although the stakes for us may be different than they were for the British warplane analyst, the issues are important nevertheless.
Most audiences are more engaged by a visual display than by a text presentation. When a good deal of data must be communicated in a short time, a visual display serves as a good place to begin. The discussions that follow suggest some of the more common procedures for repre- senting different kinds of data, if only to introduce them briefly. For someone interested in a more in- depth discussion, books by authors such as Friendly (2000) and Tufte (2001) will be helpful. Tufte in particular has a reputation for innovative and infor- mative data displays.
Data distributions of one sort or another are ubiquitous. A glance at the latest news reports indi- cates how unemployment numbers have changed during the year. Checking how the stock market has fluctuated over today’s trading session indicates highs, lows, and the volume of trading. The fact that data fluctuate makes them interesting. Data that either all have the same value or that always occur in the same proportions leave little to be analyzed. They interest us much less than data for which pro- portions and frequencies change.
Frequency Distributions Scores on most measures vary, but the variation will generally have some repetition. Whether college admissions test results or the scores on a statistics quiz, all scores are not equally likely; some will occur more frequently than others. Frequency distributions indicate the number of measures in a data set that have the same characteristic. They allow us to display scores in terms of both their variability and their frequency of occurrence.
Suppose a state board administers a licensing test for marriage and family counselors. Rather than report every individual score, the board finds it more economical to report test results in categories:
Meritorious
Exceeds Expectations
Pass
Pass with Exceptions
Fail
Consider the following example: A group of 25 graduates of State U’s marriage and family counseling program takes the test. Table 2.1 shows the group’s results.
John Moore/Getty Images News/Thinkstock
Tracking the highs, lows, and trading volume of stocks on a graph allows us to concisely evaluate what would otherwise be very large quantities of data.
tan82773_02_ch02_029-060.indd 31 3/3/16 9:59 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 2.1 From Description to Display
Table 2.1: A frequency distribution for licensing test results
Licensing test results f
Meritorious 4
Exceeds expectations 6
Pass 8
Pass with exceptions 4
Fail 3
Total 25
Table 2.1 depicts a frequency distribution, with the symbol f indicating the number of scores that occur in a particular category. If each individual score had been entered rather than being grouped into categories, the result would have been a table with 25 discrete entries. Instead, the data in Table 2.1 represent a grouped frequency distribution. Such a table provides a compact presentation when there are many scores.
Ordered and Disordered Arrays Table 2.1 is divided into categories, but if each of the 25 results was listed in ranked order from the four that were meritorious down to the three fails, the display would reflect an ordered array. If instead of listing them from highest to lowest, the board arbitrarily piled all the scores into the table, it would show, not surprisingly, a disordered array. In such a table, for example, although the meritorious scores would still occur as a group, they would be in no particular order. Table 2.1 is a much shorter display than either an ordered or a disordered array.
When sample sizes are comparatively small—15 or 20 scores from a larger popula- tion, for example—the type of presentation is not an issue, but presentation would be a greater issue if the frequency distribution included data for every aspiring mar- riage and family counselor in the state who took the licensing test. Even if hundreds of scores were being reported, a grouped frequency distribution would have the same number of rows as Table 2.1. Frequency distributions, then, can make a presentation compact. Jokela (2012) studied whether associations between individuals’ personality traits and whether they have children are affected by when they were born. Table 2.2 is part of his subjects’ description. It shows the birth cohort, or particular period of birth, and gender for 6,259 subjects (2,971 men and 3,288 women) in a relatively compact display.
Class Intervals The “groups” in grouped frequency distributions—the birth cohorts in Table 2.2—are called class intervals. Although they provide an economical data presentation and make a great deal of data accessible to even a casual observer, some details are inevitably lost. It is not apparent from studying Table 2.1, for example, which numerical test scores belong to a particular class
tan82773_02_ch02_029-060.indd 32 3/3/16 9:59 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 2.1 From Description to Display
interval. We can address that deficiency by incorporat- ing a list of score ranges, which might be the following:
28–34 Meritorious 21–27 Exceeds Expectations 14–20 Pass
7–13 Pass with Exceptions 0–6 Fail
With the ranges, we know how scores were classified, but it still is not apparent exactly how one individual whose score is in the “pass” interval, for example, scored. The person could have scored anywhere from 14 to 20. We know only the category. The same difficulty emerges in Table 2.2. The table shows 347 female subjects in the 1920–1929 birth cohort, but it does not make any distinction within the 1920–1929 group, a range of 9 years.
If we cannot know precisely how a particular individual scored, or the exact year in which a subject was born (Table 2.2), the data can at least be roughly ranked. Clearly, those in Table 2.1 who “exceeded expectations” did better than those in the pass category, although exactly how much better is not indicated.
Estimating the Mean from a Class Interval Indicating the score frequencies in the class intervals reduces the scores to values that can be ranked approximately. Even without the individual scores, we can use the categories to esti- mate the mean of the scores from class intervals. To estimate the mean from class intervals,
1. Determine the midpoint in each class interval. 2. Sum the midpoints of all the class intervals. 3. Divide the sum of the midpoints by the number of class intervals.
Table 2.2: A grouped frequency distribution of subjects’ birth cohort
Birth year Men (2,971) Women (3,288)
1914–1919 0 0
1920–1929 316 347
1930–1939 498 614
1940–1949 732 795
1950–1959 816 802
1960–1969 585 707
1970–1979 24 23
Source: Jokela, M. (2012). Birth-cohort effects in the association between personality and fertility. Psychological Science, 23, 835–841.
Try It!: #1 According to the discussion of the scale of data in Chapter 1, what scale do data cate- gories such as meritorious, exceeds expec- tations, and so on indicate?
tan82773_02_ch02_029-060.indd 33 3/3/16 9:59 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 2.1 From Description to Display
To see how accurate the estimated mean is, using the data in Table 2.1, we will first calculate the actual mean. Perhaps for the licensing test data in the grouped frequency distribution above, the individual scores were the following:
Meritorious: 34, 33, 33, 29 Exceeds Expectations: 26, 26, 24, 23, 23, 22
Pass: 20, 19, 19, 18, 17, 15, 15, 14 Pass with Exceptions: 12, 11, 9, 8
Fail: 6, 3, 1
Using the formula for the mean, M 5 ∑x n
, verify that 460 25 5 18.40.
Now, to estimate the mean based on the class intervals, follow these four steps:
1. Determine the midpoint of each class interval by
a) adding the two possible extreme scores within each interval (not the actual scores) and then
b) dividing by 2.
For
Meritorious: (28 1 34)/2 5 31
Exceeds Expectations: (21 1 27)/2 5 24
Pass: (14 1 20)/2 5 17
Pass with Exceptions: (7 1 13)/2 5 10
Fail: (0 1 6)/2 5 3
2. Multiply the midpoint values from Step 1 by the number of scores in the interval.
31 3 4 5 124
24 3 6 5 144
17 3 8 5 136
10 3 4 5 40
3 3 3 5 9
3. Sum Step 2’s products (the midpoints times the number of values).
124 1 144 1 136 1 40 1 9 5 453
4. Divide the sum of the products from Step 3 by the number of scores.
453/25 5 18.12
The actual mean is 18.40. The estimated mean is 18.12.
tan82773_02_ch02_029-060.indd 34 3/3/16 9:59 AM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 2.1 From Description to Display
Because this is an estimate, there will generally be a minor discrepancy between the value estimated from the class intervals and the actual value of the mean. In this exam- ple, the difference between the estimated and actual mean is 0.28. As the number of values in the data set increases, the discrepancy will usually diminish. The point is that with only the values that constitute the class intervals and the number of scores in each interval, it is possible to estimate the value of the mean. That can be helpful in a data summary when the original scores are unavailable, as is the case for data in Table 2.2. Whenever the value of M is estimated from the class intervals, any reporting of the value must clearly state that it is an estimate and that it was not calculated directly from the raw data.
The Difference Between Apparent and Actual Limits For the licensing data, the scores are all whole numbers: integers. This makes creating the class intervals easy, but researchers often work with data that include decimal values, and class limits must accommodate any value between the highest and lowest integers. The high- est and lowest integers in the category represent the apparent limits of the class interval. For example, in Table 2.1’s meritorious category, the apparent limits are 28 and 34. If the scores do not involve decimal values, determining class limits does not pose a problem, but sometimes decimals are part of the data being represented. A student’s grade point average, for example, is likely to have a decimal value. Ordinary grading procedures also often include decimals. If the lower limit for A work is 90% and the upper limit for B work is 89%, to which class interval does 89.5% belong?
To accommodate any value, class intervals must have actual limits in addition to apparent limits. In the case of grade averages and a great many other kinds of data, the class interval actually extends from a half point below the lower whole number in the interval to a half point above. That means the lower limit for an A would be 89.5%. For the 21–27 class interval (exceeds expectations), the actual limits are 20.5 to 27.5. If we subtract the lower from the upper actual limit we have the width of the class interval: 27.5 2 20.5 5 7.0.