When I watch basketball on television, it is a common occurrence to have an announcer state that some player has the hot-hand. This raises the question: Are Bernoulli trials an adequate model for the outcomes of successive shots in basketball? This paper addresses this question in a controlled (practice) setting. A large simulation study examines the power of the tests that have appeared in the literature as well as tests motivated by the work of Larkey, Smith, and Kadane (LSK). Three test statistics for the null hypothesis of Bernoulli trials have been considered in the literature; one of these, the runs test, is effective at detecting one-step autocorrelation, but poor at detecting nonstationariy. A second test is essentially equivalent to the runs test, and the third is shown to be worthless. The LSK-motivated tests are shown to be effective at detecting nonstationarity. Finally, a case study of 2,000 shots by a single player is analyzed. For this player, the model of Bernoulli trials is inadequate. KEY WORDS: Bernoulli trials, the hot-hand, power, simulation study, case study. 1

Introduction

In this paper I consider a statistical analysis of basketball shooting in a controlled (practice) setting, with special interest in the hot-hand. In Section 2, I review and critically examine the two seminal papers on this topic: Gilovich, Vallone, and Tversky (GVT) [5], and Tversky and Gilovich (TG1) [10]. A simulation study of power is presented in Section 3. Finally, in Section 4, a case study of 2,000 trials is analyzed. In GVT and TG1, three additional topics appear which are beyond the scope of this paper:

1. Modeling game free throw shooting,

2. Modeling game shooting, and

3. Opinions and misconceptions of fans.

Readers interested in the first of these topics should refer to Wardrop [12] for a further analysis of the free throw data from the papers.

Several researchers have considered the second topic; the interested reader is referred to Larkey, Smith, and Kadane (LSK) [7], Tversky and Gilovich (TG2) [11], Hooke [6] and Forthofer [4]. For related work in baseball, see Albright [1], Albert [2], Stern and Morris [9] and Stern [8]. Topic 3 is considered briefly in Section 2 of this paper.

Finally, readers interested in statistical research in sports are referred to Bennett [3]. The chapters on basketball and baseball should prove to be of special interest to readers of this paper. 2 Review of Literature

GVT appeared in 1985 in a “psychology journal.” Four years later the same research was restructured as TG1 and appeared in a “statistics journal.” For the most part the papers present identical analyses and interpretations of the data, with the earlier paper generally providing more detail. Twenty-six members of the Cornell University varsity and junior varsity basketball teams generated the data that are examined. The players are labeled M1 (for male one) through M14, and F1 through F12. Each player provided two sequences of shot attempts: the shooting data and the prediction data. I will begin with an examination of the shooting data.

The plan was for each player to provide a sequence of 100 shots, but three of the players, M4 (90 shots), M7 (75), and M8 (50), fell short of the target number. Twenty-six null hypotheses are tested; namely, for each player the null hypothesis is that his or her shots satisfy the assumptions of Bernoulli trials. Below is one summary of the data obtained by M9. Previous Current Shot

Shot S F Total

S 38 15 53

F 16 30 46

The researchers describe these data in two ways. First, note that M9 made 72 percent of his shots after a hit, but only 35 percent after a miss; a difference of 37 percentage points. Second, the researchers compute the serial correlation and obtain 0.37.

The researchers analyze each player’s data with three test statistics. The first two are a test of the serial correlation and the runs test. They summarize their findings as follows. With the exception of one player (r = 0.37) who produced a significant positive correlation (and we might expect one significant result out of 26 just by chance), both the serial correlations and the distributions of runs indicated that the outcomes of successive shots are statistically independent.

Use the 25 counts to test the null hypothesis that the data come from a binomial distribution with n = 4 and p estimated as the proportion of successes obtained in the data. The first difficulty with implementing this test is that typically one or more of the expected counts is quite small. The researchers overcame this problem by combining the O’s and E’s to yield three response categories: fewer than 2, 2, and more than 2, and then applied a test with one degree of freedom. The test can be made one-sided by rejecting if and only if the test would reject at 0.10 and E > O for the middle category (corresponding to two successes).

The rationale for this decision rule is that E > O in the central category indicates heavier tails, which implies more streakiness. The theoretical basis for this test is shaky, but the simulation study reported in Section 3 suggests that its probability of type 1 error might be close to its nominal level. It is unclear whether the researchers apply the test of fit as a one- or two-sided test. In any event, the researchers apply their test of fit to their 26 sets of data and report,

The results provided no evidence for departures from stationarity for any player but (M)9. The researchers describe the shooting data in two other ways. First, they explore the possibility that the past influences the present only after two consecutive successes or failures. For example, this approach yields the following data for M9.

Next, I will criticize the above analysis.

Performing a test of serial correlation and a runs test is largely redundant. For the 26 players studied, the correlation coefficient for a player’s serial correlation and standardized number of runs is −0.993. (Two

notes: Positive serial correlation corresponds to fewer runs than expected and a negative z ; hence, the correlation is negative. While performing this computation I discovered two misprints in Table 4 of GVT. First, the entry for player M10 for P(hit

|1 hit) should read 0.58(60). Second, the serial correlation for F12 is −0.070.) Thus, it is incorrect to believe that performing these two tests gives the data “two chances” to reveal that the Bernoulli trials model is inadequate. Throughout the remainder of this paper I will focus on the runs test and disregard the test of serial correlation. The simulation study reported in Section 3 reveals that the runs test has some ability to detect autocorrelation, but is poor at detecting nonstationarity. In GVT, the test of fit is introduced under the heading, “Test of Stationarity.” In TG1, it is described as “A more sensitive test of stationarity.” The simulation study reported in Section 3 indicates that the test of fit is, in fact, abysmally poor at detecting any but the most extreme form of nonstationarity. In addition, for every alternative examined in Section 3, other simple tests are far superior to the test of fit. To put it bluntly, the test of fit should not be used.

Thus, my first conclusion is that the researchers used only one (distinct and possibly effective) test statistic, the runs test. As noted earlier the researchers find one significant result and note that, “We might expect one significant result out of 26 by chance.” I will now explain why I consider this quote to be an incomplete description of their findings.

The significant result was obtained by M9 and his exact one-sided P-value for the runs test is 0.000044. Having noted that only one P-value was smaller than 0.05, would it not have been fair to mention that one out of 26 P-values was smaller than 1 in 20,000? Is this result not a bit surprising to one who believes in the omnipresence of Bernoulli trials?

Throughout the two papers the researchers refer to the alternative as being “the hot-hand.” For the two test statistics considered, the runs test and the test of fit, the hot-hand alternative, whether it means autocorrelation or nonstationarity, is naturally a one-sided alternative and should therefore, in my opinion, have a one-sided test. The researchers acknowledge this “one-sidedness” repeatedly; for example, in TG1 they refer to a negative serial correlation as coming from data that, “Run counter to the streak-shooting hypothesis.”

The one-sided versus two-sided debate has a long history and is not going to be settled here. I believe it is worth noting, however, that in addition to M9 two other players had one-sided P-values below 0.05 and two others had values slightly above 0.05. In particular, M3 shot better after a hit than after a miss by 18 percentage points, and gave an exact one-sided P-value of 0.0375. Similarly, M6 shot better after a hit than after a miss by 17 percentage points, and gave an exact P-value of 0.0403. Player F3 shot better after a hit than after a miss by 16 percentage points, and gave an exact P-value of 0.0680. Player M7 attempted only 75 shots and was better after a hit than after a miss by 15 percentage points; but with so little data the runs test is not statistically significant: the exact one-sided P-value is 0.0636. Disregarding M8 because he took only 50 shots, we find that:

• One of 25 P-values is extremely small, and

• Three of 25 P-values are smaller than 0.05 and two others are only slightly larger than 0.05. This is not overwhelming evidence in support of the hot-hand theory, but it is equally wrong to interpret these data as indicating, as the researchers write in TG1, “People . . . tend to detect patterns even where none exist.” In addition, as shown in Section 3, the runs test is poor at detecting nonstationarity. Who can say how many of these 25 players exhibited some form of nonstationarity?

There is, of course, an advantage to using only one test statistic; namely, one test statistic makes it relatively easy to keep track of the probability of a type 1 error. If, however, the analyst guesses wrong about the form of the alternative, then performing the test is, at best, a waste of time, and, at worst, misleading. How is one to decide on an alternative? The definitive answer can be obtained only by examining the performances of a great number of basketball players. Such an examination might show that departures from Bernoulli trials are:

• Almost always in the form of autocorrelation,

• Almost always in the form of nonstationarity,

• Frequently in the form of autocorrelation and frequently in the form of nonstationarity,

• So rare and so minor as to be unworthy of further attention, or

• Some other pattern.

Until such data are available, in addition to keeping an open mind one might choose to be guided by the opinions of experts (in basketball, not statistics!). The researchers preface TG1 with a quote from professional basketball player Purvis Short: You’re in a world all your own. Its hard to describe. But the basket seems to be so wide. No matter what you do, you know the ball is going to go in.

The researchers then write,

This statement describes a phenomenon known to everyone who plays or watches the game of basketball, a phenomenon known as the “hot hand.” Clearly, Short is describing an occasional phenomenon and is not describing anything as omnipresent as lag one autocorrelation. Unfortunately, the researchers appear to be confused about the distinction between autocorrelation and nonstationarity. For example, immediately after the previously quoted statement, they write.

The term refers to the putative tendency for success (and failure) in basketball to be self- promoting or self-sustaining.

In others words, they are saying that Short’s description of nonstationarity is the hot-hand which in turn is autocorrelation! Later they write. Do players occasionally get a “hot hand”?

This question now suggests that the researchers acknowledge that the hot-hand might be nonstationarity rather than autocorrelation.

The concluding section of TG1 is titled, “The Hot Hand as Cognitive Illusion,” and it contains further evidence of the researchers’ confusion about alternatives. They write, Naturally, every now and then, a player may make, say, nine of ten shots, and one may wish to claim—after the fact—that he was hot. Such use, however, is misleading if the length and frequency of such streaks do not exceed chance expectation. Nowhere in either paper do the researchers analyze their data searching for streaks of successes or streaks of k successes in k + 1 shots. But in this passage they seem to be suggesting that their data contained no such unusual streaks. In a later paper, TG2, the researchers report,

[In] Our previous analyses . . . We found . . . the frequency of streaks of various lengths was not significantly different from that expected by chance. This statement puzzles me because I found no evidence in GVT or TG1 that the researchers examined the lengths of streaks of successes. The following revealing statement also appears in TG2. Many observers of basketball believe that the probability of hitting a shot is higher following a hit than following a miss, and this conviction is at the heart of the belief in the “hot hand” (emphasis added).

The belief in autocorrelation is certainly not at the heart of my belief in the hot-hand. My belief is that on those somewhat infrequent occasions when the Bernoulli trials model is inadequate, nonstationarity is much more common a phenomenon than autocorrelation. If Mr. Purvis Short spoke in the language of statisticians, he would certainly say that he believes in nonstationarity rather than autocorrelation. I do acknowledge that for persons who have studied neither probability nor statistics carefully, it is easy to be confused about differences between autocorrelation and nonstationarity. My generosity to such persons, however, is strained when their extremely incomplete data analysis is followed by a statement that anyone who disagrees with them is suffering from a “Cognitive Illusion.” Lest I be charged with ignoring the researchers “survey data,” let me turn to that issue. The researchers asked a convenience sample of 100 “avid basketball fans” to consider a hypothetical player who shoots 50 percent from the field. Each fan was asked two questions about this player. • 1. (2.) What is your estimate of his field goal percentage for those shots that he takes after having just made (missed) a shot?

The mean of the responses to questions 1 and 2 were 61 percent and 42 percent, respectively. In contrast to Short’s statement, these fans are describing autocorrelation. There are three points, how- ever, that makes Short’s description more compelling than the opinion of the fans. First, he is the expert and his opinion should be better informed than those of the fans. Second, the Short quote appears to be a “free response,” whereas the fans’ opinion is largely a result of the bias, perhaps unconscious, in the research method. For example, what would the results have been if the fans had been presented with the following two-part question?

Consider a collection of hypothetical players whose “usual ability” is to have a probability of 50 percent of making an 18 foot jump shot.

1. What percentage of such players occasionally “get hot” and perform at a level above his or her usual ability?

2. Consider a player who is typical of those who occasionally “get hot.” What is your esti- mate of his or her probability of making an 18 foot jump shot when “hot?” If the researchers had substituted this two part question for their questions, they might well have reached the conclusion that, “A belief in nonstationarity is at the heart of the belief in the hot-hand.” Third, the fans’ response is so ludicrous that it does not merit further consideration. The practical difference between 61 percent and 42 percent shooters is huge. It is absurd to believe that the norm is for a player to be constantly fluctuating between two states—a great shooter and a horrible shooter. Finally, the researchers’ interpretation of results is extremely biased. If the null hypothesis of Bernoulli trials fails to be rejected, they interpret this as lack of evidence for the hot-hand. Fair enough, power considerations aside. But data that reject a null hypothesis are disdained by the researchers also, in three ways:

1. Differences are not important unless they exceed in magnitude the 19 percentage points difference obtained by questioning the fans. (See the discussion of the prediction data below. In particular, the researchers state that the difference between 60 and 40 percent shooters is “small.”) 2. Differences are not important unless they match the omnipresence expressed by the fans. In a criticism of LSK, the researchers write in TG2,

As our survey shows, it is widely believed that the hot hand applies to most people. . . . Because LSK’s entire argument is based on the performance of a single player, we could rest our case right there.

(Note: In addition to revealing a biased approach to data analysis, this quote is an extremely unfair evaluation of LSK. LSK did not conduct hypothesis tests; the goal of the research was descriptive; they wanted to find the streakiest among several players. Also note that their convenience sample of 100 basketball fans is interpreted as showing that “. . . it is widely believed.”) 3. After rejecting the null hypothesis of Bernoulli trials, the researchers decide that the alternative really is not the hot-hand, but a tendency to “try harder.” (See my discussion of the prediction data below.) In short, the researchers view the fans’ opinion as the gold standard; I see it as a straw man that, despite whatever data might be obtained, allows them to continue to proclaim the folly of all who believe that perhaps on occasion basketball is more complex than Bernoulli trials. In the remainder of this section I will examine the researchers’ prediction data.

Return to the Short quote. Notice that he is describing how he feels. He does not say, for example, When I analyze lengthy records of my shooting data I detect patterns. Therefore, I conclude that the null hypothesis of Bernoulli trials should be rejected. The researchers realize the difference between data analysis and feelings. In GVT they write, There is another cluster of intuitions about “being hot” that involves predictability rather than sequential dependency. If, on certain occasions, a player can predict a “hit” before taking a shot, he or she may have a justified sense of being “hot” even when the pattern of hits and misses does not stray from chance expectation.

They investigate this notion with the prediction data.

Each player attempted 100 shots from a location at which he or she was believed to be about a 50 percent shooter. Before each shot a player would bet high or low and was advised to bet high if and only if he or she felt confident about the pending attempt. A success on a high bet would earn the shooter five cents, while a miss would cost four cents; for a low bet the values were two and one cents, respectively. The data analysis strategy and conclusions are presented in the following passage, taken from GVT. If players can predict their hits and misses, their bets should correlate with their performance. . . . These data reveal that the players were generally unsuccessful in predicting hits and misses. . . . Only 5 of the 26 individual correlations were statistically significant, of which four were quite low (0.20 to 0.22) and the 5th was negative (

−0.51). The four small but significant positive

positive correlations may reflect either a limited ability to predict the outcome of an upcoming shot, or a tendency to try harder following a high bet. This is quite a passage! Unlike the shooting data for which one significant result is discounted as being due to chance, the researchers do not acknowledge that obtaining four (or five if you look at both tails) significant results is noteworthy. Instead they label the correlations “quite low” and then “small.” Note, however, that the table below gives a correlation of 0.20.

As argued above, for this application, a difference of 20 percentage points is not small! Finally, having obtained results counter to what they hoped to find, the researchers suggest that the data are due to players trying harder. In other words, if the null hypothesis is not rejected, there is no hot hand; if the null hypothesis is rejected, there is no hot hand. Why bother collecting data? 3 A Simulation Study of Power

In this section I study the performances of seven test statistics: the runs test and test of fit used in GVT and TG1; two tests motivated by data summaries presented in GVT and TG1; and three tests motivated by the work of LSK. The following is an overview of the results.

1. None of the tests possesses much power unless the departure from Bernoulli trials is fairly substantial.

2. Three of the tests—the runs test and the two motivated by GVT and TG1—are good at detecting autocorrelation, but poor at detecting nonstationarity.

3. The three tests motivated by the work of LSK are good at detecting nonstationarity, but poor at de- tecting autocorrelation.

4. The test of fit is inferior to the other tests at detecting any departure from Bernoulli trials. I will begin with a description of the tests and the critical regions used.