Equilibrium Data Mining and Data Abundance

We analyze how computing power and data abundance affect speculators' search for predictors. In our model, speculators search for predictors through trials and optimally stop searching when they find a predictor with a signal-to-noise ratio larger than an endogenous threshold. Greater computing power raises this threshold, and therefore price informativeness, by reducing search costs. In contrast, data abundance can reduce this threshold because (i) it intensifies competition among speculators and (ii) it increases the average number of trials to find a predictor. In the former (latter) case, price informativeness increases (decreases) with data abundance. We derive implications of these effects for the distribution of asset managers' skills and trading profits.


Introduction
Progress in information technologies has reduced information processing costs (due to improvement in computing power) and considerably increased the volume and diversity of available data (due to digitization and increase in storage capacities). This evolution is affecting many economic activities, in particular the production of financial information. 1 The active asset management industry (whose size is $16.5 trillion in 2019) is a case in point. Managers of actively managed mutual funds and hedge funds devote considerable effort and money to find new investment signals (predictors of asset cash-flows or returns). To do so, they increasingly use so called "alternative data" (e.g., social media, web traffic, credit card and point-of-sale, geolocation and satellite imagery etc.) and rely on computer-based methods. 2 In this paper, we study the effects of this evolution on the heterogeneity of asset managers' signals (does progress in information technologies increase or reduce the diversity of predictors?) and the informativeness of asset prices about fundamentals (does progress in information technologies enhance or impair the informational role of financial markets?) Data abundance and improvements in computing power are related but distinct phenomena. For instance, unstructured data such as satellite images or text from social media expand the set of variables to obtain predictors of asset returns. 3 However, they do not per se reduce the cost of processing data for obtaining these predictors. Thus, understanding the effects of data abundance requires analyzing the effect of expanding the search space for predictors holding the cost of data processing constant (and vice versa). This is not possible in existing models of financial information acquisition (e.g., Verrecchia (1982)) because they do not explicitly formalize information acquisition as a search problem. In this paper, we do so and we show that the effects of data abundance and progress in computing power on equilibrium outcomes in financial markets are different.
Our model features a continuum of risk averse speculators (asset managers). In the first stage (the "exploration stage"), each speculator optimally scours available data to find a predictor of the payoff of a risky asset. In the second stage (the "trading stage"), each speculator observes the realization of her predictor and optimally chooses her trading strategy. We formalize the trading stage as a standard rational expectations model (similar to Vives (1995)). The novelty of our model (and its implications) stems from the exploration stage. Here, instead of following the standard approach (e.g., Grossman and Stiglitz (1980) or Verrecchia (1982)), whereby speculators obtain a predictor of a given precision in exchange of a payment, we explicitly model the search for a predictor as a sequential process and we analyze how the optimal search strategy depends on (i) the cost of exploration and (ii) the amount of data available for exploration (the "search space").
We model the search for predictors as follows. We assume that existing data can be combined to generate predictors differing in their signal-to-noise ratios ("quality").
The search space is determined by the quality of the most informative predictor (the "data frontier"), denoted τ max , and the least informative predictor, which is just noise.
The distribution of the quality of predictors on this interval is exogenous. Given this distribution, each speculator simultaneously and independently explores ("mines") the data. Each new exploration costs c and returns a predictor whose quality is drawn from the distribution of predictors' quality. After obtaining a predictor, a speculator can decide either to explore the data further, to possibly obtain an even better predictor, or to trade on the predictor she just found.
As a motivation for our approach, consider using accounting variables to forecast future stock earnings. There are many ways to combine these variables to obtain predictors.
For instance, Yan and Zheng (2017) build more than 18, 000 trading signals combining 240 accounting variables, and find that many of these yield significant abnormal returns (even after accounting for the risk of data snooping). The data mining cost, c, represents the labor and computing costs of finding a particular predictor, designing a trading strategy based on this predictor, and backtesting it. After obtaining a predictor, each manager can decide to start trading on it or to keep searching for another, more precise, predictor.
New datasets enable speculators to use new variables to forecast asset payoffs and should therefore push back the data frontier, i.e., increase τ max . 4 In fact, the interest of alternative data by asset managers is often described as a gold rush for this reason. Indeed, it is expected that the exploitation of these data with new forecasting techniques (machine learning) should enable asset managers to discover more precise predictors. 5 We refer to this dimension of data abundance as the "hidden gold nugget" effect. However, data abundance also creates "a needle in the haystack problem": It results in a proliferation of datasets and only a fraction of these datasets contains useful information for forecasting asset payoffs. Separating the wheat from the shaff can only be done through explorations, which is costly. To capture this dimension of data abundance, we assume that each exploration returns an informative predictor with probability α < 1. In sum, we analyze the effect of data abundance on equilibrium outcomes by considering either an increase in τ max (the hidden gold nugget effect) or a decrease in α (the needle in the haystack problem). 6 As for greater computing power, it reduces the cost of exploring a new dataset. 7 Thus, we study the effect of greater computing power by considering the effect of a decrease in the cost of exploration, c, on equilibrium outcomes.
In equilibrium, each speculator's optimal search strategy follows a stopping rule: She stops searching for a predictor after finding one whose quality (signal-to-noise ratio) exceeds an endogenous threshold, denoted τ * (we refer to such a predictor as being "satisficing"). This threshold is such that the speculator's expected utility of trading on a predictor of quality τ * is just equal to her expected utility of searching for another 4 Recent empirical findings support this conjecture. For instance, Katona et al. (2019) find that combining satellite images of parking lots of U.S. retailers from two distinct data providers improves the accuracy of the forecasts of retailers' quarterly earnings (see also Zhu (2019)). Also, van Binsbergen et al. (2020) find that, with machine learning techniques, one can obtain more precise forecasts of firms' future earnings than analysts' forecasts (they use random forests regressions combining more than 70 accounting variables with analysts' forecasts). Last, Gu et al. (2020) consider 900+ predictors of stock and market returns and find that machine learning techniques (trees and neural networks) considerably increase out-of-sample R 2 of predictive models. 5 See, for instance, "Hedge funds see a gold rush in data mining", Financial Times, August 28, 2017. 6 As an illustration, consider searching for medication to cure the Coronavirus in the scientific literature. There have been more than 23,000 scientific papers written on this topic between January and June 2020 (see da Silva et al. (2020)). As this number grows, the fraction of truly informative papers might drop, even though the chance of a scientific discovery that stops the virus goes up. 7 For instance, an increase in computing power reduces the time costs of finding predictors. Brogaard and Zareei (2019) use a genetic algorithm approach to select technical trading rules. They note that "the average time needed to find the optimum trading rules for a diversified portfolio of ten NYSE/AMEX volatility assets for the 40 year sample using a computer with an IntelÂ® Core(TM) CPU i7-2600 and 16 GM RAM is 459.29 days (11,022.97 hours)." For one year it takes approximately 11.48 days." They conclude that their analysis would not be possible without the considerable increase in computing power in the last 20 years.
predictor. The latter reflects the prospect of obtaining a larger expected trading profit by finding a predictor of higher quality deflated by the total expected cost of search to find such a predictor (i.e., the per-exploration cost, c times the expected number of explorations required to find a predictor with a quality higher than τ * ).
All speculators use the same stopping rule because they are ex-ante identical (same preferences, search cost etc.). However, as explorations' outcomes are random, speculators find and trade on predictors of different quality. Thus, in equilibrium, (i) only predictors of sufficiently high quality are used for trading and (ii) speculators endogenously exploit predictors of different quality. Specifically, the quality of predictors used in equilibrium ranges from τ * (the least informative predictor used in equilibrium) to τ max (most informative).
Greater computing power induces speculators to adopt a more stringent stopping rule in equilibrium, i.e., a decrease in c raises τ * . Indeed, a decrease in the per-exploration cost, c, directly reduces the total expected utility cost of launching a new exploration after finding a predictor. Hence, it raises the value of searching for another predictor after finding one and therefore it induces speculators to be more demanding for the quality, τ * , of the least informative predictor used in equilibrium. An indirect consequence (the "competition effect") is that, on average, speculators trade more aggressively on their signal. Indeed, they face less uncertainty on the asset payoff because their predictors are better on average. As a result, the informativeness of the asset price about the payoff of the asset increases. The competition effect dampens the positive effect of a reduction in the exploration cost on the value of searching for a better predictor. However, it is never strong enough to fully offset it.
The needle in the haystack problem (a drop in α) does not affect the per exploration cost, c. However, it raises the total expected utility cost of search for speculators because it reduces the chance of finding a satisficing predictor in each exploration. For this reason, it leads speculators to be less demanding for the quality of the least informative predictor, The effect of pushing back the data frontier (an increase in τ max ) on speculators' optimal search strategy (τ * ) is more subtle because it directly affects the value of searching for another predictor in two opposite directions. One the one hand, it raises this value for two reasons. First, holding investors' stopping rule constant, it enlarges the range of satisficing predictors, which raises the probability that each exploration is successful. This effect reduces the total expected cost of search. Second, holding price informativeness constant, it increases the expected utility of trading on a satisficing predictor due to the prospect of finding even more informative predictors (the "hidden gold nugget effect").
However, an increase in the quality of the best predictor also has a direct positive effect on price informativeness because it raises the average quality of predictors and therefore the average aggressiveness with which speculators exploit their signals. This competition effect reduces the value of searching for predictors. We show that it dominates when τ max is high enough. Then, a push back of the data frontier leads speculators to follow a less demanding search policy (i.e., τ * drops). Thus, the model implies an inverse U-shape relationship between the quality of the least informative predictor used in equilibrium (τ * ) and the quality of the most informative predictor.
In sum, the model highlights two channels through which data abundance can reduce the quality of the least informative predictor used in equilibrium: (i) It reduces the trading value of predictors by intensifying competition among speculators (the "competition effect") and (ii) it increases the total expected cost of search, even though it does not change the per exploration cost ("needle in the haystack effect").
The model has several testable implications. First, it has implications for the distribution of investment skills across funds (or managers of these funds). Several papers (e.g., Kacperczyk and Seru (2007) or Kacperczyk et al. (2014)) relate these skills to the quality (precision) of asset managers' signals and interpret heterogeneity in skills as heterogeneity in the quality of these signals. In our model, speculators' skills are heterogeneous even though speculators are ex-ante identical. This heterogeneity is not due to innate differences in abilities or differences in efforts (in the model, speculators who happen to pay a larger search cost and therefore seem to exert more effort do not necessarily trade on predictors of higher quality). Rather, it reflects the fact that, even though all speculators follow the same optimal search strategy to find predictors, the outcome of their search for predictors is random. Shocks to computing power, data abundance and other parameters of the model affect the equilibrium search policy followed by speculators and thereby the distribution of skills between speculators (e.g., the difference in skills between funds in the lowest and top skill deciles). For instance, the model predicts that improvements in computing power should reduce heterogeneity in funds' skills (because it increases τ * ) while data abundance (a push back of the data frontier or the needle in the haystack problem) can have the opposite effect (because it can reduce τ * ). The model also implies that an increase in prior uncertainty (the variance of the asset payoff) or the volume of uninformed (noise) trading should reduce heterogeneity in funds' skills because it induces speculators to be more demanding for the quality of their predictors in equilibrium.
Our second set of predictions is about the informativeness of asset prices for fundamentals. Our model predicts that greater computing power improves price informativeness because it leads speculators to be more demanding for the quality of their predictors. 8 In contrast, the effect of data abundance on asset price informativeness is more complex.
On the one hand, it can lead speculators to be less demanding for the quality, τ * , of the least satisficing predictor. On the other hand, it pushes back the data frontier and improves the quality of the most informative predictor. The first effect reduces the average quality of predictors used by investors while the second improves it. As a result, the effect of data abundance on price informativeness is ambiguous in our model. In the absence of the needle in the haystack problem (α = 1), we show that the second effect dominates and therefore data abundance improves price informativess. In contrast, if data abundance also makes the needle in the haystack problem more severe (α decreases) then the first effect can dominate so that price informativeness drops when more data become available. 9 .
Our third set of predictions regards effects of computing power and data abundance on speculators' trading profits (excess returns) and the crowdedness of their strategies (measured by the correlation of their holdings). The model predicts an inverse U-shape relationship between speculators' average trading profits and computing power. Indeed, greater computing power raises the average quality of the predictors used in equilibrium and therefore price informativeness. The first effect raises speculators' expected trading profit while the second reduces it. The former dominates if and only if speculators' cost of exploration, c, is small enough. A push back of the data frontier has the same effect for the same reasons. The needle in the haystack problem reduces price informativeness and the 8 In line with this prediction, Gao and Huang (2019) find that the introduction of the EDGAR system in the U.S. (which allows investors to have internet access to electronic filings by firms) had a positive effects on measures of price efficiency. One possible reason, as argued by Gao and Huang (2019), is that the EDGAR system reduced the cost of accessing data (a component of exploration cost) for investors. 9 Given that technological progress has both enlarged the search space and reduced search costs, these implications of our model can explain why the empirical literature on the effect this progress on asset price informativeness reports conflicting results. See Section 5.2 for a discussion.
average quality of predictors used in equilibrium. The second (first) effect dominates when this problem becomes sufficiently severe (α is low enough). Hence, ultimately, the model also predicts an inverse U-shape relationship between speculators' average trading profits and data abundance. Overall the model implies that progress in information technologies initially benefit to all speculators until a point where it starts reducing their profits. We also show that greater computing power or an improvement in the data frontier reduce the pairwise correlation in speculators' trades while a drop in the proportion of informative datasets (α) has the opposite effect. Finally, in the last part of the paper, we show that data abundance can reduce speculators' expected utility in equilibrium so that speculators would be better off if they could commit not to exploit new datasets.

Related Literature
Our paper contributes to the literature on informed trading in financial markets when information acquisition is endogenous (e.g., Grossman andStiglitz (1980), Verrecchia (1982); see Veldkamp (2011) for a survey). This literature often takes a reduced-form approach to model the cost of acquiring a signal of given precision. For instance, Verrecchia (1982) (and several subsequent papers) assumes that this cost is a convex function of the precision of the signal. The learning technology in our model is different. The relationship between a speculator's total expected cost of obtaining information and the expected precision of her signal is endogenous and micro-founded by an optimal search model. 10 As explained previously, this approach gives us a way to analyze separately the effects of greater computing power (a decrease in the cost of processing data) and data abundance (an expansion of the search space).
Banerjee and Breon-Drish (2020) consider a model in which one informed investor can dynamically control his timing for information acquisition about the payoff of a 10 Han and Sangiorgi (2018) offers an interesting micro-foundation for the specification of information acquisition costs based on a model in which an agent can draw normally distributed signals from a fixed set (an "urn"), with replacement (so that the agent can draw the same signal multiple times). Each draw is costly in their model. They show that the relationship between the precision of the average signal obtained by the agent (a sufficient statistics for all his signals) and her total investment in drawing signals is convex and becomes linear when the number of possible signals goes to infinity. Han and Sangiorgi (2018) use this specification to analyze an optimal forecasting problem. Our approach differs in many respects. In particular, we jointly solve for the equilibrium of the market for a risky asset and speculators' optimal search for predictors (in Han and Sangiorgi (2018), the number of draws by an agent is exogenous and they do not apply their model to trading in financial markets). risky asset. In this model, the informed investor optimally alternates between periods in which she searches for information (when the volume of noise trading is high enough) and periods in which she does not (when the volume of noise trading is low). When she searches for information, the investor finds a signal of a given precision according to a Poisson process and starts trading on this signal as soon as she finds it. Interestingly, Banerjee and Breon-Drish (2020) shows that this dynamic model generates predictions different from the standard static model in which the informed investor must decide to acquire a signal before trading. In contrast, we depart from the traditional standard static model by modeling informed investors' search for signals of different precisions (in a static environment because there is no time-variation in parameters affecting the profitability of informed trading over exploration rounds in our model) and we compare the effects (e.g., on the heterogeneity in signals' precisions) of a reduction in search costs with the effects of expanding the search space (data abundance).
Our paper is also related to the recent literature analyzing the economic effects of progress in information technologies (see, Goldfarb and Tucker (2019) and Veldkamp and Chung (2020) for a review) and more specifically theoretical papers analyzing the effects of these technologies for the production of financial information (e.g., Abis ( 2020)). Theories in these papers explore ramifications of the idea that progress in information technologies reduces the cost of processing information or relax investors' attention constraints. In contrast, our model focuses on another dimension of this progress, namely data abundance, i.e., the expansion of investors' search space for predictors. We show that the effects of data abundance and the cost of processing data (c in our model) are different and derive several implications that should allow empiricists to test whether these differences matter empirically. Also, we explicitly analyze the acquisition of financial information as a search problem and consider the effects of reducing the cost of search

Model
We consider a financial market with a unit mass continuum of risk averse (CARA) speculators, a risk neutral and competitive market maker, and noise traders. Investors can invest in a risky asset and a risk free asset with interest rate normalized to zero. Speculators have no initial endowments in the risky and the riskless assets. Figure 1 describes the timing of the model.

Period 0
Exploration : Each speculator searches for a predictor of the asset payoff.

Period 1
Trading : Each speculator observes the realization of her predictor (s θ ) and chooses a trading strategy, x(s θ , p).
Speculators, noise traders and dealers trade.
Market clears : The asset price is realized.

Figure 1: Timing
The payoff of the risky asset, ω, is realized in period 2 and is normally distributed with mean zero and variance σ 2 . Speculators search for predictors of the asset payoff in period 0 (the "exploration stage"). Then, in period 1 (the "trading stage"), they observe the realization of these predictors and can trade on them in the market for the risky asset.
We now describe these two stages in details.
The exploration stage. In period 0, each speculator i searches for a predictor of the asset payoff, ω. There is a continuum of potential predictors. Each predictor, s θ , is characterized by its type θ and is such that: where θ ∈ [0, π/2] and the ε θ s are normally and independently distributed with mean zero and variance σ 2 . Moreover, ε θ is independent from ω. Let τ (θ) ≡ cos 2 (θ)/ sin 2 (θ) = cot 2 (θ) denote the signal-to-noise ratio for a predictor with type θ. We refer to this ratio as the "quality" of a predictor. 11 The quality of a predictor decreases with its type, θ and varies from zero (θ = π 2 to infinity when θ goes to zero). It is unrelated to the uncertainty about the asset payoff, σ 2 , because Var[ε θ ] = Var[ω] = σ 2 . Without this assumption, the quality of all predictors would, counter-intuitively, increase with uncertainty.
We assume that predictors' types, θs, are distributed according to the cumulative probability distribution Φ(.) (density φ(.)) on [0, π/2]. Speculators discover predictors' types in period 0 via a sequential search process. Each search round corresponds to a new exploration ("mining") of available data to obtain a new type of predictor. Each exploration costs c. It is unsuccessful, i.e., yields no predictor (or equivalently a predictor that is just noise), with probability (1 − αPr(θ ∈ [θ, π 2 )), where 0 < α ≤ 1. Otherwise the exploration is successful and returns a predictor of type θ ∈ [θ, π 2 ] with probability φ(.). After each exploration, a speculator can decide (i) to stop searching and trade in period 1 on the predictor she just found or (ii) to start a new exploration in the hope of finding an even better predictor. We assume that there is no limit on the number of explorations.
It is worth stressing that speculators observe the realization of their chosen predictor, s θ , in period 1, not in period 0. In period 0, they just choose the type (quality) of the predictor whose realization they will observe at date 1. A predictor can be viewed as a particular combination of variables from various datasets (e.g., past earnings, satellite images and consumer transactions data) that forecast the payoff of the asset. One exploration consists in testing the predicting power of a particular combination with prediction tools (e.g., linear regressions or machine learning techniques). For instance, one can interpret each exploration as collecting various variables and running a regression of the asset 11 Observe that the predictor s θ is equivalent (in terms of informativeness) to the predictorŝ θ = ω + cot(θ) −1 ε θ , whose precision is τ (θ)/σ 2 . Thus, a predictor of high quality is a predictor with high precision.
payoff (e.g., stock earnings) on these variables. The estimates of the coefficients of this regression can then be used to compute the predicted value of the regression, s θ , at date 1 after observing the realization of the variables used in this regression at this date. 12 Thus, a predictor does not need to be interpreted as a single variable. It can be viewed as a combination of variables whose weights have been optimally chosen to minimize the predictor's forecasting error in-sample. In this interpretation, speculators can try to improve the quality of their predictors by trying new combinations (e.g., by buying datasets with new variables).
As more datasets become available ("data abundance"), the number of possible combinations of variables that one can use to predict asset payoffs increases. This evolution has two consequences controlled by parameters θ and α in the model. First, it pushes back the "data frontier", i.e., it increases the chance (at least weakly) of finding even more informative predictors than those existing before. We refer to this dimension of data abundance as the "hidden gold nugget effect." For instance, by combining satellite images of retailers' parking lots and point of sale data with more traditional accounting data, one might be able to find more informative predictors of future earnings for these retailers than using accounting data alone. This dimension of data abundance is controlled by θ in our model: When θ decreases, the quality of the best predictor (the "hidden gold nugget"), denoted τ max ≡ τ (θ), improves.
Second, the share of combinations that yield informative predictors might fall as the number of all possible combinations explodes. For instance, there are myriads of ways in which one could combine traffic data in large cities with other data to predict economic growth. However, only a few are likely to be informative and discovering these combinations take time. We refer to this dimension of data abundance as the "needle in the haystack problem." 13 It is controlled by α in our model: As α decreases, each round 12 In this approach, the R 2 of the regression is a measure of the quality of the predictor. Indeed, the theoretical R 2 of a regression of ω on s θ (i.e., 1 − Var[ω | s θ ]/ Var[ω]) is equal to cos 2 (θ). Thus, the higher the quality of a predictor, the higher the R 2 of a regression of the asset payoff on the predictor. In other words, searching for predictors of high quality in the model is the same thing as searching for predictors with high R 2 s. Note that, as usual in rational expectations model, we assume that there is no uncertainty on θ, i.e., on the true predictive model relating the payoff of the asset to the predictor. In reality, investors might be uncertain about the true R 2 of a predictive model (e.g., because of too few past observations for past cash-flows relative to the number of variables used to forecast these cash-flows) and learn it over time (see Martin and Nagel (2020)). In our model, this means that speculators would learn about the true θ of a predictor (e.g., after observing an estimate of θ). We leave this extension for the future.
13 Agrawal et al. (2019) discusses a related problem for the generation of new scientific ideas. Specifically, as the space of possible combinations of existing ideas to create new ones enlarges, it becomes more of exploration is less likely to be successful as if the share of informative predictors was falling. 14 Finally, parameter c represents the cost of exploring a specific dataset to identify a predictor. Greater computing power reduces this cost. For instance, with more powerful computers, one can explore more datasets in a fixed amount of time. So the time cost of data mining is smaller. Thus, we analyze the effect of progress in computing power by considering the effect of a decrease in c on the equilibrium. 15 We focus on equilibria in which each speculator follows an optimal stopping rule θ * i . That is, speculator i stops searching for new predictors once she finds a predictor with type θ ≤ θ < θ * i (a predictor of sufficiently high quality in the feasible range). We denote by Λ(θ * i ; θ, α) the likelihood of this event (the probability of success) for speculator i in a given search round. That is: Thus, a decrease in θ raises the likelihood of finding a predictor in a given exploration, holding α constant. This effect captures the idea that while data abundance might reduce the fraction of informative datasets, it increases the chance of finding a good predictor once one has identified an informative dataset.
As the outcome of each exploration is random, the realized number of explorations varies across speculators (even if they use the same stopping rule). We denote by n i the realized number of search rounds for speculator i. This number follows a geometric distribution with parameter Λ(θ * i ; θ, α). Thus, the expected number of explorations for a given speculator (a measure of her search intensity) is: difficult to identify new useful combinations. One can think of the search for predictors at date 0 as a search for new "ideas" to forecast asset payoff. Each new idea is characterized by its forecasting power. 14 See for instance "The quant fund investing in humans not algorithms" (AlphaVille, Financial Times, December 6, 2017), reporting discussions with a manager from TwoSigma noting that: "Data are noise. Drawing a tradable signal from that noise, meanwhile, takes work, since the signal is continuously evolving [...] Crucially, Duncombe added, there's qualitative data decay going on too. Back in the day, star managers may have had access to far smaller data sets, but the data in hand was of much higher quality." 15 We assume that the data frontier, θ and the cost of exploration, c are identical for all speculators. Thus, speculators are ex-ante identical and heterogeneity in their performance is endogenous. In a more complex model, these parameters may also differ across speculators (e.g., θ may be lower for institutions who accumulated more data over time). We leave the analysis of this case for future work.
To simplify the exposition, we assume that speculators cannot "store" predictors that they turn down (i.e., the search for predictors is without recall). We show in Section 1 of the online appendix that this assumption is innocuous.
A last observation is in order. In our model, launching a new exploration does not guarantee that one will necessarily obtain a better predictor than in previous explorations.
At the first glance, this may look counter-intuitive because one might think that as speculators observe more predictors, they should be able to obtain an increasingly precise signal about the asset payoff (e.g., by just taking the average of all signals). However, at date 0, speculators discover in each exploration the type of a particular predictor, not its future realization (signals are observed only at date 1). And, as previously explained, we see an exploration as experimenting with a new combination of variables (a new "investment idea") to build a predictor of the asset payoff. As this combination is new, it does not necessarily have a higher forecasting power than previous combinations.
The trading stage. Trading begins after all speculators find a predictor with satisficing quality. At the beginning of period 1, each speculator observes the realization of her predictor, s θ and chooses a trading strategy, i.e., a demand schedule, is the asset price in period 1. (1995), speculators trade with noise traders and risk-neutral market makers. Noise traders' aggregate demand is price-inelastic and equal to η, where η ∼ N (0, ν 2 ) (η is independent of ω and errors' in speculators' signals). Market-makers observe investors' aggregate demand, D(p) = x i (s θ , p)di + η and behave competitively. The equilibrium price, p * is equal to their expectation of the asset payoff conditional on aggregate demand from noise traders and speculators:

As in Vives
Speculators' objective function. At t = 2, the asset pays off and speculator i's final wealth is The number of explorations for speculator i, n i , is independent from the asset payoff, its price, and the realization of the speculator's predictor, s θ , because n i is determined in period 0, before the realizations of these variables. Thus, the ex-ante expected utility of a speculator can be written: Expected Utility Cost of Exploration (6) The first term in this expression represents the ex-ante expected utility that a speculator derives from trading gross of her total exploration cost while the second term represents the expected utility of the total cost paid to find a predictor (we call it the expected utility cost of exploration). The expected utility from trading depends both on the investor's optimal trading strategy (x i (s θ,i , p)) and her optimal stopping rule (θ * i ) because this rule determines the distribution of s θ . The expected utility cost of exploration depends on the speculator's stopping rule, θ * i , because it determines the distribution of n i . In the existing literature (e.g., Grossman and Stiglitz (1980)), n i = 1 (investors pays a cost and gets one signal of known quality). In our model, n i is random and its distribution is controlled by the speculator via her stopping rule.
Each speculator chooses her stopping rule, θ * i , and her trading strategy, x i (s θ,i , p), to maximize her ex-ante expected utility.

Equilibrium
We focus on symmetric equilibria in which all speculators choose the same stopping rule, θ * . We solve for such an equilibrium as follows. First, we solve for the equilibrium of the trading stage in period 1 taking θ * as given and we deduce the ex-ante expected utility achieved by speculator i when she chooses a predictor of type θ in period 0. We then observe that a speculator should stop searching when she finds a predictor such that the expected utility of trading on this predictor is larger than or equal to the expected utility she can obtain by launching a new exploration. The optimal stopping rule of each investor, θ * i (θ * ), is such that this condition holds as an equality (so that the speculator is just indifferent between searching more or stopping). Finally, we pin down θ * by observing that, in a symmetric equilibrium, each speculator's best response to other speculators' stopping rule, θ * , must be identical, i.e., θ * i (θ * ) = θ * .
Equilibrium of the asset market in period 1. The outcome of the exploration phase is characterized by the distribution of the predictors' types found by speculators.
Let φ * (θ; θ * ; θ, α) be this distribution given that speculators follow the stopping rule θ * : This distribution characterizes the heterogeneity of speculators' predictors in equilibrium. We denote the average quality of predictors across all speculators in period 1 bȳ and we make the following assumption on the distribution φ(·): exists.
This technical condition guarantees that the equilibrium remains well defined even when θ = 0. 16 Proposition 1 provides the equilibrium of the asset market in period 1.
Proposition 1. In period 1, the equilibrium trading strategy of a speculator with type θ is: whereŝ θ = ω + τ (θ) −1/2 ε θ and the equilibrium price of the asset is: This result extends Proposition 1.1 in Vives (1995) to the case in which speculators have signals of heterogenous precisions (determined by their θ in our model). The predictor s θ is informationally equivalent to the predictorŝ θ = ω + τ (θ) −1/2 ε θ . A speculator's optimal position in the asset is equal to the difference betweenŝ θ and the price of the asset (her expected dollar return) scaled by a factor that increases with the quality of the predictor and decreases with the speculator's risk aversion. The scaling factor measures the speculator's aggressiveness in trading on her predictor. Speculators with predictors of higher quality trade more aggressively on their signal because they face less risk (their forecast of the asset payoff is more precise).
The total demand for the asset (D(p)) aggregates speculators' orders and therefore reflects their information. Observing this demand is informationally equivalent to observing the signal ξ, whose informativeness increases with the average quality of speculators' predictors,τ (θ * ; θ, α). Thus, the market maker can form a more precise forecast of the asset payoff and the asset price is therefore more informative about this payoff when the average quality of speculators' predictors,τ (θ * ; θ, α), is higher. Formally, let measure the informativeness of the asset price by I(θ * ; θ, α) = Var[ω | p * ] −1 as in Grossman and Stiglitz (1980). Using Proposition 1, we obtain: where τ ω = 1/σ 2 is the precision of speculators' prior about the asset payoff. As expected, the asset price is more informative when the average quality of speculator's predictors increases. Thus, the informativeness of the asset price is inversely related to θ * becausē τ (θ * ; θ, α) decreases with θ * . Thus, other things equal, price informativeness is smaller when speculators chooses a less stringent stopping rule for the quality of the predictors on which they trade.
Equilibrium of the exploration phase. Using the characterization of the equilibrium of the asset market, we compute a speculator's expected utility from trading ex-ante, i.e., before observing the realization of her predictor and the equilibrium price, when her predictor has type θ and other speculators follow the stopping rule θ * . We denote this ex-ante expected utility by g(θ, θ * ) and refer to it as the trading value of a predictor with type θ. Formally: Lemma 1. In equilibrium, the trading value of a predictor with type θ is: The trading value of a predictor increases with its quality and decreases with the informativeness of the asset price. 17 Thus, it is inversely related to the average quality of predictors used by speculators. Hence, the value of a given predictor for a speculator depends on the search strategy followed by other speculators: It is smaller if other speculators are more demanding for the quality of their predictors (i.e., when θ * decreases).
Armed with Lemma 1, we can now derive a speculator's optimal stopping rule given that other speculators follow the stopping rule θ * . Let θ i be an arbitrary stopping rule for speculator i. The speculator's continuation utility (the expected utility of launching a new round of exploration) after turning down a predictor is: The first term (exp(ρc)) in eq. (14) is the expected utility cost of running an additional search. The second term is the likelihood that the next exploration is successful times the average trading value of a predictor conditional on the type of this predictor being satisficing (i.e., in [θ, θ i ]). Finally, the third term is the likelihood that the next exploration is unsuccessful times the speculator's continuation utility when she turns down a predictor. Solving eq.(14) for J( θ i , θ * ), we obtain: The continuation value of the speculator when she turns down a predictor does not depend on the outcomes of past explorations because these outcomes do not affect the speculator's opportunity set in future explorations. Thus, J( θ i , θ * ) is also the speculator's ex-ante expected utility before starting any exploration in period 0. As explained previously, it is the product of the expected utility cost from explorations and the expected utility from trading. Now suppose that speculator i has obtained a predictor with quality θ. If the spec- p] because E[ω|s θ , p * ]−p * = 0. Thus, eq.(13) implies that, τ (θ)τω I(θ * ;θ,α) = E ( E(R θ |s θ ) σ R θ |s θ ) 2 , where R θ = ω/p * − 1 is the excess return of a speculator with type θ (the riskless rate ofd return is normalized to zero) and σ R θ |s θ is the standard deviation of this return conditional on the observation of s θ . In other words, τ (θ)τω I(θ * ;θ,α) is the equilibrium value of the expected square Sharpe ratio of a speculator trading on a predictor with type θ. ulator stops exploring the data at this stage, her expected utility is g(θ, θ * ) (her cost of exploration to obtain this predictor is sunk). If instead the speculator decides to launch a new round of exploration, her expected utility is J( θ i , θ * ). Thus, her optimal decision is to stop searching for a predictor if g(θ, θ * ) ≥ J( θ i , θ * ) and to keep searching otherwise.
As g(θ, θ * ) decreases with θ, the optimal stopping rule of the speculator, θ * i (θ * ), is the value of θ such that the speculator is just indifferent between these two options: In a symmetric equilibrium, it must be that θ * i (θ * ) = θ * . We deduce that θ * solves: Using the expression for J(., θ * ) in eq. (14), we can equivalently rewrite this equilibrium condition as: where: with where the second equality in eq.(20) follows from eq.(13). Assumption A.1 guarantees that F (θ * ) is well defined even when θ = 0. The next proposition shows that there is a unique interior solution (i.e., θ * ∈ (θ, π 2 )) to the equilibrium condition (18) when c is small enough.
Proposition 2. There is a unique symmetric interior equilibrium of the exploration phase in which all speculators are active (i.e., a unique stopping rule such that θ < θ * < π/2 common to all speculators) if and only if F (π/2) < exp(−ρc) < 1.
When exp(−ρc) ≤ F (π/2) (i.e., c large enough), there is no symmetric interior equilibrium. However, in this case, one can build an equilibrium in which only a fraction of all speculators are active, i.e., search for a predictor and trade (if c is not too large). In this equilibrium, active speculators search for a predictor with a stopping rule equal to θ * = π/2 while others remain completely inactive (do not search and do not trade). Moreover, the fraction of speculators who are active is such that all speculators are indifferent between being active or not. Henceforth,we focus on the case in which the equilibrium is interior (i.e., F (π/2) < exp(−ρc) < 1 because (i) we are interested in what happens when the cost of exploration becomes small and (ii) this shortens the exposition.
4.2 Data abundance, computing power and optimal data mining.
We now analyze how data abundance (a decrease in θ and/or α) and computing power (a decrease in c) affect the quality of the worst predictor on which speculators trade in equilibrium, i.e., τ (θ * ). Indeed, the quality of this predictor determines the range of predictors used in equilibrium and ultimately several equilibrum outcomes of interest (see next section).
Proposition 3. A decrease in the cost of exploration, c, always reduces the stopping rule θ * used by speculators in equilibrium (∂θ * /∂c > 0). Thus, greater computing power raises the quality, τ (θ * ), of the worst predictor used by speculators in equilibrium.
The economic mechanism for this finding is as follows. Holding θ * constant, a decrease in the per-exploration cost, c, directly reduces the expected utility cost of launching a new exploration after finding a predictor (the first term in bracket in eq.(15)). Hence, it raises the value of searching for another predictor after finding one (i.e., J(θ * , θ * )). This direct effect induces speculators to be more demanding for the quality of their predictor and therefore works to decrease θ * . One indirect consequence of this behavior is that, on average, speculators trade more aggressively on their signal (the "competition effect") because they face less uncertainty on the asset payoff (their predictors are better on average). As a result, price informativeness increases. This indirect effect reduces the expected utility from trading on a satisficing predictor (the second term in bracket in eq.(15)) and therefore dampens the direct positive effect of a decrease in c on the value of searching for a better predictor after finding one. However, it is never strong enough to fully offset it.
We now consider the effect of data abundance on speculators' optimal stopping rule.
Remember that data abundance has two consequences in the model: (i) it pushes back the data frontier by raising the quality of the best predictor and (ii) it increases the risk for speculators of using datasets which, after exploration, proves to be useless (the needle in the haystack problem).
2. The effect of a decrease in θ on speculators' stopping rule is ambiguous. However, when θ is less than θ tr (c), a decrease in θ always increases speculators' stopping rule in equilibrium (∂θ * /∂θ < 0 for θ < θ tr (c)) and reduces the quality, τ (θ * ), of the worst predictor used by speculators in equilibrium.
When the needle in the haystack problem becomes more acute, speculators become less demanding for the quality of their predictors. Intuitively, a drop in α increases the expected utility cost of launching a new exploration after finding a predictor (the first term in bracket in eq.(15)) because it reduces the likelihood of finding a predictor in a given exploration (Λ). Thus, after turning down a predictor, speculators expect to go through a larger number of explorations rounds before finding a satisficing predictor, which increases their total cost of search. This direct effect induces speculators to be less demanding for the quality of their predictor and therefore works to increase θ * (reduce τ (θ * )). Indirectly, this behavior reduces asset price informativeness and therefore raises the expected utility from trading on a satisficing predictor (the second term in bracket in eq.(15)), which alleviates the direct negative effect of a decrease in α on the value of searching for a better predictor after finding one. However, this indirect effect is never strong enough to fully offset the direct effect. In sum, qualitatively, the effect of a drop in α is similar to that of an increase in the per exploration cost. 18 18 Given this, one might be tempted to capture the needle in the haystack effect by just considering the effect of increasing c (on the ground that it becomes more costly to find good datasets). But this approach is inconsistent with the argument that progress in information technology has reduced information processing costs. This point illustrates the importance of having separate parameters to capture the effects of (i) greater information processing power (a decrease in c in our model) on the one hand and (ii) data abundance on the other hand.
The effect of pushing back the data frontier on speculators' stopping rule is more complex. Counterintuitively, it can lead speculators to trade on predictors of worse quality, even though the quality of the best predictor increases. The reason is as follows. On the one hand, pushing back the data frontier increases the chance of finding a satisficing predictor holding the search strategy, θ * constant (Λ(θ * ; θ, α) increases when θ goes down).
This effect reduce the expected number of rounds required to find a predictor and therefore reduces the expected utility cost of searching for a new predictor after rejecting one. Therefore, it increases the continuation value of searching for a predictor (see eq. (15)).
On the other hand, a push back of the data frontier affects the expected utility from trading for two reasons. First, it gives the possibility to obtain more informative predictors than those existing before ("the hidden gold nugget effect"), which raises the expected utility from trading on a satisficing predictor. Second, it increases price informativeness (other things equal, I(θ * ; θ, α) increases when θ decreases) because speculators who obtain the most informative predictors trade even more aggressively than before the change in the data frontier. As a result, speculators' aggregate demand and therefore the asset price are more informative, which reduces the value of being informed ("the competition effect"). This effect reduces the expected utility from trading on a satisficing predictor.
Thus, the sign of a change in the data frontier (holding θ * constant) on the expected utility from trading is ambiguous.
To analyze this more formally, we differentiate the expected utility from trading, When θ becomes small enough, the competition effect dominates the hidden gold nugget effect and the expected utility from trading on a satisficing predictor drops. The second part of Proposition 4 shows that there is always a sufficiently low value of θ such that this drop more offsets the reduction in the expected utility cost of finding a predictor.
When this happens, pushing back the data frontier further reduces the continuation value of exploration. Hence, speculators choose a less stringent stopping rule in equilibrium and some optimally choose to trade on less informative predictors (τ (θ * ) decreases).
Proposition 5. The quality of the worst predictor used in equilibrium, τ (θ * ), increases with the volume of noise trading, ν 2 , or the volatility of the asset payoff, σ 2 .
An increase in the volume of noise trading or the volatility of the asset reduces the informativeness of the equilibrium price. This effect raises the expected value of trading, holding the search policy, θ * , constant. Thus, the continuation value from searching increases and speculators become therefore more demanding for their predictors (θ * decreases).

Data Abundance, Computing Power, and Managerial Skills
As explained in the previous section, the model has implications for the effects of data abundance and computing power on the distribution of the quality of predictors used by speculators in equilibrium, in particular the lower bound of this distribution τ (θ * ). To test these implications, one can use data on active funds' holdings and their returns on these holdings and regress the position of each fund (speculator) in a given asset (x i (s θ , p * ) in the model), at a given point in time on their return on this position ((ω − p * ) in the model). In the model, the coefficient of this regression, β θ , is: where the last equality follows from Proposition 1. Intuitively, β θ is a measure of a speculator's stock picking ability or investment "skills". 20 Equation (22) Grinblatt and Titman (1993) and Daniel et al. (1997) 21 Alternatively, one could proceed as in Kacperczyk and Seru (2007) to measure asset managers' investment skills and rank these. Specifically, Kacperczyk and Seru (2007) measures the precision of asset managers' signals (their "skill") by the sensitivity of their holdings to public information. The higher is this sensitivity, the lower is the precision of a manager's private signals. This would also be the case in a simple extension of our model in which speculators receive a public signal at date 1 in addition to their private signal s θ . positive shocks to computing power increase the stock picking ability (measured by β) of the funds with the lowest βs' (say in the lowest decile) while positive shocks to data abundance (e.g., the availability of new alternative data as in Zhu (2019) or Dessaint et al. (2021)) have the opposite effect (even though they may increase the stock picking ability of the best performing funds). One could also test whether the difference between the stock picking ability of speculators with the lowest and highest ability is reduced in periods of heightened fundamental volatility or noise trading, as implied by Proposition

5.
Kacperczyk and Seru (2007) (and others) find that there is considerable heterogeneity in asset managers' skills (see their Table I). Our model suggests that one source of heterogeneity might be managers' luck in their search for a predictor, rather than differences in innate abilities to find investment ideas or effort. Indeed, in our model, all speculators are ex-ante identical and choose the same effort in terms of search in the sense that their stopping rule (and therefore expected total cost of search) is identical. Yet, they end up trading on predictors of different qualities because the outcome of the search process is random. This implies in particular that a speculator might end up paying a large total search cost (n i c) and yet appear as having low skills (trading on a signal of poor quality).

Data Abundance, Computing Power, and Asset Price Informativeness
Progress in information technologies have improved investors' ability to forecast asset payoffs in two ways. On the one hand, these technologies reduce the cost of filtering out noise from raw data (e.g., greater computing power enables asset managers to use powerful statistical techniques, such as deep neural networks, to form their forecasts). On the other hand, they allow to collect and store increasing volume of data. Propositions 6 and 7 show that these two different distinct dimensions of technological progress do not affect asset price informativeness in the same way.
Proposition 6. In equilibrium, an increase in computing power (a decrease in c) raises the average quality of speculators' predictors and therefore price informativeness.
Greater computing power induces speculators to be more demanding for the quality of their predictors (to put more effort in the search of good predictors) because it re-duces the cost of exploring new data to obtain a predictor (see Proposition 3). Thus, speculators obtain signals of higher quality on average. Hence, on average, they trade more aggressively on their signals, their aggregate demand for an asset becomes more informative and, for this reason, price informativeness increases (see eq.(11)).
1. In equilibrium, an improvement in the quality of the most informative predictor (a decrease in θ) raises the average quality of speculators' predictors and therefore price informativeness.
2. In equilibrium, a decrease in the proportion of informative datasets (a decrease in α) reduces the average quality of speculators' predictors and therefore price informativeness.
Thus, the effect of data abundance on price informativeness is ambiguous. Holding α constant, data abundance (a decrease in θ) improves asset price informativeness, even when it induces speculators to be less demanding for the quality of their predictors (i.e., when a decrease in θ reduces τ (θ * ); see Proposition 4). The reason is that the negative effect of the drop in the quality of the worst predictor used in equilibrium (when it happens) on the average quality of speculators' signals is never sufficient to offset the positive effect of the improvement in the quality of the best predictor in equilibrium.
As a result, a push back of the data frontier raises the average quality of predictors and speculators' average trading aggressiveness. In contrast, holding θ constant, data abundance (a decrease in α) leads speculators to be less demanding for the quality of their predictors. As a result, the average quality of predictors drops, speculators' aggregate demand is less informative and therefore price informativeness drops.
In reality, data abundance is likely to both push back the data frontier (reduce θ) and exacerbate the needle in the haystack problem (reduce α). As a result, the net effect of data abundance on the long run evolution of asset price informativeness is ambiguous, as shown in Figure 3 (in which we assume that α = min{1, 0.32 + 0.8 × θ})). investment to stock prices after the digitization of firms' regulatory filings, which they explain by a decline in the production of private information. Our results suggests that considering shocks that only affect computing power or abundance (rather than both dimensions simultaneously) would help to make progress in understanding how progress in information technologies affect asset price informativeness.

Data abundance, Computing Power and Trading Profits
In equilibrium, the total trading profit ("excess return"), π(s θ ), of a speculator with type θ on his position in the risky asset is: where x * (s θ , p * ) and p * are given by eq.(8) and eq.(9), respectively. Using eq.(8), we deduce that: Using eq.(23), the expected trading profit of a speculator with type θ is thereforē where the last equality follows from the fact that Var[ω | p * ] = (I(θ * , θ)) −1 (by definition of I(θ * , θ)).
Thus, the unconditional expected trading profit of all speculators (the average trading profit across all speculators) is: and the variance of trading profits for speculators (the dispersion of trading profits across all speculators) is: Empirically, E[π(θ)] and Var[π(θ)] could be measured by the cross-sectional mean and variance of trading profits of active funds (for instance in a given quarter). Another possibility is to consider the distribution (across funds) of the squared Sharpe Ratio (the ratio of average excess returns for a fund divided by the standard deviation of returns) of active funds. Indeed,π(θ) is equal to the expected squared Sharpe ratio of a speculator with type θ, divided by her risk aversion (see Footnote 17). Thus, E[π(θ)] and Var[π(θ)] can also be interpreted as the mean and variance of the distribution of squared Sharpe ratios across funds.
3. Ifτ (θ * (θ, c, 1), θ, 1) > (τ ω ρ 2 ν 2 ) 1/2 then speculators' expected profit is a hump shaped function of α, which reaches its maximum for α =α (characterized in the proof of the proposition). Otherwise, speculators' expected profit increases with α and reaches its maximum for α = 1 Thus, data abundance or greater computing power do not necessarily improve speculators' expected trading profit. Consider first a decrease in c or θ. Such a decrease leads speculators to be more demanding for the quality of their predictors and raises the average quality of their signals. However, for this reason, it raises price informativeness.
The first effect has a positive effect on speculators' expected profit while the second has a negative effect. The latter effect always dominates when c or θ are small enough (see Figure 4 for a numerical example). A decrease in α has the opposite effects: It reduces the average quality of speculators' signals and price informativeness. The first effect reduces speculators' expected profit while the second increases this expected profit. The former effect always dominates when α is small enough. Overall, these findings suggest that there can be a point at which further improvements in computing power or data availability reduces speculators' expected profit. , as a function of the search cost, c (other parameter values are θ = π/5, ρ = σ 2 = ν 2 = 1). Right: Speculators' expected profits, E(π), as a function of the data frontier, θ (other parameter values are c = 0.05, ρ = σ 2 = ν 2 = 1). Upper graphs: φ(θ) = 3 cos(θ) sin 2 (θ). Lower graphs: φ(θ) = 5 cos(θ) sin 4 (θ).
Now consider the effect of changes in the cost of processing data and data abundance on the dispersion (Var[π(θ)]) of expected trading profits across speculators. Using eq.(27), we obtain the following result.
1. Other things equal, the dispersion of speculators' expected trading profit decreases when the cost of processing data goes down for c small enough (d Var[π(θ)]/dc > 0 for c sufficiently close to zero).
2. Other things equal, the dispersion of speculators' expected profit increases when the data frontier is pushed back for θ small enough (d Var[π(θ)]/dθ < 0 for θ sufficiently close to zero).
To understand the first part of the proposition, suppose that c = 0. In this case, all speculators search for a predictor until they find one with the highest possible quality, i.e., θ * = θ. As a result, all speculators trade on predictors of the same quality (Var[τ (θ) | θ < θ < θ * ] = 0) and therefore the dispersion of expected trading profits is nil (see (eq.(27)). Now consider a small increase in c starting from the situation in which c = 0.
This increase raises θ * and therefore the dispersion of the quality of predictors used by speculators (Var[τ (θ) | θ < θ < θ * ] increases). As a result, the dispersion of trading profits increases as well. This increase is amplified by the fact that price informativeness goes down, which works to increase the dispersion in trading profits as well (see the expression for Var[π(θ)] in eq.(27)). As these effects still hold for larger values of c, we conjecture that the first part of Proposition 9 holds for all values of c but we have not been able to show it analytically (numerical simulations suggest that our conjecture is correct; see Figure 5 below for an example).
When θ < θ tr (c), pushing back the data frontier further raises the quality of the best predictor and reduces the quality of the worst predictor used by speculators (see Proposition 4). Thus, the range of quality for the predictors used in equilibrium widen. This effect increases the dispersion of the quality of predictors used by speculators (Var[τ (θ)] increases), which increases the dispersion of speculators' expected profits, holding price informativeness constant. In equilibrium, price informativeness improves, which dampens the previous effect (since Var[π(θ)] is inversely related to price informativeness; see eq. (27)). However, for θ small enough, this second effect is not sufficient to offset the first. This explains the second part of the proposition.

Data Abundance, Computing Power and Crowding
Practitioners refers to the tendency for investors to follow the same trading strategy and exploit the same signals as "crowding". 22 Let Cov(x(s θ i , p * ), x(s θ j , p * )) be the covariance between the equilibrium holdings of a speculator with type θ i and a speculator with type θ j . Using eq.(24) and the fact that Var[ω − p * ] = (I(θ * , θ)) −1 , we obtain: We deduce that the pairwise correlation between the equilibrium positions of a speculator with type θ i and a speculator with type θ j (a measure of crowding) is: 22 Shanta Putchler, the CEO of Mannumeric (a quantitative investment fund) notes that: "The single largest contributor to crowding is the simple fact that investors tend to do the same sorts of things. There is a real propensity for investors to analyse the same datasets, with the same statistical techniques, and hence end up with largely overlapping positions." See https://www.man.com/maninstitute/crowding. Thus, holding the quality of the predictors used by two speculators constant, their positions become less correlated when price informativeness is higher. The reason is that speculators trade on the component of their forecast of the asset payoff that is orthogonal to the price. This component reflects both the component of the fundamental, ω, that is not reflected into the equilibrium price and the noise in speculators' signal. The higher the first component relative to the second, the higher the pairwise correlation in speculators' positions in the asset. As the price becomes more informative, the first component becomes smaller relative to the noise component and as a result, the pairwise correlation between speculators' positions drops. Using Proposition 7, we deduce the following result. Testing Proposition 10 requires measuring the pairwise correlation of speculators' positions, holding the quality of their signal constant. One possibility is to estimate the cross-sectional distribution of funds' predictors quality using the method described in Section 5.1 and analyze the effect of shocks to computing power or data abundance on the correlation in the positions of funds in different quantiles of the distribution.
That is, each speculator's expected utility is just equal to the expected utility from trading on the worst predictor used in equilibrium. The reason is that the increase in the expected utility from trading associated with further explorations for a speculator who has found a predictor with type θ * is just offset by the expected utility cost of further explorations.
As can be seen from eq.(13), the data frontier, θ, affects speculators' ex-ante expected utility only via its effects on (i) the quality of the worst predictor, τ (θ * ) and (ii) the informativeness of the asset price, I(θ * ; θ, α). Now, a decrease in θ always raises price informativeness (Proposition 7) and, when θ < θ tr (c), it reduces the quality of the worst predictor (Proposition 4). Thus, it unambiguously reduces speculators' expected utility because g(θ * , θ * ) decreases with the informativeness of the asset price and increases with the quality of the worst predictor (τ (θ * )).
An increase in computing power raises the quality of the worst predictor and price informativeness in equilibrium. Thus, its effect on speculators' welfare is ambiguous.
Numerical simulations show that the first effect dominates unless c becomes very small.
Thus, in contrast to a push back of the data frontier, an improvement in computing power raises speculators' welfare (even though, it can reduce their average gross trading profits; see Proposition 8). Figure 6 illustrates this point. For similar reasons, the needle in the haystack problem (a decrease in α) has an ambiguous effect on speculators' welfare: It reduces price informativeness but it also decreases the quality of the worst predictor. The first effect improve speculators' welfare while the second reduces it. Numerical simulations show that the second effect dominates for α low enough. Thus, data abundance can make speculators worse off in equilibrium. One might then wonder whether it would not be optimal for a speculator to ignore new data. This is not the case, however. To see this, suppose that the emergence of new datasets enable investors to reduce θ from θ 0 to θ 1 < θ 0 but that speculators agree not to take advantage of the new datasets. In this case, each speculator obtains an expected utility equal to J(θ * (θ 0 ), θ * (θ 0 )). If an investor secretely deviates by acquiring the new data, she does not affect the equilibrium of the trading stage. Hence, price informativeness is unchanged. It follows that if the investor finds a predictor with a type in [θ 0 , θ * (θ 0 )], her expected utility of trading on this predictor is unchanged. However, in addition, the speculator has the possibility to finds a predictor with a type in [θ 1 , θ * (θ 0 )] and her expected utility from trading on a predictor with a type in this range is strictly higher than her expected utility from trading on a predictor with a type in [θ 0 , θ * (θ 0 )]. Thus, the deviation is profitable for the speculator. In other words, each speculator individually finds optimal to use the new datasets, if she expects others not to do so. Hence, unless speculators can credibly commit not using the new datasets, all of them do, which makes them collectively worse off than if they did not.
Thus, data abundance can be "excessive" from speculators' viewpoint in the sense that they would be better off if the data frontier could not be improved. We now show that speculators' average investment in search is also excessive in the sense that, holding all exogenous parameters constant, they would be better off if they could commit to use a less demanding stopping rule (and therefore predictors of lower quality on average).
To see this, let assume that speculators can collectively choose a stopping rule, θ r and commit to this choice. In this case, speculators would optimally choose the stopping rule θ * * r such that: θ * * r = arg max θr J(θ r , θ r ).
Proposition 12. In a symmetric interior equilibrium of the exploration phase, the stopping rule used by speculators is more demanding than the optimal stopping rule with commitment, that is, θ * < θ * * r . Thus, in equilibrium, speculators' investment in search for predictors, E(n i c), is higher than the investment that would maximize their welfare if they could collectively choose their stopping rule.
Thus, there is excessive investment in search for predictors in equilibrium from speculators' viewpoint. The reason is as follows. When speculators choose a more stringent stopping rule, they expect to trade on a predictor of better quality on average, which raises their expected utility. However, this choice raises their expected utility cost of exploration (as it will take more exploration rounds to find a predictor) and price informativeness (as speculators trade more aggressively on more precise signals). Both effects reduce their expected utility. When they individually choose their stopping rule, each speculator accounts for the first cost but ignores the second (as each speculator is too "small' to affect price informativeness). In contrast, a central planner organizing the search for predictors in speculators' interests internalizes both costs and therefore choose a less stringent stopping rule than that chosen individually by speculators.

Conclusion
Progress in information technologies enable investors to have access to more data (data abundance), both in terms of volume and diversity, and greater computing power, so that they can deploy more powerful techniques to extract information from raw data. In this paper, we propose a new model of information acquisition to analyze separately the effects of these two distinct dimensions of technological progress.
In our model, speculators search (mine data) for predictors via trials and optimally stop searching when they find a predictor with a signal-to-noise ratio larger than an endogenous threshold. As the outcome of speculators' search process is random, speculators discover different predictors. Thus, even though they are homogenous ex-ante, speculators are heterogeneous ex-post in terms of the quality of their predictors, their performance, their holdings etc. In this way, our model generates predictions about the effects of data abundance and computing power on the distribution of asset managers' skills (precisions of their signals), the distribution of their trading profits, or the correlation in their holdings. Moreover, asset price informativeness is determined by speculators' optimal data mining strategy because this strategy determines the average quality of their signals and thereby the informativeness of their aggregate demand.
The main message of our model is that the effects of data abundance and greater computing power are not the same. For instance, greater computing power always induces speculators to be more demanding for the minimal quality of their predictors while this is not necessarily the case for data abundance. As a result, positive shocks to computing power improve and homogenize predictors' quality across speculators and, for this reason, improve price informativeness. In contrast, data abundance can result in a greater dispersion of predictors' quality across speculators and a drop in price informativeness.
In this case, the aggregate demand for the asset is given by: whereā is the average value of a(θ) across all speculators Hence, observing D(p) (and p) is informationally equivalent to observing ξ = ω +ā −1 η. Thus: where τ ξ ≡ā 2 ν 2 is the precision of ξ as a signal about ω. Now consider speculators. Using standard calculations in the CARA gaussian framework, we obtain that the optimal demand for the risky asset of a speculator with signal s θ is: As speculators have rational expectations on the price, they anticipate that it is linear in ξ, as in eq.(32). Moreover, letŝ θ ≡ ω + τ (θ) − 1 2 θ , so that s θ = cos(θ)ŝ θ . Thus, and Note that the precision ofŝ θ is τ (θ)τ ω . Thus, as all variables are normally distributed and θ and η (the noises inŝ θ and ξ) are independent, standard calculations yield: and Thus, we can rewrite eq.(33) as: Using the fact that p = τ ξ τω+τ ξ ξ we deduce that: Thus, x * (s θ , p) is as conjectured (and as in eq.(8)) if and only if a(θ) = τ (θ) ρσ 2 . If follows thatā =τ (θ) ρσ 2 . Eq.(9) and eq.(10) in the text immediately follow from substituting this expression forā in eq.(32).
In sum we have shown that (i) if dealers expect speculators to follow the trading strategy x * (s θ , p) given by eq.(8) then they set a price given by eq. (9) and (ii) if dealers set a price given by eq.(9) then speculators follow the trading strategy x * (s θ , p) given by eq.(8). Thus, eq.(8) and eq.(9) form an equilibrium. More generally, it is possible to show that this is the unique equilibrium in which speculators' trading strategy is a linear function of their signal and the price.
Proof of Proposition 3. In equilibrium, F (θ * ) = exp(−ρc). We have shown that F (.) decreases in θ * in the proof of Proposition 2. It immediately follows from these two observations that θ * increase in c.
Proof of Proposition 9.
If the second moment of the distribution for the variable τ (θ) converges when θ goes to zero, the analysis is more complex. 23 Indeed, as shown below, both the second and the first moments of the distribution for τ (θ) decreases with θ. If the effect on the second moment dominates then Var[π(θ)] decreases with θ while if the effect on the first moment dominates then Var[π(θ)] increases with θ (see eq. (61)). We show below that for θ sufficiently close to zero the first effect dominates.
Proof of Proposition 10. Direct from the arguments in the text.
Proof of Proposition 11 Direct from the arguments in the text.