We decided to target this book at the investment community. Applications, of course, can be found elsewhere, and indeed everywhere. By staying within the financial domain, we could also have discussed areas such as credit decisions or insurance pricing, for example.

We will not discuss these particular applications in this book, as we decided to focus on questions that an investor might face. Of course, we might consider adding these applications in future editions of the book. It is a world, in which it is very important for decision makers to make the right judgement, and furthermore, these decisions must be done in a timely manner.

Delays or poor decision making can have fatal consequences in the current environment. Having access to data streams that track the foot traffic of people can be crucial to curb the spread of the disease. Using satellite or aerial images could be helpful to identify mass gatherings and to disperse them for reasons of public safety. From an asset manager's point of view, creating nowcasts before official macroeconomic figures and company financial statements are released, results better investment decisions.

It is no longer sufficient to wait several months to find out about the state of the economy. Investors want to have be able to estimate such points on a very high frequency basis. The recent advances in technology and artificial intelligence makes all this possible. So, let us commence on our journey through alternative data.

We hope you will enjoy this book! Acknowledgments We would like to thank our friends and colleagues who have helped us by providing suggestions and correcting our errors. In first place, we would like to express our gratitude to Dr. Marcos Lopez de Prado who gave us the idea of writing this book.

We would like to thank Kate Lavrinenko without whom the chapter on outliers would not have been possible; Dave Peterson, who proofread the entire book and provided useful and thorough feedback; Henry Sorsky for his work with us on the automotive fundamental data and missing data chapters, as well as proofreading many of the chapters and pointing out mistakes; Doug Dannemiller for his work around the risks of alternative data which we leveraged; Mike Taylor for his contribution to the data vendors section; Jorge Prado for his ideas around the auctions of data.

We would also like to extend our thanks to Paul Bilokon and Matthew Dixon for their support during the writing process. We are very grateful to Wiley, and Bill Falloon in particular, for the enthusiasm with which they have accepted our proposal, and for the rigor and constructive nature of the reviewing process by Amy Handy. Last but not least, we are thankful to our families.

Without their continuous support this work would have been impossible. In this book, we seek to discuss the topic in detail, showing how alternative data can be used to enhance understanding of financial markets, improve returns, and manage risk better.

This book is aimed at investors who are in search of superior returns through nontraditional approaches. These methods are different from fundamental analysis or quantitative methods that rely solely on data widely available in financial markets. It is also aimed at risk managers who want to identify early signals of events that could have a negative impact, using information that is not present yet in any standard and broadly used datasets. There is news in the press about hedge funds and banks who have tried, but failed to extract value from it see e.

Risk, We must stress, however, that the absence of predictive signals in alternative data is only one of the components of a potential failure. In fact, we will try to convince the reader, through the practical examples that we will examine, that useful signals can be gleaned from alternative data in many cases.

At the same time, we will also explain why any strategy that aims to extract and make successful use of signals is a combination of algorithms, processes, technology, and careful cost-benefit analysis. Failure to tackle any of these aspects in the right way will lead to a failure to extract usable insights from alternative data. Hence, the proof of the existence of a signal in a dataset is not sufficient to benefit from a superior investment strategy, given that there are many other subtle issues at play, most of which are dynamic in nature, as we will explain later.

In this book, we will also discuss in detail the techniques that can be used to make alternative data usable for the purposes we have already noted. Hence, we will also include simpler and more traditional techniques, such as linear and logistic regression,2 with which the financial community is already familiar. Indeed, in many instances simpler techniques can be very useful when seeking to extract signals from alternative datasets in finance. Nevertheless, this is not a machine learning textbook and hence we will not delve in the details of each technique we will use, but we will only provide a succinct introduction.

We will refer the reader to the appropriate texts where necessary. This is also not a book about the technology and the infrastructure that underlie any real- world implementations of alternative data. These topics encompassing data engineering are still, of course, very important. Indeed, they are necessary for anything found to be a signal in the data to be of any use in real life. However, given the variety and the deep expertise needed to treat them in detail, we believe that these topics deserve a book on their own.

Nevertheless, we must stress that methodologies that we use in practice to extract a signal are often constrained by technological limitations. Do we need an algorithm to work fast and deliver results in almost real time or can we live with some latency? Hence, the type of algorithm we choose will be very much determined by technological constraints like these.

We will hint at these important aspects throughout, although this book will not be, strictly speaking, technological. In this book, we will go through practical case studies showing how different alternative data sources can be profitably employed for different purposes within finance. These case studies will cover a variety of data sources and for each of them will explore in detail how to solve a specific problem like, for example, predicting equity returns from fundamental industrial data or forecasting economic variables from survey indices.

The case studies will be selfcontained and representative of a wide array of situations that could appear in the real-world applications, across a number of different asset classes. Finally, this book will not be a catalogue of all the alternative data sources existing at the moment of writing. We deem this to be futile because, in our dynamic world, the number and variety of such datasets increase every day. What is more important, in our view, is the process and techniques of how to make the available data useful.

In doing so, we will be quite practical by also examining mundane problems that appear in sieving through datasets, the missteps and mistakes that any practical application entails. This book is structured as follows.

Part I will be a general introduction to alternative data, the processes and the techniques to make it usable in an investment strategy. In Chapter 1, we will define alternative data and create a taxonomy. In Chapter 2 we will discuss the subtle problem of how to price datasets. This subject is currently being actively debated in the industry. Chapter 3 will talk about the risks associated with alternative data, in particular the legal risks, and we will also delve more into the details of the technical problems that one faces when implementing alternative data strategies.

Chapter 4 introduces many of the machine learning and structuring techniques that can be relevant for understanding alternative data. Again, we will refer the reader to the appropriate literature for a more indepth understanding of those techniques. Chapter 5 will examine the processes behind the testing and the implementation of alternative data signals-based strategies. We will recommend a fail-fast approach to the problem.

In a world where datasets are many and further proliferating, we believe that this is the best way to proceed. Part II will focus on some real-world use cases, beginning with an explanation of factor investing in Chapter 6, and a discussion of how alternative data can be incorporated in this framework.

One of the use cases will not be directly related to an investment strategy but is a problem at the entry point of any project and must be treated before anything else is attempted — missing data, in Chapters 7 and 8. We also address another ubiquitous problem of outliers in data see Chapter 9. We will then examine use cases for investment strategies and economic forecasting based on a broad array of different types of alternative datasets, in many different asset classes, including public markets such as equities and FX.

We also look at the applicability of alternative data to understand private markets see Chapter 20 , where markets are typically opaquer given the lack of publicly available information. The alternative datasets we shall discuss include automotive supply chain data see Chapter 10 , satellite imagery see Chapter 13 , and machine readable news see Chapter In many instances, we shall also illustrate the use case with trading strategies on various asset classes. It is widely known that information can provide an edge.

Hence, financial practitioners have historically tried to gather as much data as is feasible. The nature of this information, however, has changed over time, especially since the beginning of the Big Data revolution.

These include, for example, satellite imagery, social media, ship movements, and the Internet-of-Things IoT. In practice, alternative data has several characteristics, which we list below. It is data that has at least one of the following features: Less commonly used by market participants Tends to be more costly to collect, and hence more expensive to purchase Usually outside of financial markets Has shorter history More challenging to use We must note from this list that what constitutes alternative data can vary significantly over time according to how widely available it is, as well has how embedded in a process it is.

Obviously, today most financial market data is far more commoditized and more widely available than it was decades ago. Hence, it is not generally labeled as alternative. For example, a daily time series for equity closing prices is easily accessible from many sources and it is considered nonalternative. In contrast, very high frequency FX data, although financial, is far more expensive, specialized, and niche.

The same is also true of comprehensive FX volume and flow data, which is less readily available. Hence, these market derived datasets may then be considered alternative. The cost and availability of a dataset are very much dependent on several factors, such as asset class and frequency.

In recent years, the alternative data landscape has significantly expanded. One major reason is that there has been a proliferation of devices and processes that generate data. Furthermore, much of this data can be recorded automatically, as opposed to requiring manual processes to do so.

The cost of data storage is also coming down, making it more feasible to record this data to disk for longer periods of time. Traders trade with one another on an exchange and on an over-the-counter basis.

Every time they post quotes or agree to trade at a price with a counterparty, they create a data point. This data exists as an exhaust of the trading activity. The concept of distributing market data is hardly new and has been an important part of markets for the ages and is an important part of the revenue for exchanges and trading venues.

However, there are other types of exhaust data that have been less commonly utilized. Take, for example, a large newswire organization. Journalists continually write news articles to inform their readers as part of their everyday business. This generates large amounts of text daily, which can be stored on disk and structured.

If we think about firms such as Google, Facebook, and Twitter, their users essentially generate vast amounts of data, in terms of their searches, their posts, and likes. This exhaust data, which is a by-product of user activity, is monetized by serving advertisements targets toward users. Additionally, each of us creates exhaust data every time we use our mobile phones, creating a record of our location and leaving a digital footprint on the web. Corporations that produce and record this exhaust data are increasingly beginning to think about ways of monetizing it outside of their organization.

Most of the exhaust data, however, remains underutilized and not monetized. It could be archived emails, project communications, and so on. Once such data is structured, it will also make that data more useful for generating internal insights, as well as for external monetization.

First, we can divide the alternative data sources into the following high-level categories of generators:4 individuals, institutions5 and sensors, and derivations or combinations of these. The latter is important because it can lead to the practically infinite proliferation of datasets. For example, a series of trading signals extracted from data can be considered as another transformed dataset.

The collectors of data can be either institutions or individuals. They can store information created by other data generators. For example, credit card institutions can collect transactions from individual consumers. Concert venues could use sensors to track the number of individuals entering a particular concert hall.

The data collection can be either manual or automatic e. The latter is prevalent in the modern age, although until a couple of decades ago the opposite was true. This segmentation is summarized in Table 1. We can further subdivide the high-level categories into finer-grained categories according to the type of data is generated.

A list can never be exhaustive. For example, individuals generate internet traffic and activity, physical movement and location e. TABLE 1. Who Generates the Data? Via digital methods Via analog methods As individuals, we generate data via our actions: we spend, we walk, we talk, we browse the web, and so on.

Each of these activities leaves a digital footprint that can be stored and later analyzed. We have limited action capital, which means that the number of actions we can perform each day is limited. Hence, the amount of data we can generate individually is also limited by this.

Institutions also have limited action capital: mergers and acquisitions, corporate reports, and the like. Sensors also have limited data generation capacity given by the frequency, bandwidth, and other physical limitations underpinning their structure. However, data can also be artificially generated by computers that aggregate, interpolate, and extrapolate data from the previous data sources.

They can transform and derive the data as already mentioned above. Therefore, for practical purposes we can say that the amount of data is unlimited. One such example of data generated by a computer is that of an electronic market maker, which continually trades with the market and publishes quotes, creating a digital footprint of its trading activity. How to navigate this infinite universe of data and how to select which datasets we believe might contain something valuable for us is almost an art.

Practically speaking, we are limited by time and budget constraints. Hence, venturing into inspecting many data sources, without some process of prescreening, can be risky and is also not cost effective. We will discuss how to approach this problem of finding datasets later and how a new profession is emerging to tackle this task — the data scout and data strategist.

Data can be collected by firms and then resold to other parties in a raw format. This means that no or minimal data preprocessing is performed. Data can be then processed by cleansing it, running it through quality control checks, and maybe enriching it through other sources. Processed data can then be transformed into signals to be consumed by investment professionals. These signals could be, for example, a factor that is predictive of the return of an asset class or a company, or an early warning indicator for an extreme event.

A subsequent transformation could then be performed to convert a signal, or a series of signals, into a strategy encompassing several time steps based, for instance, on determining portfolio weights at each time step over an investment horizon.

These four stages are illustrated in Figure 1. Volume increasing refers to the amount of generated data. For example, the actions of individuals on the web browsing, blogging, uploading pictures, etc. These actions are aggregated into many billions of records globally. Furthermore, computer algorithms are used to further process, aggregate, and, hence, multiply the amount of data generated. Traditional databases can no longer cope with storing and analyzing these datasets.

Instead, distributed systems are now preferred for these purposes. Variety increasing refers to both the diversity of data sources and the forms of data coming from those sources. The latter can be structured in different ways e. The increasing variety is due to the fact that the set of activities and physical variables that can be tracked is increasing, alongside the greater penetration of devices and sensors that can collect data.

Trying to understand different forms of data can come with analytical challenges. These challenges can relate to structuring these datasets and also how to extract features from them. Velocity increasing refers to the speed with which data are being generated, transmitted, and refreshed. In fact, the time to get hold of a piece of data has decreased as computing power and connectivity have increased.

In substance, the 3 Vs signal that the technological and analytical challenges to ingest, cleanse, transform, and incorporate data in processes are increasing. For example, a common analytical challenge is tracking information about one specific company in many datasets. If we want to leverage information from all the datasets at hand, we must join them by the identifier of that company. A hurdle to this can be the fact that the company appears with different names or tickers in the different datasets.

The complexity of this problem explodes exponentially as we add more and more datasets. We will discuss the challenges behind this later in a section specifically dedicated to record linkage and entity mapping see Chapter 3. These 3 Vs are more related to technical issues, rather than business specific issues. Recently 4 further Vs have been defined, namely Variability, Veracity, Validity, and Value, which are focused more on the usage of Big Data.

Variability increasing refers both to the regularity and quality inconsistency e. As we explained above, the diversity of the data sources and the speed at which data originates from them has increased. In this sense, the regularity aspect of Variability is a consequence of both Variety and Velocity.

Veracity decreasing refers to the confidence or trust in the data source. In fact, with the multiplication of data sources it has become increasingly difficult to assess the reliability of the data originating from them. While one can be pretty confident of the data, say, from a national bureau of statistics such as the Bureau of Labor Statistics in the United States, a greater leap of faith is needed for smaller and unknown data providers.

This refers both to whether data is truthful and the quality of the transformations the provider has performed on the data, such as cleansing, filling missing values, and so on. Validity decreasing refers to how accurate and correct the data is for its intended use. For example, data might be invalid because of purely physical limitations. These limitations might reduce accuracy and also result in missing observations; for example, a GPS signal can deteriorate on narrow streets in between buildings in this case overlaying them onto a roadmap can be a good solution to rectify incorrect positioning information.

Value increasing refers to the business impact of data. This is the ultimate motivation for venturing into data analysis. In general, the belief is that overall Value is increasing but this does not mean that all data has value for a business. This must be proven case by case, which is the purpose of this book. We have encountered other Vs, such as Vulnerability, Volatility, and Visualization. We will not debate them here because we believe they are a marginal addition to the 7 Vs we have just discussed.

In closing, we note that parts of the alternative data universe are not characterized by all these Vs if looked upon in isolation. The 7 Vs should, therefore, be interpreted as a general characterization of data nowadays. Hence, they paint a broad picture of the data universe, although some alternative datasets can still exhibit properties that are more typical of the pre—Big Data age. Now that we have defined what alternative data is, it is time to ask the question of why investment professionals and risk managers should be concerned with it.

The insights can be of two types: either anticipatory or complementary to already available information. Hence, information advantage is the primary reason for using alternative data. However, they are nevertheless deemed to be important factors in portfolio performance. Investors want to anticipate these macro data points and rebalance their portfolios in the light of early insights.

For example, GDP figures, which are the main indicator for economic activity, are released quarterly. This is because compiling the numbers that compose it is a labor-intensive and meticulous process, which takes some time. Furthermore, revisions of these numbers can be frequent.

Nevertheless, knowing in advance what the next GDP figure will be can provide an edge, especially if done before other market participants. Central banks, for example, closely watch inflation and economic activity i. GDP as an input to the decision on the next rate move. FX and bond traders try in their turn to anticipate the move of the central banks and make a profitable trade.

Furthermore, on an intraday basis, traders with good forecasts for official data can trade the short-term reaction of the market to any data surprise. What can be a proxy for GDP, which is released at a higher frequency than quarterly? Any value higher than the 50 level is considered to show expanding conditions while a value below the 50 mark potentially signals a recession.

One explanation is the relative differences in what the measures represent. GDP measures economic output that has already happened. Hence, it is defined as hard data. By contrast, PMIs tend to be more forward-looking, given the nature of the survey questions asked.

We define such forward-looking, survey-based releases as soft data. We should note that it can be the case that soft data is not always perfectly confirmed by subsequent hard data, even if they are generally correlated. The dots indicate quarterly values. The PMI indicators are considered alternative data, in particular when we consider looking at them in a much more granular form. We will examine them more in detail in Chapter Value investing, for example, is rooted in the idea that share prices should reflect company fundamentals in the long-term which are also reflective of the macro environment , so the best predictors are the current fundamentals of a firm.

However, maybe we can do even better if we knew or could forecast the current fundamentals in advance of the market? We will test this hypothesis later. An example of alternative data in this context is the aggregated, anonymized transaction data of millions of consumers' retail transactions that can be mapped to the shopping malls sales numbers where these purchases happened.

The performance and hence the fundamentals of a mall can thus be forecasted relatively accurately long before the official income statement is released. Alternative data can also be used as a complement, not just a replacement or substitute for other data sources as we have already mentioned.

Thus, investors will be look at it for signals that are uncorrelated or weakly correlated to existing ones. For example, apart from company fundamentals disclosed in the financial statements, a good predictor for the future performance of an industrial firm could be examining the capacity and utilization of plants they operate or the consumer loyalty to the brand. Alternatively, we could collect data about their greenhouse gas emissions. Some of this information could be absent in balance a sheet but could be an indicator of the long-term performance of the company.

In Figure 1. After a seminal paper in see Bollen et al. The paper showed an accuracy of This provided the spark for alternative data and, since then, quantitative hedge funds have been at the forefront of the usage of and investment in this space. Zuckerman discusses how a very sophisticated quant firm, Renaissance Technologies, had been using unusual forms of data for many years.

At time of press, several asset management firms are setting up data science teams to experiment with the alternative data world. To the knowledge of the authors, many attempts have been unsuccessful so far. This can be due to many reasons and some of them are not linked to the presence or absence of signals in the dataset they have acquired but to setting the right processes in place.

As a cautious first step, many are using it as a confirmation of the information coming from more traditional data sources. Fortado, Wigglesworth, and Scannell talk about many of the price and logistics barriers faced by hedge funds when using alternative data. Some of these are fairly obvious, such as the cost associated with alternative data. There are also often internal barriers, related to procurement, which can slow down the purchase of datasets.

It also requires management buy-in to provide budget, not only for purchasing of alternative data, but also to hire sufficiently experienced data scientists to extract value from the data. The underusage of data could happen for a variety of reasons, as mentioned in the previous paragraph. Another reason could be coverage.

Systematic funds, for example, try to diversify their portfolios by investing in many assets. While machine readable news tends to have an extensive coverage of all the assets, other datasets like satellite images may only be available for a small subset of assets. Hence, in many instances, strategies derived from satellite images could be seen as too niche to be implemented and they are thus defined as low capacity.

Larger firms with substantial amounts of assets under management typically need to deploy capital to strategies that have large capacity, even if the risk-adjusted returns might be smaller compared to low-capacity strategies. We give a more detailed definition of what capacity is in the context of a trading strategy later in this chapter.

The decision of whether to buy a dataset is often based on a performance measure such as backtests. A quandary with alternative data is that, as we have mentioned, it tends to be characterized by a shorter history. In order to have an effective backtest, a longer history is preferred. A buy side firm could of course simply wait for more history to become available. However, this can result in a decay in the value of the data due to overcrowding.

We tackle the problem of valuing alternative data in Chapter 2. All these considerations point to the fact that — as with every innovation — only a few bold players have taken risks of starting to use alternative data, but further along the way, other firms might also get involved e.

We illustrate a snapshot of our thinking in Figure 1. We expect, of course, as technological and talent barriers decrease and the awareness of the market to alternative data increases, every investor to make use of at least a few alternative data signals in the next decade. Essentially, we are referring to the amount of capital that can be allocated to it, without the performance of a strategy being degraded significantly.

In other words, we want to make sure that the returns of our strategy are sufficiently large to offset the transaction costs of executing it in the market and the crowding out of the signal by other market participants, who are also trading similar strategies. Trying to understand whether other market participants are trading similar strategies is challenging. One way to do it is to look at the correlation of the strategy returns against fund returns, although this is only likely to be of use for strategies that dominate a fund's AUM.

We can also try to look at positioning and flow data collected from across the market. When it comes to transaction costs, at least for more liquid markets, the problem is somewhat easier to measure. When we refer to transaction costs, we include not only the spread between where we execute and the prevailing market mid-price, but also the market impact, namely how much the price moves during our execution.

Typically, for large orders we need to split up the risks and execute them over a longer period, during which time the price could drift against us. As we would expect, the transaction costs, which we incur, increase as we trade larger order sizes. However, this relationship is not linear. In practice, for example, doubling the size of the notional that we trade is likely to increase our transaction costs much more than a factor of 2. It has been shown with empirical trading data across many different markets, ranging from equities and options to cryptocurrencies, that there is a square root relationship between the size of our orders and the market impact see Lehalle, The transaction costs are contingent on a number of factors as well as the size of the order, such as the volatility of underlying market, the traded volume in that asset, and so on.

If the asset we are trading has very high volatility and low traded volume, we would expect the market impact to be very high. Let us take for example a trading strategy that trades on a relatively high frequency, where on average we can make 1 basis point per trade in the absence of transaction costs.

In this instance, if our transaction costs exceed 1 basis point per trade, the strategy would become loss making. By contrast, if a trading strategy has high capacity, then we can allocate large amounts of capital to it, without our returns being degraded significantly by increased transaction costs. Say, for example, we are seeking to make 20—30 basis points per trade.

Hence, we could conceivably allocate a much larger amount of capital to such a strategy. Note that, if we are trading a very illiquid asset, where typically transaction costs are much higher, then such a strategy could be rendered as low capacity. One simple way to understand the capacity of a strategy is to look at the ratio of returns to transaction costs.

If this ratio is very high, it would imply that you can allocate a large amount of capital to that strategy. By contrast, if that ratio is very low, then it is likely that the strategy is much lower capacity, and we cannot trade very large notional sizes with it. It is too labor intensive to deploy large amounts of capital only to niche strategies because it would require a significant amount of research to create and implement many of them.

Different types of strategies can require very different skillsets as well. For more fundamentally focused firms, having a dataset that is only available for a smaller subset of firms is less of an impediment. Typically, they will drill down into greater detail to investigate a narrower universe of assets. Hence, for smaller trading firms, niche strategies might be more attractive, as they are less impacted by capacity considerations.

In other words, they are typically trading smaller notional sizes in the markets, given that they have less AUM, which are less impacted by transaction costs. Hence, they are able to run strategies that trade more often, such as high-frequency trading strategies, or those with more illiquid assets.

Below we summarize some of the properties that are typical of high-capacity strategies: Returns are less sensitive to increased transaction costs. Higher amounts of capital can be allocated without negatively impacting returns. Can be traded on a wide variety of tickers. Lower frequency. Lower Sharpe ratio. Here we do the same for low-capacity strategies: Returns are sensitive to transaction costs.

Higher amounts of capital will render the strategy loss making. Restricted to a small number of tickers. Higher frequency. Higher Sharpe ratio. We show the risk-adjusted returns of Cuemacro's CTA commodity trading advisor strategy, dependent on different assumptions for transaction costs for a period between and These strategies are often known as CTA-type strategies, because originally firms trading them predominantly traded commodities.

However, these days they trade these strategies across liquid futures in a number of different asset classes, including FX, fixed income, equity indices, and commodities. The CTA strategy involves trend following and typically also involves some sort of risk allocation based on vol targeting and positions are often leveraged.

This perhaps isn't surprising given the strategy trades relatively infrequently, and relies upon identifying longer term trends. Hence, the returns per trade are typically quite large compared to the transaction costs. The various properties of the strategy suggest that we could label it as a higher-capacity strategy.

Increasing the transaction costs for a low-capacity strategy would have a negative impact on both the information ratio and annualized returns. Source: Based on data from Cuemacro. Why is this concept of strategy capacity important in the context of alternative data? Once we know the approximate amount of capital we can deploy to a strategy, it enables us to understand the dollar value we can make, as opposed to purely the percentage returns. This in turn helps us when evaluating how much value to associate with a certain alternative dataset, if we are using it to generate trading signals.

However, the capacity of the strategy is very limited. Hence, we can only allocate at most 1 million USD to it, without transaction costs significantly impacting our returns. If we have lots of capital available for deployment, then the second dataset generates more value in dollar terms. Hence, we would likely be willing to pay more for the second dataset.

By contrast if we have very limited capital available, it is unlikely we would be willing to pay as much for the second dataset, as we would be unable to use up much of the capacity of that strategy. As discussed elsewhere in the book, we also need to evaluate other costs associated with using the dataset too, such as the time taken to incorporate it within our investment process.

In Chapter 2, we discuss the value of alternative data in more detail from the perspective of both buyers and sellers. Every time an investor ponders whether to purchase a dataset, they must bear in mind all these aspects together, along with other important issues such as the business use and technological limitations. We show in this section a summary of dimensions along which a potential data source should be projected in our view, ideally before it is purchased.

Of course, the most important thing in the end is the amount of extracted alpha but before venturing into alpha research some prescreening should be carried out along the lines of these dimensions. Full — None — 1 Breadth14 within an asset class score 1—10 e. None — 1 Depth15 within an asset class score 1—10 e. None — 1 Free data? Yes, the raw data only Yes, the processed dataset No History score 1—10 e.

Short — 1. Medium — 5. Real-time — Lagged — 5. Public dataset — 10 … Widely sold against a subscription fee — 7 … Exclusive — 1 Data Originality score 1—10 e. Similar to many other datasets in the market — 1 … Unique — 10 Technology score 1—10 e. No legal limitations to use the data — 10 … Limitations only in certain jurisdictions —5 … Severe restrictions to use the data — 1 Portfolio effects — degree of orthogonality to other already purchased datasets score 1—10 Investment style suitability Macro Sector specific Asset specific Time frequency of the investment strategy Intraday Daily Weekly Monthly Quarterly Yearly Other Building a scorecard by considering some or all of these dimensions is an option to decide whether to purchase a dataset.

If the score is higher than a certain threshold, a dataset might be considered further for acquisition. To some extent data brokers and scouts can help to outsource this type of scoring process. In many cases, financial firms will ask data firms to fill in questionnaires to answer similar questions to the above.

In building a scorecard, one must also consider rules that directly exclude or include a dataset for further consideration, for example, when there are severe legal restrictions when using the dataset. In this case, a dataset can be blacklisted directly without scoring it across the other dimensions. We have noted that alternative data has proliferated over years, increasing its supply to the market with this trend likely to accelerate over time. Indeed, statistics from Neudata show that the number of alternative datasets is now around see Figure 1.

The alternative vendors can range significantly in size and what they do. They can include well-known existing market data companies such as Bloomberg, which sell their own alternative datasets, such as machine readable news see Chapter 15 , or IHS Markit, which sells alternative datasets related to crude oil shipping see Chapter A lot of these firms are also creating their own data markets to offer data from third-party alternative data vendors.

At the other end of the spectrum, many alternative data vendors can be start-ups. Large corporates, not traditionally associated with this space, can also be alternative data vendors. They can sell their datasets derived from their exhaust directly to data users. These firms include MasterCard, which sells its consumer transaction data see Chapter In practice, many corporates who wish to monetize their own exhaust data often work with an alternative data vendor or a consultancy to help them.

These vendors can use their expertise in alternative data processing to monetize these datasets, which include structuring the data, creation of data products, marketing and selling the data to users, and so on. Source: Based on data from Neudata. Source: Based on data from Greenwich Associates. Having an internal exhaust source requires a firm to engage in a large amount of business tangential to selling data.

As a result, many alternative data vendors source their raw data from many different external sources, rather than being able to exclusively use their own exhaust data. In terms of the brands most associated with alternative data, we present a recent survey of market participants from Greenwich Associates, in Figure 1. The poll is topped by Quandl, which is an aggregator and marketplace of alternative datasets.

It is followed by Orbital Insight, which sells its own datasets related to satellite imagery. Neudata is an alternative data scouting firm see Chapter 5. Thinknum creates datasets based upon web data. As we can see the most recognized alternative data vendors differ significantly in terms of what they do and also in what the focus of their business is.

We, of course, acknowledge that the sample is relatively small, and given the fast-moving nature of the alternative data landscape it is likely that these names may have changed recently. Indeed, since publication a number of entrants have entered this space, such as Bloomberg, which has launched its platform for distributing alternative data. In Section 5.

We mentioned in Section 1. Source: Based on data from alternativedata. A survey from alternativedata. Typically, these employees have more than a decade of experience, often from areas outside of asset management such as technology, academia, and working at data providers themselves. This increase in the capability of funds to process alternative datasets has perhaps unsurprisingly been accompanied by an increase in spending on the actual alternative datasets themselves. It is forecasted that spending by buy-side firms on alternative data is likely to increase to close to 2 billion USD for see Figure 1.

This compares with million USD in We would expect continuing growth in alternative data spending by the buy side in the coming decade. We noted that one of the main differentiating properties of alternative datasets is that they are not purely derived from financial markets. The usage of alternative datasets by funds varies significantly by type. Datasets derived from web scraping are most popular at present at funds see Figure 1. By contrast, those datasets from satellite imagery, geolocation, and email receipts are less popular, using the data available at the time of print.

In doing so we have only scratched the surface of a big and complex world. In the next chapters we will dig more into the details of this world that underpin practical applications. We will thus reexamine many of the concepts and topics introduced here. Later, in the second part of this book we will explore real-world case studies so that the reader can become further familiar with the concepts discussed in this chapter and also in the next few. NOTES 1 A lot of applications of alternative data are being found today in insurance and credit markets see e.

Turner, ; Turner, ; Financial Times, We will not explicitly treat them here, although the alternative data generalities we will examine are also applicable to those cases. In Thank You for Being Late: An Optimist's Guide to Thriving in the Age of Accelerations, Thomas Friedman puts the starting year as because this is the year when major development in computational power, software, sensors, and connectivity happened. In this sense, data can be also semi-processed.

We will not use this fine distinction here, but this is something to bear in mind. By , that volume is forecast to increase up to 40 times over, as technologies including the Internet of Things create vast new datasets. For example, do we have information about the whole supply chain and assets of a manufacturing stock? This needs to be addressed from the perspectives of both data consumers and data producers. Data is only valuable from the perspective of a data consumer if it can be monetized, directly or indirectly.

From the viewpoint of a data vendor, the cost of creating and distributing the dataset needs to be recouped when selling it. The data vendor would of course also want to add margin on it, when selling it. In this chapter, we discuss this topic in some detail and show some directions to help value alternative data. The development of marketplaces where market participants can converge to one price is still at its infancy and, given the nature of what data is, many challenges still remain open, as we will shortly discuss.

We will also show that having a standardized marketplace might not be the economically optimal solution for a data vendor. In the investment space, if all or most of the market participants make the same prediction based on the same information, they can trade on it and opportunities could quickly disappear.

The Efficient Market Hypothesis EMH in its semi-strong form reflects this point of view by asserting that public information is incorporated almost immediately in the prices of financial assets and hence any hope to outperform the market in the long-term based on that information is in vain. A direct result of this, if true, is that superior riskadjusted returns are only available through insider information or an exclusive or restricted access to a dataset in the sense of this book.

We will not debate the validity of this hypothesis1 but what everybody would agree on is that if a piece of information has been available publicly for a while, then it has most probably lost a lot of its investment value. This could be a problem for data providers whose datasets then face the danger of quick obsolescence. In fact, in our experience, some of the alternative datasets that emerged first are now decreasing in their ability to generate alpha e.

All this reasoning might lead us to the conclusion that data, alternative or not, could be of little value unless exploited almost immediately after its release, and gains over the market can only be made by having a speed edge. However, there are some counterarguments to this. First, the variety and the multitude of data could make the decay of a signal less rapid. New data sources continuously appear. Hence, it becomes less probable that a large number of the market participants have access to all of them and have incorporated a given dataset into their processes once it is available, let alone the case that they have combined with exactly the same set of other data sources that other participants are using.

There are many more alternative data sources than standardized financial market data and their types are much more diverse. Hence, it is less probable that two different market players will discover precisely the same datasets and gain access to them at the same time. It is also less likely that they will end up mining these datasets and combine them with other signals gleaned from other data sources.

We can argue that essentially there are more degrees of freedom for the data sources used in general, with the advent of alternative data. This could lead to different results, unless there is a very strong directional signal, in particular because they are likely to augment it with different datasets. A linear regression model, for instance, is not able to exploit nonlinear relationships in the data that a deep learning model is naturally incorporating.

The two could lead to quite different predictions over the next investment horizon and hence point to different actions e. Third, another factor that contributes to the persistence of the value of a dataset is the different investment mandates, horizons, styles, and risk appetite that investors have. Given this, the number of relevant features that can be extracted from a combination of orthogonal data sources is, therefore, further multiplied. For example, investors interested in directional trading will be looking at trends and features that can predict it.

Volatility investors, on the other hand, will search for signals that drive the price of an asset in both directions. Styles like long-only, long-short, and so forth also determine what is relevant in the data and what is not. Sometimes datasets can be relevant for many different investment styles. We can take the example of machine-readable news. Longer term investors can aggregate the sentiment from machine readable news articles over a long period of time to inform their trading.

By contrast high-frequency traders will use machine readable news at a much more granular level, using it to trigger very short-term trades, and also for risk management purposes, to identify when an asset is suspended from trading on an exchange. Still the timeliness of the prediction and speed of the subsequent action are also of essence to make the most of a dataset.

In fact, hedge funds have invested millions in servers physically located as close as possible to stock exchanges to gain a timing edge over their competitors. For latency-sensitive strategies, like high-frequency trading, this is very important. However, it is not the only thing that matters as we have just explained. It is also important to make an accurate prediction of at least the direction of the markets. In summary, getting access to the right data at the right time is advantageous for the monetization of a dataset in the likely short opportunity window after it is released.

Market players who are quick to discover valuable datasets will have an edge, before these datasets become more commoditized. However, whether a dataset has a positive investment value will also depend on other factors such as its price. We will turn to examine the delicate issue of pricing in the next sections. We note in closing that arguments about decay of investment value can be time dependent. A dataset could cease to provide signals only temporarily if the economy enters an irrelevant period for the type of data it contains but could re-surface again in a future period.

For example, a political news stream could bear almost no impact on financial markets in relatively calm periods but in times of political turbulence e. Brexit it could be the most important source of signals. In these circumstances data pricing is often determined by the seller who does not provide visibility into the cost of collection, treating, and packaging to the buyer see Heckman et al.

According to Heckman, this asymmetry of information results in a lack of pricing transparency, hurting both the seller and the buyer. The former is unable to price optimally in the market, and the latter cannot strategically assess pricing options across data service providers. According to Heckman , a more structured data market with standardized pricing models would improve the transaction experience for all parties. Indeed, recently we have seen the rise of data marketplaces,4 although we are still far from the adoption of standardized models.

A data marketplace also called Data as a Service, or DaaS is essentially a platform in which data sellers and data buyers connect to buy and sell data from each other. A typical data market comprises three main roles: data sellers, data buyers, and a data marketplace owner. Data sellers supply data to the data marketplace and set the corresponding prices.

bet365 2up strategy Matched Betting with OddsMonkey in detail

Next Steps Different offers require the pot is then split and lay bets to qualify. The conclusion i matched betting guide mser was the place you go to lay against your bookmaker bets, very good, but have a of your lay bets at. Decimal odds are very simple: very same Matched Betting technique count trying to find someone the more payout you get. I highly recommend signing up. Once the result is in, why this should be avoided whichever team had the largest. Small but steady steps so and are pretty much neck-and-neck. Follow me on social: 27, months of being credited to. This includes an automatcher, for finding the perfect sports events odds that are bit higher. In most cases only your. I have probably searched more websites than I care to bigger the odds are, and places, whilst minimising qualifying losses.

To this end, we focus on the design of the adaptive normalized matched filter (​odds ratios (OR) = ; 95% confidence intervals (CI): - ; P = ). may have some advantages for fostering interactions with the opposite sex. (MSE​). [60] proposed FREAK binary descriptor which uses learning strategy of ORB descriptor and DAISY-like sampling pattern []. A number of. she exe1mines the uneasy crossing-over bet\·veen Skillicorn's unortho- dox sexual orientation most frequently offered to the researcher matched prevailing nnd nonnative --(February ) 'The Industrial Employment of Women', f-'​mser's.