The CQ9 Beat

How CQ9 Thinks About Data

Written

Topic

Published

Feb 1, 2022

算法在算法交易中扮演着核心角色——你可能从它的名字就猜到了. The simplest mental model of an algorithm is one of a black box that takes in data as input and outputs trading decisions. 作为研究人员，我们花了很多时间思考这个盒子里发生了什么. 这转化为交易策略内部的大量多样性.

However, there is one thing that all algorithms have in common: if the data they ingest is fundamentally flawed, they are destined to fail. “Garbage in, garbage out” sums it up. In this post, I share the broad strokes of how we think about data quality and provenance in order to give our trading strategies a fighting chance to succeed.

What do we want from data?

There is an objective yardstick with which to measure any dataset: the improved performance of optimal trading strategies when they get to condition on the dataset vs when they don’t. Unfortunately, applying the yardstick is expensive. It takes time and effort to build trading strategies, 更不用说找到一个好到足以赢得“最佳”这个词的人了.” We’d like to know which datasets are valuable so that 我们可以专注于这些，而不是把时间浪费在无用的事情上. To that end, we apply heuristics to make initial judgements about how datasets will fare when scrutinized further.

Relevance

Let’s start with the most prominent heuristic: is there a reason to believe the data could be predictive of future asset prices? What makes it relevant? 有很多数据集我们可以自信地用这种精神来否定: sunspot events, UFO sightings, etc. 事实上，如果我们不把它们扔掉，我们真正的发现可能会迷失在浩瀚的海洋中 spurious correlations.

These fanciful examples are clear-cut, but most datasets we’re considering aren’t quite as obvious. 例如，假设您有一个包含所有Reddit帖子和评论的数据集. Should that dataset be relevant to pricing US stocks? Maybe? Good researchers develop simplified mental models of how the world works that help them answer questions like this one. 最好的研究人员(除了技术实力强之外), 不断完善他们的心智模型，因为他们理解不同的观察结果.

Uniqueness

我们在CQ9的工作是为资产价格建模，并帮助市场反映“正确”的价格. To that end, we compete with other trading firms and sophisticated investors who have the same goal. 如果一个数据集已经被我们的竞争对手广泛使用, 我们不能指望通过以同样的方式使用相同的数据来增加多少价值. What does this mean for evaluating datasets? 被广泛使用的简单数据集——即使毫无疑问是相关的——可能也不会表现得很好.

例如，市盈率(P/E)是一个被广泛报道的CQ9公司的指标. 如果我在谷歌上搜索“苹果股票”，谷歌显示的是苹果的市盈率. P/E比率是股价和每股收益之商, and roughly speaking, it captures how “expensive” the company is. It’s incredibly relevant to investors. Yet, 如果你想制定一个只使用市盈率的交易策略, 考虑到度量的重要性，您可能对它的性能不感兴趣. We could say it’s already been “priced in,” which is to say that this is information so many people have extracted that it doesn’t provide our strategies a competitive advantage.

The most exciting datasets are novel or complex enough that we can hope to extract something from them that our competitors do not.

Avoiding lookahead

Let’s think through a hypothetical. Say you’re approached by a newspaper which has a new data product targeted for investors: Market Moving News Articles™ (MMNA). 这家报纸最近发现，它的大部分文章对投资者来说都无关紧要, and it wants to add value by delivering *tagged* articles. 除了常规的文章内容，如文本上下文, time of publication, author’s name, headline, etc. , whenever a new article is published, subscribers of MMNA will get three extra fields annotated by the newspaper’s stock analyst: affected_stock_ticker, is_market_moving, and is_good_news. 他们慷慨地为你提供了五年历史数据的试用. You backtest a simple strategy of buying (selling) the stocks with good (bad) market moving articles and it looks great. What’s wrong with this picture?

How did they get the tagging of five years worth of articles if they just recently had the idea to tag articles? If they had the foresight five years ago to start tagging articles real-time then there’s nothing wrong. But let’s say they didn’t. 相反，他们让分析师查看过去的文章，并对其进行追溯标记. This, though unintentionally so, is incredibly dangerous! There are all sorts of ways in which the analyst might be benefitting from hindsight to make better annotations than would have been made in the moment. MMNA历史数据集实际上无法使用.

Lookahead can manifest in subtle ways. It’s imperative as researchers to stay vigilant and understand the provenance of the data we work with to dodge these traps.

Sample size and noise

否则，如果数据太少或噪声太大，有用的数据可能会被取消资格. Thanks to simplifying assumptions and statistics, we can usually get a reasonable idea of whether this will afflict us even without looking at the dataset.

The US Bureau of Labor Statistics (BLS) publishes monthly statistics on employment (such as the unemployment rate) that are closely watched. Say a hypothetical survey company constructs a high-fidelity random sample one day before the BLS to try to independently estimate the unemployment rate. However, constructing true random samples is expensive, 所以他们只调查了250人(并报告了失业人数的比例). They are offering to sell you the survey result for $X, but won’t show you their data before you pay them. Are you interested? (Assume you believe their methodology is airtight.)

回答这类问题的最佳工具是一些粗略的统计数据. 劳工统计局将公布的新失业率有多少不确定性? 我们可以通过将一个简单的自回归模型拟合到失业率中来设定这个上限. Let’s say our model has prediction errors with 0.5 percentage points of standard deviation. 调查中比例的标准误差是多少? 建模为二项分布，假设失业率在5%左右

If we assume the survey is an unbiased estimator for the BLS figure and it’s uncorrelated to our autoregressive model’s errors, 这意味着我们可以将自回归模型误差标准差提高到

We could then use some further back-of-the-envelope math to conclude whether our trading strategy will benefit from an additional 0.03% in predictiveness on the unemployment rate more than the cost of the data (we’d need some model of the sensitivity of asset prices to the unemployment rate).

Where do we get data from?

我们已经讨论了如何评估某些数据集的潜力, but where do we actually get data from?

Market data

Financial markets generate a prodigious amount of data, 主要以在交易所进行的订单和交易的形式. The data is unquestionably relevant and complex; in other words, 它检查了我们期望有用的所有数据框.

Let’s consider one possible use of market data. 公司并非存在于真空中——它们都以各种方式相互关联. A chip-maker sells to a phone-maker; an app-maker sells on the phone’s platform; an advertiser buys ads on the app; the list of relationships goes on. Stock prices have to somehow reflect these relationships. 这些关系的一个简单模型是，相似的股票应该有相似的股票回报. Thus, you could conceivably use the data of AAPL stock returns to predict, say, MSFT stock returns. 当然，事情可以变得比这种基本的“配对交易”更加复杂.”

Alternative data

Widespread computer usage has made it easier to collect granular data about all sorts of things that aren’t securities but could be related. This type of data is the hardest to describe because the datasets are proprietary and no two datasets are exactly alike. 例如，您可以购买各种商店的人流量数据, capture twitter streams, or even track wind patterns in various geographies.

It is worth noting a recurrent commonality in this type of data: the relationships can be delightfully creative and/or surprising! 举一个我没有亲自验证过但听起来很有道理的例子:确实存在 evidence to suggest that angry reviews claiming Yankee Candles are scentless coincide with surges in COVID-19 prevalence.

Data Transformations

This may seem like cheating, but data transformations can sometimes so fundamentally change a dataset that in effect you’ve made a new dataset.

Say you start with a dataset of all trades in US stocks. As is, this data is already plenty interesting. 但是，让我们考虑对其进行字符更改数据转换. Roughly speaking, US Regulation (SEC Rule 612) prohibits exchanges from quoting prices in discretizations smaller than one cent while allowing sub-penny price improvement (subject to some limitations). This sub-penny price improvement is particularly characteristic of wholesalers who provide these small “discounts” to retail customers. Thus, 如果你把数据转换成低于一美分价格的交易比例, 你代表了散户对股票和时间的兴趣.

There are all sorts of research questions you can ask of a dataset of retail participation that you couldn’t directly ask of a dataset of trades. New data!

Parting thoughts

This was just a short, high-level overview of how we think about data at CQ9 and its profound impact on the quality of our strategies. 值得注意的是，这里有许多重要的事情没有涉及到. For example, 良好的工程实践对于数据的研究和生产都是至关重要的, particularly when the datasets get large. 对于不同类型的数据(无论是时间序列还是数据类型)，也存在许多细微差别, textual data, etc.). 如果你对这些问题感兴趣，可以看看我们的 current openings!