Data curation for data hungry models
Data curation, collection, and cleansing may be the unglamorous part of being a quant, but the value of doing it right is observable in model performance. Saeed Amen, Founder, Cuemacro, explores this challenge in a handful of scenarios and looks at possible solutions.
What’s the most important thing for quants, is it the data or the model? I don’t think it is possible to answer this question for all cases. What is clear however, is that if we want to train a model, we need to have data. The more complex the model, the more data hungry it is likely to be.
So how do we curate our dataset? For something like an LLM, we need massive amounts of data (or we might choose to finetune an existing LLM). It’s less a case of carefully curating, and more a case of ingesting as much data as we can. Obviously, over time this might prove more problematic, as more content on the internet is itself generated by LLMs.
However, in many other cases, we do not have such huge amounts of data, and we instead need to brainstorm to understand very particular types of data that could be relevant. If we are forecasting time series, we essentially have a y variable which we looking to forecast and our x variables which are our inputs.
The dataset which you curate, will depend on your particular use case and requires domain specific knowledge. Let’s take for example, forecasting high frequency FX markets. It isn’t my area of expertise, but I think we can all agree that input high frequency FX data is likely to be a good first step. We might also consider ingesting other associated markets, such as bond and equities markets etc. After all, market moves do not purely occur in a vacuum. We might also consider other high frequency alternative data, such as newswire data, which can be useful for explaining market moves. We might have more specific models that can be useful for explaining particular points in time, e.g. forecasts around data releases etc. Is there any point adding very low frequency data to explain the next tick? Probably not? We will likely face limits in terms of the complexity of the model we employ given the constraints on execution times.
If our problem is forecasting variables at lower frequency horizon, let’s say an economic variable, we are likely able to curate a far larger dataset, given that a larger array of data is available at a lower frequency. Our dataset is likely to vary significantly between countries. Furthermore, how we curate our dataset will be dependent on each economy. If we are trying to model economic variables in New Zealand, we may want to choose to include variables related to dairy markets, given its such a large export. In Chile, we are likely to want to include variables that describe the copper market, which an important part of Chile’s GDP.
Once we have selected the variables to be included in your dataset, we need to collect this data. Some data might be available from both data providers or we might be able to source it ourselves from public sources, like the websites of national statistics agencies. If data is indeed available from the web, it does not necessarily mean it is “easier” to obtain from webscraping. In practice, it might be more cost effective from a maintenance perspective to pay for it from a data provider (e.g. they will deal with the complexity of changing APIs, websites etc.). Other datasets might only be available from specific data providers, because the data is not published publicly. Even here we might have a choice, because we might need to select from a myriad of different data providers. We then need to evaluate which provider to choose depending on factors such as the quality of the data.
Of course, only once we’ve curated our dataset, collected the data and cleaned it, can we actually begin with the more “exciting” parts of modelling. However, if we neglect the data curation step, it will become far more challenging for our model to perform well.