Insight

The best python tools to analyze alternative investment data for $0

Posted by Norman Niemer on 09 October 2018

Share this article

Alternative data such as credit card transaction and web data is becoming an integral part of the investment process. Building an alternative data effort can be expensive but fortunately unlike buying datasets and hiring people, the technology infrastructure required to analyze alternative data can be acquired at minimal to zero cost. Here, Norman Niemer, Chief Data Scientist at UBS Asset Management (QED)*, explores the best tools to analyse alternative investment data.

These free tools will allow you to analyze alternative data without having to spend a lot, and can be used individually to perform a particular function or together to form a comprehensive technology stack to operate your entire data effort. The list walks you through the alternative data analysis process step-by-step. For each step, you have the key tools including links - the final product will give you the resources needed to analyze alternative data at a cost of $0!

Overview of the alternative data analysis process

Get data: copy data from the vendor to where you want to analyse data
Ingest data: you may need to manipulate raw data for it to be loaded correctly
Load data: import data into the analysis package
Preprocess data: clean, filter and transform data for it to be ready for modeling
Modeling: apply your analysis, make predictions and draw conclusions
Presenting: present your insights and conclusions in a digestible format

Step	Goal	Tool
Get
	AWS S3 file download	S3 browser (graphical)AWS CLI (programmatic)
	FTP file download	Winscp (graphical)Pyftpsync (programmatic)
	Vendor API data download	Postman (graphical)Requests (programmatic)
	Corporate proxy authentication	Cntlm
	Manage data pipelines	Luigi / Airflow / d6tpipe
Ingest & Load
	Read CSV, Text an Excel files directly without ingesting	Pandas / Dask / Pyspark
	Quickly ingest any type of CSV or Excel data	d6tstack
	Store data in a database	Postgres / Mysql / Others (database engines)SQuirrel / HeidiSQL (GUI tools)
Preprocess
	Prepare data to fit into prediction model	Pandas / Dask / Pyspark
	Join datasets even if they don't perfect match	d6tjoin
	Statistical data vizualisation	seaborn
	Handle missing values	fancyimpute
Modelling
	Basic statistical analysis	Pandas
	Advanced statistical analysis	Statsmodels
	Machine Learning	Sklearn
	Bayesian modelling in python	Pymc3
	Probabilistic modelling in python	Pomegranate
Presentation
	Summarize your results in static charts	Pandas visualization
	Interactive reports and dashboards	Rstudio markdown (reports, includes python support)Rshiny (dashboards)
	Interactive visualizations	Plot.ly
	Interactive visualizations, alternative	Bokeh
	Turn jupyter notebooks into slideshows	Jupyter present

Step 1. Getting Data

Why should I care?

So you’ve found a dataset on alternativedata.org, the vendor gave you access and you are ready to leverage alternative data! The very first step to analyzing alternative data is to get your hands on the data.

Help, I’m a total beginner! How how do I install python and how do I learn how to use it?

We will be using python a lot for many tasks. If you’ve never used python before, we recommend you install Anaconda python and learn to code in python using many of the excellent free/affordable resources such as codecademy and Coursera. Note that for getting data you can get by without python if need be but you will definitely need it later on when you want to process and analyze data. For additional details see alternativedata.org step 1.

Step 2 & 3. Ingesting and Loading Data

Why should I care?

In step 1 you obtained data from vendors for you to do your analysis. The data may arrive in various formats and you may need to manipulate the raw data for it to be loaded correctly.

What are pandas, dask, pyspark? How do I choose between them?

Those are core python data analytics libraries. They all have some pros/cons, mostly related to the size of data you are analyzing. An overview of each is given below. Also note that there are graphical and commercial alternatives available for data preparation, data integration and data analysis.

Pandas

Overview: THE python data analysis library. If you’re just getting started, stick with this it take you a long way
Pros: Widely used with lots of support and many extensions available
Cons: Likely to break when you start analyzing >=1GB files

Dask

Overview: built on top of pandas to make it scale for large datasets. The next best thing after you’ve pushed pandas to its limits
Pros: Good for analyzing large files and intermediate multi-core/distributed computing
Cons: Missing some pandas functionality and works slightly differently from pandas

Pyspark

Overview: once you start analyzing really large datasets TB+ datasets pyspark gives you fully distributed computing
Pros: Scales really well and integrated into Hadoop ecosystem
Cons: Steeper learning curve and infrastructure setup and more expensive

For additional details see alternativedata.org step 2-3.

Step 4. Preprocess: clean, filter and transform data for it to be ready for modelling

Why should I care?

Everybody knows - or should know - the maxim of data analysis: garbage in, garbage out! Before you start feeding data into a prediction model, you need to explore the data and likely clean and transform it for the data to be appropriate to feed into a prediction model.

What are common problems and how can I solve them?

Look-ahead bias: for your model to work in practice you can't be making predictions with data that was not available at the time of decision. Say for example a company reports 2018-01-08 and an alt data provider has data for 2017-12-31 but the data gets published with a 2 week delay, you couldn't have used it for making a prediction in your backtest. Use asof joins to address avoid this bias.
Missing values: They are obvious when present because they likely break your prediction code so you have to deal with them somehow. The simple solution is to just drop them but alternative datasets often have short histories and therefore simply dropping missing values can significantly impact model training and/or skew results. Inspect what you will be dropping and why it is missing, perhaps there is a fix or try to fill them intelligently
Outliers: unlike missing values they likely won’t break your prediction library but skew your predictions without you even noticing! Alternative datasets often have short histories, combining that with noisy financial data means you should carefully inspect your prediction input data
Stationarity: prediction models often don’t produce good results when you data is not “stationary”. Practically that means you want to model changes instead of levels, for example it’s better to model sales growth than it is to model sales
Seasonality: especially consumer companies have high seasonality, eg over Christmas. If possible work with yearly changes instead of sequential changes

Common preprocessing operations

You’re really getting into the weeds now with the data analysis libraries. To get you started we suggest working with pandas, here are some resources:

10 Minutes to pandas: good overview into what pandas can do
Indexing and Selecting Data: how to filter out the data you are interested in from raw data
Group By: split-apply-combine: perform the same computation for different groups, eg calculate an industry mean or generate predictions for all companies in the vendor universe
Merge, join, and concatenate: combine different datasets
Reshaping: reshape your data, eg transpose from fiscal quarters going sideways to quarters going down

See documentation and the pandas book for full details.

Combining datasets

Almost always you need to combine different datasets to make a prediction, eg vendor data with financial statement data or multiple vendors. There are three types of joins:

Exact joins: say you have a dataframe with data from a vendor but want to analyze it by industry but the vendor didn't include industry data. You can merge another dataset with industry data on company ticker using pd.merge() if the ticker format matches exactly
Fuzzy joins: say you want to combine data from two vendors but they use different ticker (“ABC US Equity” vs “ABC-US”) or date conventions (business vs calendar days). To find the closest match, you can use d6join top1 with a variety of similarity functions
Asof join: say you have a set of company reported kpis with announcement dates and want to know the last known data point before announcement, use pd.merge_asof(direction : ‘backward’). Alternatively say you want to know the first trading price after some announcement including weekends etc, use pd.merge_asof(direction : ‘forward’)

Many things can go wrong when joining datasets, we suggest your use d6tjoin prejoin prior to running joins to assess join quality and analyze join problems.

Step 5. Modelling: apply your analysis, make predictions and draw conclusions

Why should I care?

Finally! This is the step where you predict an important investment metric, such as returns or financials using alternative data. Because you’ve diligently ingested and cleaned the data, you’ve maximized your chances of making meaningful and accurate forecasts.

What are some unique challenges modeling alternative datasets and how do I address them?

Short history: since many of those datasets just weren’t available before, they have a short history. Use simple models on a small set of predictors to avoid overfitting and improve out-sample performance.
Limited breadth: quants might be used to datasets that cover 1000+ names. Just because a dataset has limited breadth doesn’t mean it can’t be really valuable
Lots of datasets: with more datasets coming out every day, it is tempting to try to throw everything in and see what happens. This might work well when modeling but often leads to bad decisions in the future. Use cross-validation, point-in-time out-sample testing and your economic intuition to avoid overfitting and making bad investment decisions in the future

Step 6. Presenting: present your insights and conclusions in a digestible format

Why should I care?

Many senior stakeholders, as well as fundamental investment analysts, prefer simple messages with a key insight backed up by data. They might also want to interact with the data in an intuitive way. Less is more here, follow Einstein’s advice: “Everything should be made as simple as possible (, but not simpler)”.

Next Steps: Sample Code and Tutorials: Additional resources including sample code and hands-on tutorials are available on github.

*Disclaimer: These materials, and any other information or data conveyed in connection with these materials, is intended for informational purposes only. Under no circumstances are these materials, or any information or data conveyed in connection with such report, to be considered an offer or solicitation of an offer to buy or sell any securities of any company. Nor may these materials, or any information or data conveyed in connection with such report, be relied on in any manner as legal, tax or investment advice. The facts and opinions presented are those of the author only and not official opinions of UBS.

The best python tools to analyze alternative investment data for $0

Share this article

Overview of the alternative data analysis process

Step 1. Getting Data

Step 2 & 3. Ingesting and Loading Data

Pandas

Dask

Pyspark

Step 4. Preprocess: clean, filter and transform data for it to be ready for modelling

Step 5. Modelling: apply your analysis, make predictions and draw conclusions

Step 6. Presenting: present your insights and conclusions in a digestible format

Share this article

Beware the hardware in quant finance with Sarina Sit, AMD

Python in quant finance: How to speed it up to compete with C++?

QuantMinds International

Related Articles

Beware the hardware in quant finance with Sarina Sit, AMD

Python in quant finance: How to speed it up to compete with C++?

Related Event

QuantMinds International

Sign up for Quant Finance email updates

Quick Links