What do you use to build your machine learning framework? Python, Erdem Ultanir, Quantitative Credit Risk Analytics Lead at Barclays, admitted, is still the preferred programming language by far, but R, Julia, Spark and Scala are worth considering. Ultanir explains why these other languages can be relevant to quants and evaluates the various properties they offer.
Python is the language of choice for machine learning projects, with many programmers favouring its familiarity, its constructs and its object-oriented approach.
But how does modelling in Python compare with other new open source data libraries? And which are the easiest to use and to understand? Erdem Ultanir, Quantitative Credit Risk Analytics Lead at Barclays, interrogated the options in his presentation at QuantMinds International in Vienna.
“Machine learning is not just an optimisation problem, the real world is different,” he told delegates at the conference. It’s not just about programming, it’s about getting the data, it’s data visualisation and explaining the results. “A data scientist needs significant knowledge in these areas and we will look at the full tool set.”
Data scientists and engineers are grappling with the best ways to utilise and scale machine learning for analytics applications in quant. While Ultanir said he primarily uses Python and his analysis of the top machine learning projects showed that almost all of them were written in Python, other languages, like R, Julia, Spark and Scala, offer different properties and are worth considering.
The dominance of Python in areas includingreinforcement learning and natural language programming is “astonishing,” he said. “R seems to be lagging behind. When people are talking about machine learning, they are primarily talking about Python.”
In addition, GitHub statistics show that there is “a clear domination of Python.”
Wes McKinney, an American software developer, created the open-source pandas package for data analysis in Python when he was working at AQR Capital Management. It has an R-style data frame, and can be useful, but it uses the memory very inefficiently and creates temporary tables that use additional memory, Ultanir said.
“It’s an overall bad situation for large datasets,” he said. “If you come to the limit of your machine, you will have problems.”
A big deficiency of Python is loading and dealing with large data tables. Pandas’ speed can be improved through parallelisation, with minimal code changes it is possible to run the code in parallel, taking advantage of the processing power.
R offers a similar take on the same thing, which can be used for fast aggregation of large data sets. It offers fast primary indexing and automatic secondary indexing. It is memory efficient and it allows rolling joins and non-equivalent joins in a single line of code.
R is slightly better in the area of data visualisation, he said, thanks to its ggplot function, which is based on the grammar of graphics. While Python is improving, the standard visualisations use matplotlib, which is quite old. Some programmers layer on top of matplotlib to improve them but switching between the different formats can be difficult.
On the algorithm side, Python relies on packages, and Scikit-learn offers simple and efficient tools for data mining and data analysis and compares well to Carat in R. One key difference is that R has more data analysis built in.
“What is striking is that if you want to do coding, with Python you need to write a lot more lines,” he said. “It requires more code to get the performance.”
Tensorflow is symbolic math library, and it is also used for machine learning applications like neural networks. It is not really a ready set of machine learning algorithms.
“This is primarily a gradient decent optimisation tool that’s doing well because it is backed by Google,” Ultanir said. “Keras on the other hand is user friendly, since you can build neural nets one layer at a time.”
Apache Spark is preferred for big data processing, because it works faster by caching data in the memory across multiple parallel operations, Ultanir said.
“What is interesting about Spark is it is not written in R or Python, it is written in another language Scala,” he said. “If you want to get the full benefit of Spark you will need to learn this language.”
Julia is another language with interesting features, he said. While it is definitely faster than Python, it is so young that there is less of a community and it is not object-orientated.
Which is best?
So which one should tomorrow’s data programmers and quants be using to get the best results? Ultanir said Python and R are quite comparable, while Julia and Scala are far behind in terms of available packages for machine learning, visualisation and data manipulations.
He advised sticking with Python for complex projects and neural networks, although he said for simpler and standard machine learning tasks, R could result in faster solutions.
“We prefer to use a single language for each project,” he concluded. “That’s our motto.”