Tag Archives: Microsoft

Capturing, Storing and Backtesting CTS/CQS Tick Data in C#/.NET

Tick data is the lifeblood of the capital markets. Unlike order book data, which can be stuffed, stale, and away from the inside market in the majority of cases, tick data represents actionable quotes and transpired trades that can be regarded as the “principal components” of capital market data. Within tick data, one can measure volume, quoting frequency, spreads, VWAPs, moving averages, volatility, and so forth. This article therefore emphasizes the capture and analysis of tick data as opposed to order book information, which can be loosely defined as orthogonal in certain respects.

There was once a time when even the attempt to capture and record tick data, specifically the CTS/CQS “tape” from the U.S. equity markets, was a sophisticated process involving a team of individuals. Even more highly regarded was the replay/analysis/backtesting of the tick data. This was often conducted only in the realm of investment banks or hedge funds.

I briefly, without code examples, want to describe how I effectively store, record, analyze, and backtest the “tape” easily and efficiently each day as part of my model construction and trading strategy deployment.

On average, the CTS/CQS produces about 30GB of information, plus or minus a few GB depending on precisely what fields are stored. I attempt to store everything (condition codes and such), and so my files tend to be a little bit larger. I receive the tick data through multicast UDP, and I proceed to immediately fire an event that strips it off of the network buffer and throw it onto a separate queue in memory. This is so as not to lose data during periods of intense volume (open, close, FED announcement), and so forth. Once it is in my in-memory queue, I then proceed to write each tick, represented as either a trade or quote. I use a common class to represent both trades and quotes as there are a lot of characteristics that are shared and useful between the two.

I begin recording at 09:00 each day (for the possibility of algorithmic “pre-market” analysis), and stop at 16:20. The roughly 20-30GB files are then compressed into a .gz format using standard software such as 7-zip and so forth. The original files are then discarded, and the compressed files are transferred over to my Microsoft Azure Cloud Storage account. I invariably can compress the files to 10% of the original size, or roughly ~2.5GB to ~3.5GB

I then download recent updates on a period (weekly) basis and distribute them across all my backtesting/analysis servers. I then replay the tick data by using the C#/.NET built-in uncompressing reader. Keep in mind that as each tick is being uncompressed, it is placed on a queue an and event is fired that processes the tick throughout my backtesting system and strategies. Therefore, I usually have 6 cores operational on a dual Xeon 8-core server at any given point. Backtesting a single day only requires a few minutes (depending of course on the complexity of the strategy), and then the entire set of trades and messages over the backtesting period is serialized and stored as a “Model” object.

I have created a WPF viewer for the model that displays the market data and various transformations (differencing, moving averages, volume, cumulative volume, quote frequency, and so forth). I use the Visiblox package to greatly facilitate this, and I include annotations on where I’ve placed my trades so I have a visual sense of the strategy. Additionally, because I have the full Model characteristics, I can compute various performance measures against the backtest (Sharpe ratio, annualized return, and so forth.).

Now, the entire process I described is necessary because I using machines with only 12GB of memory. Each day’s worth of compressed CTS/CQS data is approximately 3GB. If I had access to a 64GB or 128GB machine, the backtesting procedure would be far quicker as I could load and entire month or two worth of data into memory and never have to access secondary storage (be it a HDD or SSD).

My current project is to move the entire backtesting apparatus onto the Microsoft Azure platform, so that I fully avail the “utility computing” model and backtest day and night with literally unlimited resources. As the trading volumes have decreased, it actually facilitates backtesting using home-grown software. That is another reason why I develop fully on the Microsoft stack – everything just “works” together, without headaches of which version of Linux I’m using and so forth. But’s that’s just a personal aside.

The gold standard, in the final analysis, for these sorts of systems is of course KDB+, which is incredibly fast and powerful. It is an in memory database with an exceptionally brilliant design and comes with its own extremely concise language (q). But, since I’ve been a freelancer, I’ve had to develop my own techniques for managing large amounts of tick data.

I hope this article is useful to other financial technologists who regularly record and analyze capital market data.

Copyright © 2013, Srikant Krishna

Srikant Krishna is a financial technologist and quantitative trader. He has a background in biophysics, software development, and the capital markets.

You can follow him on Twitter @SrikantKrishna, and on LinkedIn at http://www.linkedin.com/in/srikantkrishna/, or e-mail him at sri@srikantkrishna.com.


Why C# is an Awesome Programming Language (Part 2, Specifics)

In my previous post, I’ve very briefly covered the history in the evolution of the C-esque languages, and concluded with the release of C# by Microsoft in the late 1990s.

C#, is and was, a direct consequence of the design and popularity of Java. As with Java, C# was released as part of the .NET framework – probably the largest codebase library ever released. And Microsoft also improved their development suite – Visual Studio .NET being the flagship.

Any two developers can engage in an endless discourse as to why one language is superior to another, but I’d like to address three very specific reasons why C# is simply awesome.

Firstly, the Visual Studio environment is hands-down the best development platform that exists today, period. Anyone who has used Eclipse and Visual Studio, if they are honest, will confirm the elegance, features, and power of VS. As someone who immensely enjoyed coding in C using xemacs, I became a true believer in C#/.NET when I started using Visual Studio. From a purist point of view, it is entirely true that MonoDevelop could be used instead of Visual Studio to write in C#, and that therefore this is not a particular advantage. However, I am particularly referencing the majority of C# developers who use VS.

Secondly, C# had incorporated, especially in version 3.0, a duo of extraordinarily powerful features: lambda expressions and LINQ that I simply refuse to give up in any other language. The simplicity and declarative nature of afforded by the combination of both these features has completely changed my programming style. This is particularly relevant when dealing with “big data” and analytics, which are my specialties (especially as it pertains to the capital markets).

Thirdly, C# has extensive support for build in parallelism. From PLINQ to the Task Parallel Library, I can maximize the resources available on my machine in an extraordinarily facilitated way. I don’t need to manually develop my own parallel/distributed framework, or use third-party add-ons (that may or may not be maintained frequently). Microsoft has made life much easier for developers who rely on taking full advantage of all their cores, and I commend them for this.

Finally, C# is very much a growing language. The most recent introduction of the Async framework in version 5.0 is a game changer. In a world where any sizable software system is virtually guaranteed to be asynchronous in nature, the inclusion of power built-in language support is a another incredibly effective gift from Microsoft. Just two keywords: await and async will completely change the way that a developer can process multiple, complex data streams or queries in real-time. And all this for free.

At the end of the day, it’s not going to be one specific feature or theoretical characteristic of a language that will render it an exceptional framework. It is rather the utility, the practicality of being able to simply get things done that will measure the success of a language or platform. As a front-line developer working under rigorous deadlines and having to maintain large scale systems, this is the simplest yet most profound reason for using a particular framework.

It is completely understandable that different developers will feel extremely comfortable with other platforms, operating systems, and languages, but in these two brief discussions I really wanted to explain why I feel that C# is awesome.

So I urge developers, even those who are not using the Microsoft stack, to consider Mono and C# to get a feel for how powerful the language is and how much built-in support facilitates getting your work done quickly and correctly.

Why C# is an Awesome Programming Language (Part 1, Background)

As the case with many Gen-X developers (or perhaps even Baby Boomers), I’ve worked with a wide variety of programming languages, but almost invariably the development environment for large scale projects was C or C++, often glued together by a variety of scripts and build tools.

C and C++ suffer are detrimental from two ends of the same spectrum. With C, the standard library only provides eighty seven functions, and usually entire libraries have to be written or procured from a third party. The saying used to be that when you hire a C developer, you are really buying their libraries. Of course, the standards have changed, and the ubiquity of the internet makes it possible to procure and share well-tested code easily. C also suffers from platform-dependence. The primary advantage of C is its simplicity and exceptional speed, which
are consequences of its evolutionary proximity to assembly language which greatly facilitates compiler optimizations.

When C++ was created at Bell Laboratories, the goals that were sought were reusability and the ability to develop large-scale systems through the collaborative work of teams of developers. Ostensibly, C++ is an object-oriented language, replete with encapsulation, polymorphism and inheritance. A language developed for these goals, and with an object-oriented approach would have been an ideal evolutionary step in the 1980s.

But there were two major flaws that were lethal to C++, and continue to haunt the billions of lines in codebases that have been developed since. First, C++ was to be backwardly compatible with C, and in fact the earliest C++ compilers were preprocessors that converted C++ code into C. This single language decision resulted in all of the atrocities of C (global variables, macros, conditional compilation) now inherited by C++. So from inception, bad C code could be processed perfectly as bad C++ code. And this continues to contemporary periods, though developers today would assume strong measures to avoid dangerous coding habits and design.

The second, more nuanced bombshell within C++ is that even though C++ can be regarded as an object-oriented langauge, it also can be viewed as a generic template-driven programming language. These two paradigms are completely orthogonal, and almost without exception, the single largest design flaws in C++ systems is a result of simultaneously combining both paradigms. Imagine nested template-driven code that would be applied a complex class hierarchy. To manage and continue to add to such a codebase would be a task worthy of Hercules. In my opinion, this is why highly experienced C++ developers are always in demand – only true experts can work with the multiple-inheritance, C-backward compatible template-driven complex class hierarchies that can be found in modern large scale C++ systems.

The year was 1993, and enter Java – an experimental language under development at Sun Microsystems (codename “Oak”) as a platform-independent language in which a virtual machine enables a “Write Once, Run Anywhere” development paradigm. Fortunately the designers of Java had decided to break ties with the inescapable detriments afforded by C/C++, and instead focus on designing a modern language for the 99% replete with automatic garbage collection, single inheritance, and a robust built-in library. Originally, Java was expected to run within browsers on the client side. However, the formidable qualities of the programming language from both business and development perspectives became exquisitely evident, and Java was being used in back end server-side projects, with a web presentation interface using an assortment of ancillary technologies. From the earliest versions through Java 7, a host of incremental refinements, performance improvements, and language features such as generics have resulted in Java supplanting C/C++ as the “gold standard” for software development. It was far easier for universities to teach and for students to learn Java, and it attained ubiquity in the professional development community.

Microsoft, beign the notorious laggard in innovation that it is, quickly realized that they were rapidly losing market share as the world moved towards the browser rather than applications, and developers were flocking to the new Java gold standard of modern programming language design. They responded in the same manner they dealt with in the past with other threats (including MacOS, Netscape Navigator) – they created a imitation product that retained the flagship features of the cloned product. Anders Hejlsberg, a designer of Turbo Pascal and Delphi, was in charge of this process. When C# 1.0 was first released, I can vividly recall not beign able to distinguish whether the code was Java or C#. The syntax, design, and even keywords were closely related. In fact, as of the late 1990s and early 2000s, writing a translator between Java
and C# would have been a modestly trivial enterprise. Of course, as the languages have evolved and the libraries grew enormously,
such a task would be much more complex today.

The stage was now set for Microsoft to actually innovate in terms of programming language design, and create a superb development environment and framework. I will follow this up shortly in another post (Part 2, Specifics).

Srikant Krishna (contact: sri@srikantkrishna.com) is a financial technologist and quantiative trader. He has a strong background in biophysics, software development, and the capital markets.