Ever since colocation has been launched in India, traders have been hungry for fastest possible trading system. While developing our muTrade application, we have learnt a lot of things with experience. This article captures some of the architecture principles helpful to anybody developing a low latency high volume algorithmic trading platform.


While laying out the architecture blocks, following things should be kept in mind:-

  1. Monolithic design
    • This is against the enterprise software development guidelines, but having a single application saves a lot of latency from unnecessary communication between different modules
  2. Heavy multi-threading
    • To best utilise the server, number of threads should be equivalent to the number of cores in a server
    • Threads should cooperate to free CPU for other threads too
  3. Cores vs Frequency
    • Server configuration should be planned in advance, as there is a limitation on the frequency and cores combination
  4. CEP(Complex Event Processing)
    • Algorithmic application work on three main events – Market Data, Exchange Confirmation (order/trade), User Input
    • Priority of handling the above events and processing time should be minimised
  5. Optimise the most critical path
    • Broadly, following components are there:-
    • Time to React = Time to process market data + Execute strategy logic + OMS Time + RMS Time
    • Time to Send = Time in writing the order towards exchange
    • Turnaround Time = Network time towards exchange + Exchange processing and matching time + Network time from exchange
    • First two, Time to React and Time to Send, when combined are also called Tick To Trade (or Tick to Order Time) is the software processing time which needs to optimised well
    • Network infrastructure impacts Turnaround time
  6. RMS (Risk Management System) computations
    • They should be divided into sequential and parallel
    • Fat finger checks which need to be applied individually can be done in parallel threads
    • Only sequential which need cumulative computations need to take locks
  7. Locks
    • Wherever possible, avoid locks
    • Third party solutions or in-house implementation can be smoothly done for implementing this
    • Mutex are the easiest way to implement synchronisation but acquiring and free mutex locks itself introduces lot of latencies. Spin locks can be used for the purpose
  8. Different IPC mechanisms based on specific need
    • One solution does not fit all needs
    • Benchmarking should be done, shared memory is one of the fastest mode of communication
  9. Database updation
    • Use files wherever possible
    • If database usage is must, offline threads should be run for updations and queries rather than doing everything online
  10. Kernel bypass
    • Special NIC cards like Mellonox, Solarflare, Exablaze all come with their libraries for kernel bypass. They have generally have two methods – loading libraries externally or code integration. Code integration gives better results
  11. In-memory structures to keep everything what is required in RAM and keep it accessible
    • In-memory structures are pre-allocated to save runtime in allocation, deallocation. Offline threads take care of memory management
  12. Frontend-Backend communication
    • Separate communication (preferably sockets) for order flow and market data is recommended
  13. Market Data Processing
    • Separate module is recommended for market data processing, which updates it some shared structure. All modules can use API to fetch the required market data
  14. Buffering
    • Incoming messages (and market data messages) should be read by dedicated threads into buffer
  15. Backend Transaction Application
    • Core allocation between various modules/threads is required to fully utilise the processor potential
    • Some kind of scheduler is helpful to allocate the cores efficiently if number of threads are higher than the number of cores available
  16. Exchange adaptors
    • Exchange adaptors can be configured to be run as a separate process or as a separate thread. Separate process gives flexibility to cross-exchange setup while separate thread gives latency advantage
  17. Network inconsistencies
    • Switch configuration, ping time, server actual frequency and overheating also has an impact on the performance. All should be checked on regular basis
  18. Overclock server
    • Overclock servers give a higher frequency
    • Supermicro servers are well known in this space, assembled servers are other option
    • Locally overclocked servers generally have cooling problems and have a life of one year
  19. Keyboard shortcuts
    • For traders, more than order to trade latency, latency of modifying parameters matter
    • Proper shortcuts for ease of use and fastest modification of parameters
  20. Custom algos API
    • Pro traders always have better logic in mind which they want to use for themselves
    • Algo trading application should expose some APIs so that traders can code their logic themselves
Tagged with →  
Share →

Leave a Reply

Your email address will not be published. Required fields are marked *