High Performance Computing
Parallel programming and parallel algorithms
Live online class through
Batch details
Time 11 PM - 12 AM (Night) IST
Days : Mon, Wed & Fri.
Starts on 23-Feb-2026
Fee Rs.24000/- (Cannot be paid in installments)
Duration : 10 to 12 months.

Programming language that I will use to teach : C & C++
Prerequisite : You should not be afraid of pointers, pointers to pointers or using pointers as an array. You should be aware of classes, objects, encapsulation, polymorphism, inheritance, virtual polymorphism, function templates, class templates, how abstract classes are created to impose guidelines, constructors, destructors & operator overloading.
How can you register for the course ?
Register for two free demo lectures without paying any fee.
Demo lectures will help you decide, whether you should go for this course or not.
Before joining demo lecture just see to it that you know the math behind 2D Matrix Multiplication.
After attending demo lectures, you will have a window of two days to register for the course by paying fee.
Register now
Live class audio recording / code will be available for download through our website.
This helps students to revise, write down missed theory and document everything in neat and tidy fashion.
If you are going to miss a class for some serious reason, then you will have to tell us in advance to enable us to provide you audio / video recording.
Course Contents
 HPC v/s HFT and the HFC overlap
  1. What is HPC ?
  2. What is HFT ?
  3. How HFC is intersection of both ?
  4. Can it be learned on one machine or a cluster is required ?
 Minimizing Latency
  1. Understanding latency
  2. Measuring latency
  3. Is it always related to network programming ?
  4. Cache hit/miss
  5. Cache friendly data access to maximize cache hit
  6. Aligning Data
  7. Alternative to virtual polymorphism
  8. Optimizing DS, loops, function calls
  9. Exception handling will slow down
  10. Compiler optimization flags
  11. Dynamic memory allocation will slow down
  12. Avoiding dynamic memory allocations
  13. Creating memory pools
  14. Creating lock free data structures to avoid wait times
  15. Low latency logging
  16. Network programming
 Multi-threading
  1. Matrix multiplication
  2. Measuring processing time
  3. Understanding cache hit/cache miss
  4. Revised matrix multiplication
  5. Concurrency
  6. Creating threads
  7. Threaded matrix multiplication
  8. lambdas
  9. Locks
  10. Lock guards
  11. Preventing deadlocks
  12. Condition variables
  13. Atomics
  14. Tasks and futures
  15. Synchronizing threads
  16. Communication between threads
  17. Creating lock based data structures
  18. Creating thread pools
  19. Creating lock free data structures
  20. Parallel Standard Template Library
  21. execution policies
  22. vectors in parallel
  23. for_each in parallel
  24. load balancing
  25. exception handling in parallel execution
  26. for_each_n and ranges
  27. custom iterators in parallelism
  28. synchronization
  29. parallel data transformation using transform
  30. reduce and accumulate in parallel
  31. sorting in parallel
  32. searching in parallel
 Open Multi-Processing (OpenMP)
  1. Open MP Directives
  2. Parallelize loops
  3. Implementing reduction
  4. Environment variables
  5. Parallel Regions
  6. Work sharing
  7. Decomposing data structures for parallelism
  8. Controlling / Removing data dependencies
  9. Synchronization
  10. Mutual exclusion
  11. Synchronizing events
  12. Communication between threads
  13. Thread affinity
  14. SIMD Vectorization
  15. GPU Offloading
 Compute Unified Device Architecture (CUDA)
  1. CPU v/s GPU
  2. Which GPU for learning CUDA ?
  3. Data v/s Task parallelism
  4. GPU Architecture
  5. Setting up development environment for CUDA Programming
  6. Parallel programming begins with SIMD
  7. Compilation
  8. Writing kernel function
  9. Measuring GPU processing time
  10. Thread / Block / Grid
  11. Organizing parallel threads
  12. Query GPU Information
  13. Error handling
  14. CUDA Memory model
  15. Asynchronous execution with streams/events
  16. Setting up launch configurations
  17. Designing parallel algorithms
  18. Reduction algorithm
  19. Sorting in parallel
  20. Profiling and optimizing code
  21. Unrolling loops
  22. Debugging techniques
  23. CUDA Streams
  24. Creating library for integration with other programming languages
 What next ?
  1. HPC using distributed computing frameworks - An introduction
  2. How to get into High Frequency Trading domain as a programmer ?
  3. What is FPGA, Verilog & VHDL ? - An introduction