
Rishabh Misra
Staff Machine Learning Engineer
This course provides a comprehensive overview of techniques to enhance the performance of large language models (LLMs) during inference. It begins with an introduction to the principles of LLM inference optimization, focusing on the transformer architecture and various optimization strategies. Participants will explore advanced methods, including quantization and speculative decoding, to reduce model complexity and improve execution speed. The course also covers model parallelism and sharding techniques for effective deployment in real-world applications. Finally, learners will complete a project on accelerating news headline generation using LLM optimization, demonstrating practical implementations of the concepts discussed.

Subscription · Monthly
2 skills
4 prerequisites
Prior to enrolling, you should have the following knowledge:
You will also need to be able to communicate fluently and professionally in written and spoken English.
1 instructor
Unlike typical professors, our instructors come from Fortune 500 and Global 2000 companies and have demonstrated leadership and expertise in their professions:

Rishabh Misra
Staff Machine Learning Engineer
Master LLM inference optimization with transformer tweaks, model parallelism, and sharding using DeepSpeed, TensorRT-LLM, Triton Inference Server, and more.

Subscription · Monthly