top of page

Course Syllabus

Module

Topics

1

Introduction to Machine Learning (ML) and Deep Learning (DL)

  • ML revolution and cloud

  • ML System and its constituents

  • ML performance concepts/techniques: Overfitting and generalization, bias, variance, regularization

  • Performance metrics: Algorithmic and System Level

  • DL training: backpropagation, activation functions, data preprocessing,
    batch normalization, SGD, and its variants exploding and vanishing gradients,
    weight initialization, learning rate schedule

  • Regularization techniques in DL Training: dropout, early stopping, data augmentation​

2

DL Training: Architecture, Frameworks, Hyperparameters

• DL training architectures

• Model and Data Parallelism
• Single node training
• Distributed training

• DL training hyperparameters

• Batch size, Learning rate, Momentum, Weight decay; Convergence and RunVme
• RelaVon between batch size and learning rate

• Hardware Acceleration: CPUs, GPUs, FPGAs, TPUs, Specialized systems for AI acceleration
• Specialized DL architectures: CNNs, RNNs, LSTMs, GANs

3

ML and Cloud Technologies

• ML system stack on the cloud
• Micro-services architecture, Docker, Kubernetes, Kubeflow
• Cloud storage: File, Block, Object storage; performance (latency, throughput), cost, scalability

Cloud Based ML PlaEorms

• ML as a service offering: AWS, Microsoft, Google, and IBM
• System stack, capablities and tools support
• Monitoring and observability
• Performance and availability

4

DL Performance Evaluation: Tools and Techniques

• Monitoring tools for GPU resources (nvprof, nvidia smi)
• Time series analysis of resource usage data
• Predic*ve performance modeling techniques

• Black-box vs white-box modeling
• Analytical modeling

• Predictive performance models, for DL: accuracy and run*me 

5

ML Benchmarks

• DAWNBench, MLPerf
• Datasets: MNIST, CIFAR10/100, ImageNet, OpenML
• Performance metrics for DL jobs
• RunVme, Cost, Response Vme, Accuracy, Time To Accuracy (TTA)
• Study of published numbers by different cloud service providers/vendors
• Compare performance scaling across GPUs for different models in MLperf
• Open Neural Network Exchange (ONNX)

6

DL Systems Performance

• Training-logs: framework-specific support, instrumentation, analysis
• Checkpointing: framework-specific support, restarting from a checkpoint
• Job Scheduling on Cluster:
• Policies: FIFO, Gang, Earliest Deadline First
• Job Scheduler: Kubernetes default, Gandiva, Optimus
• Job Elasticity: scaling GPUs during runtime, platform support
• Performance scalability of DL clusters

7

Operational DL and MLOps

• Model Lifecycle management
• DL systems testing and quality
• Drift detection and re-training
• Robustness and Adversarial training
• DevOps principle in machine learning
• MLOps and associated tool-chain

8

Special Topics

• Transfer Learning: finetuning and pseudo-labeling techniques
• Deep Reinforcement Learning systems
• Neural Network synthesis and architecture search (NAS)
• Hyperparameter optimization
• Automated Machine Learning (AutoML)
• Federated machine learning

Assignments based on the following technologies

bottom of page