Course Syllabus

Module

Topics

1 Introduction to Machine Learning (ML) and Deep Learning (DL)

ML revolution and cloud
ML System and its constituents
ML performance concepts/techniques: Overfitting and generalization, bias, variance, regularization
Performance metrics: Algorithmic and System Level
DL training: backpropagation, activation functions, data preprocessing,
batch normalization, SGD, and its variants exploding and vanishing gradients,
weight initialization, learning rate schedule
Regularization techniques in DL Training: dropout, early stopping, data augmentation

2 DL Training: Architecture, Frameworks, Hyperparameters

• DL training architectures

• Model and Data Parallelism
• Single node training
• Distributed training

• DL training hyperparameters

• Batch size, Learning rate, Momentum, Weight decay; Convergence and RunVme
• RelaVon between batch size and learning rate

• Hardware Acceleration: CPUs, GPUs, FPGAs, TPUs, Specialized systems for AI acceleration
• Specialized DL architectures: CNNs, RNNs, LSTMs, GANs

3 ML and Cloud Technologies

• ML system stack on the cloud
• Micro-services architecture, Docker, Kubernetes, Kubeflow
• Cloud storage: File, Block, Object storage; performance (latency, throughput), cost, scalability

Cloud Based ML PlaEorms

• ML as a service offering: AWS, Microsoft, Google, and IBM
• System stack, capablities and tools support
• Monitoring and observability
• Performance and availability

4 DL Performance Evaluation: Tools and Techniques

• Monitoring tools for GPU resources (nvprof, nvidia smi)
• Time series analysis of resource usage data
• Predic*ve performance modeling techniques

• Black-box vs white-box modeling
• Analytical modeling

• Predictive performance models, for DL: accuracy and run*me

5 ML Benchmarks

• DAWNBench, MLPerf
• Datasets: MNIST, CIFAR10/100, ImageNet, OpenML
• Performance metrics for DL jobs
• RunVme, Cost, Response Vme, Accuracy, Time To Accuracy (TTA)
• Study of published numbers by different cloud service providers/vendors
• Compare performance scaling across GPUs for different models in MLperf
• Open Neural Network Exchange (ONNX)

6 DL Systems Performance

• Training-logs: framework-specific support, instrumentation, analysis
• Checkpointing: framework-specific support, restarting from a checkpoint
• Job Scheduling on Cluster:
• Policies: FIFO, Gang, Earliest Deadline First
• Job Scheduler: Kubernetes default, Gandiva, Optimus
• Job Elasticity: scaling GPUs during runtime, platform support
• Performance scalability of DL clusters

7 Operational DL and MLOps

• Model Lifecycle management
• DL systems testing and quality
• Drift detection and re-training
• Robustness and Adversarial training
• DevOps principle in machine learning
• MLOps and associated tool-chain

8 Special Topics

• Transfer Learning: finetuning and pseudo-labeling techniques
• Deep Reinforcement Learning systems
• Neural Network synthesis and architecture search (NAS)
• Hyperparameter optimization
• Automated Machine Learning (AutoML)
• Federated machine learning

Course Syllabus

Module

Topics

1

Introduction to Machine Learning (ML) and Deep Learning (DL)

2

DL Training: Architecture, Frameworks, Hyperparameters

3

ML and Cloud Technologies

Cloud Based ML PlaEorms

4

DL Performance Evaluation: Tools and Techniques

5

ML Benchmarks

6

DL Systems Performance

7

Operational DL and MLOps

8

Special Topics

Assignments based on the following technologies