Course Syllabus
Module
Topics
1
Introduction to Machine Learning (ML) and Deep Learning (DL)
-
ML revolution and cloud
-
ML System and its constituents
-
ML performance concepts/techniques: Overfitting and generalization, bias, variance, regularization
-
Performance metrics: Algorithmic and System Level
-
DL training: backpropagation, activation functions, data preprocessing,
batch normalization, SGD, and its variants exploding and vanishing gradients,
weight initialization, learning rate schedule -
Regularization techniques in DL Training: dropout, early stopping, data augmentation​
2
DL Training: Architecture, Frameworks, Hyperparameters
• DL training architectures
• Model and Data Parallelism
• Single node training
• Distributed training
• DL training hyperparameters
• Batch size, Learning rate, Momentum, Weight decay; Convergence and RunVme
• RelaVon between batch size and learning rate
• Hardware Acceleration: CPUs, GPUs, FPGAs, TPUs, Specialized systems for AI acceleration
• Specialized DL architectures: CNNs, RNNs, LSTMs, GANs
3
ML and Cloud Technologies
• ML system stack on the cloud
• Micro-services architecture, Docker, Kubernetes, Kubeflow
• Cloud storage: File, Block, Object storage; performance (latency, throughput), cost, scalability
Cloud Based ML PlaEorms
• ML as a service offering: AWS, Microsoft, Google, and IBM
• System stack, capablities and tools support
• Monitoring and observability
• Performance and availability
4
DL Performance Evaluation: Tools and Techniques
• Monitoring tools for GPU resources (nvprof, nvidia smi)
• Time series analysis of resource usage data
• Predic*ve performance modeling techniques
• Black-box vs white-box modeling
• Analytical modeling
• Predictive performance models, for DL: accuracy and run*me
5
ML Benchmarks
• DAWNBench, MLPerf
• Datasets: MNIST, CIFAR10/100, ImageNet, OpenML
• Performance metrics for DL jobs
• RunVme, Cost, Response Vme, Accuracy, Time To Accuracy (TTA)
• Study of published numbers by different cloud service providers/vendors
• Compare performance scaling across GPUs for different models in MLperf
• Open Neural Network Exchange (ONNX)
6
DL Systems Performance
• Training-logs: framework-specific support, instrumentation, analysis
• Checkpointing: framework-specific support, restarting from a checkpoint
• Job Scheduling on Cluster:
• Policies: FIFO, Gang, Earliest Deadline First
• Job Scheduler: Kubernetes default, Gandiva, Optimus
• Job Elasticity: scaling GPUs during runtime, platform support
• Performance scalability of DL clusters
7
Operational DL and MLOps
• Model Lifecycle management
• DL systems testing and quality
• Drift detection and re-training
• Robustness and Adversarial training
• DevOps principle in machine learning
• MLOps and associated tool-chain
8
Special Topics
• Transfer Learning: finetuning and pseudo-labeling techniques
• Deep Reinforcement Learning systems
• Neural Network synthesis and architecture search (NAS)
• Hyperparameter optimization
• Automated Machine Learning (AutoML)
• Federated machine learning