Mining from Massive Data

About This Course

This course is a comprehensive guide to big data analytics, machine learning, and advanced NLP using PySpark, from exploratory data analysis to cloud-native production deployment via Docker and Kubernetes.

Course Syllabus (2025)

  1. Databricks

  2. How to do exploratory data analysis in PySpark(slides

  3. Linear regression in Python and Pyspark(slides

  4. How to do grid search in PySpark and Classification in PySpark(slides

  5. Imbalanced data with classification model(slides

  6. Predicting Stock Price(slides

  7. Docker and Kubernetes(slides

  8. How to deploy PySpark to a Kubernetes cluster(slides

  9. Spark NLP Models Hub(slides

  10. How to use John Snow’s spark-nlp for sentiment analysis(slides

  11. Hugging Face(slides

  12. Horovod(slides