Large-scale data science for real-world data

COM-490 / 6 crédits

Enseignant(s): Bouillet Eric Pierre, Delgado Borda Pamela Isabel, Sarni Sofiane, Verscheure Olivier

Langue: Anglais

Withdrawal: It is not allowed to withdraw from this subject after the registration deadline.

Summary

This hands-on course covers tools and methods used by data scientists, from researching solutions to scaling prototypes on Spark clusters. Students engage with the full data engineering and data science pipeline, from data acquisition to extracting insights, applied to real-world problems.

Content

1. Crash Course in Python for Data Science

Use essential Python libraries for data manipulation, visualization, and introductory machine learning; get hands-on with development environments, collaborate via version control tools, work with interactive notebooks, and build workflows using real-world datasets.

2. Distributed Data Wrangling at Scale

Understand distributed data processing platforms spanning multiple servers and storage systems; build and optimize data lakes using efficient storage formats; perform large-scale Extract-Transform-Load (ETL) workflows; and explore and transform massive datasets for batch processing.

3. Distributed Processing with Spark

Apply advanced data engineering techniques using Spark; process and transform large datasets; train machine learning models in distributed pipelines; and optimize performance through efficient execution strategies.

4. Real-Time Big Data Processing

Learn real-time and event-driven processing concepts; design scalable streaming pipelines integrated with batch systems; and perform live inference on streaming data under dynamic conditions.

5. Final Project (Assignment) - Integration and Application

Design a comprehensive data science solution combining batch and streaming workflows; integrate methods from previous modules; apply best practices for scalable processing and deployment; and demonstrate full-cycle implementation on a real-world-inspired problem.

Keywords

Data Engineering, Data Lakes, Machine Learning Operations (MLOps), Distributed Computing, Real-Time Data Stream Processing, Scalable Data Processing, Large-Scale Data Analysis, Predictive Modeling, Apache Spark, Hadoop, Kafka.

Learning Prerequisites

Important concepts to start the course

Participants are expected to have prior experience with Python programming and understanding of fundamental mathematical concepts relevant to data science. Familiarity with key data science libraries - such as NumPy, pandas, and scikit-learn - is strongly recommended. A basic understanding of using the Linux terminal, including navigating file systems and executing command-line tools, is also beneficial.

Learning Outcomes

By the end of the course, the student must be able to:

Apply and coordinate the use of standard data science libraries and big data technologies to manage distributed and real-time data workflows.
Design , build, and optimize data lakes using efficient storage formats to enable scalable, high-performance data engineering.
Conduct large-scale data wrangling, transformation, and model training tasks on complex, real-world datasets.
Design , develop, and optimize scalable data pipelines for both batch and streaming contexts.
Integrate machine learning techniques into end-to-end data science workflows using appropriate tools and environments.
Interpret complex datasets and evaluate outcomes to extract actionable insights that support data-driven decision-making.
Formulate and justify technical choices in data storage, pipeline design, and machine learning model deployment, and communicate results clearly through visualizations, documentation, and oral presentations.

Transversal skills

Continue to work through difficulties or initial failure to find optimal solutions.
Identify the different roles that are involved in well-functioning teams and assume different roles, including leadership roles.
Use a work methodology appropriate to the task.
Manage priorities.
Use both general and domain specific IT resources and tools

Teaching methods

Assessment is based on lectures and hands-on lab sessions, with all activities involving real-world datasets and the use of distributed computing and storage services to ensure practical, applied learning.

Expected student activities

Apply: Put concepts into practice during hands-on lab sessions.
Engage: Take part in class discussions and interactive activities.
Collaborate: Work in teams to complete assignments and tackle real-world challenges.
Explain: Present your ideas and results clearly and concisely.

Assessment methods

60% Continuous group assessments during the semester
40% Final group project

Supervision

Office hours	Yes
Assistant.e.s	Yes
Forum	Yes

Resources

Virtual desktop infrastructure (VDI)

Yes

Bibliography

Python Data Science Handbook: Essential Tools for Working with Data by Jake VanderPlas, O'Reilly Media, 2023

Ressources en bibliothèque

Find the references at the Library

Moodle Link

https://go.epfl.ch/COM-490

Dans les plans d'études

Semestre: Printemps
Forme de l'examen: Pendant le semestre (session d'été)
Matière examinée: Large-scale data science for real-world data
Projet: 4 Heure(s) hebdo x 14 semaines
Type: optionnel

Semestre: Printemps
Forme de l'examen: Pendant le semestre (session d'été)
Matière examinée: Large-scale data science for real-world data
Projet: 4 Heure(s) hebdo x 14 semaines
Type: optionnel

Semestre: Printemps
Forme de l'examen: Pendant le semestre (session d'été)
Matière examinée: Large-scale data science for real-world data
Projet: 4 Heure(s) hebdo x 14 semaines
Type: optionnel

Semestre: Printemps
Forme de l'examen: Pendant le semestre (session d'été)
Matière examinée: Large-scale data science for real-world data
Projet: 4 Heure(s) hebdo x 14 semaines
Type: optionnel

Semestre: Printemps
Forme de l'examen: Pendant le semestre (session d'été)
Matière examinée: Large-scale data science for real-world data
Projet: 4 Heure(s) hebdo x 14 semaines
Type: optionnel

Semestre: Printemps
Forme de l'examen: Pendant le semestre (session d'été)
Matière examinée: Large-scale data science for real-world data
Projet: 4 Heure(s) hebdo x 14 semaines
Type: optionnel

Semestre: Printemps
Forme de l'examen: Pendant le semestre (session d'été)
Matière examinée: Large-scale data science for real-world data
Projet: 4 Heure(s) hebdo x 14 semaines
Type: optionnel

Semestre: Printemps
Forme de l'examen: Pendant le semestre (session d'été)
Matière examinée: Large-scale data science for real-world data
Projet: 4 Heure(s) hebdo x 14 semaines
Type: optionnel

Semestre: Printemps
Forme de l'examen: Pendant le semestre (session d'été)
Matière examinée: Large-scale data science for real-world data
Projet: 4 Heure(s) hebdo x 14 semaines
Type: optionnel

Semaine de référence

Légendes:

Cours

Exercice, TP

Projet, Labo, autre

Cours connexes

Résultats de graphsearch.epfl.ch.

	Lu	Ma	Me	Je	Ve
8-9
9-10
10-11
11-12
12-13
13-14
14-15
15-16
16-17
17-18
18-19
19-20
20-21
21-22