Large-scale data science for real-world data

COM-490 / 6 credits

Teacher(s): Bouillet Eric Pierre, Delgado Borda Pamela Isabel, Sarni Sofiane, Verscheure Olivier

Language: English

Withdrawal: It is not allowed to withdraw from this subject after the registration deadline.

Summary

This hands-on course teaches the tools & methods used by data scientists, from researching solutions to scaling up prototypes to Spark clusters. It exposes the students to the entire data science pipeline, from data acquisition to extracting valuable insights applied to real-world problems.

Content

1. Crash-course in Python for data scientists

Main Python libraries for data scientists
Interactive data science with web-based notebooks
Reusable compute environments for reproducible science
Homework: Curating data from a network of CO2 sensors

2. Distributed data wrangling at scale

Understand the main constituents of an Apache Hadoop distribution
Put Map-Reduce into practice
Focus on HDFS, Hive and HBase and associated data storage formats
Homework: Big data wrangling with massive travel data from SBB/CFF

3. Distributed processing with Apache Spark

RDDs and best practices for order of operations, data partitioning, caching
Data science packages in Spark: GraphX, MLlib, etc.
Homework: Uncovering world events using Twitter hashtags

4. Real-time big data processing using Apache Spark Streaming

Window-based processing of unbounded data
Homework: Geospatial analysis and visualization of real-time train geolocation data from the Netherlands

5. Final project - Summing it all up

Robust Journey Planning on the Swiss multimodal transportation network - Given a desired departure, or arrival time,your route planner will compute the fastest route between two stops within a provided uncertainty tolerance expressedas interquartiles. For instance, what route from A to B is the fastest at least Q% of the time if I want to leave from A(resp. arrive at B) at instant ?.

Keywords

Data Science, IoT, Machine Learning, Predictive Modeling, Big Data, Stream Processing, Apache Spark, Hadoop,
Large-Scale Data Analysis

Learning Prerequisites

Required courses

Students must have prior experience with Python

Important concepts to start the course

It is recommended that students familiarize themselves with concepts in statistics and standard methods in machine learning.

Learning Outcomes

By the end of the course, the student must be able to:

Use standard Big Data tools and Data Science librairies
Carry out out out-real-world projects with a variety of real datasets, both at rest and in motion
Design large scale data science and engineering problems
Present tangible solution to a real-world Data Science problem

Teaching methods

Hands-on lab sessions

Homework assignments

Final project

... using real-world datasets and Cloud Compute & Storage Services

Expected student activities

STUDY : Attend the lab sessions
WORK : Complete homework assignments
ENGAGE : Contribute to the ineractive nature of the class
COLLABORATE : Work in small groups to provide solutions to real-world problems
EXPLAIN : Present ideas and results to the class

Assessment methods

60% continuous assessment during the semester
40% final project, done is small groups

Supervision

Office hours	Yes
Assistants	Yes
Forum	Yes

Resources

Bibliography

Python Data Science Handbook: Essential Tools for Working with Data by Jake VanderPlas, O'Reilly Media, November 2016
pyGAM - https://github.com/dswah/pyGAM

A list of additional readings will be distributed at the beginning of the course

Ressources en bibliothèque

Python Data Science Handbook / VanderPlas

Moodle Link

https://go.epfl.ch/COM-490

In the programs

Semester: Spring
Exam form: During the semester (summer session)
Subject examined: Large-scale data science for real-world data
TP: 4 Hour(s) per week x 14 weeks
Type: optional

Semester: Spring
Exam form: During the semester (summer session)
Subject examined: Large-scale data science for real-world data
TP: 4 Hour(s) per week x 14 weeks
Type: optional

Semester: Spring
Exam form: During the semester (summer session)
Subject examined: Large-scale data science for real-world data
TP: 4 Hour(s) per week x 14 weeks
Type: optional

Semester: Spring
Exam form: During the semester (summer session)
Subject examined: Large-scale data science for real-world data
TP: 4 Hour(s) per week x 14 weeks
Type: optional

Semester: Spring
Exam form: During the semester (summer session)
Subject examined: Large-scale data science for real-world data
TP: 4 Hour(s) per week x 14 weeks
Type: optional

Semester: Spring
Exam form: During the semester (summer session)
Subject examined: Large-scale data science for real-world data
TP: 4 Hour(s) per week x 14 weeks
Type: optional

Semester: Spring
Exam form: During the semester (summer session)
Subject examined: Large-scale data science for real-world data
TP: 4 Hour(s) per week x 14 weeks
Type: optional

Semester: Spring
Exam form: During the semester (summer session)
Subject examined: Large-scale data science for real-world data
TP: 4 Hour(s) per week x 14 weeks
Type: optional

Semester: Spring
Exam form: During the semester (summer session)
Subject examined: Large-scale data science for real-world data
TP: 4 Hour(s) per week x 14 weeks
Type: optional

Reference week

Légendes:

Lecture

Exercise, TP

Project, Lab, other

Related courses

Results from graphsearch.epfl.ch.

	Mo	Tu	We	Th	Fr
8-9
9-10
10-11
11-12
12-13
13-14			INF1
14-15
15-16
16-17
17-18
18-19
19-20
20-21
21-22