- français
- English
Fiches de cours 2017-2018
Applied data analysis
CS-401
Enseignant(s) :
West RobertLangue:
English
Summary
This course teaches the basic techniques and practical skills required to make sense out of a variety of data, with the help of the most acclaimed software tools in the data science world: pandas, scikit-learn, Spark, etc.Content
Thanks to a new breed of software tools that allows to easily process and analyze data at scale, we are now able to extract invaluable insights from the vast amount of data generated daily. As a result, both the business and scientific world are undergoing a revolution which is fueled by one of the most sought after job profiles: the data scientist.
This course covers the fundamental steps of the data science pipeline:
Data Acquisition
- Variety as one of the main challenges in big data: structured, semi-structured, unstructured
- Data sources: open, public (scraping, parsing, and down-sampling)
- Dataset fusion, filtering, slicing & dicing
- Data granularities and aggregations
Data Wrangling
- Data manipulation, array programming, dataframes
- The many sources of data problems (and how to fix them): missing data, incorrect data, inconsistent representations
- Schema alignment, data reconciliation
- Data quality testing with crowdsourcing
Data Interpretation
- Stats in practice (distribution fitting, statistical significance, etc.)
- Co-occurrence grouping (market-basket analysis)
- Machine learning in practice (supervised and unsupervised, feature engineering, more data vs. advanced algorithms, curse of dimensionality, etc.)
- Text mining: vector space model, topic models, word embedding
- Social network analysis (influencers, community detection, etc.)
Data Visualization
- Introduction to different plot types (1, 2, and 3 variables), layout best practices, network and geographical data
- Visualization to diagnose data problems, scaling visualization to large datasets, visualizing uncertain data
Reporting
- Results reporting, infographics
- How to publish reproducible results
- Anonymiziation, ethical concerns
The students will learn the techniques during the ex-cathedra lectures, and will then get familiar with the software tools to complete the homework assignments (which will be in part executed under the supervision of the teacher and the assistants, during the lab hours).
In parallel, the students will embark in a semester-long project, split in agile teams of 3. The outcome of such team efforts will be unified towards the end of the course, to build a project portfolio that will be made public (and available as open-source).
At the end of the semester, students will also take a 3-hour final exam in a classroom with computers, where they will be asked to complete a data analysis pipeline (both with code and extensive comments) on a dataset they have never worked with before.
Keywords
data science, data analysis, data mining, machine learning
Learning Prerequisites
Required courses
The student MUST have passed an introduction to databases course, OR a course in probability & statistics, OR two separate courses that include programming projects.
Recommended courses
- CS-423 Distributed Information Systems
- CS-433 Pattern Classification and Machine Learning
Important concepts to start the course
Algorithms, object oriented programming, basic probability and statistics
Learning Outcomes
By the end of the course, the student must be able to:- Construct a coherent understanding of the techniques and software tools required to perform the fundamental steps of the Data Science pipeline
- Perform data acquisition (data formats, dataset fusion, Web scrapers, REST APIs, open data, big data platforms, etc.)
- Perform data wrangling (fixing missing and incorrect data, data reconciliation, data quality assessments, etc.)
- Perform data interpretation (statistics, knowledge extraction, critical thinking, team discussions, ad-hoc visualizations, etc.)
- Perform result dissemination (reporting, visualizations, publishing reproducible results, ethical concerns, etc.)
Transversal skills
- Give feedback (critique) in an appropriate fashion.
- Demonstrate the capacity for critical thinking
- Write a scientific or technical report.
- Evaluate one's own performance in the team, receive and respond appropriately to feedback.
Teaching methods
- Physical in-class recitations and lab sessions
- Homework assignments
- Course project
Expected student activities
Students are expected to:
- Attend the lectures and lab sessions
- Complete a weekly homework assignment
- Read/watch the pertinent material before a lecture
- Engage during the class, and present their results in front of the other colleagues
Assessment methods
- 30% continuous assessment during the semester (homework)
- 30% final exam, data analysis task on a computer (3 hours)
- 40% final project, done in groups of 3
Supervision
Office hours | Yes |
Assistants | Yes |
Forum | Yes |
Others | http://ada.epfl.ch |
Resources
Virtual desktop infrastructure (VDI)
No
Websites
Dans les plans d'études
- SemestreAutomne
- Forme de l'examenEcrit
- Crédits
6 - Matière examinée
Applied data analysis - Cours
2 Heure(s) hebdo x 14 semaines - Projet
2 Heure(s) hebdo x 14 semaines
- Semestre
- SemestreAutomne
- Forme de l'examenEcrit
- Crédits
6 - Matière examinée
Applied data analysis - Cours
2 Heure(s) hebdo x 14 semaines - Projet
2 Heure(s) hebdo x 14 semaines
- Semestre
Semaine de référence
Lu | Ma | Me | Je | Ve | |
---|---|---|---|---|---|
8-9 | |||||
9-10 | |||||
10-11 | |||||
11-12 | |||||
12-13 | |||||
13-14 | |||||
14-15 | |||||
15-16 | |||||
16-17 | |||||
17-18 | |||||
18-19 | |||||
19-20 | |||||
20-21 | |||||
21-22 |
légende
- Semestre d'automne
- Session d'hiver
- Semestre de printemps
- Session d'été
- Cours en français
- Cours en anglais
- Cours en allemand