CS-460 / 8 credits

Teacher(s): Ailamaki Anastasia, Kermarrec Anne-Marie

Language: English


Summary

This is a course for students who want to understand modern large-scale data analysis systems and database systems. The course covers fundamental principles for understanding and building systems for managing and analyzing large amounts of data. It covers a wide range of topics and technologies.

Content

Topics include large-scale data systems design and implementation, and specifically :

  • Distributed data management systems
  • Data management : locality, accesses, partitioning, replication
  • Modern storage hierarchies
  • Query optimization, database tuning
  • Transaction management
  • Data structures : File systems, Key-value stores, DBMS
  • Consistency models
  • Large-scale data analytics infrastructures
  • Parallel Processing
  • Data stream and graph processing

Learning Prerequisites

Required courses

  • CS-107 Introduction to programming
  • CS-214 Software construction
  • CS-300 Data-Intensive Systems
  • CS-202 Computer systems

or equivalent courses

Important concepts to start the course

  • Knowledge of algorithms and data structures.
  • Scala and/or Java programming languages will be used throughout the course. Programming experience in one of these languages is strongly recommanded.
  • Basic knowledge or computer networking and distributed systems.

Learning Outcomes

By the end of the course, the student must be able to:

  • Understand in detail the design big data analytics systems using state-of-the-art infrastructures for horizontal scaling, e.g., Spark
  • Implement algorithms and data structures for streaming data analytics
  • Understand the advantage and disavantages of different storage models for a given workload, based on the offered optimization enabled by each model and the workload characteristics
  • Compare concurrency control algorithms, and algorithms for distributed data management
  • Configure systems parameters, data layouts, and application designs for database systems
  • Develop data-parallel analytics programs that make us of modern clusters and cloud offerings to scale up to very large workloads
  • Analyze the trade-offs between various approaches to large-scale data management and analytics, depending on efficiency, scalability, and latency needs

Teaching methods

Lectures, project, homework, exercises and practical work

Expected student activities

  • Attend lectures and participate in class
  • Complete a project as per the guidelines posted by the teaching team

Assessment methods

  • Project
  • Midterm (as needed)
  • Final exam

Supervision

Office hours Yes
Assistants Yes
Forum Yes

Resources

Bibliography

J. Hellerstein & M. Stonebraker, Readings in Database Systems, 4th Edition, 2005
R. Ramakrishnan & J. Gehrke: "Database Management Systems", McGraw-Hill, 3rd Edition,
2002.
A. Rajaraman & J. Ullman: "Mining of Massive Datasets", Cambridge Univ. Press, 2011.

Ressources en bibliothèque

Moodle Link

In the programs

  • Semester: Spring
  • Exam form: Written (summer session)
  • Subject examined: Systems for data management and data science
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Lab: 2 Hour(s) per week x 14 weeks
  • Type: mandatory
  • Semester: Spring
  • Exam form: Written (summer session)
  • Subject examined: Systems for data management and data science
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Lab: 2 Hour(s) per week x 14 weeks
  • Type: mandatory
  • Semester: Spring
  • Exam form: Written (summer session)
  • Subject examined: Systems for data management and data science
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Lab: 2 Hour(s) per week x 14 weeks
  • Type: optional
  • Semester: Spring
  • Exam form: Written (summer session)
  • Subject examined: Systems for data management and data science
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Lab: 2 Hour(s) per week x 14 weeks
  • Type: optional
  • Semester: Spring
  • Exam form: Written (summer session)
  • Subject examined: Systems for data management and data science
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Lab: 2 Hour(s) per week x 14 weeks
  • Type: optional
  • Semester: Spring
  • Exam form: Written (summer session)
  • Subject examined: Systems for data management and data science
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Lab: 2 Hour(s) per week x 14 weeks
  • Type: optional
  • Semester: Spring
  • Exam form: Written (summer session)
  • Subject examined: Systems for data management and data science
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Lab: 2 Hour(s) per week x 14 weeks
  • Type: mandatory
  • Semester: Spring
  • Exam form: Written (summer session)
  • Subject examined: Systems for data management and data science
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Lab: 2 Hour(s) per week x 14 weeks
  • Type: mandatory
  • Semester: Spring
  • Exam form: Written (summer session)
  • Subject examined: Systems for data management and data science
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Lab: 2 Hour(s) per week x 14 weeks
  • Type: mandatory
  • Semester: Spring
  • Exam form: Written (summer session)
  • Subject examined: Systems for data management and data science
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Lab: 2 Hour(s) per week x 14 weeks
  • Type: mandatory
  • Semester: Spring
  • Exam form: Written (summer session)
  • Subject examined: Systems for data management and data science
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Lab: 2 Hour(s) per week x 14 weeks
  • Type: optional
  • Semester: Spring
  • Exam form: Written (summer session)
  • Subject examined: Systems for data management and data science
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Lab: 2 Hour(s) per week x 14 weeks
  • Type: optional
  • Exam form: Written (summer session)
  • Subject examined: Systems for data management and data science
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Lab: 2 Hour(s) per week x 14 weeks
  • Type: optional
  • Semester: Spring
  • Exam form: Written (summer session)
  • Subject examined: Systems for data management and data science
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Lab: 2 Hour(s) per week x 14 weeks
  • Type: optional
  • Exam form: Written (summer session)
  • Subject examined: Systems for data management and data science
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Lab: 2 Hour(s) per week x 14 weeks
  • Type: optional
  • Semester: Spring
  • Exam form: Written (summer session)
  • Subject examined: Systems for data management and data science
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Lab: 2 Hour(s) per week x 14 weeks
  • Type: optional
  • Semester: Spring
  • Exam form: Written (summer session)
  • Subject examined: Systems for data management and data science
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Lab: 2 Hour(s) per week x 14 weeks
  • Type: optional

Reference week

Monday, 14h - 16h: Lecture CE12

Monday, 16h - 18h: Exercise, TP CE12

Tuesday, 11h - 13h: Project, labs, other GRA330
GRA332
GRB330

Related courses

Results from graphsearch.epfl.ch.