Large-scale data science for real-world data
COM-490 / 6 credits
Teacher(s): Bouillet Eric Pierre, Delgado Borda Pamela Isabel, Sarni Sofiane, Verscheure Olivier
Language: English
Withdrawal: It is not allowed to withdraw from this subject after the registration deadline.
Summary
This hands-on course covers tools and methods used by data scientists, from researching solutions to scaling prototypes on Spark clusters. Students engage with the full data engineering and data science pipeline, from data acquisition to extracting insights, applied to real-world problems.
Content
1. Crash Course in Python for Data Science
Use essential Python libraries for data manipulation, visualization, and introductory machine learning; get hands-on with development environments, collaborate via version control tools, work with interactive notebooks, and build workflows using real-world datasets.
2. Distributed Data Wrangling at Scale
Understand distributed data processing platforms spanning multiple servers and storage systems; build and optimize data lakes using efficient storage formats; perform large-scale Extract-Transform-Load (ETL) workflows; and explore and transform massive datasets for batch processing.
3. Distributed Processing with Spark
Apply advanced data engineering techniques using Spark; process and transform large datasets; train machine learning models in distributed pipelines; and optimize performance through efficient execution strategies.
4. Real-Time Big Data Processing
Learn real-time and event-driven processing concepts; design scalable streaming pipelines integrated with batch systems; and perform live inference on streaming data under dynamic conditions.
5. Final Project (Assignment) - Integration and Application
Design a comprehensive data science solution combining batch and streaming workflows; integrate methods from previous modules; apply best practices for scalable processing and deployment; and demonstrate full-cycle implementation on a real-world-inspired problem.
Keywords
Data Engineering, Data Lakes, Machine Learning Operations (MLOps), Distributed Computing, Real-Time Data Stream Processing, Scalable Data Processing, Large-Scale Data Analysis, Predictive Modeling, Apache Spark, Hadoop, Kafka.
Learning Prerequisites
Important concepts to start the course
Participants are expected to have prior experience with Python programming and understanding of fundamental mathematical concepts relevant to data science. Familiarity with key data science libraries - such as NumPy, pandas, and scikit-learn - is strongly recommended. A basic understanding of using the Linux terminal, including navigating file systems and executing command-line tools, is also beneficial.
Learning Outcomes
By the end of the course, the student must be able to:
- Apply and coordinate the use of standard data science libraries and big data technologies to manage distributed and real-time data workflows.
- Design , build, and optimize data lakes using efficient storage formats to enable scalable, high-performance data engineering.
- Conduct large-scale data wrangling, transformation, and model training tasks on complex, real-world datasets.
- Design , develop, and optimize scalable data pipelines for both batch and streaming contexts.
- Integrate machine learning techniques into end-to-end data science workflows using appropriate tools and environments.
- Interpret complex datasets and evaluate outcomes to extract actionable insights that support data-driven decision-making.
- Formulate and justify technical choices in data storage, pipeline design, and machine learning model deployment, and communicate results clearly through visualizations, documentation, and oral presentations.
Transversal skills
- Continue to work through difficulties or initial failure to find optimal solutions.
- Identify the different roles that are involved in well-functioning teams and assume different roles, including leadership roles.
- Use a work methodology appropriate to the task.
- Manage priorities.
- Use both general and domain specific IT resources and tools
Teaching methods
Assessment is based on lectures and hands-on lab sessions, with all activities involving real-world datasets and the use of distributed computing and storage services to ensure practical, applied learning.
Expected student activities
- Apply: Put concepts into practice during hands-on lab sessions.
- Engage: Take part in class discussions and interactive activities.
- Collaborate: Work in teams to complete assignments and tackle real-world challenges.
- Explain: Present your ideas and results clearly and concisely.
Assessment methods
- 60% Continuous group assessments during the semester
- 40% Final group project
Supervision
Office hours | Yes |
Assistants | Yes |
Forum | Yes |
Resources
Virtual desktop infrastructure (VDI)
Yes
Bibliography
- Python Data Science Handbook: Essential Tools for Working with Data by Jake VanderPlas, O'Reilly Media, 2023
Ressources en bibliothèque
Moodle Link
In the programs
- Semester: Spring
- Exam form: During the semester (summer session)
- Subject examined: Large-scale data science for real-world data
- Project: 4 Hour(s) per week x 14 weeks
- Type: optional
- Semester: Spring
- Exam form: During the semester (summer session)
- Subject examined: Large-scale data science for real-world data
- Project: 4 Hour(s) per week x 14 weeks
- Type: optional
- Semester: Spring
- Exam form: During the semester (summer session)
- Subject examined: Large-scale data science for real-world data
- Project: 4 Hour(s) per week x 14 weeks
- Type: optional
- Semester: Spring
- Exam form: During the semester (summer session)
- Subject examined: Large-scale data science for real-world data
- Project: 4 Hour(s) per week x 14 weeks
- Type: optional
- Semester: Spring
- Exam form: During the semester (summer session)
- Subject examined: Large-scale data science for real-world data
- Project: 4 Hour(s) per week x 14 weeks
- Type: optional
- Semester: Spring
- Exam form: During the semester (summer session)
- Subject examined: Large-scale data science for real-world data
- Project: 4 Hour(s) per week x 14 weeks
- Type: optional
- Semester: Spring
- Exam form: During the semester (summer session)
- Subject examined: Large-scale data science for real-world data
- Project: 4 Hour(s) per week x 14 weeks
- Type: optional
- Semester: Spring
- Exam form: During the semester (summer session)
- Subject examined: Large-scale data science for real-world data
- Project: 4 Hour(s) per week x 14 weeks
- Type: optional
- Semester: Spring
- Exam form: During the semester (summer session)
- Subject examined: Large-scale data science for real-world data
- Project: 4 Hour(s) per week x 14 weeks
- Type: optional
Reference week
Mo | Tu | We | Th | Fr | |
8-9 | |||||
9-10 | |||||
10-11 | |||||
11-12 | |||||
12-13 | |||||
13-14 | |||||
14-15 | |||||
15-16 | |||||
16-17 | |||||
17-18 | |||||
18-19 | |||||
19-20 | |||||
20-21 | |||||
21-22 |
Légendes:
Lecture
Exercise, TP
Project, Lab, other