Organizations use their data for decision support and to build data-intensive products and services. The collection of skills required by organizations to support these functions has been grouped under the term Data Science. This course will attempt to articulate the expected output of Data Scientists and then equip the students with the ability to deliver against these expectations. The assignments will involve web programming, statistics, and the ability to manipulate data sets with code.
Pre-requisites for this course include 61A, 61B, 61C and basic programming skills. Knowledge of Python will be useful for the assignments, and a few will also use the Scala language. Students will also be expected to run VirtualBox on their laptops for the assignments.
Please take the class survey here.
Plese set up your machine according to these instuctions.
Lecture Date | Lecture Material | Weds Lab | Reading | Assignments |
W 9/3 | L1: Introduction/Data Science Process [PPTX] [PDF] | No Lab | ||
M 9/8 | L2: Data Preparation [PPTX] [PDF] | Lab 1 Unix | Enterprise Data Analysis and Visualization: An Interview Study | Bunny 1 by 5pm on 9/8 |
M 9/15 | L3: Tabular Data [PPTX] [PDF] | Lab 2 Pandas | From Databases to Dataspaces: A New Abstraction for Information Management
Schemaless SQL and Schema on Write vs. Schema on Read |
Bunny 2 by 5pm on 9/15
Homework 1 out. Due by 10/2 |
M 9/22 | ||||
M 9/29 | L4: Data Cleaning [PPTX] [PDF] | Lab 3 OpenRefine | ||
Th 10/2 | Homework 1 Due! Submit using glookup | |||
M 10/6 | L5- Data Integration [PPTX] [PDF] | Lab 4 Pandas | WebTables: Exploring the Power of Tables on the Web (Sections 1,2 and 4; others optional)
and OpenRefine Data Augmentation (video) |
Bunny 3 by 5pm;
Final Project Group Lists Due Midnight |
M 10/13 | L6: Exploratory Data Analysis [PPTX] [PDF] | Lab 5 Python | Statistical Thinking in the Age of Big Data
Exploratory Data Analysis From the O'Reilly Book "Doing Data Science" - available on campus or via the library VPN. Introduction to Hypothesis Testing |
Bunny 4 by 5pm;
Final Project Proposals due Thurs 10/16 Midnight. Homework 2 out. Due by 11/6 |
M 10/20 | L7: Regression, Classification, intro to Supervised Learning Part 1:[PPTX] [PDF] Part 2:[PPTX] [PDF] Homework Tips |
Lab 6 R | Three Basic Algorithms
From the O'Reilly Book "Doing Data Science" - available on campus or via the library VPN. |
Bunny 5 by 5pm |
M 10/27 | ||||
M 11/3 | L8: Data Products, Slides: [PDF](29MB); followed by:
Part2 - Unsupervised Learning and K-Means Clustering (in Python) |
Lab 7 Python | K-Nearest Neighbors and K-Means clustering from Three Basic Algorithms.
Part of the O'Reilly Book "Doing Data Science" - available on campus or via the library VPN. |
No Bunny ! |
Th 11/6 | Homework 2 Due. Submit using glookup | |||
M 11/10 | L9: Scaling Up Analytics; [PDF] [PPTX] | Lab 8 Spark/EC2 | "MapReduce," "Word Frequency Problem", and "Other Examples of MapReduce" sections from O'Reilly "Doing Data Science" book (available online or from the library) and Spark Short paper | Bunny 9 by 5pm Homework 3, Part 1 Due 4/14 |
F 4/11 | Final Project update due on glookup | |||
M 11/17 |
L10: Visualization (D3 lab)[PPTX] [PDF] |
Lab 9 Visualization Lab Slides |
Chapter 9 on Data Visualization from "Doing Data Science" available online or from the library. D3: Data Driven Documents by Bostock et. al. Optional: Reading about how the challenger disaster may have been prevented with data visualization by Edward Tufte |
Bunny 10 by 5pm Homework 3, Part 1 due Homework 3, Part 2 out. Due by 11/25. |
TBD | Midterm - 5.00 to 6.30 pm | |||
M 11/24 | L11: Graph Processing;
[PPTX](19MB) [PDF](19MB) |
Lab 10 Graphx | Chapter 2 from "Networks, Crowds, and Markets: Reasoning About a Highly Connected World" | Bunny 11 by 5pm |
Tu 11/25 | Homework 3, Part 2 due | |||
M 12/1 | L12: Putting it All Together | Bunny 12 by 5pm |