CS 194-16 Introduction to Data Science, UC Berkeley - Fall 2014

Organizations use their data for decision support and to build data-intensive products and services. The collection of skills required by organizations to support these functions has been grouped under the term Data Science. This course will attempt to articulate the expected output of Data Scientists and then equip the students with the ability to deliver against these expectations. The assignments will involve web programming, statistics, and the ability to manipulate data sets with code.

Logistics

Pre-requisites

Pre-requisites for this course include 61A, 61B, 61C and basic programming skills. Knowledge of Python will be useful for the assignments, and a few will also use the Scala language. Students will also be expected to run VirtualBox on their laptops for the assignments.

Please take the class survey here.

Plese set up your machine according to these instuctions.

Grading

Schedule (warning - very early draft!)

Lecture Date Lecture Material Weds Lab Reading Assignments
W 9/3 L1: Introduction/Data Science Process [PPTX] [PDF] No Lab
M 9/8 L2: Data Preparation [PPTX] [PDF] Lab 1 Unix Enterprise Data Analysis and Visualization: An Interview Study Bunny 1 by 5pm on 9/8
M 9/15 L3: Tabular Data [PPTX] [PDF] Lab 2 Pandas From Databases to Dataspaces: A New Abstraction for Information Management
Schemaless SQL and Schema on Write vs. Schema on Read
Bunny 2 by 5pm on 9/15
Homework 1 out. Due by 10/2
M 9/22
M 9/29 L4: Data Cleaning [PPTX] [PDF] Lab 3 OpenRefine
Th 10/2 Homework 1 Due! Submit using glookup
M 10/6 L5- Data Integration [PPTX] [PDF] Lab 4 Pandas WebTables: Exploring the Power of Tables on the Web (Sections 1,2 and 4; others optional)
and OpenRefine Data Augmentation (video)
Bunny 3 by 5pm;
Final Project Group Lists Due Midnight
M 10/13 L6: Exploratory Data Analysis [PPTX] [PDF] Lab 5 Python Statistical Thinking in the Age of Big Data
Exploratory Data Analysis
From the O'Reilly Book "Doing Data Science" - available on campus or via the library VPN.

Introduction to Hypothesis Testing
Bunny 4 by 5pm;
Final Project Proposals due Thurs 10/16 Midnight.
Homework 2 out. Due by 11/6
M 10/20 L7: Regression, Classification, intro to Supervised Learning
Part 1:[PPTX] [PDF] Part 2:[PPTX] [PDF]
Homework Tips
Lab 6 R Three Basic Algorithms From the O'Reilly Book "Doing Data Science" - available on campus or via the library VPN.
Bunny 5 by 5pm
M 10/27
M 11/3 L8: Data Products, Slides: [PDF](29MB); followed by:
Part2 - Unsupervised Learning and K-Means Clustering (in Python)
Lab 7 Python K-Nearest Neighbors and K-Means clustering from Three Basic Algorithms. Part of the O'Reilly Book "Doing Data Science" - available on campus or via the library VPN.
No Bunny !
Th 11/6 Homework 2 Due. Submit using glookup
M 11/10 L9: Scaling Up Analytics; [PDF] [PPTX] Lab 8 Spark/EC2 "MapReduce," "Word Frequency Problem", and "Other Examples of MapReduce" sections from O'Reilly "Doing Data Science" book (available online or from the library) and Spark Short paper Bunny 9 by 5pm
Homework 3, Part 1 Due 4/14
F 4/11 Final Project update due on glookup
M 11/17 L10: Visualization (D3 lab)[PPTX] [PDF]
Lab 9 Visualization Lab Slides Chapter 9 on Data Visualization from "Doing Data Science" available online or from the library.
D3: Data Driven Documents by Bostock et. al.
Optional: Reading about how the challenger disaster may have been prevented with data visualization by Edward Tufte
Bunny 10 by 5pm
Homework 3, Part 1 due
Homework 3, Part 2 out. Due by 11/25.
TBD Midterm - 5.00 to 6.30 pm
M 11/24 L11: Graph Processing; [PPTX](19MB) [PDF](19MB)
Lab 10 Graphx Chapter 2 from "Networks, Crowds, and Markets: Reasoning About a Highly Connected World" Bunny 11 by 5pm
Tu 11/25 Homework 3, Part 2 due
M 12/1 L12: Putting it All Together Bunny 12 by 5pm