A new CSC 151, revisited
I’ve been working on the new CSC 151 for much
of the summer. Today I gave a report to our Data Science workshop [1]
on our new Data Science
model for CSC 151, our introductory course.
I owe a more condensed written version of that presentation to our grants
officer. This musing serves as a draft of my report. I may continue
tomorrow with a longer attempt to transform my notes [2] into narrative.
In this project, we are adding topics from Data Science topics to
CSC 151, Grinnell’s introductory course in Computer Science. The
primary focus of CSC 151 will remain computer science, but we will
apply CS principles to problems in data science that most draw upon
computational thinking, focusing on three big pictures
issues: a
data science process, ways of thinking about data, and a cyclic
use then build
process.
Process. Data scientists employ various versions of the scientific method to their work. We will focus on an exploratory data process, one in which the researcher starts with a collection of data (e.g., found on the Internet or previously gathered) and explores the data to see what questions they reveal. After finding or gathering data, one develops a question or questions, writes programs that transform the data into useable forms, represents and stores the data, writes programs that clean the data (e.g., standardizing formats, removing entries with incomplete data), analyzes the data, visualizes the data, and presents any preliminary insights from the exploration. This process is both cyclic and back-and-forth. It is cyclic in that the data scientists will often go back to the beginning after completing the process, returning once again to find and gather more data. It is back-and-forth in that each stage may send the data scientist back to a prior stage or stages. For example, a visualization may lead to another analysis, and an analysis may lead the data scientist to revisit how they cleaned the data.
Because this is a course in computer science and not in statistics, the mechanisms for exploring data will be relatively shallow. We will send students interested in deeper analysis to other courses in Statistics and Data Science.
Thinking about data. Throughout the course, we will emphasize two key
aspects of thinking like a data scientist
. First, we will encourage
students to challenge the data. A good data scientist approaches results
skeptically and asks questions like What other tests might I do to confirm
the preliminary result I’ve found?
What data are missing and might those
missing data have affected my results? How might having those data change
my results?
How were the data gathered?
. Second, as noted in the
section on Process, we will also encourage students to actively explore
data sets, to think creatively about what they might find in a data set.
Use then build. Just as statistics students gain power and
understanding by developing and using statistical tests that they first
use as built in
functions in programs like Excel and Stata, computer
science students benefit from learning how to implement that standard
functions that they might use to do computational data science. In the
first half of the semester, students will use pre-written procedures,
such as map
, reduce
, and filter
, as they work with data. In the
second half of the semester, students will implement the procedures
they used in the first half of the semester.
Finally, the course will retain a mid-size project that students complete in teams toward the end of the semester. The project will require students to identify and explore their own data sets.
We expect that this revised version of CSC 151 will not only continue to help students develop skills in computational thinking, but also expand those skills to reflect key issues and ideas in data science.
[1] Yes, there are those damn workshops again.
[2] I just wrote notes; I did not make a presentation.
[3] Audience members, in the original.
Version 1.0 of 2017-08-07.