Reflections on the new data-science-themed CSC 151

Topics/tags: CSC 151, teaching [1], Scheme, Racket, data science, long, rambly

In the summer of 2017, Sarah Dahlby Albright, Titus Klinge, and I developed a new version of CSC 151 that used data science as the theme. I taught two sections in the fall and a section in the spring. Titus also taught a section in the fall and a section in the spring. I won’t be teaching it this fall; instead, I’m taking the semester to develop a third version of CSC 151 that emphasizes digital humanities [4] that I will teach in the spring.

That doesn’t mean that the data science version won’t be offered. Since CSC 151 is key to the Grinnell CS curriculum [6], others will be teaching CSC 151 this fall. And, since the data science [7] curriculum is a bit more robust than the mediascheme curriculum [8], my colleagues will be following the new curriculum, using the readings and labs we developed last year.

A week or two ago, one of those colleagues asked me to summarize what I would change about this new CSC 151. I consider it useful to reflect on those matters, not just to support my colleagues who have to deal with materials that I developed, but also to help myself prepare to design the next version of the course.

Let’s start with what went well with the new version. We were able to get rid of the dependence on a custom interprocess communication library written in C. Things crash much less often. Students are also able to run the class examples on all three major platforms (Linux, Mac OS, Microsoft Windows). That’s a significant advantage, but it’s mostly behind the scenes.

We rearranged the curriculum so that students work with higher-order procedures over lists (e.g., map, reduce) starting in the second week of the semester. That’s worked well. We also introduce section and compose early and that has worked well. In teaching the course, we want to help students think differently about how you define procedures and to get accustomed to techniques of higher-order programming [9]. The map-early ordering has worked well. We also introduce and use strings much earlier. More broadly, the order of topics seems to function well.

Like the earlier mediascheme course, The data science version includes a substantial project. The project remains engaging and students feel a clear sense of ownership. We’ve also added explicit work days; those have proven useful to the students.

I will note that I find the find a data set and do something interesting with it projects less interesting than the old a procedure is worth 1000 pictures projects. And I miss the way in which those old projects helped students realize just how broadly applicable and creative CS can be. I don’t know the right way to deal with this issue. Nonetheless, it seems worthwhile to note. One possibility is to encourage students to develop new ways to visualize a data set.

More broadly, I find the image-making theme more exciting than the data science theme. But that may just be me.

Let’s turn to things that can be changed, or that are worth considering changing.

The datacsci [10] students seem less comfortable using lambda expressions than do the mediascheme students. Too often, the datacsci students want to map something over a list and struggle to build an expression with compose and section when a lambda expression would be more straightforward. We need to emphasize the benefits of using lambda with map (and filter and reduce). Pay attention when you introduce procedures and make sure to re-emphasize how to use anonymous lambda procedures.

For sighted students, visualization is a powerful tool. While we do present some visualization techniques and procedures, we could use more emphasis on visualization. It may be that an early homework assignment or a few extra problems will be enough.

There have been significant changes to MathLAN since the spring. The getting started with GNU/Linux lab needs to be rewritten and tested.

We’ve had four take-home examinations each semester with the first exam assigned in the middle of week three. We had that early exam so that students could better understand the level of the course before the add/drop deadline. With the change in the add/drop deadline, that no longer seems necessary. I would consider switching to three exams, one assigned week four and due in week five, one assigned week eight or nine and due in week nine or ten, and one assigned in week thirteen and due in week fourteen. I am not sure whether I would change the number of problems on the exams; six seems to have worked well.

Since data munging and data cleaning are two core aspects of our approach to data science, we should consider adding a paired reading and lab on regular expressions [11]. However, I’m not sure what we could cut to make room for that additional material.

We’ve tried to include ethical considerations implicitly in much of what we do in the class. But they could use a bit more emphasis. It might be worth having students read the ACM Code of Ethics. Of course, the ethics of data are a bit different, so it might also be worth having students read something about those issues, too. I’m not entirely sure where that fits best.

Our sample datasets are not particularly exciting. We use a set of zip codes which provides a good starting point for thinking about visualizations (and helps students realize when they switch latitude and longitude). It also requires some relatively straightforward cleaning. We use course registration data because it’s familiar to students. But we could come up with some more compelling sets.

Here are a few moderate-length additions that would be useful. If my colleagues ask nicely, I may find the time to write something up for each of them in the first few weeks of the semester [12].

Students would benefit from a style guide. We discuss style both implicitly and explicitly, but the material is not all in one place. It may be useful to put everything together, along with examples.

It has been suggested that we might want to walk students more explicitly through various problem-solving strategies: You have a computational problem to solve or a procedure to write. Now what? I model approaches in class, but students might benefit from a list of steps accompanied by examples [14].

And a big one: We lost the online reference when we switched from the old mediascheme version. We should reintroduce it. But that will take some effort.

Then there are the issues that aren’t worth considering for this fall’s course, but might be worth exploring in the future.

It may be worth revisiting approaches local bindings. We’ve generally told students to use let, let*, letrec, and named let. The Racket community seems more fond of internal define statements. I worry that a student who does not learn let will find that their Scheme education is incomplete [15]. There’s also the (slight?) value considering the nesting of let and lambda [16]; it helps students think more deeply about scope and the lifetimes of scopes. In contrast, internal defines seem a bit easier to understand and require less indentation.

Racket has a reasonable approach to structures (records, objects, whatever you want to call them). We might want to consider introducing the basics of structures.

At one point, we taught one approach to objects in CSC 151. Students modeled objects as functions that take messages (and additional parameters) as input. They used local bindings to create state. I don’t think I’d recommend revisiting this approach, but something is comforting [18] about having functional, imperative, and object-oriented concepts introduced in our first course. And Racket does have a reasonable object system.

That’s a lot. What are the highest priorities? Reference pages because they make students’ lives easier. Better coverage of lambda expressions, since it is both straightforward and useful. More coverage of ethical considerations. Those three seem worth the most effort. Fixing the introduction to MathLAN is also necessary, but should not be complicated. If you have time, I’d also look for additional data sets and work them into the course as appropriate.

Postscript: Now I’m starting to worry about the new FunDHum [19] course. There are lots of things I want to add, such as simple XML trees, regular expressions, structures, and annotating data. But what can I drop from the current course to make room? And I’m pretty sure that I only have forty-one days in the spring, rather than the standard forty two [20]. Oh well, finding room will be one of my (many) challenges.

[1] Or at least teaching computer science [2].

[2] Or perhaps teaching introductory computer science [3].

[3] Or is that teaching introductory computer science at Grinnell and with Racket?

[4] One useful thing about topics like data science and digital humanities is that there are many interpretations of each term, so you have a wide array of choices about what to focus on [5].

[5] Alternately: No matter what you do, someone will say that you missed a critical point.

[6] And, I would say, a valuable part of many students’ liberal arts educations.

[7] I will continue to put data science in quotation marks to emphasize that we don’t do everything that someone expects in a data science course.

[8] More accurately, the software for the mediascheme curriculum is less robust.

[9] Yes, this is our introductory course. We teach CS differently at Grinnell.

[10] Data Science + CSC = datacsci.

[11] I realize that some of my colleagues feel that because Racket regular expressions mimic the syntax of regular expressions in Perl and other similar languages, they are confusing and potentially harmful to students. I accept that the broader community uses this ugly syntax.

[12] Even if they don’t ask me, I may try to find the time.

[14] Here’s a sketch of potential steps. (1) Make sure that you understand what the procedure is supposed to do by identifying sample inputs and outputs. (2) Reinforce your understanding by documenting the procedure. (3) Try one of the basic approaches to designing algorithms: (a) see if you’ve previously solved a similar problem and adapt that solution; (b) work a few examples by hand and then generalize; (c) use one of the basic patterns we’ve identified; (d) identify relevant and useful procedures and think about how we might combine them. (4) Run your algorithm by hand on a few examples. (5) Turn your sample inputs and outputs into tests. (6) Convert your algorithm to Scheme code. (7) Run the tests.

[15] Of course, I’ve accepted that they will survive without learning set! and continuations, to name two other useful concepts.

[16] That is, there’s a difference in efficiency between (let (...) (lambda (x) ...)) and (lambda (x) (let (...) ...)). Consider, for example,

(define fun
  (let ([square (lambda (x) (* x x))])
    (lambda (x)
      (square (square x)))))

and

(define fun
  (lambda (x)
    (let ([square (lambda (x) (* x x))])
      (square (square (x))))))

You can’t do anything like the first version with define.

[17] That state got introduced with a let. That’s another reason to keep let.

[18] I had originally written something nice about …. Because nice is overused, Grammarly suggested, beautiful or sweet. I settled on comforting.

[19] A functional approach to digital humanities -> FunDHum.

[20] It appears that the question is How many times should a Grinnell MWF class meet in a semester? Or perhaps it’s just a question.

Version 1.0 of 2018-08-19.

The opinions stated herein are those of Samuel A. Rebelsky and do not necessarily reflect those of Grinnell College, Grinnell's Computer Science Department, the Rebelsky family, CMD-IT, SIGCAS, SIGCSE, any other organizations I am or have been affiliated with, or even most other sentient beings.

Check accessibility with WAVE.

SamR's Assorted Musings and Rants: Reflections on the new data-science-themed CSC 151 by Samuel A. Rebelsky is licensed under a Creative Commons Attribution 4.0 International License.

This Web site was built using Markdown, some custom scripts, Twitter Bootstrap, and the Bootswatch Readable Theme.