# Codd's 1970 Paper

Codd, E. F. (1970). A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13(6), June 1970.

Comments from: Peter Brown (*), Michael Claveria, Davis Hart, Alex Leach, Angeline Namai, Mark Nettling, Dimitar Tasev, Dave Ventresca.

Questions

Codd wrote that in some systems ordering dependence, indexing dependence and access path dependence are not clearly separable. How is this so and what are some examples of when they are not separable?

The most obvious instance is one in which the values are both ordered and indexed, and the client code relies on both facts. (So, for example, the client code assumes that if x is smaller than y, then (a) x appears before y in the file and (b) the index of x is smaller than the index of y.

Relationships are the domain-unordered counterparts of relations. What does it mean to be domain-unordered? Does this mean that it doesn't take an element from set 1 first and then set 2, set 3, and finally set n last?

Each element of a relation is an ordered tuple. That means that, given a domain, we know the index of the corresponding value. Each element of a relationship is more like a set. Given a domain, we know that a value is there, but the particular position of the value is hidden from the client.

On pg. 384 what is T(j,s) -> there exists a P (R(S,p) ^ S(p,j)) mean? I don't get the T(j,s) and the R(s,p), S(p,j) part.

Well, there's a typo. The R(S,p) should be R(s,p). More generally, we're speaking about T in terms of R and S. If the pair (j,s) is in T, then we can find a value, p, that ensures that corresponding pairs (s,p) and (p,j) are in R and S, respectively.

Will you be able to elaborate (in class) on the differences between join (chapter 2.1.3) and composition (chapter 2.1.4). I seem to understand the main differences, but not all the particular details.

Yes, I will ellaborate in class. The short answer is that join treats the relations more like sets, and does a variant of set product (at least the natural join does) and that composition treats the relations more like functions, and applies one function (relation) to the result of another function (relation).

I question his argument against indexing dependence especially since indices have returned in many major database packages. He claims that despite their ability to improve efficiency, indices waste space with redundancy of data. This downside is reflective of the time of writing. However, his question regarding applications support of changing indices seems to be a warning to applications developers that they ought not depend on the existence of indices in client software. Does he assume that indices are made available to applications and that they are not for internal use only?

At the time he was writing, there was no real separation between what was made available to applications and what was for internal use only.

I don't quite follow how normalizing a data structure helps. Does normalizing any given data structure guarantee an increase in efficiency? When looking at figures 3a and 3b, it doesn't look all that much better. In fact, the unnormalized set seems more desirable, in that it shows and forces the relationships between the domains. For instance, you wouldn't want to have a salaryhistory for a given jobdate if there's no jobhistory there.

You're not so much normalizing arbitrary data structures, as normalizing relations. And yes, normalization can speed things up. More importantly, it can better support more interesting queries.

Questions for Class

In section 1.1, Codd warns the reader not to mistake the derivation of connections for the derivation of relations, and he goes on to explain the difference between the two ideas in section 2.1.4. However, his explanation of the connection trap is not very clear to me. I understand the point about the supplier-project paths, but I am lost when he claims that the target relation must be a natural composition of the project and supplier relations in order for the derivation to hold. Could you please rephrase this explanation in simpler terms, and possibly provide a diagram to illustrate the difference? As Codd suggests in section 1.3, a relation is a set, and I better understand the nature of sets via visual representations.

Diagrams should probably be done in class.

Being a person with relatively weak relational database background, the paper provided me with a basic overview of the issues that concern database maintainers. It helped me comprehend the ideas of data independance and data inconsistency and the problems that arise from them.

I found the section on normalization to be of interest. Instead of the root of the tree containing all the for the object, make each node an object that contains the identifiers for all the parent nodes. It seems like this would be more efficient when searching for a particular piece of information and returning specific information about the matching objects. For instance, in the example given in the paper, if you could search for all the people that have X number of children and return the employee's name, you just have to search for the employee with the same id number instead of transcending a tree to determine the number of children and match it with the employee's name. In trees of great length, this should help optimize efficiency.

I found the technical details difficult to digest (particularly his mathematical representation of relations in terms of a pi operator).

It was more representing operations on relations in terms of the pi operator.

Perhaps I am missing the point, but what does the organization of data in a data bank have to do with programming language concepts? Is a data base/bank a programming language? I can understand this reading as a preparation perhaps for SQL, but it contains a good deal of information that does not seem relevant to this course.

Part of your goal is to figure out what's relevant to the course. As I suggested in class, parts of this reading relate to typing, to language design, and to abstraction, all of which are relevant to the course.

Does the author really think that having multiple columns named the same thing (he uses the example of part and part) and then proceeding to give them distinctions via sub.part and super.part is a new and profound idea? Why does the author even address the problem, it seems like a non-issue?

It has to do with the relationship between formalism (in a relationship, identifying an entry by type alone does not suffice) and practice. And yes, amazingly, many of the things we consider obvious were not so obvious 35 years ago.

The entirety of the remaining portion of the paper is focussed on defining terms and how they relate to this particular model of representing data. Why? I would presume to think that most of us understand what a permutation is, etc... Admittedly the part about projections was mildly interesting, however my assertion remains. I don't mean to be so critical, but Sam, what were we supposed to LEARN from this reading?

Disclaimer: I usually create these pages on the fly, which means that I rarely proofread them and they may contain bad grammar and incorrect details. It also means that I tend to update them regularly (see the history for more details). Feel free to contact me with any suggestions for changes.

This document was generated by Siteweaver on Wed May 10 09:03:12 2006.
The source to the document was last modified on Thu Apr 13 20:06:16 2006.
This document may be found at `http://www.cs.grinnell.edu/~rebelsky/Courses/CS302/2006S/Readings/codd1970.html`.

You may wish to validate this document's HTML ; ; Check with Bobby

Samuel A. Rebelsky, rebelsky@grinnell.edu