CSC323 2010S Software Design

Beautiful Code 19: Multidimensional Iterators in NumPy

Oliphant, Travis E. (2007). Multidimensional Iterators in NumPy. Chapter 27 (pp. 303-318) in Oram & Wilson (eds), Beautiful Code. Sebastapol, CA: O'Reilly.


At the beginning of the chapter Oliphant explains slice using an example which involves cropping and shrinking an image. Can you explain this example?

I'd prefer to let your colleagues do so. However, if they cannot, I would be happy to do so.

Can we talk more about the idea of N-dimensional arrays, where N is greater than 3? How would we represent such arrays?

It's a generalization of the previous techniques. Essentially, we represent any N-dimensional array by a one-dimensional array with some knowledge as to where things get placed in that array.

Iterators are used everywhere in all of Python, not just NumPy. What optimizations have been done to regular iterators to make them faster?

Well, one big change was the use of Xrange to replace range, although that's really more an issue of space efficiency than time efficiency. (Given the architecture of modern systems, space efficiency can lead to time efficiency.)

How does the interface between C and Python work? The code in this section is written in C, so it's probably faster, but aren't there still substantial bottlenecks when going back and forth between Python and C code?

Python is implemented in C. The interface works by providing the C programmer with access to important Python structures. The code you're seeing appears to call no Python code.

The chapter states that NumPy is capable of iterating through both contiguous and noncontiguous arrays. The inclusion of noncontiguous arrays (especially those whose stride is arbitrary) makes generalizing iteration significantly more difficult than for contiguous arrays alone, which would only require incrementing a pointer by a constant for each index. What is the necessity of constructing arbitrarily discontiguous, multidimensional arrays that would warrant facing this sort of challenge and addition to the process' time or memory usage?

It's not quite as open as you suggest. The stride for each dimension is fixed, which means that the increments are predictable and easy to compute. The big win in this strategy for slicing is that you save a lot of memory (and time in copying).

This could be naive but it seems like the stride length could change between different data entries in a non-contiguous array. How does the NumPy iterator account for this?

Stride length cannot change for a particular dimension.

Several times it is mentioned that macros are used instead of functions, such as PyArray_ITER_NEXT. My question is how is this different from a function? How does it check if the array is contiguous in such a way that the macro needs to repeatedly check it? If the majority of the time difference between the NumPy iterator and the best contiguous-case algorithm lies in the time NumPy spends checking if the contiguous-case algorithm should be used, is it really worth even doing those checks in the first place?

Macros replace the 'function' with its body at compile time. Functions require the overhead of a function call. (That overhead involves putting values on the stack, jumping to a different area of memory, etc.)

The check is required, since we just have the iterator structure. If I really cared about running time, I would use a different strategy than the author and encapsulate the iteration code in something like a higher-order procedure that could make the decision once.

The UML seems to be mostly a way of more visualizable documentation and seems to have a fair share of its critics. Have there been any studies showing that this visualization technique can actually improve understanding of object oriented design or coding efficiency?

I'm pretty sure that there have been studies. However, the UML is often used because of anecdotal evidence: Lots of people find that pictures more succinctly and succesfully convey information, particularly in discussions.

Oliphant describes the iteration process for both contiguous and noncontiguous arrays. Why is it that iterating over a contiguous array is less time consuming than iterating over a noncontiguous array?

Because you just have to advance the pointer/counter one space in a contiguous array, and have to do various computations (as well as some tests) for the non-contiguous one.

Do all languages have implementations of multidimensional iterators like these? I remember using Iterators in Java, but not ones that could iterate over multiple arrays or noncontiguous areas. Are there multidimensional iterators like this available in every language?

No. Not all languages provide multidimensional arrays. Not all languages provide iterators. But you as programmer can always implement them (which is an implicit moral of the article).

An issue I see with broadcasting is that if my interpretation is correct, there can be new "elements" created with no value to put in them, so they end up as a 1. Wouldn't these 1's be problematic in most applications?

Your interpretation is incorrect. Basically, broadcasting treats a N-dimensional array as an NxM-dimensional array that contains multiple copies of the original array. That is, as best as I can tell, we've cast the N-dimensional array to a broader array.

What is the C broadcasting API?

I'm pretty sure that it's the API in the C implementation of Python that permits those implementing Python features in C to take advantage of broadcasting.

The author mentions random number generation, but I don't understand how broadcasting would help with that.

It doesn't help us generate the random numbers. It helps in a particular use of random number generators, in which you want to generate multiple sets of random numbers from different distributions.

Can you give a few examples of what broadcasting would be used for?

Sorry, but broadcasting is not a feature I normally use.

Disclaimer: I usually create these pages on the fly, which means that I rarely proofread them and they may contain bad grammar and incorrect details. It also means that I tend to update them regularly (see the history for more details). Feel free to contact me with any suggestions for changes.

This document was generated by Siteweaver on Thu Feb 4 11:00:34 2010.
The source to the document was last modified on Thu Feb 4 11:00:33 2010.
This document may be found at http://www.cs.grinnell.edu/~rebelsky/Courses/CSC323/2010S/Readings/beautiful-19.html.

You may wish to validate this document's HTML ; Valid CSS! ; Creative Commons License

Samuel A. Rebelsky, rebelsky@grinnell.edu

Copyright © 2010 Samuel A. Rebelsky. This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/2.5/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.