Appendix A. Choosing A Collection

Appendix A. Choosing A Collection
Prev		Next

Tools.h++ has an abundance of collection classes--when you're faced with choosing which one to use in your code, it may seem like an overabundance! This section provides suggestions and information that will help you select the most appropriate collection for a given programming task.

Choosing the most appropriate collection class to fit your needs is not a trivial task. First you need to consider the data in your collection. Does your collection need to store the data in order? Will there be duplicate data? And, how do you find or insert data in your collection? The first part of this appendix, "Selecting a Tools.h++ Collection Class" includes a decision tree diagram that lets you consider specific questions about your data and, through your answers, quickly focus on the collections that will best fit your data requirements. A preface to the decision tree discusses questions you'll see in the tree and includes some additional selection criteria that address issues such as whether to choose a pointer or value-based collections, when to use sequential collections, and what to use for disk-based access.

The second part of this appendix presents a rough comparison of how much time and memory different collections and collection families need to perform common operations such as insertions, finds, and removals.

Selecting a Tools.h++ Collection Class

The decision tree diagram includes questions about the data you plan to store in your collection. By traversing the tree you can quickly see which Tools.h++ collection classes will best suit your data and programming project.

How to Use the Decision Tree

The questions that appear on the decision tree are brief, so that the diagram will be easy to read. The following questions expand upon the questions in the decision tree.

Is the order of data in the collection meaningful? Some collections allow you to control the location of data within the collection. For example, arrays or vectors, and linked lists present data in order. If you need to access data in a particular order, or based on a numeric index, then order is meaningful.
Are duplicate entries allowed? Some collections, usually called sets, will not allow you to insert an item equal to an item that is already in the collection. Other collections do permit duplicates, and have various ways to hold multiple matching items. Tools.h++ collections provide mechanisms for both checking for duplication and holding the duplicates.
Is the order determined intrinsically or externally? If data within the collection is controlled by the way you insert it, we say that the order is determined externally. For example, a vector, or a linked list is externally ordered. If the data is stored at a place determined by an algorithm used by the collection, then the ordering is intrinsic. For example, a sorted vector, or a balanced tree has intrinsic ordering.
Is data accessed by an external key? If you access a value based on a key that is not the same as the value, we say that data is accessed by an external key. For example, a "phone list" associates data, in the form of telephone numbers, with keys, in the form of names. Conversly, a list of committee members is simply a set of data in the form of names. You do not need a key to get at the information.
Is data accessed by a numeric index? Objects stored in an array or vector are accessed by numeric index. For example, you access an object at location 12 by using the numeric index "12" to find it.
Is data accessed by "comparison with self?" Data that is stored with neither an explicit index nor an explicit key can be found by looking for a match between what you need to find and what is contained. The list of committee members mentioned in item 4 is and example of this type of data. Sets or bags are examples of collections that are accessed by comparison with self.

When data is accessed by comparison with self, it is also necessary to know what kind of match is used: matching may be based on equality, which directly compares one object with another, or based on identity, which compares object addresses to see if the objects are the same.
Is the best method of access by following linked nodes? Collections that make use of linked nodes are usually called lists. Lists provide quick access to data at each end of the collection, and allow you to insert data efficiently into the middle of the collection. However, if you need repeated access to data in the middle of the collection, lists are not as efficient as some other collections.
Will most of your access to data be at the ends of a collection? There are many occasions when you need to handle an unknown amount of data, but most of that data handling will apply to data that was most recently or least recently put into the collection. A collection that is particularly efficient at handling data that was most recently added is said to have a "last in, first out" policy. A last in, first out (LIFO) container is a stack. A collection that handles the data in a "first in first out" (FIFO) manner is called a queue. Finally, a collection that allows efficient access to both the most recently and least recently added data is called a deque, or double ended queue.
For linked lists—Do you need to access the data from only one end of the list , or from both ends? Singly-linked lists are smaller, but they allow access only from the "front" of the list. Doubly-linked lists have a more flexible access policy, but at the cost of requiring an additional pointer for every stored object.
For collections that are accessed by numeric index—Do you need the collection to automatically resize? If you know the maximum number of items that will be stored in the collection, you can make insertion and removal slightly more efficient by choosing a collection with a fixed size. On the other hand, if you need to allow for nearly unlimited expansion, you should choose a collection that will automatically adjust itself to the amount of data it is currently storing.

Additional Selection Criteria

Which collection you choose will depend of many things (not the least of which will be your experience and intuition). In addition to the decision tree, the following questions will also influence your choice of container.

Do I need to maintain a single object in multiple collections? Use a pointer-based collection.
Am I collecting objects that are very expensive to copy? Use a pointer-based collection.
Is there no compelling reason to use a pointer-based collection? Use a value-based collection.
Do I want to control the order of the objects within the collection externally? Use a sequential collection such as a vector or list.
Should the items within the collection be mutable (not fixed) after they are inserted? Use a sequential or mapping (dictionary) collection. Maps and dictionaries have immutable keys but mutable values.
Would I prefer that the collection maintain its own order based on object comparison? Use a set, map, or sorted collection.
Do I wish to access objects within the collection based on a numerical index? Use a sequential or sorted collection.
Do I need to find values based on non-numeric keys? Use a map or dictionary.
Would I prefer to access objects within the collection by supplying an object for comparison? Use a set, map or hash-based collection.
Am I willing to forego meaningful ordering, and use some extra memory in return for constant-time look-up by key? Use a hash-based collection.
Do I need fast lookup and insertion in a collection that grows or shrinks to meet the current need? Use a b-tree, or an associative container based on the new Standard C++ Library.
Do I need access the data without bringing it all into memory? Use RWBTreeOnDisk or RWTValVirtualArray.

Figure A-1. Tools.h++ Collection Class Decision Tree (1 of 2)

Figure A-2. Tools.h++ Collection Class Decision Tree (2 of 2)

Time and Space Considerations

This section presents a very approximate analysis and comparison of the time and space requirements for a variety of common operations on different specific collections and collection families. We've presented the information as a set of tables that lists the operation, the time cost and the space cost. Any applicable comments appear at the bottom of the table. A key to the abbreviations used in the tables appears at the bottom of each page.

As you read these analyses, keep in mind that various processors, operating systems, compilation optimizations, and many other factors will affect the exact values. The point of these tables is to provide you with some idea of how the behaviors of the various collections will compare, all other things being equal. For more details on algorithm complexity, refer to Knuth, Sedgewick, or any number of other books.

Because many of the Tools.h++ collections have essentially similar interfaces, it is easy to experiment and discover what effect a different choice of collection will have on your program.

For each of the following tables:

N is the number of items in the collection;
M is the current size of the collection;
t is the size of the item being stored (possibly a pointer);
i is the size of an integer;
p is the size of a pointer;
C is a "constant value".
Time costs for each pointer dereference, copy, destroy, allocate, or comparison are considered equal.
Container overhead is as space cost that consists of two terms. The left term is the size of an empty container, while the right term shows the added cost for N items.
Space cost is indicated both for insertions and deletions. Space cost that is marked "(recovered)" indicates that the space has been handed back to the heap allocator.

Whenever an allocation is mentioned, you should be aware that memory allocation policies differ radically among various implementations. However, it is generally true that a heap allocation (or deallocation) that translates to a system call is more expensive than most of the other constant costs. "Amortized" cost is averaged over the life of the collection. Any individual action may have a higher or lower cost than the amortized value.

Table A-1. Key to the Comparison Tables

N	M	t	i	p	C
count of items	count of space for items	`sizeof` (`item`)	`sizeof` (`int`)	`sizeof` (`void*`)	a constant

RWGVector, RWGBitVec, RWTBitVec<size>, RWTPtrVector, and RWTValVector

Table A-2. RWGVector, RWGBitVec, RWTBitVec<size>, RWTPtrVector, and RWTValVector

Operation	Time Cost	Space Cost
Insert at an end	`C`	`0`
Insert in middle	`C`	`0`
Find (average item)	`N/2`	`0`
Change/replace item	`C`	`0`
Remove first	`C`	`0`
Remove last	`C`	`0`
Remove in middle	`C`	`0`
Container overhead		`Mt + 0`
Comments	Simple wrapper around an array of T (except `bitvec: array` of byte)	Resize only if told to.

Singly Linked Lists

Table A-3. Singly Linked Lists

Operation	Time Cost	Space Cost
Insert at an end	`C`	`t + p`
Insert in middle	`C` (assumes that you have an iterator in position)	`t + p`
Find (average item)	`N/2`	`0`
Change/replace item	`C`	`0`
Remove first	`C`	`t + p` (recovered)
Remove last	`C`	`t + p` (recovered)
Remove in middle	`C` (assumes that you have an iterator in position)	`t + p` (recovered)
Container overhead		`(2p+i) + N(t+p)`
Comments	Allocation with each insert Iterators go forward only	Grows or shrinks with each item. Smaller than doubly-linked list

Doubly Linked Lists

Table A-4. Doubly Linked Lists

Operation	Time Cost	Space Cost
Insert at an end	`C`	`t + 2p`
Insert in middle	`C` (assumes that you have an iterator in position)	`t + 2p`
Find (average item)	`n/2`	`0`
Change/replace item	`C`	`0`
Remove first	`C`	`t + 2p` (recovered)
Remove last	`C`	`t + 2p` (recovered)
Remove in middle	`C` (assumes that you have an iterator in position)	`t + 2p` (recovered)
Container overhead		`(2p+i) + N(t+2p)`
Comments	Allocation with each insert Iterate in either direction	Grows or shrinks with each item Larger than `Slist`

Ordered Vectors

Table A-5. Ordered Vectors

Operation	Time Cost	Space Cost
Insert at end	`C` (amortized)	`t` (amortized)
Insert in middle	`N/2`	`t` (amortized)
Find (average item)	`N/2`	`0`
Change/replace item	`C`	`0`
Remove first	`N`	`0`
Remove last	`C`	`0`
Remove in middle	`N/2`	`0`
Container overhead		`(Mt + p + 2i) + 0`
Comments	No iterators (use `size_t` index) Allocation only when the vector grows.	Expands as needed by adding space for many entries at once. Shrinks only via `resize()`

Sorted Vectors

Table A-6. Sorted Vectors

Operation	Time Cost	Space Cost
Insert	`logN + N/2` (average)	`t` (amortized)
Find (average item)	`logN`	`0`
Change/replace item	`N`	`0`
Remove first	`N`	`0`
Remove last	`C`	`0`
Remove in middle	`N/2`	`0`
Container overhead		`(Mt + p + 2i) + 0`
Comments	Insertion happens based on sort order. No iterators (use `size_t` index) `replace` requires `remove/add` sequence to maintain sorting Allocation only when the vector grows.	Expands as needed by adding space for many entries at once. Shrinks only via `resize()`

Stacks and Queues

Table A-7. Stacks and Queues

Operation	Time Cost	Space Cost
Insert at end	`C`	`t + p`
Remove (pop)	`C`	`t + p` (recovered)
Container overhead		`(2p+i) + N(t+p)`
Comments:	Implemented as singly -linked list. Templatized version allows choice of container: time and space costs will reflect that choice.

Deques

Table A-8. Deques

Operation	Time Cost	Space Cost
Insert at end	`C`	`t` (amortized)
Find (average item)	`N/2`	`0`
Change/replace item	`C`	`0`
Remove first	`C`	`t` (amortized, recovered)
Remove last	`C`	`t` (amortized, recovered)
Remove in middle	`N/2`	`t` (amortized, recovered)
Container overhead		`(Mt + p + i) + 0`
Comments	Implemented as circular queue in an array. Allocation only when collection grows	Expands and shrinks as needed, caching extra expansion room with each increase in size

Binary Tree

Table A-9. Binary Tree

Operation	Time Cost	Space Cost
Insert	`logN+C`	`2p+t`
Find (average item)	`logN`	`0`
Change/replace item	`2(logN + C)`	`0`
Remove first	`logN + C`	`2p+t` (recovered)
Remove last	`logN + C`	`2p+t` (recovered)
Remove in middle	`logN + C`	`2p+t` (recovered)
Container overhead		`(p+i) + N(2p+t)`
Comments	Insertion happens based on sort order. Allocation with each insert Replace requires remove/add sequence to maintain order Does not automatically remain balanced. Numbers above assume a balanced tree.	Costs same as doubly linked list per item

(multi)map and (multi)set family

Table A-10. (multi)map and (multi)set family

Operation	Time Cost	Space Cost
Insert	`logN+C`	`2p+t`
Find (average item)	`logN`	0
Change/replace item	`2(logN+C)`or`C`	0
Remove first	`logN` (worst case)	`2p+t` (recovered)
Remove last	`logN` (worst case)	`2p+t` (recovered)
Remove in middle	`logN` (worst case)	`2p+t` (recovered)
Container overhead	re-balance may occur at each `insert` or `remove`	`(3p+i) + N(2p+t)`
Comments	Insertion happens based on sort order. Allocation with each insert Replace for sets requires remove/insert. For maps the value is copied in place. implemented as balanced (red-black) binary tree.

RWBTree, RWBTreeDictionary^[30]

Table A-11. RWBTree, RWBTreeDictionary

Operation	Time Cost	Space Cost
Insert	`logN+C`	`2p + t +` small (fully amortized)
Find (average item)	`logN`	`0`
Change/replace item	`2logN+2 or C`	`0`
Remove first	`2logN(log2(ORDER))+C` (worst case)	`2p+t` (recovered)
Remove last	`2logN(log2(ORDER))+C` (worst case)	`2p+t` (recovered)
Remove in middle	`2logN(log2(ORDER))+C` (worst case)	`2p+t` (recovered)
Container overhead	Re-balance may occur at each insert or remove. However it will happen less often than for a balanced binary tree.	This depends on how fully the nodes are packed. Each node costs `ORDER(2p+t+1)+i` and there will be no more than `2N/ORDER`, and no fewer than `min(N/ORDER,1)` nodes. Inserting presorted items will tend to maximize the size. Sedgewick says the size is close to `1.44 N/ORDER` for random data
Comments	Insertion based on sort order. The logarithm is approximately base `ORDER` which is the splay of the b-tree. (In fact the base is between `ORDER` and 2`ORDER` depending on the actual loading of the b-tree) Replace for b-tree requires remove then insert. For `btreedictionary` the value is copied in place

^[30] RWBTreeOnDisk has complexity similar to RWBTreeDictionary, but the time overhead is much greater since "following linked nodes" becomes "disk seek;" and the size overhead has a much greater impact on disk storage than on core memory.

Prev	Table of Contents	Next
Chapter 18. Common Mistakes		Appendix B. Typedefs and Macros