Written by Joël Benjamin Huber. Translated by Joël Benjamin Huber.
Contents
This article should introduce the reader to the data structure std::set. For this article, basic knowledge of iterators is needed. If the reader doesn’t know what iterators are and how they work, I highly recommend them to look at iterators first. A short note about iterators can be found in the article about stdlib.
What is a set? A set is a sorted collection of unique keys. Essentially, that just means a set is a container that stores a bunch of values in sorted order. The real power lies in the logarithmic running time all of the important functions. One can insert, search, find and erase from the set in logarithmic time, while the set remains sorted all the time.
Operations of the set
Type
First of all, we can define a set as a container of any type equipped with any comparison function. This means, you can not only throw strings and integers in it, but also your own self-created types, with your homemade comparison function! There are some restrictions about the comparison function (strict weak ordering) you can use, but normally you don’t need to order them in such a way that these requirements are not satisfied.
set<int> your_set; // Just a set of integers sorted in ascending order. set<string, greater<string>> my_set; // A set of strings, sorted in lexicographic descending order.
Insertion
Then, there is the insert() function, which just - as the name suggests - inserts an element into the set. It’s as easy as this:
your_set.insert(4); // Inserts a 4 into your_set my_set.insert("Hallo"); // Inserts "Hallo" into my_set
The insertion returns a pair: The first element is an iterator which points to the newly inserted element or if the element is already in the set, then it returns an iterator pointing to said element. The second value is just a boolean which tells us whether the insertion took place. Another cool property of the set is that the iterators are not affected by insertion. This means we can insert things into our set while looping.
your_set.insert(3); // Returns {(Iterator to 3), true}; your_set.insert(3); // Returns {(Iterator to 3), false}, because 3 already existed // The iterator returned by the second insertion is the same // as the one returned by the first insertion
Removal
Then we can erase elements from the set with the erase() function. We can pass either the value we want to erase or an iterator pointing to this value as an argument of our function. This is important when using a multiset: Passing the value will erase all items with this value while passing an iterator will only erase one element. Passing the value returns the new size of the set, since in the multiset, several elements may be removed. If we pass an iterator to the element we want to remove, erase returns an iterator pointing to the successor of the removed element. This is quite useful when looping through a set. As in insertion, all iterators stay valid (except for all iterators pointing to the removed elements).
your_set.insert(9); auto it = your_set.insert(7).first; // it is an iterator pointing to the 7 it = your_set.erase(it); // it is now an iterator pointing to the 9 int j = your_set.erase(9); // returns the size of the set. // it is now invalidated, because it pointed to the 9. Other iterators however are still valid
Searching
Next on the list are the 3 search functions:
find(key) just returns an iterator to the first object equal to key. If there is no such element, then it returns the past-the-end-iterator for this set. So we can efficiently check whether an element is in the set or not.
The other two search-functions are hard to distinguish:
- lower_bound(key) returns an iterator to the first element key (greater or equal).
- upper_bound(key) returns an iterator to the first element key (strictly greater).
The small but key difference between these 2 functions is, that if you search for a key which is actually in the set, lower_bound will return it while upper_bound will return the next-bigger one. One way to remember the difference between the two functions is to think of them as vs , with lower bound returning the smaller of the two and upper bound the larger.
if (your_set.find(10) == your_set.end()) // If 10 is not in the set { your_set.insert(10); } your_set.insert(12); your_set.lower_bound(10), // Returns an iterator to the 10 your_set.upper_bound(10); // Returns an iterator to the 12 your_set.upper_bound(12); // Returns your_set.end();
If you only want to check whether some element is in the set but don’t care where exactly, you can use the function . This function returns 0 if key is not in the set and 1 if it is. It is most convenient to mentally replace the word count with contains and pretend it returns a boolean instead of 1 or 0. So the above if condition could also have been written like this:
if (!your_set.count(10)) // read as: if not your_set contains 10 { ... }
Running time
The nice part about the set is that all these operations run in , where is the number of elements in the set. This is the big strength of the set: We can do all these operations in a very short amount of time.
Regarding the memory, the set uses space.
Sortedness
Because the set is always sorted, when we iterate through the set, we see the elements in ascending order. This also means that if we have an iterator to an element in the set, we can easily find the next-biggest and the next-smallest one by just incrementing/decrementing the iterators. Another nice effect of the sortedness is that it’s easy to find the smallest element: Since the set is sorted, the .begin()-iterator always points to the smallest element if there is one. Similarly, we can find the largest element with the end-iterator, but we have to decrease it, since .end() points past the end. (Don’t forget to check if the set is not empty!)
set<int> his_set; his_set.insert(10); his_set.insert(15); his_set.insert(12); for (auto j : his_set) cout << j << ' '; cout << '\n'; // Prints "10 12 15 " auto it = his_set.find(12); cout << *next(it) << '\n'; // Prints "15"
Conceptually, next(it) is the same as (it+1). However, (it+1) is not valid because set iterators don’t support the + operator. Similarly, prev(it) is conceptually the same as (it-1).
We can use this together with lower_bound and upper_bound:
- First element : s.upper_bound(x).
- First element : s.lower_bound(x).
- First element : prev(s.upper_bound(x)) (make sure that s.upper_bound(x)!=s.begin()).
- First element : prev(s.lower_bound(x)) (make sure that s.upper_bound(x)!=s.begin()).
One very common pitfall is to use lower_bound(s.begin(), s.end(), x) and get time limit exceeded because of it. When using sets, always use the member functions.
auto good = s.lower_bound(x); // O(log(n)) auto bad = lower_bound(s.begin(), s.end(), x); // O(n)
What do we lose?
To be able to provide all these operations, there has to be some price to pay. This price is that the set doesn’t support random access. This means, we can’t ask for the element anymore. Adding and subtracting integers from iterators etc. doesn’t work anymore. Most of the time, we don’t need this.
(In case we really need this, there are some structures which allow us to also do order-statistics (which element is at index , what’s the index of value etc.). These structures are called policy-based data structures. For the olympiad, you never need to use them.)
Advanced information about the set
This section is a bit more advanced and for the interested, however it is not needed for the olympiad. You may as well skip this section and go directly to the examples.
How does the set internally work?
The C++-standard only defines what operations a set should support, but not how it is implemented. This is may be different in different compilers. I’m using the gcc g++ compiler. We can find out how the set internally works by looking at the library. If we dig into the library and search for the set-include file, we find a nearly empty file, which only includes other files. Following these files, we finally arrive at a file which defines the set:
template<typename _Key, typename _Compare = std::less<_Key>, typename _Alloc = std::allocator<_Key> > class set { ... }
So we can manually set the key, the comparison function and the allocator. That’s quite cool. What does this exactly mean?
- We can set the type of the objects we want to store in the set. This can as well be struct you defined in your code!
- Then next we can set a comparison function. This means we can as well order our set of numbers by a really weird function (for example by the sum of the digits and if the sum is the same, we can sort them decreasingly). This works as long as our comparison is a strict weak ordering.
- The allocator is an object which allocates memory for the objects in the set. This uses by default the allocator of the standard library.
However, we still didn’t find out, what our set is doing internally. Let’s continue reading the codee:
typedef _Key key_type; typedef _Key value_type; typedef _Compare key_compare; typedef _Compare value_compare; typedef _Alloc allocator_type; //@} private: typedef typename __gnu_cxx::__alloc_traits<_Alloc>::template rebind<_Key>::other _Key_alloc_type; typedef _Rb_tree<key_type, value_type, _Identity<value_type>, key_compare, _Key_alloc_type> _Rep_type; _Rep_type _M_t;
Here, we can see that our set internally actually uses a _Rb_tree, this is an abbreviaton for Red-Black Tree. A red-black tree is a self-balancing binary-search tree. A binary search tree is a search tree, where each node has exactly two nodes. If we look at a certain node, all the keys of the nodes in the subtree of the left child are smaller than the key of this node, and all the keys of the nodes of the subtree to the right are bigger. This allows us to efficiently search for a key or find the place to insert a key. The problem with binary-search-trees is that we have to balance them somehow. Imagine we have a long chain, Then we need steps to get from the top to the leaf, but if the tree is perfectly balanced, we only need steps to get there. To solve this problem, there are several approaches. Some examples are randomized trees (treaps), statically balanced trees (sparse trees) etc. The way a red-black-tree approaches this problem is to color each nodes either black or red. Then there are some rules about red and black nodes, for example all the paths from the root to any leaf should have the same number of black leaves. These rules keep the tree balanced. If you’re interested, you can read more about it here.
Maybe you have recognized that I started speaking of keys instead of values. Search trees in general allow us to store values in the nodes. The key normally is only here to guide us to the node with the value. But our set uses the same type for both the key and the value - since we don’t store any values with our keys. We can see this in the typedef _Key value_type; and the typedef _Compare value_compare; line: it defines the type of the value to be the same as the type of the key. If we look at the definition of the map, we can see that it uses both key and value and so is able to store values with the keys. But most of the time, we don’t really need this, the set alone is already strong enough.
After these lines, the functions of the set follow, but most of them just use the corresponding functions of the red-black tree. If you’re interested in the implementation of the red-black tree, you can find it in the library in bits/stl_tree.h.
What is a strict weak ordering?
We above said that the comparison function should make a strict weak ordering. What requirements must this an ordering satisfy to be a strict weak ordering?
Let be our comparison-operation defined on the set . Then our comparison function should satisfy:
- Irreflexivity:
- Asymmetry:
- Transivity:
Then equivalence is defined:
Basically, we say that two elements are equal if none of the two elements should be before the other in the ordering. This equivalence should also satisfy:
- Reflexivity:
- Symmetry:
- Transivity:
Exercice for the reader: Which of these properties imply which of the other properties? How many of these 6 properties do you need at least to be able to derive the other properties?
Example Problems
Billboards in New York
We’re given the skyline of New York as some rectangles. For each building (rectangle), we want to know the length of the part which is visible from the top.
The naive algorithm for this problem runs in : For each rectangle , we first take the full length and then we subtract for each other rectangle the part of which is covered by . Since we have to take care that we don’t subtract some values twice, we have to maintain a list of the already subtracted part. Let’s see how we can solve this problem in another way.
What we can do, is to use a sweep line to scan the rectangles from left to right and then update some data structure. We have 2 types of events:
- A new rectangle starts
- A rectangle ends
Our datastructure should thus support the following operations:
- Find the highest element
- Remove an arbitrary element
- Insert an element
We can see that our set fully supports these three operations: Insertion and Removal in , find the maximum element in . This gives us an solution. How exactly do we do this? We make an event for each start and end of an rectangle, sort these events by x-coordinates, and then process them in ascending order. While we process them, we keep a set of pair of integers, representing the active rectangle. The first integer is the height of the rectangle and the second the number / index of the rectangle. We can efficiently update this set.
The set is often used in combination with the sweep line, because it’s capabilities to store information while sweeping are outstanding.
Onoriver
In this problem, there is a frog who wants to cross a river. The frog can jump metres, however, the river is wider, it is metres wide. Luckily, there are several lilypads in a line which the frog can use to cross the river. However, in the beginning these lilypads are still to small for the frog to jump on. The lilypads grow and for each lilypad, we know when its big enough for the frog to jump on. However, at some point in time, the lilypads are too weak and can’t hold the frog anymore. For each lilypad, we also now when this happens. The frog can jump very fast: He can start on one side of the river and arrive in the same second on the other side of the river. However, the frog is heavy to carry for the lilypads so he can’t stop on the lilypads. That’s why he wants to jump from one side of the river directly to the other side without waiting on one lilypads. He now wants to know the first point in time where he can cross the river and the total number of seconds where he’s able to cross the river.
So what can we do now? This once again really looks like a sweeping line problem. Sweeping over the -axis doens’t really help us in this problem. Instead, we sweep over the timeline. This may sound a bit special if you hear it for the first time, however, this is quite normal for sweeping line problems. Our events:
- A lilypad gets strong enough to carry the frog.
- A lilypad gets too weak to carry the frog.
Now the only thing we need is an efficient data structure to find out if the frog can go from left to right. Although the master solution didn’t use a set, the solution using a set is simpler and in my opinion nicer. Exercise for the reader: Solve this problem without a set in
Instead of just describing the solution, I want to describe the thought process which leads to this solution.
How can we now find out if we can reach the other side only using the active lilypads? We can realize that we can group the active lilypads into components where we can reach every lilypad from the other lilypads. The condition for the (active) lilypads to be in the same component is that if we sort them increasingly, the distance between to neighbouring lilypads is . Now we use a quite standard trick: we add sentinels. Let’s pretend the start- and the endpont of the river are also lilypads which are always active. Now the question simplifies to: Are the first lilypad and the last lilypad in the same component? This question simplifies to: Is the number of components ? Now instead of counting the number of components, we can count the number of holes: A hole is a border between to components: The frog can’t jump over it. We can quite easily see that a hole is the gap between to neighbouring lilypads if and only if the distance between these two is . So we can simplify the question even more: Is the number of gaps between neighbouring lilypads with distance equals to ? We can easily keep the number of holes updated while keeping a set with the active lilypads and support our two events:
- We insert a lilypad . Let the active lilypad at the left of it be and the one on the right . We can find these two easily with our set. If (where denotes the distance between and ), then we decrease the hole-counter by one, because and aren’t next to each other anymore. Now if , then we have to increase the counter by . We then do the same thing for and .
- We erase a lilypad . We again look at the neighbouring active lilypads and which can be found by the set. We first have to subtract the holes which have u as endpoint. This means we again check if and and change the counter accordingly. Then we check if . If yes, then we have to increase our counter again.
Now to check if the we can get to the other side, we just have to look at the number of holes. If it is , then we can cross the river. Here is an implementation: