Algorithms Weekly by Petr Mitrichev: March 2015

Sunday, March 29, 2015

This week in competitive programming

TopCoder SRM 654 took place in the early hours of Thursday (problems, results, top 5 on the left). Less than half a year after winning an SRM for the first time, Endagorion has now earned his third SRM victory, a feat accomplished by just 50 contestants in the history of TopCoder - congratulations! Next step would be to become the 32nd person to win four SRMs, of course :)

Russian Code Cup 2015 Qualification Round 1 happened on Saturday (problems in Russian, results, top 5 on the left). This was the strongest qualification round since top 200 have qualified and will thus be unable to participate in the next two qualification rounds. Congratulations to Gennady on another flawless victory!

The last problem is worth mentioning at least for its very simple statement: how many strings of length n are there with the given value of a polynomial hash? n is up to 10⁶, the polynomial hash is computed with the given base p and modulo m, m is up to 10⁴ (more precisely, the hash is equal to a₀+a₁*p+a₂*p²+... mod m, where a_i is the i-th character of the string, which consists only of lowercase English letters). You need to output the answer modulo 998244353.

Finally, Open Cup 2014-15 Grand Prix of America happened on Sunday as usual (results, top 5 on the left, results of another contest with the same problemset but different time limits). One of the tricky problems, problem G, was about constructing a long string s using a short string t. We start with just the string t. Now we insert another occurrence of t anywhere (possibly before the first or after the last character), and repeat this process until we obtain string s. In the sample input, s was "hhehellolloelhellolo" and t was "hello". Given the string s with at most 200 characters, what's the shortest string t that could've been used to construct it?

Thanks for reading, and see you next week! Please also find a photo with a spring - if a bit gloomy - feeling on the left :)

Sunday, March 22, 2015

This week in competitive programming

TopCoder SRM 653 has ignited this week's contests on Tuesday (problems, results, top 5 on the left). Egor and Kazuhiro were in their own league with amazingly fast solutions both for the medium and for the hard problem, but Egor has squeezed out the victory during the challenge phase - congratulations!

The most interesting part of this round, in my view, was coming up with a challenge for the easy problem. You were given a sequence of numbers with some numbers replaced by wildcards, and were guaranteed that the sequence before replacement by wildcards consisted of several consecutive segments of equal numbers, where each number is equal to the length of the corresponding segment, for example: 3, 3, 3, 4, 4, 4, 4, 2, 2, 2, 2 (3x3+4x4+2x2+2x2), then 3, *, 3, *, *, 4, 4, *, *, 2, * with wildcards. The problem asked to check if there's more than one way to reconstruct the numbers that were replaced by wildcards, and many people have simply counted the number of ways to reconstruct the numbers, and compared it with 1.

Many of those people have used the 32-bit integer type to count the number of ways, but it's not sufficient to count the number of ways for a sequence of length 100, so their solutions might fail if the total number of ways is k*2³²+1. I found one such solution during the challenge phase, and tried to create a testcase to fail it - but could not, and neither did the system test fail it. However, right after the SRM Misha 'Endagorion' has posted such testcase on Codeforces. Can you come up with a tricky testcase without following that link?

Codeforces Round 296 (problems, results, top 5 on the left) happened later that day. I've skipped the round, but want to congratulate piob on the amazing victory which he achieved in just 54 minutes out of two hours!

On Saturday, VK Cup 2015 Round 1 pioneered a (relatively) original competition format: 2 person teams (problems, results, top 5 on the left). Congratulations to Boris and Adam on the victory! The pre-round favorite team "Never Sorry" has led through most of the contest, but had to resubmit the solution for the hardest problem several minutes before the end of the round and dropped to fourth place. The reason for their resubmission? Their solution made an out-of-bounds access, namely tried to reach n+1-st character in an n-character string. Since they were using C++, this would have flown just fine if that was an ordinary string, but they had their own class since the string was constructed implicitly, and it had an explicit assertion for out-of-bounds accesses. Indeed, removing line 87 "assert(false);" from their first submission makes it pass the system test!

I have to admit that this example goes against my philosophy that more strict languages like Java or Pascal lead to higher probability of passing the system test because more bugs can be caught during the coding phase. Of course, this is just one example :)

VK Cup 2015 Round 1 online mirror was held several hours later with a slightly modified problemset (problems, results, top 5 on the left). Congratulations to Ivan on his first victory on Codeforces!

Now, let's come back to the problem I described last week and the new data structure. You are given a tree with at most 10⁵ vertices, where each edge has an integer length, and a sequence of 10⁵ updates and queries. Each update tells to color all vertices in the tree that are at most the given distance from the given vertex with the given color. Each query requires you to output the current color of a given vertex.

The data structure as described in the Russian post-match discussion and in another Codeforces comment is called "Centroid Decomposition of a Tree". We start by finding the centroid of the tree: a vertex such that it splits the tree into components of size at most N/2, where N is the number of vertices in the tree. One way to find such vertex is to pick an arbitrary root, then run a depth-first search computing the size of each subtree, and then move starting from root to the largest subtree until we reach a vertex where no subtree has size greater than N/2.

Let's mark the centroid with label 0, and remove it. After removing the centroid the tree separates into several parts of size at most N/2. Naturally, now we do the same recursively for each part, only marking the new centroids with label 1, then we get even more parts of size at most N/4, mark their centroids with label 2, and so on, until we reach parts of size 1. Since the size decreases at least twice with each step, the labels will be at most log(N).

The process of construction is displayed in the pictures on the left. The right subtree displays an interval tree analogy, while the left subtree shows that more unusual things can happen.

Now, consider any two vertices A and B in the tree and the path connecting them, and let's find the vertex C with the smallest label on that path. It's not hard to see that the path connecting A and B lies entirely in the part that vertex C was the centroid of in the above process, and that A and B lie in different parts that appear after removing C. So our path is concatenation of two paths: from C to A, and from C to B.

Finding C given A and B is also easy: let's just keep a link from each vertex to its "parent" in the above process (if our vertex has label K, the parent will have label K-1), and let's repeatedly follow this link either on A or on B, whichever currently has a higher label, until the two coincide.

Notice that we've chosen O(NlogN) paths in the tree (from each centroid to all vertices in the corresponding part) such that every path is a concatenation of two paths from that set, and we can find those two paths in O(logN) time. Such decomposition of paths turns out useful in many problems.

Now, how does one solve the problem in question? Well, whenever we need to color all vertices B at distance at most D from the given vertex A with color X, we will group possible B's by C - the vertex with the smallest label on the path from A to B, as descried above. To find all possible C's, we just need to follow the "decomposition parent" links from A, and there are at most O(logN) such C's. For each candidate C, we will remember that all vertices in its part with distance at most D-dist(A,C) from C need to be colored with color X.

When we need to handle the second type of query, in other words when we know vertex B but not A, we can also iterate over possible candidate C's. For each C, we need to find the latest update recorded there where the distance is at least dist(B, C). After finding the latest update for each C, we just find the latest update affecting B by comparing them all, and thus learn the current color of B.

Finally, in order to find the last update for each C efficiently, we will keep the updates for each C in a stack where the distance decreases and the time increases (so the last item in the stack is always the last update, the previous item is the last update before that one that had larger distance, and so on). Finding the latest update with at least the given distance is now a matter of simple binary search.

As usual, I'm expecting that some of you have already known this data structure for ages. Still, I'd love to hear what do you think about my explanation above! Also please tell if you've read a better explanation somewhere else.

And in any case, check back next week!

Tuesday, March 17, 2015

This week in competitive programming

TopCoder SRM 652 took place on Monday (problems, results, top 5 on the left). The round was soon after the flight home from the Hacker Cup, so I've skipped it and thus don't have much to tell about the problems. Adam "subscriber", who has already been featured in top 5 on this blog several times, has won his first SRM - great job!

Here are the solution ideas for the problems from the last week's summary. The first problem described there was about picking the right order to apply skill improvements. The solution is explained very well in the editorial (heading "521D - Shop"), but the basic steps one needed to make were:

All multiplications should be applied in the end, in arbitrary order, and higher multiplier is better than lower multiplier.
All assignments should be applied before all other operations, and we should use at most one assignment per skill. Since we only apply one, we can imagine that we have an addition instead of an assignment: highest value that can be assigned minus initial value.
For each particular skill, higher addition is always better than lower addition, so we should sort the additions in decreasing order and imagine applying them in that order. In this case, for each addition we know for sure the value of the corresponding skill before and after the addition, and thus we can imagine that we have a multiplication instead (new value divided by old value)!
Now all our operations are multiplications, and we should simply sort them in decreasing order.

The second problem was about Conway's look-and-say sequence, and the main solution idea is actually described in the linked Wikipedia article as the "cosmological theorem": sooner or later, the sequence separates into several parts that never interact again, and it turns out that the set of all possible strings that do not separate into several parts is very small - there are around 100 such strings - so we get a linear recurrence relation on a 100-element vector and can use fast matrix exponentiation to apply it many times quickly.

And the third problem was about reconstructing the smallest parallelepiped drawn on the grid that contains the two given cells. Here's the solution that avoids a bulk of case studies that I was referring to: when the two given points are close to each other, we can just try all small parallelpipeds until we find one that covers both; when they are far from each other (let's say at least 10 apart), it's not hard to see that there's a simple lower boundary on the answer that is achievable. More specifically, let the parallelepiped horizontal side be a, vertical side be b, and the diagonal side be c. y-coordinate can be increased at most b+c-2 times (and similarly x-coordinate can be increased at most a+c-2 times) inside the parallelepiped, so b+c-2 must be at least y2-y1, and thus a+b+c (which is almot the answer) must be at least 5+y2-y1. Also, the difference between the coordinates doesn't change when we move diagonally, so it can only increase at most a+b-2 times, so the answer must be at least 5+(y2-x2)-(y1-x1). The only remaining step is to understand that the maximum of those lower bounds is always achievable if the two given points are sufficiently far apart.

Now, let's finish digging into the Open Cup archives! Open Cup 2014-15 Grand Prix of China happened 2 weeks ago (results, top 5 on the left). Let me describe the problem that I couldn't solve. You're given a simple undirected connected graph with at most 8 vertices. Let's say each edge has a random uniform real weight between 0 and 1. What's the expected value of the minimum spanning tree of this graph? It would seem that with 8 vertices any solution should work, but I've managed to implement one that times out :) Of course, the key to creating a fast solution lies in linearity of expectation. Can you see how to use it?

And finally, Open Cup 2014-15 Grand Prix of Tatarstan happened this Sunday (results, top 5 on the left). I've learned a new beautiful data structure at this contest - a rare event in my old age. Here's the problem that required it: you are given a tree with at most 10⁵ vertices, where each edge has an integer length, and a sequence of 10⁵ updates and queries. Each update tells to color all vertices in the tree that are at most the given distance from the given vertex with the given color. Each query requires you to output the current color of a given vertex.

I couldn't invent the data structure during the contest time, nor did I know it in advance, so I've only managed to come up with a O(N*sqrtN) solution for the case where all update distances are equal. But it turns out that the problem is solvable in O(N*logN*logN) for arbitrary updates. Do you know which data structure helps here?

Thanks for reading, and see you next week!

Saturday, March 7, 2015

This week in competitive programming

Codeforces Round 295 happened early on Monday (problems, results, editorial with challenges, top 5 on the left). Let me describe problem D which I didn't solve during the round: you have an array with 10⁵ positive integers, denoting your various skill levels in an online game. You're also given up to 10⁵ possible improvements for your skills. Each improvement is applicable to one particular skill, and is of one of three types: set the skill level to the given value, add the given value to the skill level, or multipy the skill level by the given value. The skill that can be improved, the type of improvement, and the improvement value are fixed for each possible improvement, so the only freedom you have is which improvements you will apply, and in which order. You goal is to achieve the maximum product of all your skill levels using at most m improvements (m is also given).

This problem requires careful reduction of complexity until it becomes simple. The first step, for example, is to notice that it only makes sense to apply the "multiplication" improvements after all others, and the order of their application does not matter. I've managed to do a few more steps during the contest, but stopped short of the solution because I couldn't find a way to properly handle the "assignment" improvements. Can you see the remaining steps?

Facebook Hacker Cup 2015 Final Round happened on Friday in Menlo Park (results, top 5 on the left). I've managed a very good start by getting the first submission and then skipping the tricky geometry problem (the "Fox Lochs" column in the above table). After submitting the 20-point with almost two hours left, I had two strategies to choose from. I could go back to the geometry problem, implement it carefully, test it a lot, but thus likely not solve anything else - looking at the final scoreboard, that would've earned me the second or third place in case my solution was correct - but given that even Gennady had failed this problem, that's far from certain. Or I could try to solve one of the two harder problems which seemed to require some thinking but looked tractable.

I've decided to go after the 25-point problems, and after some thinking came up with a O(N*sqrtN) solution for the fourth problem ("Fox Hawks"). The problem was: you're given a boolean expression with at most 200000 boolean variables, each appearing exactly once, for example "((1 & (2 | 3)) | 4)". What's the k-th lexicographically set of variable values that evaluate this expression to true?

It was not clear whether O(N*sqrt(N)) would pass within the time limit, so I hoped for the best and started implementing it. The implementation was a bit tricky and required more than an hour (including writing a simple but slow solution and comparing on a lot of small testcases). I've finally downloaded the input with about 30 minutes left in the contest - and my solution turned out a bit too slow (solved 12 testcases out of 20 in the time limit) :( Had I implemented it a bit more efficiently, or had I used a more powerful computer, I might've got it, and it turns out that would've earned me the victory. Well, better luck next time!

After discussing my solution with Slava "winger" Isenbaev after the contest, I've also realized that it's not hard to change it into a O(N*logN) approach. I've done that after the contest, and after about 20 minutes of coding and 5 minutes of debugging I got a working solution that solved all cases in about 20 seconds (out of 6 minutes). Can you guess what's the O(N*logN) approach knowing that it's an improvement of an O(N*sqrtN) one?

Now, let's continue covering the Open Cup contests from February. Before presenting a new round, let me give the solution ideas for the problems I posted last week.

The first problem was about finding the k-th lexicographically borderless word of length n (up to 64), where a word is borderless when it has no non-trivial borders. In order to solve this problem, let's learn to count borderless words first. The first idea is to notice that if a word of length n has any borders, then it must also have a border of length at most n/2 (because if a prefix is equal to a suffix and they intersect, then we can find a shorter prefix that is equal to a suffix - see the picture on the left). Because of this, when n is odd, the number of borderless words of length n is equal to the size of the alphabet times the number of borderless words of length n-1 - we can simpy put an arbitrary character in the middle of a word of length n-1. And when n is even, the number of borderless words of length n is equal to size of the alphabet times the number of borderless words of length n-1 minus the number of borderless words of length n/2: we can also put an arbitrary character into the middle of a word of length n-1, but we need to subtract cases where the new word has a new border of size n/2. Finding k-th lexicographically borderless word is done in a very similar manner: instead of just counting all borderless words, we can use the same approach to count all borderless words with the given prefix.

The second problem was about finding two deterministic finite automata with at most n+1 states that, when used together, accept the given word and only that word. The automaton with n+2 states that accepts only the given word is straightforward. How do we get rid of one state? Well, let's glue two adjacent states together. This will result in an automaton that accepts the given word, but also all words where the letter in a certain position can be repeated an arbitrary number of times. If we do this once for the first letter, and the second time for the last letter, we'll obtain the two automata that we need, unless all letters in our word are equal. An in case all letters in our word are equal, it's not hard to see that there's no solution.

Now, on to new tricky problems! Open Cup 2014-15 Grand Prix of Karelia happened 3 weeks ago (results, top 5 on the left). The most difficult problem F was about the Conway's look-and-say sequence starting with 2: we start with single digit 2, then repeatedly describe what we see. For example, on the first step, we see one "2", so we write "12". On the second step, we see one "1" and one "2", so we write "1112". On the third step, we see three "1"s and one "2", so we write "3112", and so on. How many digits does the n-th step contain (modulo p)? n is up to 10¹⁸.

Open Cup 2014-15 Grand Prix of Udmurtia happened 2 weeks ago (results, top 5 on the left). Let's talk about a relatively easy problem for a change.

Problem B of this round was concerned with drawing parallelepipeds on a grid. More specifically, a parallelepiped is drawn on a grid like this: we start with a rectangle, then add three diagonal segments of the same length, and then connect their ends as well - see the picture on the left. The parallelepiped has three parameters: the two sides of the rectangle, and the size of the diagonal. All three parameters must be at least 3.

A parallelepiped was drawn, but then all squares but two were erased, so you're given the coordinates of the two remaining squares, each up to a billion. What's the smallest possible total number of cells in the original drawing?

This problem looks quite nasty from the outside, and it feels like it can have a lot of tricky corner cases. But it turns out it's possible to write a solution that sidesteps those and solves everything in quite general manner. How would you approach this problem?

Thanks for reading, and check back next week!