Archive | March, 2015

Taking Random Samples

29 Mar

Reblogged from https://eyalsch.wordpress.com/2010/04/01/random-sample/

Sometimes we need to take a random sample from a given collection of objects. This is useful in simulations, games, and statistical applications.  In this post I discuss some efficient algorithms for choosing m items out of n available, such that the m items are chosen at random, with equal probability for all possible subsets of size m. For each algorithm I present the code in Java, as well as a formal analysis of its performance and correctness, where needed. As you may have already noticed from the introduction, this post is more theoretic than usual…

A somehow related problem is the problem of generating a random permutation of given collection. While Java provides an efficient utility method for that purpose (See Collections.shuffle(..), based on the Knuth Shuffle algorithm), there is no built-in utility for generating random subsets.

Before we start discussing some known algorithms, it is important to note that there are many variations of the problem, depending  on the following parameters:

  1. Is the collection random access (e.g. ArrayList)?
  2. Is the collection read only?
  3. Do we allow random running time?
  4. Is N known at invocation time, or are we dealing with a stream of items of unknown size?

Trial and Error

We will start with the simplest approach. Assuming  that the collection to take the sample from is random access, we can repeatedly add random items from the collection to our result set, until the set contains m different items. As an optimization, when m > n/2, we can choose (n-m) items instead of m, and then return the rest:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public static <T> Set<T> randomSample1(List<T> items, int m){
    HashSet<T> res = new HashSet<T>();
    int n = items.size();
    if (m > n/2){ // The optimization
        Set<T> negativeSet = randomSample1(items, n-m);
        for(T item : items){
            if (!negativeSet.contains(item))
                res.add(item);
        }
    }else{ // The main loop
        while(res.size() < m){
            int randPos = rnd.nextInt(n);
            res.add(items.get(randPos));
        }
    }
    return res;
}

Clearly, the number of iterations in the main loop is not bounded. However, we can compute the expected number of iterations, what makes this algorithm a Las Vegas algorithm. The iterations can be split into m different sequences (numbered 0 to m-1), where sequence i refers to the attempts needed for adding the (i+1)-th item into the result set. By observing that the length of sequence i has a geometric distribution defined by p=(n-i)/n, we can calculate the expected number of iterations as follows:

E(m,n) = \frac{n}{n} + \frac{n}{n-1} + \frac{n}{n-2} + ... + \frac{n}{n-m+1}
And lets also define E(0,n)=0

If we examine E(m,n) as a function of m only, in the interval m=0 to m=\lfloor\frac{n}{2}\rfloor, it can be easily verified that the function is increasing and convex. Therefore we have:
E(m,n) \leq \frac{E(\lfloor\frac{n}{2}\rfloor,n)}{\lfloor\frac{n}{2}\rfloor} \cdot m=\frac{n\cdot(\frac{1}{n} + \frac{1}{n-1} + ... + \frac{1}{n-\lfloor\frac{n}{2}\rfloor+1})}{\lfloor\frac{n}{2}\rfloor} \cdot m \leq \frac{n\cdot\lfloor\frac{n}{2}\rfloor\cdot\frac{1}{\lfloor\frac{n}{2}\rfloor}}{\lfloor\frac{n}{2}\rfloor} \cdot m \leq 3m

In other words the expected time complexity is linear in m, which is optimal. This is a little surprising, given the naive nature of the algorithm, and it results from our optimization.

Swapping

If our collection is random access and its items can be freely reordered, then we can efficiently draw random items one by one from a candidates set, containing only items not chosen so far. By swapping items, we can guarantee that the candidates set is always contiguous . As a matter of fact, this algorithm is a bounded version of the Knuth Shuffle algorithm for generating random permutations, and its correctness is trivial.

1
2
3
4
5
6
7
8
9
public static <T> List<T> randomSample2(List<T> items, int m){
    for(int i=0;i<m;i++){
        int pos = i + rnd.nextInt(items.size() - i);
        T tmp = items.get(pos);
        items.set(pos, items.get(i));
        items.set(i, tmp);
    }
    return items.subList(0, m);
}

Full Scan

Sometimes our collection is not random access, so we have no choice but to traverse it completely in the worst case. Following is an elegant solution, that iterates once through the items, and selects an item with probability (#remaining to select)/(#remaining to scan):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public static <T> List<T> randomSample3(List<T> items, int m){
    ArrayList<T> res = new ArrayList<T>(m);
    int visited = 0;
    Iterator<T> it = items.iterator();
    while (m > 0){
        T item = it.next();
        if (rnd.nextDouble() < ((double)m)/(items.size() - visited)){
            res.add(item);
            m--;
        }
        visited++;
    }
    return res;
}

Clearly, the running time is O(n), which is optimal given the constraints.
It is left to prove that for any collection C and any subset S, the chances of generating S are {1 \over {|C| \choose |S|}}. We can do it by induction on the size of C.
For |C|=1 and |S| between 0 and 1 this is trivial. Now, let C be an ordered collection and let S be a subset of C. When the algorithm starts traversing C, it first encounters item v. If v \in S, we should choose it and then proceed by choosing items S-{v} from the collection C-{v}. We can use our induction assumption to calculate the probability of this:
p(C,S)={|S|\over|C|}\cdot{1\over{|C|-1\choose |S|-1}}={1\over{|C|\choose |S|}}

If on the other hand v \notin S, then we should reject v and proceed by choosing items S from the collection C-{v}. This happens with probability:
p(C,S)={|C|-|S|\over|C|}\cdot{1\over{|C|-1\choose |S|}}={1\over{|C|\choose |S|}}
In either cases we get the required probability.

Floyd’s Algorithm

What happens if we don’t want the time complexity to be dependent on N, and the collection is read only?
In this case, assuming the collection is random access, we can use Floyd’s algorithm, which is both brilliant and easy to implement. It iterates with variable i through the last m indexes of the collection, and on each iteration adds a single item from the range 0..i, with a non-uniform distribution:

1
2
3
4
5
6
7
8
9
10
11
12
13
public static <T> Set<T> randomSample4(List<T> items, int m){
    HashSet<T> res = new HashSet<T>(m);
    int n = items.size();
    for(int i=n-m;i<n;i++){
        int pos = rnd.nextInt(i+1);
        T item = items.get(pos);
        if (res.contains(item))
            res.add(items.get(i));
        else
            res.add(item);
    }
    return res;
}

The time complexity is O(m) on the average, but we can bound the worst case time by O(m log(m)) if we use a balanced tree instead of a hash set.
The correctness follows by proving the following loop invariant: After completing an iteration with i=j, the set res contains a uniformly distributed random subset of size j-n+m+1 of the items in the range 0..j. We will prove this by induction on i. For the initial value of i (i=n-m), we simply choose a random position in the range 0..i, so it is also a random subset of size 1 of the same range. Now, let i=j (>n-m), and let S be any subset of size  j-n+m+1 from the range 0..j. If items[j] is in S, then the previous iterations must have completed with res=S-{items[j]}, and then either position j or any previously selected position should be chosen. Using the induction assumption, we can compute the probability of obtaining S as follows:

p_1 = {1 \over {j\choose |S|-1}}\cdot{|S|\over j+1}={1 \over {j+1 \choose |S|}}

If on the other hand items[j] is not in S, then we have many options of selecting |S|-1 items in the previous iterations, and then we should choose the remaining index:

p_2 = {|S| \choose |S|-1}{1 \over {j\choose |S|-1}}\cdot{1\over j+1}={1 \over {j+1 \choose |S|}}

In both cases we have a uniform probability for choosing |S| items from j+1 candidates, and that completes the proof.

Stream of items

Sometimes we don’t know the collection size in advance. We can only iterate through it, as if it was a data stream with unknown size. The following algorithm (Known as Reservoir sampling) performs only one pass on the input stream. While iterating, it maintains a set of items that represents a random subset (of size m) of the items visited so far. It starts by selecting the first m items, and then it selects the k-th item with probability m/k. If the item is selected, it replaces a randomly chosen item from the previously selected ones:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public static <T> List<T> reservoirSample(Iterable<T> items, int m){
    ArrayList<T> res = new ArrayList<T>(m);
    int count = 0;
    for(T item : items){
        count++;
        if (count <= m)
            res.add(item);
        else{
            int r = rnd.nextInt(count);
            if (r < m)
                res.set(r, item);
        }
    }
    return res;
}

Each iteration consumes constant time, therefore the algorithm runs in O(n) time, where n is the stream length.
The correctness can be proved by induction on the stream size. We want to prove that for every stream T and every value of m (m<=|T|), all subsets of T of size m are equally likely to be returned. The base case is a stream of size m. In this case we have only one possible subset, and the algorithm always returns it. Now assume that we have a given stream T, and we know that the induction property holds for stream lengths below |T|. Let S be a subset of T. We shall inspect the last item v in the stream.
If v is not in S, then we must have chosen all items of S by the time we completed the previous iteration. We should also reject the last item:
p_1 = {1 \over {|T|-1 \choose |S|}}\cdot{|T|-|S|\over |T|}={1 \over {|T|\choose|S|}}

If on the other hand v is in S, then we should have the other |S|-1 items of the sample in the previous iteration already (plus an extra item among the |T|-|S| possible ones), and then we should choose v and make it replace the extra item:
p_2 = {|T|-|S| \over {|T| \choose |S|}}\cdot{|S|\over |T|}\cdot{1\over |S|}={1 \over {|T|\choose|S|}}

In both cases we get the required probability.

A note about random number generators

The algorithms above assume that the random number generator has a pure random behavior, but this is rarely the case. The class java.util.Random uses a very popular pseudo number generation approach, called Linear Congruential Generator.  Due to the fact that every generated number is determined by the previously generated one (or initially the seed), there is a bounded number of sequences that can be produced. Java’s Random class uses a state record of 48 bits, so we can never exceed 2^{48} different sequences. This has a huge impact when generating random constructs from a large combinatorial space, such as in the case of subsets or permutations. For example, if we are interested in generating a random subset of size 20 from a set of size 60, we have {60 \choose 20} options, which exceeds 2^{48}. Therefore, the results of any algorithm that uses java.lang.Random would be completely biased because there are many possible subsets that will never be returned. Actually, we will cover only 7% of the subsets space! For many applications this behavior is good enough. For others, which require more accuracy, we should consider a different random source than java.util.Random. After searching the web a little, I found the RngPack library, which seems to have some powerful random number generator implementations.

Sources

http://delivery.acm.org/10.1145/320000/315746/p754-bentley.pdf?key1=315746&key2=0487028621&coll=GUIDE&dl=GUIDE&CFID=79451466&CFTOKEN=34515112 http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle http://en.wikipedia.org/wiki/Linear_congruential_generatorhttp://comjnl.oxfordjournals.org/cgi/reprint/25/1/45.pdf http://comjnl.oxfordjournals.org/cgi/reprint/28/4/412.pdf http://www.jstor.org/stable/pdfplus/2347297.pdf http://stackoverflow.com/questions/48087/select-a-random-n-elements-from-listt-in-c# http://stackoverflow.com/questions/136474/best-way-to-pick-a-random-subset-from-a-collection

A window separating the world and me

28 Mar

When I took the last bus home yesterday night, as usual, I saw the the drunken people along the 6th street. Different from usual, a huge crowd gathered in the several open bars and made large sound. I suddenly realized that, it is Friday night.

Thanks God, it is Friday.

With the windows in my bus, I saw different illusions. One is the reflection image of me, one is those exhilarated people around the bar. In some sense, I was in a loss. Where am I in the hell?

Recalling my past, I am always put into such a dilemma. I witnessed different people come and go, and in some sense they are just a passenger in my life train. Some will accompany me for quite a long journey, but most of them will just cross my railway, yet gone away soon.

I sat here in the bus, I was so lucky that I could have the chance to see such a plausible image of me and the world. I know I come into the world, for the life meaning of contributing to the world. However on the other hand, to the world I am only also a passenger in the history river.

I have already ran out of one third of my life journey, but almost achieved nothing. I have made friends with different people, but few of them still in contact from the beginning. But anyway, this is life.

What can I do now? I believe the only thing I can achieve now is to stick to my dreams. I am always motivated by Pavel Korchagin, the main character of <How the Steel Was Tempered>.

Man’s dearest possession is life. It is given to him but once, and he must live it so as to feel no torturing regrets for wasted years, never know the burning shame of a mean and petty past; so live that, dying he might say: all my life, all my strength were given to the finest cause in all the world- the fight for the Liberation of Mankind.

Keep Moving

27 Mar

I still remember that in my high school period, my teacher asked everyone in class to put a slogan in the back wall of our classroom. I wrote “Accumulate deep and show off little, dare to lead others”, while the second one wrote “Keep Moving”.

I don’t quite get the meaning of “Keep moving” until recently. I realized that again and again, I fell into the “comfortable area” in my life. That is actually a signal of danger. I can still cheat to myself, I am now enjoying myself and appears to work for my dreams, but the fact is that I am farther and farther away from my goal.

In other word, I stick to my origin, and give up my effort to keep moving.

That is not what I want. I am now in my 20s, which is the most valuable period in my whole life. I can already achieve $50/hour last summer, but measured with money, my value should be more than $10000/hour. Anyway, at the end of the day, you will realize that money cannot be traded with time, since there is limited time while there will be enormous money.

That is life. You have to keep moving. You have to continuously work hard. You have to ask yourself whether you are approaching your dream or actually deviating from the focus on your dreams.

I forwarded a related article as the following,

http://www.headsup.ie/stayfocused.php

How to Stay Focused on Your Goals
Focus is what keeps us on track when we are trying to reach our goals. If you think of an archer as s/he aims at a target, it is essential that the archer has focus in order to hit the bulls eye. Likewise, when a person wants to create something in their life, they must main focus in order to achieve the goal.

There are many reasons why a person loses focus while pursuing their goals. The constant distractions in today’s world mean that maintaining focus can be very difficult. Distraction of work, home, friends, socialising, events and everything else constantly compete for our attention and dedicated time. It takes time to stay focused on achieving our goals and it’s so easy to fail in our attempt to maintain enough momentum, desire, energy and persistence needed to achieve our goals. As a result we often give up and think our goal is unachievable.

Another reason why we might lose focus in pursuing our goals is that we may talk ourselves out of pursuing our dreams. When we first begin pursuing our goals, we may be motivated by the fact that we are improving our lives and achieving our wishes and dreams. Shortly after starting the process of taking action towards achieving our goals, we can often begin to question the plausibility of achieving what we want. As soon as we start questioning our ability, we start to loose focus.

So what can we do to ensure we maintain the focus we need to achieve our goals? Here are some tips that will help us stay focused.

Write our goals down. Put your goals, reasons, objectives and beliefs in writing. When they are in writing it makes them more real and they can be reviewed and used as a constant source of encouragement. Eventually your goals will become branded in your mind.
Click here for the Goal Setting factsheet

Watch for Negative Self Talk. To find our more on noticing and changing negative thinking check out the Online Skills section

Plan your action. Build your plan for reaching your goal. Your plan becomes your road map and helps you determine the best course to follow.

Minimise Distractions. Get rid of as much temptation as possible that deviates from your focus on your goal. If your goal is to loose weight, throw away all takeaway and delivery food menus that are delivered through your letterbox. If your goal is to save money and reduce money spent on non necessities, unsubscribe from any online shopping, gaming, promotional newsletters, email or text alerts. If for example your goal is to exercise more and you have identified watching TV as a distraction, don’t turn on the TV until you have achieved the tasks relating to your goal. Use every method you can think of to remove distractions from your situation and it will certainly help your focus.

Use the Power of Visualisation. To have something to focus on we must have a strong visual picture of our target so we can maintain focus on the end result. The more you bring what it is you wish to obtain to the forefront of your mind, the more focus you can give to it.

Measure Your Progress. You can’t control your progress unless you can analyze your progress. Create a system and timetable to measure your progress. Record your daily progress. This can tell us if we are on track or if we need to make adjustments to either our plan or activities. You need to measure your progress on a regular basis as it gives you some control.

Prioritise your Goal. Focus on a few goals at one time. Try not to overburden yourself as it will limit your chance of achieving your goal and demotivate you. Concentrate on the important ones first, achieve them and then you can look at addressing the other goals.

Work your Goals into your Daily Plan. Do something towards achieving your goals everyday. The best way to achieve your goals and maintain your focus is to do something that will make it happen each and every day. Even if its only 15 minutes each day, it is better than not doing any goal related activity at all. Give your goal daily attention and you’ll remain more focused and at a better chance of achieving your goal.

Use the Power of Positive Affirminations. Keep telling yourself in the present tense and in a positive statement of what you want to achieve. When someone’s goal is to climb a mountain, his/her affirmation will detail being at the top of the mountain, not on getting from the bottom to the top. Positive affirmation can help you maintain focus and help you believe in yourself and your ability to obtain your goal.

Celebrate your Milestones. Mark your successes and acknowledge your progress towards your desired goal. Set milestones and as you achieve them reward yourself. This will motivate you and make you see your milestone in a very positive way.

Also remember that there is no quick fix to achieving your goals. It requires your motivation, effort, research and time. However our goals are far more attainable when you stay focused on them.

Goals for my current period

25 Mar

1. SuperComputing 2015 (With Chenhan Yu)

2. NLP Project (With Feng Yu)

3. Intel lab intern

4. Research Project: (Heterogeneous scheduler + DAG with communication nodes) (with Andreas)

I believe in you!!

Intel Lab Interview

24 Mar

I was interviewed by Intel lab today. Since it is a research intern opportunity, I prepared for some research statement material, but it turned out it might be useless. Actually whether you would be recruited as a intern or not depends on the number of available intern positions, your publication and the comparison with you opponents.

I was asked about my basic skills and very few about my research. My guess is that they have some available intern positions and someone else has already been interviewed, but may have the potential to decline that offer. In other words, my interview can only be a “plan B” in case the intern position may not be filled up.

My interview only lasts for about 20 minutes. Kindly of funny considering that it took me 6 rounds interviews for an internship in VMware last summer. Anyway, I hope that everything in my destiny.

Later I swam in gym again. Finish some coding for executor in research projects. Too tired now.

Regular Expression

23 Mar

http://stackoverflow.com/questions/4736/learning-regular-expressions

https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx

Click to access Regular-Expressions.pdf

The Most Beautiful English Word

23 Mar

From http://www.u148.net/article/122815.html

1. Epoch: (n) a particular period of time in history or a person’s life

2. Somnambulist: (n) a person who sleepwalks

3. Ineffable: (adj) too great to expressed in words

4. Hiraeth: (n) a homesickness for a home you can’t return to, or that never was.

5. Petrichor: (n) the pleasant, earthy smell after rain

6. Mellifluous: (adj) a sound that is sweet and smooth, pleasing to hear

7. Serendipity: (n) the chance occurrence of events in a beneficial way

8. Sonorous: (adj) an imposingly deep and full sound

9. Aquiver: (adj) Quivering, trembling

10. Limerence: (n) the state of being infatuated with another person

11. Bombinate: (v) to make a humming or buzzing noize

12. Ethereal: (adj) extremely delicate light, not of this world

13. illicit: (adj) not legally, permitted

14. Nefarious: (adj) Wicked, villainous depicable

15. Iridescent: (adj) producing a display of rainbowlike colors.

16. Epiphany: (n) a moment of sudden revelation.

DOTW 1

22 Mar

I am always motivated by those people who work together with me. I believe the stronger my competitor is, the stronger I will be.

Reflection

–GPU, CUDA modeling

–Understanding of numerical linear algebra

–English (speaking, writing)

–Coding

–Swimming

What I have done in Spring Break

visit SXSW Geek Stage 03/15/2015

visit NASA Houston 03/16/2015

Finish udacity “intro to parallel computing”: https://www.udacity.com/course/cs344

Learn Git (See previous blog)

English Punctuation

21 Mar
. period/full stop
, comma
: colon
! exclamation mark
? question mark
hyphen
* asterisk
apostrophe Lily’s, omission inside words
dash
_ underscore
‘ha’ single quotation marks
“ha” double quotation marks
(ha) parenthesis or round brackets
[ha] square brackets
<ha> angle brackets
{ha} curly brackets/braces
<< ha >> French quotation mark
ellipsis
; semi-colon
ditto mark
|| parallel
| vertical bar
/ slash, stroke
& ampersand=and
~ tilde/swung dash
§ section; division
pilcrow
-> arrow
^ caret
% percent
per mil
°C Celsius
°F Fahrenheit 32 degrees Fahrenheit

Learning Git from scratch

21 Mar

Some paper to recommend:

Click to access git.from.bottom.up.pdf

Cheating Sheet:

http://www.git-tower.com/blog/git-cheat-sheet/

Click to access git-cheat-sheet.pdf