Youtube View Problem
A surprising fact of youtube Views figures
I shall assume you (yeah, you the reader) knows what is Youtube.
Have you noticed that a large majority of the “number of views” of videos start with the digit 1 ? If you haven’t, go do a quick survey now. You will realise the fact surprisingly holds without fail.
Benford’s Law
Benford’s Law, also called the FirstDigit Law, refers to the frequency distribution of digits in many (but not all) reallife sources of data. Read more about it at wikipedia.
The law states the following (Digit followed by frequency of occurence)
 1 > 30.1%
 2 > 17.6%
 3 > 12.5%
 4 > 9.7%
 5 > 7.9%
 6 > 6.7%
 7 > 5.8%
 8 > 5.1%
 9 > 4.6%
The shape of the curve is simliar to those of the normal distribution curve. I will be using youtube videos views to study the validity of this claim in terms of video views.
Youtube Videos Views Study
Now that you have a brief introduction about benford’s law, I wish to test it on video views. I will use Youtube as a datamodel as it is ubiqutous. So we will have some number of videos, and some number of views. Each view will go to a video and a video will have a number of views.
The question is
Will the percentage distribution of the first digit of number of views of videos agree with Benford’s Law ?
Some mathematical variables
N = the number of videos in YouTube
V = the total number of video views in YouTube
Assumptions
As this is a mere mathematical model, there are several pretty logical assumptions we should make during this test:

The probability for a video to double (or triple, quadriple, just simply multiplied by an arbitary k) is directly proportional to the number of views it currently has. This assumption is pretty valid

Every video start with one view. Again, logical as the uploader probably has to view the video first to stimulate some “view traffic”

The views are allocated based on a weighted random algorithm
Carrying out the test
I have written the test in ruby and it is available here at gist/github . A probably better implementation is to use a fenwick tree to record the Range sum, but the current method will do for a small experiment. This is further supported by the fact that a value of N >= 30 and V >= 30 will produce pretty decent normal approximations.
The steps are as follows:
 All videos start with 1 view
 All videos start with equal probability of being picked.
 Randomly pick a video based on their probabilities.
 Increase the number of view of the video by 1
 Increase the probability of the video accordingly
 Repeat steps 3 to 5 V times
 Record the statistics (how many views have 1 as the first digit, how many have 2 etc. )
Results of Study
It turns out that this approximation model largely agrees with Benford’s law.
I tried N = 1000 and V = 100000 and got the following results:
 no videos have no views
 334 (33.4%) videos have 1 as the first digit
 145 (14.5%) videos have 2 as the first digit
 118 (11.8%) videos have 3 as the first digit
 93 (9.3%) videos have 4 as the first digit
 67 (6.7%) videos have 5 as the first digit
 65 (6.5%) videos have 6 as the first digit
 61 (6.1%) videos have 7 as the first digit
 54 (5.4%) videos have 8 as the first digit
 63 (6.3%) videos have 9 as the first digit
Here is a comparison table of the both the observed frequency and the expected frequency:
Expected percentage  Observed percentages 
30.1%  33.4% 
17.6%  14.5% 
12.5%  11.8% 
9.7%  9.3% 
7.9%  6.7% 
6.7%  6.5% 
5.8%  6.1% 
5.1%  5.4% 
4.6%  6.3% 
The value of N is 1000, and if you were to carry out a chisquared test, the value you will get is 18.13 while the degree of freedom is 8. Hence, the hypothesis can be accepted at 1% significance level.
Conclusion
The tests made, while fit the hypothesis at 1% significance level, could be further by other means. Due to the limitation of the algorithm provided, large numbers of N and V cannot be provided (see below at mathematics notes for usage of larger N and V).
Nevertheless, as N and V both approaches infinity (with V strictly larger than N significantly), it can be shown that this argument holds as long as the logical assumptions do apply. Hence, hoorary! Data source which are consistent with Benford’s Law now include Youtube video views!
Mathematics notes
Why no videos will end up with only 1 view (under ideal conditions)
You may be tempted to think that the probability of being picked will be low when all other videos are picked. However, mathematics say otherwise. Every round, the probability of a video which was not picked before being chosen is 1 /(N+v) , where v is the number of views that have been allocated up to date. So with some simple math, we can conclude that the probability of a video being chosen is:
1 / N + 1 / (N+1) + 1 / (N+2) ….
This is a harmonic sequence, and it is divergent (really slowly), and that is a pretty good conclusion about every video.
Proving that the probability is proportional to the number of views
Complexity of Algorithm:
It is definitely not the best algorithm, but anyway.. :
 Precomputing the videos to have 1 view is O(N)
 Randomly picking a video is O(log N)
 Reassigning probability after increaseing view is O(N)
 Computing statistics is O(N)
Final complexity: O(N + V(log N + N) )
This limits N and V to be around 1000 and 10000 respectively
A better algorithm:
A better solution by using Fenwick tree as the cumulative frequency container (used by the weighted random algorithm) would give a healthier complexity:
 Precomputing the videos to have 1 view is O(N)
 Randomly picking a video is O(log N * log N) because you need to binary search a fenwick tree
 Reassigning probability after increaseing view is O(log N)
 Computing statistics is O(N)
Final complexity: O(N + V log N(1 + log N) )
The limits of N and V can be around 1000 and 1000000 respectively