Contact Me

Use the form on the right to contact me.

 

         

123 Street Avenue, City Town, 99999

(123) 555-6789

email@address.com

 

You can set your address, phone number, email and site description in the settings tab.
Link to read me page with more information.

P7060368.JPG

Explorations

How Much Did King Kong Weigh?

Andrew Elliott

The Empire State Building always brings to mind that iconic image of King Kong atop the skyscraper, swatting away biplanes as he clutches Fay Wray in his massive hand. But how massive? How much would the 1933-version of the mighty Kong have weighed?

King Kong, the 1933 movie

King Kong, the 1933 movie

IMDb provides some relevant information on scale. Apparently, the size of the enormous ape varies from location to location and scene to scene. The publicity described Kong as 50 foot tall, but the sets in the jungle of his home island were consistent with an 18ft beast. The models for close-up photography of his hand were built to a scale that would fit a 40 ft animal, and the New York scenes were consistent with a 24 foot scale. Since it was the image of Kong on the Empire State Building that sparked this thought, let’s go with that figure, and treat him as 7.32 m tall.

If we take a Western Gorilla as the model for Kong when calculating height/weight ratios, we can scale up the height and use the square-cube law to scale up the weight. A very large gorilla of this species would be around 1.8m high and would weigh about 230 kg. So Kong was just over 4 times as tall as a very large gorilla, and using the cube of that ratio to scale his weight, we need a factor of 67.25 to give us a final mass of just under 15,500 kilograms. Does this seem reasonable? Three times as big as an elephant? I guess it does.

It’s entirely feasible that Kong would be scornful of the aircraft, since the planes used in the scene were Curtiss O2C-2 'Helldivers', which have a gross mass of a little over 2000 kg, around one eighth of his weight. But those annoying planes, equipped as they are with machine guns, finally cause the mighty Kong to lose his grip and tumble the 381 metres (52 times his own height), to the street below. And compared to that iconic building, the giant ape comes in at less than 1 / 20,000 of the mass of the Empire State Building itself, which is estimated to weigh 331 million tons.

Plain Talking About Numbers

Andrew Elliott

I've recently been taking forward an idea that's been in the back of my mind for a while: www.IsThatABigNumber.com is a website that has a simple aim: to put big numbers in context, and in so doing, start to develop a more intuitive feel for them.

While I can intellectually understand the meaning of large numbers, typically written in scientific notation (e.g., 2.5 x 10^8 or expressed in billions and trillions), that's not quite the same as having a "feeling" for very large numbers. In fact, when I really think about it, I think my sense of comfort with numbers runs out somewhere around the 1000 mark. That is, I think I can visualise 1000 items without things becoming blurry, but not much more than that.  But that is another blog post for another day.

The topic for today is how we talk about numbers. The website IsThatABigNumber.com is all about numbers, and the expression of those numbers needs to be clear and comprehensible.

Take measurements of length: I was taught about the SI system, based on meters, kilograms and seconds. Now for scientists and engineers, it's perfectly fine to talk about 4 x 10^7 m. It's convenient for calculations and it's the proper thing to do. But if I want to explain how long the equator is, I want to about 40 thousand kilometers instead.

Because? Because that's the way folk talk. Not 4 x 10^7 m; not 40 Megameters; not even 40 million meters. In my mind, things that can be measured using "meters" as the unit range from a bit less than one meter, to a somewhat more than a thousand. Half a meter? 0.5m is just fine; a 10,000m race? That's fine too. 50,000m? Nah, I'm better with 50km; 0.02m? Nope, give me 2cm or 20mm.

So, here are some of the principles that I am using for IsThatABigNumber:

For all numbers)

  • Numbers are expressed in three parts: a base magnitude between 1 and 1000, followed by a multiple, and where needed, a unit.  So the population of the world is expressed as 7 billion, not 7,000,000,000 (all those zeroes? too hard to grok)
  • The multiple used is based around powers of 1000, with the exception that ...
  • "12,500" is more natural than "12.5 thousand", so for numbers in the 1000 - 999,999 range, we make an exception and use numerals
  • But "12.5 million" is more natural than 12,500,000, so for a million and beyond, we use "*illion" words, to the limit of septillion - 10^24 (and I struggle with septillion!)
  • Beyond septillion, fall back to scientific notation starting with 10^27. In this area, the game is pretty much out of the hands of "folk", and in the hands of the scientists.

The, when it comes to units: for distance measures:

  • Meters are used between 1m and 999m
  • Kilometers are used for distances above 1km
  • Millimeters are used for distances below 1 m.

For measuring mass:

  • Kilograms are used for masses above 1kg
  • Grams are used for masses below 1 kg
  • (Thinking about using metric tons - 1000kg for bigger masses - but currently undecided)

Time is a whole separate problem, not yet addressed.  For now, years are the only units in use, but really, days and seconds seem more natural for small time periods. But then this is about BIG Num8ers.

Money is the other measure included in IsThatABigNumber.com. For now, US Dollars are the standard unit, rendered with a "$" sign.

Is That A Big Number?

Andrew Elliott

Do numbers make you numb?

If they do, have a look  here (www.isthatabignumber.com) to restore some number sensitivity. Or read on to understand why...

Way back in May 1982, Douglas Hofstadter (he of "Gödel, Escher, Bach" fame) wrote an article for Scientific American called "Number Numbness, or Why innumeracy may be just as dangerous as illiteracy". To provoke the readers to think about how they internalise big numbers, he concocted this scenario:

'The renowned cosmologist Professor Bignumska, lecturing on the future of the universe, had just stated that in about a billion years, according to her calculations, the earth would fall into the sun in a fiery death. In the back of the auditorium a tremulous voice piped up: "Excuse me, Professor", but h-h-how long did you say it would be?" Professor Bignumska calmly replied, "About a billion years." A sigh of relief was heard. "Whew! for a minute there, I thought you'd said a million years."

The absurdity of the comment arises because a million and a billion years are both so far beyond our lifespans as to make the difference meaningless from a personal point of view. In the article, he makes the case that most people have little real grasp of large numbers: not really being able to distinguish millions from billions from trillions, even though there is a thousand-fold difference between each.

But while this distinction may not give us sleepless nights when used in comparison to human lifespans, there are areas of life (national and corporate budgets, national population statistics, even hard disk sizes) where the billion vs million distinction DOES affect our lives, and many of us lack the "Number Sense" to be aware, instinctively, of the difference. Hofstadter argues that this "numbness" to numbers causes a loss of perspective, to the detriment of public debate.

Numbers in the News

The media themselves often fail to establish a proper context for the numbers in the news. Any number ending in "...illion" just ends up in a mental category called "big number".  

In November 2015, the UK public sector net borrowing was around £14 billion; debt was around £1.5 trillion.  Are those big numbers? Of course they are, but are they unexpectedly big? Are they alarmingly big? Are they big in context?

Lionel Messi earns around 25 million Euros a year.  Is this a big number? Of course it is, but how big, in context? And what context should we use? Other footballers? Other sports people? Other individuals? Corporations?

I'm a huge fan of the BBC Radio 4 programme "More or Less". This programme tears apart statistical claims floating about current debates: I think it makes a vital contribution to understanding what's really going on, and debunking inaccurate claims. And one question they will often start with, when looking at some reported statistic is "Is that a big number?".

So, Is That A Big Number?

All this is by way of introducing an idea I am currently working on - an online service to answer just that question. Enter a number, any number, and it'll respond with a bunch of relevant comparisons, to put the number in context. 

For example: in 2015, there were 72.4 million cars sold in the world. Is that a big number? the web service tells us: "One for every 100 people in the world". 17.5 million cars sold in the USA? That's "One for every 18 people in the USA" Big numbers? You can draw your own conclusions. And that's the point: to allow people to make informed judgements by putting things in context.

We'll throw in a few quirky measures too, just for fun. How long is an Imperial Star Destroyer, in terms of X-Wings? How long is a football pitch in terms of iPhones laid end to end?

It's very much in development but you can play around with what's been done here (www.isthatabignumber.com).  As you can see from all the not-yet-live links there's a lot more to come. We're hoping to use this as a hub for a variety of numeracy-related services: a number-led blog, educational resources.

So, is 25 million Euros a big number? Click this link to see:
http://www.isthatabignumber.com/itabn/compare?number=25+m+EUR

I'd love to think this could play some small role in helping people such as journalists, teachers or just the curious to better understand the numbers around us.

How Fresh is that Code?

Andrew Elliott

One of the beauties of the "R" programming language is the vitality of the user community. Language users are continuously uploading newly developed or revised versions of extension functionality. Looking at the range of packages available on CRAN, the "Comprehensive R Archive Network" I was struck by how many of these packages had recent versions resistered. So, I decided to dig a little, and at the same time give you a little flavour of quick and dirty data exploration with R. Some highlights:

Load in the package list from CRAN:

packages<- getRPackages("http://cran.r-project.org/web/packages/available_packages_by_date.html")

How many packages are in the archive?

dim(packages)[1]
## [1] 7422

Date of stalest package?

min(packages$dt)
## [1] "2005-10-29 UTC"

Date of freshest package?

max(packages$dt)
## [1] "2015-11-03 UTC"

Ooh! that's today: how many packages are fresh today?

nrow(packages[packages$dt==max(packages$dt),])
## [1] 5

And just for interest, which are they?

packages[packages$dt==max(packages$dt),c("name", "dt")]
##            name         dt
## 1      DLMtool  2015-11-03
## 2   epiDisplay  2015-11-03
## 3         MM2S  2015-11-03
## 4    quickmapr  2015-11-03
## 5  SALTSampler  2015-11-03

Ok, so let's compute the ages of the packages (in weeks). How many packages are less than 4 weeks old?

today<-max(packages$dt)
packages$age<-interval(packages$dt,today)/edays(7)
sum(packages$age<=4)
## [1] 587

Around 8%! let's look at the distribution by age - for convenience convert weeks to approximate years:

ageInYears <- packages$age / 52
hist(ageInYears, breaks=20)

More than half the packages are fresher than 1 year old; and it's easy to see that the growth took off just about 4 years ago after several years of slow burn. Let's look at the growth just over the past year (roughly 44 weeks):

freshThisYear<-packages[packages$age<=44,]$age
hist(freshThisYear, breaks=44)

I think it's clear that the takeup of R continues to accelerate, if the freshness of the user-contributed archive is any sort of guide.

"R" is for Re-use

Andrew Elliott

Previously on "R is for ..."

One of R's greatest strengths is the level of activity in the user community and the range of packages that have been developed and contributed to the general good. There are thousands of packages out there and the list grows daily. How is the young data scientist to stay on top of this flood of material?, I hear you ask. Various helpful lists have been contributed by bloggers and other commentators, such as 10 R packages I wish I knew about earlier. The CRANtastic website provides a list of the favourites based on user ratings http://crantastic.org/popcon, and r-bloggers provides a list by frequency of download in RStudio http://www.r-bloggers.com/a-list-of-r-packages-by-popularity/.

Dependencies

Another way of looking at this, is to look at which packages are most fundamental to the broader R community - which packages do package authors build upon. The CRAN repository provides structured data on each package: among the data provided are "Depends", and "Imports", which list the packages each is built upon. It seemed a fun thing to see which packages were most depended upon, which were the most fundamental in the R ecosystem.

First-Order

For this exercise I didn't bother distinguishing between "Depends" and "Imports" - I wrote a simple routine to take the list of packages from CRAN, and then for each, to harvest from the relevant page on the CRAN website, the contents of "Depends" and "Imports" properties, and stash those package names in a table which I called "antecedants". The table has columns "self", the package in question, "ante", the antecedant package and "order" the depth of the dependency.

        options(width=100)
        source("Rpackages.R")
        load("Packages.Rda")
        load("Antecedants.Rda")
        head(antecedants)
##         self         ante order
## 1  cleangeo            sp     1
## 2  cleangeo         rgeos     1
## 3  cleangeo      maptools     1
## 4     smerc  SpatialTools     1
## 5     smerc        fields     1
## 6     smerc          maps     1

That gave the first order dependencies, and here are some interesting glimpses into that table. I used table to count the order-1 dependency for each antecedant, to see which are most re-used, and then sort that table to reveal the top ten.

        ante1<-table(antecedants[antecedants["order"]==1,]$ante)
        anteSorted1<-ante1[order(ante1, decreasing=TRUE)]
        length(anteSorted1)
## [1] 1458
        dim(antecedants[antecedants["order"]==1,])
## [1] 10330     3
        head(anteSorted1, 10)
##
##     MASS     Rcpp  ggplot2     plyr   Matrix  lattice  stringr reshape2       sp  mvtnorm
##      374      370      321      266      183      173      157      151      146      142

So something over a thousand packages are in some way re-used, for a total of over 10,000 order-1 dependencies, and the most popular include many of the usual suspects like ggplot and plyr.

Going Deeper

But just looking at the first level is not good enough. If your package builds on, say, ggplot2, which has among its antecedants, plyr, then of course plyr is an antecedant of your package too, but a second-order antecedant. So we need to get recursive, and we can do this just by analysing the antecedants table. So we can build the order 2 antecedants table based on the order 1 table; and the order 3 from the order 2, and so on, until we finally bottom out and reach the maximum depth. Along the way we need to make sure we don't double-count - if a packages uses ggplot2 and also uses plyr directly, we don't want to be double-counting plyr.

So for example here are the most frequent order 3 dependencies.

        ante3<-table(antecedants[antecedants["order"]==3,]$ante)
        anteSorted3<-ante3[order(ante3, decreasing=TRUE)]
        head(anteSorted3, 10)
##
##      lattice         Rcpp      stringr RColorBrewer         plyr     magrittr       digest
##          674          659          503          469          465          421          394
##    dichromat     labeling      munsell
##          380          380          380

And having chased this down, until there were no more levels, the winners are ...

        anteN<-table(unique(antecedants[,-3])$ante)
        anteSortedN<-anteN[order(anteN, decreasing=TRUE)]
        top10ante<-head(anteSortedN, 10)
        top10ante
##
##         Rcpp      lattice         MASS     magrittr      stringi      stringr       digest
##         1341         1119         1048          911          876          864          853
##         plyr RColorBrewer   colorspace
##          799          654          650

So what are these packages that float to the top of the list?

        packages[trim(packages$name) %in% names(top10ante),1:2]
##                name                                                            desc
## 449           Rcpp                                  Seamless R and C++ Integration
## 674           MASS   Support Functions and Datasets for Venables and Ripley's MASS
## 1489       lattice                                          Trellis Graphics for R
## 1820       stringi                          Character String Processing Facilities
## 2013          plyr                Tools for Splitting, Applying and Combining Data
## 2460       stringr        Simple, Consistent Wrappers for Common String Operations
## 2871    colorspace                                        Color Space Manipulation
## 3479        digest                  Create Cryptographic Hash Digests of R Objects
## 3661  RColorBrewer                                            ColorBrewer Palettes
## 3796      magrittr                                   A Forward-Pipe Operator for R

Oh, and ...

Just for fun, some other bits and pieces

The deepest dependency:

        head(antecedants[antecedants$order==max(antecedants$order),])
##                self       ante order
## 53579  BIFIEsurvey     lattice    10
## 53580  BIFIEsurvey        Rcpp    10
## 53581  BIFIEsurvey     stringi    10
## 53582  BIFIEsurvey    magrittr    10
## 53583  BIFIEsurvey  colorspace    10

The number of dependencies for each order maxes out at second-order dependencies, and then tails away:

        table(antecedants$order)
##
##     1     2     3     4     5     6     7     8     9    10
## 10330 14504 11647  8289  5052  2582   932   208    34     5

The most dependent packages - the ones which will pull in the greatest number of other packages:

        selfN<-table(unique(antecedants[,-3])$self)
        selfSortedN<-selfN[order(selfN, decreasing=TRUE)]
        top10self<-head(selfSortedN, 10)
        top10self
##
##  BIFIEsurvey      miceadds         immer          sirt     treescape       semPlot       bootnet
##           120           119           108           106            92            87            84
##    IATscores           RAM     diveRsity
##            83            82            81

And these highly dependent packages, what do they do?

        packages[packages$name %in% names(top10self),1:2]
##               name                                                                     desc
## 386         immer                                Item Response Models for Multiple Ratings
## 437     treescape              Statistical Exploration of Landscapes of Phylogenetic Trees
## 484   BIFIEsurvey                    Tools for Survey Statistics in Educational Assessment
## 1546     miceadds    Some Additional Multiple Imputation Functions, Especially for\n'mice'
## 1817         sirt                                Supplementary Item Response Theory Models
## 2231          RAM                        R for Amplicon-Sequencing-Based Microbial-Ecology
## 2277    IATscores                 Implicit Association Test Scores Using Robust Statistics
## 2921      bootnet                Bootstrap Methods for Various Network Estimation Routines
## 3526    diveRsity   A Comprehensive, General Purpose Population Genetics Analysis\nPackage
## 4335      semPlot       Path diagrams and visual analysis of various SEM packages'\noutput