Comparison of data analysis packages: R, Matl...

来源：百度文库编辑：神马文学网时间：2024/06/13 05:03:24

AI and Social Science - Brendan O’Connor

Cognition, systems, decisions, visualization, machine learning, etc.

About

This is a blog on artificial intelligence and social science — call it "Social Science++" — with an emphasis on computation and statistics. My general website is anyall.org.

All Posts

Best posts are bold.

Updates: CMU, Facebook
quick note: cer et al 2010
How Facebook privacy failed me
List of probabilistic model mini-language toolkits
Seeing how “art” and “pharmaceuticals” are linguistically similar in web text
Quiz: “art” and “pharmaceuticals”
Don’t MAWK AWK - the fastest and most elegant big data munging language!
Patches to Rainbow, the old text classifier that won’t go away
Another R flashmob today
Beautiful Data book chapter
Haghighi and Klein (2009): Simple Coreference Resolution with Rich Syntactic and Semantic Features
Blogger to Wordpress migration helper
R questions on StackOverflow
FFT: Friedman + Fortran + Tricks
Road in a forest
Beta conjugate explorer
Michael Jackson in Persepolis
Psychometrics quote
June 4
Where tweets get sent from
Zipf’s law and world city populations
Announcing TweetMotif for summarizing twitter topics
Performance comparison: key/value stores for language model counts
1 billion web page dataset from CMU
Pirates killed by President
Binary classification evaluation in R via ROCR
Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata
La Jetee
“Logic Bomb”
SF conference for data mining mercenaries
Love it and hate it, R has come of age
Facebook sentiment mining predicts presidential polls
Can social media prevent genocide?
Statistics vs. Machine Learning, fight!
Calculating running variance in Python and C++
Python bindings to Google’s “AJAX” Search API
Netflix Prize
The Wire: Mr. Nugget
Correlations - cotton picking vs. 2008 Presidential votes
Disease tracking with web queries and social messaging (Google, Twitter, Facebook…)
Obama street celebrations in San Francisco
Twitter graphs of the debate
Is religion the opiate of the elite?
Financial market theory on the Daily Show
The Universal Declaration of Human Rights Animated
It is accurate to determine a blog’s bias by what it links to
Blog move has landed
MyDebates.org, online polling, and potentially the coolest question corpus ever
PalinSpeak.com
"Machine" translation/vision (Stanford AI courses online)
Fukuyama: Authoritarianism is still against history
A better Obama vs McCain poll aggregation
East vs West cultural psychology!
The MacGyver of data analysis
Link: Today’s international organizations
Bias correction sneak peek!
Turker classifiers and binary classification threshold calibration
Pairwise comparisons for relevance evaluation
Clinton-Obama support visualization
Sub-reddit for Systems Science and OR
conplot - a console plotter
The best natural language search commentary on the internet
Are women discriminated against in graduate admissions? Simpson’s paradox via R in three easy steps!
a regression slope is a weighted average of pairs’ slopes!
Datawocky: More data usually beats better algorithms
Allende’s cybernetic economy project
Quick-R, the only decent R documentation on the internet
Spending money on others makes you happy
color name study i did
PHD Comics: Humanities vs. Social Sciences
data data data
Food Fight
Graphics! Atari Breakout and religious text NLP
Moral psychology on Amazon Mechanical Turk
Will the humanities save us?
Indicators of a crackpot paper
What is experimental philosophy?
Data-driven charity
Race and IQ debate - links
How did Freud become a respected humanist?!
Actually that 2008 elections voter fMRI study is batshit insane (and sleazy too)
Pop cog neuro is so sigh
Authoritarian great power capitalism
neo institutional economic fun!
Verificationism dinosaur comics
EEG for the Wii and in your basement
Dollar auction
ConnectU.com SQL injection vulnerability: a story of pathetic hubris (and fun with the password ‘password’)
It’s all in a name: "Kingdom of Norway" vs. "Democratic People’s Republic of Korea"
When’s the last time you dug through 19th century English mortuary records
Are ideas interesting, or are they true?
Cooperation dynamics - Martin Nowak
China: fines for bad maps
Cerealitivity
Game outcome graphs — prisoner’s dilemma with FUN ARROWS!!!
Washington in 1774
Happiness incarnate on the Colbert Report
Evangelicals vs. Aquarians
"Time will tell, epistemology won’t"
Richard Rorty has died
Freak-Freakonomics (Ariel Rubinstein is the shit!)
"Stanford Impostor"
Rock Paper Scissors psychology
Simpson’s paradox is so totally solved
More fun with Gapminder / Trendalyzer
Gapminder.org — terrific world data visualizations
Random search engine searcher
Evil
Seth Roberts and academic blogging
Statistics is big-N logic?
Feminists, anarchists, computational complexity, bounded rationality, nethack, and other things to do
Computability and induction and ideal rationality and the simpsons
Iraq is the 9th deadliest civil war since WW2
Pascal’s Wager
When linguists appear on ironic parody talk shows
The Jungle Economy
funny comic
Anarchy vs. social order in Somalia
Double thesis action
A big, fun list of links I’m reading
4-move rock, paper, scissors!
Two Middle East politics visualizations
neuroscience and economics both ways
Social network-ized economic markets
Rock, Paper, Scissors
Neuroeconomics reviews
Lordi goes to Eurovision
Drunken monkeys experiment!
Easterly vs. Sachs on global poverty
high irony
The identity politics of satananic zombie alien man-beasts
new kind of science, for real
Mark Turner: Toward the Founding of Cognitive Social Science
Libertarianism and evolution don’t mix
academic blogging
science writing bad!
Bush approval ratings
Kurzweil interview
cognitive modelling is rational choice++
Submit your poker data!
Bayesian analysis of intelligent design (revised!)
searchin’ for our friend, homo economicus
balkanized USA
war death statistics
guns, germs, & steel pbs show?!
the psychology of design as explanation
another blog: cog psych and political/social stuff
a bayesian analysis of intelligent design
Statistical inference and social science
finding some decision science blogs
Social economics and rationality
City crisis simulation (e.g. terrorist attack)
freakonomics blog
Supreme Court justices’ agreement levels
$ echo {political,social,economic}{cognition,behavior,systems}
Modelling environmentalism thinking
monkey economics (and brothels)
more argumentation & AI/formal modelling links
zombies!
looking for related blogs/links
idea: Morals are heuristics for socially optimal behavior
1st International Conference on Computational Models of Argument (COMMA06)
Online Deliberation 2005 conference blog & more is up!
go science
addiction & 2 problems of economics
gintis: theoretical unity in the social sciences

Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata

Lukas and I were trying to write a succinct comparison of the most popular packages that are typically used for data analysis. I think most people choose one based on what people around them use or what they learn in school, so I’ve found it hard to find comparative information. I’m posting the table here in hopes of useful comments.

Name Advantages Disadvantages Open source? Typical users R Library support; visualization Steep learning curve Yes Finance; Statistics Matlab Elegant matrix support; visualization Expensive; incomplete statistics support No Engineering SciPy/NumPy/Matplotlib Python (general-purpose programming language) Immature Yes Engineering Excel Easy; visual; flexible Large datasets No Business SAS Large datasets Expensive; outdated programming language No Business; Government Stata Easy statistical analysis No Science SPSS Like Stata but more expensive and worse

[7/09 update: tweaks incorporating some of the excellent comments below, esp. for SAS, SPSS, and Stata.]

There’s a bunch more to be said for every cell. Among other things:

Two big divisions on the table: The more programming-oriented solutions are R, Matlab, and Python. More analytic solutions are Excel, SAS, Stata, and SPSS.
Python “immature”: matplotlib, numpy, and scipy are all separate libraries that don’t always get along. Why does matplotlib come with “pylab” which is supposed to be a unified namespace for everything? Isn’t scipy supposed to do that? Why is there duplication between numpy and scipy (e.g. numpy.linalg vs. scipy.linalg)? And then there’s package compatibility version hell. You can use SAGE or Enthought but neither is standard (yet). In terms of functionality and approach, SciPy is closest to Matlab, but it feels much less mature.
Matlab’s language is certainly weak. It sometimes doesn’t seem to be much more than a scripting language wrapping the matrix libraries. Python is clearly better on most counts. R’s is surprisingly good (Scheme-derived, smart use of named args, etc.) if you can get past the bizarre language constructs and weird functions in the standard library. Everyone says SAS is very bad.
Matlab is the best for developing new mathematical algorithms. Very popular in machine learning.
I’ve never used the Matlab Statistical Toolbox. I’m wondering, how good is it compared to R?
Here’s an interesting reddit thread on SAS/Stata vs R.
SPSS and Stata in the same category: they seem to have a similar role so we threw them together. Stata is a lot cheaper than SPSS, people usually seem to like it, and it seems popular for introductory courses. I personally haven’t used either…
SPSS and Stata for “Science”: we’ve seen biologists and social scientists use lots of Stata and SPSS. My impression is they get used by people who want the easiest way possible to do the sort of standard statistical analyses that are very orthodox in many academic disciplines. (ANOVA, multiple regressions, t- and chi-squared significance tests, etc.) Certain types of scientists, like physicists, computer scientists, and statisticians, often do weirder stuff that doesn’t fit into these traditional methods.
Another important thing about SAS, from my perspective at least, is that it’s used mostly by an older crowd. I know dozens of people under 30 doing statistical stuff and only one knows SAS. At that R meetup last week, Jim Porzak asked the audience if there were any recent grad students who had learned R in school. Many hands went up. Then he asked if SAS was even offered as an option. All hands went down. There were boatloads of SAS representatives at that conference and they sure didn’t seem to be on the leading edge.
But: is there ANY package besides SAS that can do analysis for datasets that don’t fit into memory? That is, ones that mostly have to stay on disk? And exactly how good as SAS’s capabilities here anyway?
If your dataset can’t fit on a single hard drive and you need a cluster, none of the above will work. There are a few multi-machine data processing frameworks that are somewhat standard (e.g. Hadoop, MPI) but It’s an open question what the standard distributed data analysis framework will be. (Hive? Pig? Or quite possibly something else.)
(This was an interesting point at the R meetup. Porzak was talking about how going to MySQL gets around R’s in-memory limitations. But Itamar Rosenn and Bo Cowgill (Facebook and Google respectively) were talking about multi-machine datasets that require cluster computation that R doesn’t come close to touching, at least right now. It’s just a whole different ballgame with that large a dataset.)
SAS people complain about poor graphing capabilities.
R vs. Matlab visualization support is controversial. One view I’ve heard is, R’s visualizations are great for exploratory analysis, but you want something else for very high-quality graphs. Matlab’s interactive plots are super nice though. Matplotlib follows the Matlab model, which is fine, but is uglier than either IMO.
Excel has a far, far larger user base than any of these other options. That’s important to know. I think it’s underrated by computer scientist sort of people. But it does massively break down at >10k or certainly >100k rows.
Another option: Fortran and C/C++. They are super fast and memory efficient, but tricky and error-prone to code, have to spend lots of time mucking around with I/O, and have zero visualization and data management support. Most of the packages listed above run Fortran numeric libraries for the heavy lifting.
Another option: Mathematica. I get the impression it’s more for theoretical math, not data analysis. Can anyone prove me wrong?
Another option: the pre-baked data mining packages. The open-source ones I know of are Weka and Orange. I hear there are zillions of commercial ones too. Jerome Friedman, a big statistical learning guy, has an interesting complaint that they should focus more on traditional things like significance tests and experimental design. (Here; the article that inspired this rant.)
I think knowing where the typical users come from is very informative for what you can expect to see in the software’s capabilities and user community. I’d love more information on this for all these options.

What do people think?

•

103 comments to “Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata”

Eric Sun wrote:
23. February 2009 at 8:53 pm :
>>I know dozens of people under 30 doing statistical stuff and only one knows SAS.

I’m assuming the “one” is me, so I’ll just say a few points:
I’m taking John Chambers’s R class at Stanford this quarter, so I’m slowly and steadily becoming an R convert.
That said, I don’t think anything besides SAS can do well with datasets that don’t fit in memory. We used SAS in litigation consulting because we frequently had datasets in the 1-20 GB range (i.e. can fit easily on one hard disk but difficult to work with in R/Stata where you have to load it all in at once) and almost never larger than 20GB. In this relatively narrow context, it makes a lot of sense to use SAS: it’s very efficient and easy to get summary statistics, look at a few observations here and there, and do lots of different kinds of analyses. I recall a Cournot Equilibrium-finding simulation that we wrote using the SAS macro language, which would be quite difficult in R, I think. I don’t have quantitative stats on SAS’s capabilities, but I would certainly not think twice about importing a 20 GB file into SAS and working with it in the same way as I would a 20 MB file.

That said, if you have really huge internet-scale data that won’t fit on one hard drive, then SAS won’t be too useful either. I’ll be very interested if this R + Hadoop system ever becomes mature: http://www.stat.purdue.edu/~sguha/rhipe/

In my work at Facebook, Python + RPy2 is a good solution for large datasets that don’t need to be loaded into memory all at once (for example, analyzing one facebook network at a time). If you have mutliple machines, these computations can be speeded up using iPython’s parallel computing facilities.

Also, R’s graphical capabilities continue to surprise me; you can actually do a lot of advanced stuff. I don’t do much graphics, but perhaps check out “R Graphics” by Murrell or Deepayan Sarkar’s book on Lattice Graphics.
Eric Sun wrote:
23. February 2009 at 8:55 pm :
I thought that most people consider SAS to have the highest learning curve, certainly higher than R. but maybe I’m mistaken about that.
Justin wrote:
23. February 2009 at 10:24 pm :
Calling scipy immature sounds somehow “wrong”. The issues you come up with are more of early design flaws that will not go away, no matter how “mature” scipy is getting.

That said, these are flaws, but they seem pretty minor to me.
Edward Ratzer wrote:
23. February 2009 at 10:42 pm :
I’ve recently seen GNU DAP mentioned as an open-source equivalent to SAS. Know if it’s any good?
TS Waterman wrote:
23. February 2009 at 10:49 pm :
Have you considered Octave in this regard? It’s a GNU-licensed Matlab clone. Very nice graphing capability, Matlab syntax and library functions, open source.

http://www.gnu.org/software/octave/FAQ.html#MATLAB-compatibility
brendano wrote:
23. February 2009 at 10:52 pm :
@Eric - oops, yeah should’ve put SAS as hardest. Good point that the standard of judging how good large dataset support is, is whether you can manipulate a big dataset the same way you manipulate a small dataset. I’ve loaded 1-2 GB of data into R and you definitely have to do things differently (e.g. never use by()).

@Justin - scipy certainly seems like it keeps improving. I just keep comparing it to matlab and it’s constantly behind. I remember once watching someone try to make a 3d plot. He spent quite a while going through various half-baked python solutions that didn’t work. Then he booted up matlab and had one in less than a minute. Matlab’s functionality is well-designed, well-put-together and well-documented.

@Edward - I have seen it mentioned too. From glancing at its home page, it seems like a pretty small-time project.
brendano wrote:
23. February 2009 at 10:58 pm :
@TS - yeah, i used octave just once for something simple. it worked fine. my issues were: first, i’m not impressed with gnuplot graphing. second, the interactive environment isn’t too great. third, trying to clone the matlab language seems crazy since it’s kind of crappy. i think i’d usually pick scipy over octave if being free is a requirement, else go with matlab if i have access to it.

otoh it looks like it supports some nice things like sparse matrices that i’ve had a hard time with lately in R and scipy. i guess worth another look at some point…
Michael E. Driscoll wrote:
23. February 2009 at 11:05 pm :
Brendan,

Nice overview, I think another dimension you don’t mention — but which Bo Cowgill alluded to at our R panel talk — is performance. Matlab is typically stronger in this vein, but R has made significant progress with more recent versions. Some benchmark results can be found at:

http://mlg.eng.cam.ac.uk/dave/rmbenchmark.php

MD
Mike wrote:
23. February 2009 at 11:27 pm :
In high energy particle physics, ROOT is the package of choice. It’s distributed by CERN, but it’s open source, and is multi-platform (though the Linux flavor is best supported). It does solve some of the problems you mentioned, like running over large datasets that can’t be entirely memory-resident. The syntax is C++ based, and has both an interpreter and the ability to compile/execute scripts from the command line.

There are lots of reasons to prefer other packages (like R) over ROOT for certain tasks, but in the end there’s little that can be done with other packages that one cannot do with ROOT.
Pete Skomoroch wrote:
24. February 2009 at 12:32 am :
This is obviously oversimplified - but that is the point of a succinct comparison. I would add that you are missing a lot of disadvantages for Excel - it has incomplete statistics support and an outdated “language” :)

Python actually really shines above the others for handling large datasets using memmap files or a distributed computing approach. R obviously has a stronger statistics user base and more complete libraries in that area - along with better “out-of-the-box” visualizations. Also, some of the benefits overlap - using numpy/scipy you get that same elegant matrix support / syntax that matlab has, basically slicing arrays and wrapping lapack.

The advantages of having a real programming language and all the additional non-statistical libraries & frameworks available to you make Python the language of choice for me. If there is something scipy is weak at that I need, I’ll also use R in a pinch or move down to C. I think you are basically operating at a disadvantage if you are using the other packages at this point. The only other reason I can see to use them is if you have no choice, for example if you inherited a ton of legacy code within your organization.
John wrote:
24. February 2009 at 3:09 am :
I’m sure you’ve stirred up a lot of controversy. Thanks for calling ‘em like you see ‘em.

As for Mathematica, I haven’t used it for statistics beyond some basic support for common distributions. But one thing it does well is very consistent syntax. I used it when it first came out, then didn’t use if for years, and then started using it again. When I came back to it, I was able to pick it up right where I left off. I can’t put R down for a week and remember the syntax. Mathematica may not do everything, but what it does do, it does elegantly.
jessy wrote:
24. February 2009 at 6:34 am :
it would be awesome to have an informal, hands on tutorial comparison of several of these languages (looking at ease, performance, features, etc.). maybe a meetup at something like super happy dev house, or even something separate. just a thought!
brendano wrote:
24. February 2009 at 6:34 am :
@Michael Driscoll - good point! I was afraid to make performance claims since I’ve heard that Matlab is getting faster, they have a JIT or a nice compiler or something now, and I haven’t used it too much recently. (That benchmark page doesn’t even say which matlab version was used, though I emailed the guy…) I’m also suspicious of performance comparisons since I’d expect much of it to be very dependent on the matrix library and there are several LAPACKs out there (ATLAS and others) and many compiletime parameters to fiddle with. I think I read something claiming many binary builds of R don’t use the best LAPACK they could. I’m not totally sure of this though. But if it’s true that Matlab knows how to vectorize for-loops, that’s really impressive.

@Mike - ah yes, i remember looking at ROOT a long time ago and thinking it was impressive. But then I forgot about it because all the cs/stats people whose stuff I usually read don’t know about it. I think it just goes to show you that the data analysis tools problem is tackled so differently by different groups of people, it’s very easy to not miss out on better options just due to lack of information!

@Pete - yeah I whine about python. but I seem to use numpy plenty still :) actually its freeness is a huge win over matlab for cluster environments since you dont’ have to pay for a zillion licenses…

Hm I seem to be talking myself into thinking it’s down to R vs Python vs Matlab. then the rosetta stone http://mathesaurus.sourceforge.net/matlab-python-xref.pdf should be my guide…

@John - very interesting. I think many R users have had the experience of quickly forgetting how to do basic things.
brendano wrote:
24. February 2009 at 9:33 am :
From David Knowles, who did the comparison Mike Driscoll linked to (http://mlg.eng.cam.ac.uk/dave/rmbenchmark.php):

> Nice comparison. I would add to the pros of R/Python that the data
> structures are much richer than Matlab. The big pro of Matlab still
> seems to be performance (and maybe the GUI for some people). On top of
> being expensive Matlab is a nightmare if you want to run a program on
> lots of nodes because you need a license for every node!
>
> It’s 2008b I did the comparison with - I should mention that!
Capt. Jean-Luc Pikachu wrote:
24. February 2009 at 2:45 pm :
From Rob Slaza’s statistics toolbox tutorials, it *seems* like using MATLAB for stats is reasonably simple…
Gaurav wrote:
24. February 2009 at 9:27 pm :

On top of being expensive Matlab is a nightmare if you want to run a program on lots of nodes because you need a license for every node!

@Brendan:

Re David Knowles’ comment…

There are specialized parallel/distributed computing tools available from MathWorks for writing large-scale applications (for clusters, grid etc.). You should check out: http://www.mathworks.com/products/parallel-computing.

Running full-fledged desktop MATLAB on a huge number of nodes is messy and of course very expensive not to mention that a single user would take away several licenses for which other users will have to wait.

Disclosure: I work for the parallel computing team at The MathWorks
brendano wrote:
25. February 2009 at 12:27 am :
Another guy from Mathworks, their head of Matlab product management Scott Hirsch, contacted me about the language issue and was very kind and clarifi-cative. The most interesting bits below.

On Tue, Feb 24, 2009 at 7:20 AM, Scott Hirsch wrote:
>> Brendan –
>>
>> Thanks for the interesting discussion you got rolling on several popular
>> data analysis packages
[...]
>> I’m always very interested to hear the perspectives of MATLAB users, and
>> appreciate your comments about what you like and what you don’t like. I was
>> interested in following up on this comment:
>>
>> “Matlab’s language is certainly weak. It sometimes doesn’t seem to be
>> much more than a scripting language wrapping the matrix libraries. “
>>
>> I have my own assumptions about what you might mean, but I’d be very
>> interested in hearing your perspectives here. I would greatly appreciate it
>> if you could share your thoughts on this subject.
>
> sure. most of my experiences are with matlab 6. just briefly,
>
> * leave out semicolon => print the expression. that is insane.
> * each function has to be defined in its own file
> * no optional arguments
> * no named arguments
> * no way to group variables together in a structure. (i don’t need object
> orientation, just a bunch of named items)
> * no perl/python-style hashes
> * no object orientation (or just a message dispatch system) … less
> important
> * poor/no support for text
> * or other things a general purpose language knows how to do (sql, networks,
> etc etc)

On Tue, Feb 24, 2009 at 11:27 AM, Scott Hirsch wrote:
> Thanks, Brendan. This is very helpful. Some of the things have been
> addressed, but not all. Here are some quick notes on where we are today.
> Just to be clear – I have no intention (or interest) in changing your
> perspectives, just figured I could let you know in case you were curious.
>
>
>
> > * leave out semicolon => print the expression. that is insane.
> No plans to change this. Our solution is a bit indirect, but doesn’t break
> the behavior that lots of users have come to expect. We have a code
> analysis tool (M-Lint) that will point out missing semi-colons, either while
> you are editing a file, or in a batch process for all files in a directory.
>
> > * each function has to be defined in its own file
> You can include multiple functions in a file, but it introduces unique
> semantics – primarily that the scope of these functions is limited to within
> the file.

[[ addendum from me: yeah, exactly. if you want to make functions that are
shared in different pieces of your code, you usually have to do 1 function per
file. ]]

> > * no optional arguments
> Nothing yet.
>
> > * no named arguments
> Nope.
>
> > * no way to group variables together in a structure. (i don’t need object
> orientation, just a bunch of named items)
> We’ve had structures since MATLAB 5.

[[ addendum from me: well, structures aren't very conventional in standard
matlab style, or at least certainly not the standard library. most algorithm
functions return a tuple of variables, instead of packaging things together
into a structure. ]]

> > * no perl/python-style hashes
> We just added a Map container last year.
>
> > * no object orientation (or just a message dispatch system) … less
> important
> We had very weak OO capabilities in MATLAB 6, but introduced a modern system
> in R2008a.
>
> > * poor/no support for text
> This has gotten a bit better, primarily through the introduction of regular
> expressions, but can still be awkward.
>
> > * or other things a general purpose language knows how to do (sql, networks,
> etc etc)
> Not much here, other than a smattering (Database Toolbox for SQL,
> miscellaneous commands for web interaction, WSDL, …)
>
> Thanks again. I really do appreciate getting your perspective. It’s
> helpful for me to understand how MATLAB is perceived.
>
> -scott
brendano wrote:
25. February 2009 at 12:38 am :
@Gaurav - it sure would be nice if i could see how much this parallel toolbox costs without having to register for a login!
Peter Skomoroch wrote:
25. February 2009 at 11:30 am :
There is another good numpy/matlab comparison here:

http://www.scipy.org/NumPy_for_Matlab_Users

As of the last year, a standard ipython install ( “easy_install IPython[kernel]” ) now includes parallel computing right out of the box, no licenses required:

http://ipython.scipy.org/doc/rel-0.9.1/html/parallel/index.html

If this is going to turn into a performance shootout, then I’ll add that from what I’ve seen Python with numpy/scipy outperforms Matlab for vectorized code.

My impression has been that performance order is Numpy > Matlab > R, but as my friend Mike Salib used to say - “All benchmarks are lies”. Anyway, competition is good and discussions like this keep everyone thinking about how to improve their platforms.

Also, keep in mind that performance is often a sticking point for people when it need not be. One of the things I’ve found with dynamically typed languages is that ease of use often trumps raw performance - and you can always move the intensive stuff down to a lower level.

For people who like poking at numbers:

http://www.scipy.org/PerformancePython
http://www.mail-archive.com/numpy-discussion@scipy.org/msg14685.html
http://www.mail-archive.com/numpy-discussion@scipy.org/msg01282.html

Sturla has some strong points here:
http://www.mail-archive.com/numpy-discussion@scipy.org/msg14697.html
thrope wrote:
25. February 2009 at 11:44 am :
@brendano - I think it might be a case of “if you have to ask you can’t afford it” :)
devicerandom wrote:
25. February 2009 at 11:48 am :
What about Origin (and Linux/Unix open source clones like Qtiplot)? I know a lot of people using them, and they allow fast, easy statistical analysis with beautiful graphs out of the box. Qtiplot is quite immature but it is Python-scriptable, which is a definitive plus for me -I don’t know about Origin.
Stefan wrote:
25. February 2009 at 12:49 pm :
Hi. I think this is a very incomplete comparison. If you want to make a real comparison, it should be more complete than this wiki article . And to give a bit of personal feedback:
I know 2 people using STATA (social science), 2 people using Excel (philosophy and economics), several using LabView (engineers), some using R (statistical science, astronomy), several using S-Lang (astronomy), several using Python (astronomy) and by using Python, I mean that they are using the packages they need, which might be numpy, scipy, matplotlib, mayavi2, pymc, kapteyn, pyfits, pytables and many more. And this is the main advantage of using a real language for data analysis: you can choose among the many solutions the one that fits you best. I also know several people who use IDL and ROOT (astronomy and physics).
I have used IDL, ROOT, PDL, (Excel if you really want to count that in) and Python and I like Python best :-)
@brendano: One other note: I think that you really have to distinguish between data analysis and data visualization. In astronomy this is often handled by completely different software. The key here is to support standardized file storage/ exchange formats. In your example the people used scipy which does not offer a single visualization routine, so you can not blame scipy for difficulties with 3D plots…
david wrote:
25. February 2009 at 12:58 pm :
I am a core scipy/numpy developer, and I don’t think calling them immature from a user POV is totally unfair. Every time someone tries numpy/scipy/matplotlib and cannot plot something simple in a couple of minutes is a failure of our side. I can only say that we are improving - projects like pythonxy or enthought are really helpful too for people who want something more integrated.

There is no denying than if you are into an integrated solution, numpy/scipy is not the best solution of the ones mentioned today - it may well be the worse (I don’t know them all, but I am very familiar with matlab, and somewhat familiar with R). There is a fundamental problem for all those integrated solutions: once you hit their limitations, you can’t go beyond that. Not being able to handle data which do not fit in memory in matlab, that’s a pretty fundamental issue, for example. Not having basic data structures (hashmap, tree, etc…) another one. Making advanced UI in matlab, not easy either.

You can build your own solution with the python stack: the numpy array capabilities are far beyond matlab’s one, for example (broadcasting, advanced indexing are much powerful than matlab current capabilities). The C API is complete, and you can do things which are simply not possible with matlab. You want to handle very big datasets ? pytables give you a database-like API on top of hdf5. Things like cython are also very powerful for people who need speed. I believe those are partially consequences of not being integrated.

Concerning the flaws you mentioned (scipy.linalg vs numpy.linalg, etc…): those are mostly legacies, or exist because removing them would be too costly. There are some efforts to remove redundancy, but not all of them will disappear. They are confusing for a newcomer (they were for me), but they are pretty minor IMHO, compared to other problems.
bill wrote:
25. February 2009 at 2:29 pm :
You forgot support and continuity. In my experience, SAS offers very good support and continuity. Others claim SPSS does, too (I have no experience there). In a commercial environment, the programs need to outlive the analyst and the whims of the academic/grad student support/development. For one-off disposable projects, R has lots of advantages. For commercial systems, not so many.
Lou Pecora wrote:
25. February 2009 at 4:45 pm :
I’ve looked at several of the “packages” mentioned here (R, Octave, MATLAB, C, C++, Fortran, Mathematica). I’m a physicist who is often working in new fields where understanding the phenomena is the main goal. This means my colleagues and I are often developing new numerical/theoretical/data-analysis approaches. For anyone in this situation I unequivocally recommend:

Python.

Why? Because given my situation there often are no canned routines. That means soon or later (usually sooner) I will be programming. Of all the languages and packages I’ve used Python has no equal. It is object oriented, has very forgiving run-time behavior, fast turn around (no edit, compile, debug cycles — just edit and run cycles), great built in structures, good modularity, and very good libraries. And, it’s easy to learn. I want to spend my time getting results, not programming, but I have to go through code development since often nothing like what I want to do exists and I’ve got to link the numerics to I/O and maybe some interactive things that make it easy to use and run smoothly. I’ve taken on projects that I would not want to attempt in any of the packages/languages I’ve listed.

I agree that Python is not wart-free. The version compatibility can sometimes be frustrating. “One-stop shopping” for a complete Python package is not here, yet (although Enthought is making good progress). It will never be as fast as MATLAB for certain things (JIT compiling, etc. makes MATLAB faster at times). Python plotting is certainly not up to Mathematica standards (although it is good).

However, the Python community is very nice and very responsive. Python now has several easy ways to add extensions written in C or C++ for faster numerics. And for all my desire not to spend time coding, I must admit I find Python programming fun to do. I cannot say that for anything else I’ve used.
David Warde-Farley wrote:
25. February 2009 at 6:35 pm :
There is good reason for the duplication of “linalg” in SciPy. SciPy’s brand has more features which probably aren’t of as much use to as wide an audience, and (perhaps more importantly) one of the requirements for NumPy is that it not depend critically on a Fortran compiler. SciPy relaxes this requirement, and thus can leverage a lot of existing Fortran code. At least that’s my understanding.
Bob Carpenter wrote:
25. February 2009 at 9:27 pm :
These packages change and it’s easy to get locked-in ideas from the past. I haven’t used Matlab since the 1990s, but the last time I used it, its I/O and singular value decomposition was so slow that we switched to S-Plus just to finish in our lifetimes.

Can any of these packages compute sparse SVDs like folks have used for Netflix (500K x 25K matrix with 100M partial entries)? Or do regressions with millions of items and hundreds of thousands of coefficients? I typically wind up writing my own code to do this kind of thing in LingPipe, as do lots of other folks (e.g. Langford et al.’s Vowpal Wabbit, Bottou et al.’s SGD, Madigan et al.’s BMR).

What’s killing me now is scaling Gibbs samplers. BUGS is even worse than R in terms of scaling, but I can write my own custom samplers that fly in some cases and easily scale. I think we’ll see more packages like Daume’s HBC for this kind of thing.

R itself tends to just wrap the real computing in layers of scripts to massage data and do error checking. The real code is often Fortran, but more typically C. That must be the same for SciPy given how relatively inefficient Python is at numerical computing. It’s frustrating that I can’t get basic access to the underlying functions without rewrapping everything myself.

A problem I see with the way R and BUGS work is that they typically try to compile a declarative model (e.g. a regression equation in R’s glm package or a model specification in BUGS), rather than giving you control over the basic functionality (optimization or sampling).

The other thing to consider with these things from a commercial perspective is licensing. R may be open source, but its Gnu license means we can’t really deploy any commercial software on top of it. Sci-Py has a mixed bag of licenses that is also not redistribution friendly. I don’t know what licensing/redistribution looks like for the other packages.

@bill Support and continuity (by which I assume you mean stability of interfaces and functionality) is great in the core R and BUGS. The problem’s in all the user-contributed packages. Even there, the big ones like lmer are quite stable.
David Warde-Farley wrote:
25. February 2009 at 9:46 pm :
As for the rather large speed gains made by recent MATLAB releases that Lou noted, I believe this is due in most part to their switch to the Intel Math Kernel Library in place of a well-tuned ATLAS (I’m not completely sure if that’s what they used before, but it’s a good bet). This hung a good number of people with PowerPC G5’s out to dry rather quickly as newer MATLABs apparently only run on Intel Macs (probably so they don’t have to maintain two separate BLAS backends).

Accelerated linear algebra routines written by people who know the processors inside and out will result in big wins, obviously. You can also license the IKML separately and use it to compile NumPy (if I recall correctly, David Cournapeau who commented above was largely responsible for this capability, so bravo!). I figure it’s only a matter of time before somebody like Enthought latch onto the idea of selling a Python environment with IKML baked in, so you can get the speedups without the hassle.
Stefan wrote:
26. February 2009 at 9:32 am :
@ben The SciPy team was also unhappy about the licensing issue, so you’ll be glad to hear that SciPy 0.7 was released under a single, BSD license.

You said “It’s frustrating that I can’t get basic access to the underlying functions without rewrapping everything myself.” We are currently working on ways to expose the mathematical functions underlying NumPy to C, so that you can access it in your extension code. During the last Google Summer of Code, the Cython team implemented a friendly interface between Cython and NumPy. This means that you can code your algorithms in Python, but still have the speed benefits of C.

A number of posts above refer to plotting in 3D. I can recommend Enthought’s Mayavi2, which makes interactive data visualisation a pleasure:

http://code.enthought.com/projects/mayavi/

We are always glad for suggestions on how to improve SciPy, so if you do try it out, please join the mailing list and tell us more about your experience.
Stewart wrote:
26. February 2009 at 12:05 pm :
You should probably add GenStat to your list, this is a UK package specialising in the biosciences. It’s a relative heavy-weight in stats having come from Rothamsted Research (home of Fisher, Yates and Nelder). Nelder was the actual originator of GenStat. GenStat is also free for teaching world-wide and free for research to the developing world. It’s popularity is mainly within Europe, Africa and Oceania, hence why many US researchers may not have heard of it. I hope this helps
brendano wrote:
27. February 2009 at 3:06 am :
Wow, this is the funnest language flamewar I’ve seen.

I will note that no one defended SAS. Maybe those people don’t read blogs.
bill wrote:
27. February 2009 at 3:26 am :
brendano,
Hmm, I thought I did. I do production work in SAS and mess around (test new stuff, experimental analyses) in R.
Bill
brendano wrote:
27. February 2009 at 3:35 am :
Oops. Yes yes. My bad!

OK: no one has defended Stata!
John Dudley wrote:
4. March 2009 at 2:46 pm :
My company has been using StatSoft’s Statistica for years and it does all of the things that you found to be shortcomings of SAS, SPSS and Matlab…

It’s fast, graphs are great and are virtually no limitations. I’m suprised it wasn’t listed as one of the packages reviewed. We have been using it for years and it is absolutely critical to our business model.
Andy Malner wrote:
4. March 2009 at 2:48 pm :
StatSoft is the only major package with R integration…The best of both worlds.
Abhijit wrote:
5. March 2009 at 3:38 am :
In stats there seems to be the S-Plus/R schools and the SAS schools. SAS people find R obtuse with poor documentation, and the R people say the same about SAS (myself included). R wins in graphics and flexibility and customizability (though I certainly won’t argue with a SAS pro who can whip up macros). SAS seems a bit better with large data sets. R is ever expanding, and has improved greatly for simulations/looping and memory management. Recently for large datasets (bioinformatic, not the 5-10G financial ones), I’ve used a combination of Python and R to great effect, and am very pleased with the workflow. I think rpy2 is a great addition to Python and works quite well. For some graphs I actually prefer matplotlib to R.

I’m also a big fan of Stata for more introductory level stuff as well as for epidemiology-related stuff. It is developing a programming language that seems useful. One real disadvantage in my book is its ability to hold only one dataset at a time, as well as a limit on the data size.

I’ve also used Matlab for a few years. It’s statistics toolbox is quite good, and Matlab is pretty fast and has great graphics. It’s limited in terms of regression modeling to some degree, as well as survival methods. Syntactically I find R more intuitive for modeling (though that is the lineage I grew up with). The other major disadvantage of matlab is distribution of programs, since Matlab is expensive. The same complaint for SAS, as well:)
Comparing statistical packages: R, SAS, SPSS, etc. — The Endeavour wrote:
5. March 2009 at 4:20 am :
[...] Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata ? X [...]
John Johnson wrote:
5. March 2009 at 2:59 pm :
I’ll sing the same song here as I do elsewhere on this topic.

In large-scale production, SAS is second to none. Of course, large-scale production shops usually have the $$$ to fork over, and SAS’s workflow capabilities (and, to a lesser extent, large dataset handling capabilities) save enough billable hours to justify the cost. However, for graphics, exploratory data analysis, and analysis beyond the well-established routines, you have to venture into the world of SAS/IML, which is a rather painful place to be. It’s PRNGs are also stuck in the last century, top of the line of a class obsolete for anything other than teaching.

R is great for simulation, exploratory data analysis, and graphics. (I disagree with the assertion that R can’t do high-quality graphics, and, like some commenters above, recommend Paul Murrell’s book on the topic.) It’s language, while arcane, is powerful enough to write outside-the-box analyses. For example, I was able to quickly write, debug, and validate an unconventional ROC analysis based on a paper I read. As another example, bootstrapping analyses are much easier in R than SAS.

In short, I keep both SAS and R around, and use both frequently.

I can’t comment too much on Python. MATLAB (or Octave or Scilab) is great for roll-your-own statistical analyses as well, though I can’t see using it for, e.g., a conventional linear models analysis unless I wanted the experience. R’s matrix capabilities are enough for me at this point. I used Mathematica some time ago for some chaos theory and Fourier/wavelet analysis of images and it performed perfectly well. If I could afford to shell out the money for a non-educational license, I would just to have it around for the tasks it does really well, like symbolic manipulation.

I used SPSS a long time ago, and have no interest in trying it again.
Jon Peck wrote:
5. March 2009 at 6:11 pm :
SPSS has for several years been offering smooth integration with both Python and R. There are extensive apis foe both. Check out the possibilities at http://www.spss.com/devcentral. See also my blog at insideout.spss.com.

You can even easily build SPSS Statistics dialog boxes and syntax for R and Python programs. DevCentral has a collection of tools to facilitate this.

This integration is free with SPSS Base.
A lot of Stuff « Blog Pra falar de coisas wrote:
9. March 2009 at 8:14 pm :
[...] comparando software statísticos (R, SAS, SPSS, MATLAB e Stata). [...]
Sean wrote:
11. March 2009 at 4:18 am :
I used Matlab, R, stata, spss and SAS over the years.

To me, the only reason for using sas is because of its large data ability. otherwise, it is a very very bad program. It, from day one, trains it users to be a third rate programmer.
The learning curve for SAS is actually very steep, particularily for a very logical person. Why? the whole syntax in SAS is pretty illogical and inconsistent.
sometimes, it is ‘/out’ sometimes, it is ‘output’.

In 9.2, SAS started to make variables inside a macro as local variables by default.
This is ridiculous!! SAS company has existed for at least 30 years. How can this basic programming rule should be implemented after 30 years?!

Also, if a variable is uninitialized, SAS will still let the code run. One time, I worked in a company, this simple stupid SAS design flaw causes our project 3 weeks of delay (there is one uninitialized varaible among 80k lines of log, all blue). A couple of PhDs in the project who used C and Matlab did not believe why SAS makes such a stupid mistake. Yes, with a big disbelief, it made!

My ranking is that Matlab and R are about the same, Matlab is better in plots most times. R is better is manipulation datasets. stata and SAS are the same level.
After taking into account of cost, then the answer is more obvious.
bill r wrote:
12. March 2009 at 1:37 pm :
SAS was not designed by a language maven, like Pascal. It grew from its PL/1 and Fortran roots. It is a collection of working tools, added to meet the demands of working statisticians and IT folk, that has grown since its start in the late ’60s and early ’70s. SAS clearly has kruft that shows its growth over time. Sort of like the UNIX tools, S, and R, actually.

And, really, what competent programmer would ever use a variable without initializing or testing it first? That’s a basic programming rule I learned back in the mid ’60s, after branching off of uninitialized registers, and popping empty stacks.

Bah, you kids. Get off of my lawn!
tom p wrote:
13. March 2009 at 4:57 am :
i work for a retail company that deploys SAS for their large datasets and complex analysis. just about everything else is done in excel.

we had a demo of omniture’s discover onpremise (formerly visual sciences), and the visualization tools are fairly amazing. it seems like an interesting solution for trending real time evolving data, but we aren’t pulling the trigger on it now.
draegtun wrote:
13. March 2009 at 9:12 am :
For reference PDL (Perl Data Language) can be found at pdl.perl.org/ and is also available via CPAN

/I3az/
draegtun wrote:
13. March 2009 at 9:14 am :
opps.. link screwed up… here goes again ;-)

pdl.perl.org
Giles wrote:
13. March 2009 at 5:52 pm :
Have you seen Resolver One It’s a spreadsheet like Excel, but has built-in Python support, and allows cells in the grid to hold objects. This means that numpy mostly works, and you can have one cell in the grid hold a complete dataset, then manipulate that dataset in bulk using spreadsheet-like formulae. Someone has also just built an extension that allows you to connect it to R, too. In theory, this means that you can get the best of all three — spreadsheet, numpy, and R — in your model, using the right tool for each job.

On the other hand, the integration with both numpy and R is quite new, so it’s immature as a stats tool compared to the other packages in this list.

Full transparency: I work for Resolver Systems, so obviously I’m biased towards it :-) Still, we’re very keen on feedback, and we’re happy to give out free copies for non-commercial research and for open source projects.
Will Dwinnell wrote:
13. March 2009 at 7:29 pm :
Being the resident MATLAB enthusiast in a house built on another tool, I will pitch in my two cents, by suggesting another spectrum along which these tools lie: “canned procedures” versus “roll your own”. Use of general-purpose programming languages, such as has been suggested in the comments for Fortran or C/C++ clearly anchor one end of this dimension, whereas the statistical software sporting canned routines lie all the way at the other. A tool like MATLAB, which provides some but not complete direct statistical support, is somewhere in the middle. The trade-off here, naturally, is the ability to customize analysis vs. convenience.
Jude Ryan wrote:
16. March 2009 at 4:37 pm :
Most of the users on this post are biased towards packages like R, rather than packages like SAS, and I want to offer my perspective of the relative advantages and disadvantages of SAS relative to R.

I am primarily a SAS user (over 20 years) who has been using R as needed (a few years) to do things that SAS cannot do (like MARS splines), or cannot do as well (like exploratory data analysis and graphics), or requires expensive SAS products like Enterprise Miner to do (like decision trees, neural networks, etc).

I have worked primarily for financial service (credit cards) companies. SAS is the primary statistical analysis tool in these companies partly due to history (S, the precursor to S+ and R, was not yet developed) and partly because it can run on mainframes (another legacy system) accessing huge amounts of data stored on tapes, which I am not sure any other statistical package can. Furthermore, business who have the $ will be the last to embrace open source software like R, as they generally require quick support when they get stuck trying to solve a business problem, and researching the problem in a language like R is generally not an option in a business setting.

Also, SAS’ capabilities for handling large volumes of data are unmatched. I have read huge compressed files of online data (Double Click), having over 2 billion records, using SAS, to filter the data and keep only the records I needed. Each of the resulting SAS datasets were anywhere from 35 GB to 60 GB in size. As far as I know, no other statistical tool can process such large volumes of data programatically. First we had to be able to read in the data and understand it. Sampling the data for modeling purposes came later. I would run the SAS program overnight, and it would generally take anywhere from 6 to 12 hours to complete, depending on the load on the server. In theory, any statistical software that works with records one at a time should be able to process such large volumes of data, and maybe the Python based tools can do this. I do not know as I have never used them. But I do know that R, and even tools like WEKA cannot process such volumes of data. Reading the data from a database, using R, can mitigate the large data problems encountered in R (as does using packages like biglm), but SAS is the clear leader in handling large volumes of data.

R on the other hand is better suited for academics and research, as cutting edge methodologies can be and are implemented much more rapidly in R than in SAS, as R’s programming language has more elegant support for vectors and matricies than SAS (proc IML). R’s programming language is much more elegant and logically consistent, while SAS’ programming language(s) are more adhoc with non-standard programming constructs. Furthermore, people who prefer R generally have a stronger “theoretical” programming background (most have programmed in C, Perl, or objected oriented languages) or are able to pick up programming faster, while most users who feel comfortable with SAS have less of a programming background and can tolerate many of SAS’ non-standard programming constructs and inconsistencies. These people do not require or need a comprehensive programming language to accomplish their tasks, and it takes much less effort to program in base SAS than in R if one has no “theoretical” programming background. SAS macros take more time to learn and many programming languages have no equivalent (one exception I know are C’s pre-processor commands). But languages like R do not need anything like SAS macros and can achieve the same results all in one, logically consistent, programming language, and do more, like enabling R users to write their own functions. The equivalent to writing functions in R, in SAS, is to now program a new proc in C and know how to integrate it with SAS. An extremely steep learning curve. SAS is more of a suite of products, many of them with inconsistent programming constructs (base SAS is totally different from SCL - formerly Screen Control language but now SAS Component Language), and proc SQL and proc IML are different from data step programming.

So while SAS has a shallow learning curve initially (learn only base SAS), the user can only accomplish tasks of “limited” sophistication with SAS, without resorting to proc IML (which is quite ugly). For the business world this is generally adequate. R, on the other hand, has a steeper learning curve initially, but tasks of much greater sophistication can handled more easily in R than is SAS, once R’s steeper learning curve is behind you.

I forsee an increased use of R relative to SAS over time, as many statistical departments at Universities have started teaching R (sometimes replacing SAS with R) and students graduating from these universities will be more conversant with R, or equally conversant with both SAS and R. Many of these students entering the workforce will gravitate towards R, and to the extent the companies they work for do not mandate which statistical software to use, the use of R is bound to increase over time. With memory becoming cheaper, and Microsoft based 64 bit operating systems becoming more prevalent, bigger data sets can be stored in RAM, and R’s limitation in handling large volumes of data are starting to matter less. But the amount of data is also starting to grow, thanks to the internet, scanners (used in grocery chains), etc., and the volume of data may very well grow so rapidly that even cheaper RAM and 64 bit operating systems may not be able to cope with the data deluge. But not every organization works with such large datasets.

For someone who has started their careers using SAS, SAS is more than adequate to solve all problems faced in the business world, and there may seem to be no real reason, or even justification to learn packages like R or other statistical tools. To learn R, I have put in much personal time and effort, and I do like R and have been and forsee using it more frequently over time for exploratory data analysis, and in areas where I want to implement cutting edge methodologies, and where I am not hampered by large data issues. Personally, both SAS and R will always be part of my “tool kit” and I will leverage the strengths of both. For those who do not currently use R, it would be wise to start doing so, as R is going to be more widely used over time. The number of R users has already reached critical mass, and since R is free, this is bound to increase the usage of R as the R community grows. Furthermore, the R Help Digest, and the incredibly talented R users that support it, is an invaluable aid to anyone interested in learning R.
Dailycious 14.03.09 « cendres.net wrote:
17. March 2009 at 1:08 am :
[...] Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata - Brendan O’Co… statistics software No comments yet. [...]
Y-H Chen wrote:
20. March 2009 at 3:36 am :
Interesting. I don’t think I would have put SPSS and Stata in the same category. I haven’t spend a tremendous amount of time working with SPSS, but I have spent a fair amount of time with Stata, and my biased perspective is that Stata is more sophisticated and powerful than SPSS. Certainly, Stata’s language isn’t as powerful as R’s, but I definitely wouldn’t say it’s “weak.” Stata’s not my favorite statistical program in the world (that would, of course, be R), but there are definitely things I like about it; it’s a definite second to R in my book.

By the way, here’s my (unfair) generalization regarding usage:
– R: academic statisticians
– SAS: statisticians and data-y people in non-academic settings, plus health scientists in academic and non-academic settings
– SPSS: social scientists
– Stata: health scientists
Walking Randomly » R Compared to MATLAB (or ‘learning a thing or two from your students’) wrote:
23. March 2009 at 5:58 pm :
[...] matrices. You don’t get much more MATLABy than matrices! Other articles such as this comparison between various data analysis packages also proved interesting and [...]
xin wrote:
19. April 2009 at 2:03 am :
Sean:
I am a junior SAS user with only 3 year experience. But even I know that you need to press ‘ctrl’ and ‘F’ to search for ‘uninitialized’ and ‘more than’ in SAS log to ensure everything is OK.
As far as a couple C++PHD in your group is concerned, they need to understand to play with rules of whatever system they are using……
xin wrote:
19. April 2009 at 2:07 am :
by the way, I found the comments of SAS people left are more tolerant, open-minded (maybe they are older, lol). Instad the majority of ‘R’ers on this thread act like a bunch of rebellious teens…..
Joe wrote:
30. April 2009 at 6:58 pm :
I am a big fan of Stata over SAS for medium and small businesses. SAS is the mercedes-benz of stats I’ll admit for Govt and Big business. I use Stata a LOT for economics, it has all the most-used predictive methods (OLS, MLE, GLS, 2SLS, binary choice, etc) models built it. I think the model would have to be pretty essoteric not to be found in Stata.

I ran Stata on linux server with 16GB ram and about 2TB of disk storage. The Hardware config was about $12K. I would not recommend using virtual memory for Stata. That said, you can stick a lot of data in 16GB ram! If I pay attention to the variable sizes (keep textual ones out), I got 100s of millons of rows into memory.

Stata supports scripting (*do files) and are very easy to use as is the GUI. The GUI is probably the best feauture.

The Hardware ($12,000) + Software ($3000 - 2 user license) costs $15,000. The equivilient SAS software was about $100,000. You do the math.

I’ve used SPSS, but that was a while ago. At that time I felt Stata was the superior product.
brendano wrote:
1. May 2009 at 2:08 am :
Finally a direct Stata vs SAS comparison! Very interesting. Thanks for posting. I can’t believe SAS = $100,000.

> I ran Stata on linux server with 16GB ram and about 2TB of disk storage.
> I would not recommend using virtual memory for Stata.

In my experience, virtual memory is *always* a bad idea. I remember working with ops guys who would consider a server as good as dead once it started using swap.

All programs that effectively use hard disks always have custom code to control when to move data on and off the disk. Disk seeks and reads are just too slow and cumbersome compared to RAM to have the OS try to automatically handle it.

This would be my guess why SAS handles on-disk data so well - they put a lot of engineering work into supporting that feature. Same for SQL databases, data warehouses, and inverted text indexes. (Or the widespread popuarity of Memcached among web engineers.) R, Matlab, Stata and the rest were originally written for memory data and still work pretty much only in that setting.
brendano wrote:
1. May 2009 at 2:48 am :
And also, on the RAM vs hard disk issue — according to Jude Ryan’s very interesting comment above, SAS has a heritage of working with datasets on *tape* drives. Tape, of course, is even further along the size-vs-latency spectrum than RAM or hard disk. Now hard disk sizes are rapidly growing but seek times are not catching up, so people like to say “hard disk is the new tape” — therefore, if your software was originally designed for tape, it may do best! :)
brendano wrote:
1. May 2009 at 9:02 pm :
Here’s an overly detailed comparison of Stata, SAS, and SPSS. Basically no coverage of R beyond the complaint that it’s too hard.
http://www.ats.ucla.edu/stat/technicalreports/

There’s also an interesting reply from Patrick Burns, defending R and comparing it to those 3.
http://www.ats.ucla.edu/stat/technicalreports/Number1/R_relative_statpack.pdf

(Found linked from a comment on John D. Cook’s blog here:
http://www.johndcook.com/blog/2009/05/01/r-the-good-parts/ )
Jaime wrote:
27. May 2009 at 9:37 pm :
I feel so old. Been using SAS for many years. But what the hell is this R ?????? That’s what the kids are using now?
Gye Greene wrote:
28. May 2009 at 4:54 am :
Great comparison of SPSS, SAS, and Stata by Acock (a summary of his findings here — http://www.ocair.org/files/KnowledgeBase/willard/StatPgmEvalAb.pdf)

Below is a summary of the summary — !!! — with my own observations added on.

SAS: Scripting language is awkward, but it’s great for manipulating complex data structures; folks that analyze relational DBs (e.g. govt. folks) tend to use it.

SPSS: Great for the “weekend warriors”; strongly GUI-based; has a scripting language, but it’s in-elegant. They charge a license for **each** “module” (e.g. correlations? linear regressions? Poisson regressions? A separate fee!). Also, charge an annual license. Can read Excel files directly. Used to have nicer graphs and charts than Stata (but, see below).

Stata: Elegant, short-’n'-punchy scripting language; CLI and script-oriented, but also allows GUI. Strong user base, with user-written add-ons available for D/L. **Excellent** tech support! The most recent version (Stata 10) now has some pretty powerful chart/graph editing options (GUI, plus CLI, your choice) that makes it competitive with the SPSS graphs. (Minor annoyance: ever few versions, they make the data format NOT back-compatible with the previous version — have to remember to “Save As” last-year’s version, or else what you save at work won’t open at home…)

My background: Took a course on SAS, but haven’t had a reason to use it. I’ve used SPSS and Stata both, on a reasonably regular basis: I currently teach “Intro to Methods” courses with SPSS, but use Stata for my own work. I dislike how SPSS handles missing values. Unlike SPSS, Stata sells a one-time license: once you buy a version, it’s yours to keep until you feel it’s too obsolete to use.

–GG
Gye Greene wrote:
28. May 2009 at 1:53 pm :
This may be an unfair generalization, but my personal observation is that SPSS users (within the social sciences, at least) tend to have less quantitative training than Stata users. Probably highly correlated with the GUI vs. CLI orientations of the two packages (although each of them allows for both).

Another way of’ differentiating between various statistical software packages is its Geek Cred. I usually tell my Intro to Research Methods (for the social sciences), that…

(On a scale of 0-10…)

R, Matlab, etc. = 9

SAS = 7

Stata = 5

SPSS = 3

Excel = 2

YMMV. :)

COMMENT ON EXCEL: It’s a spreadsheet, first and foremost — so it doesn’t treat rows (cases) as “locked together”, like statistical software does. Thus, when you highlight a column and ask it to sort, it sorts **only** that column. I got burned by this once, back in my first year of grad school, T.A.-ing: sorted HW #1 scores (out of curiosity), and didn’t notice that the rest of the scores had stayed put. Oops.

I now keep my gradebooks in Stata. :)

–GG
Chuck Moore wrote:
29. May 2009 at 1:29 pm :
I began programming in SAS every day at a financial exchange in 1995. SAS has three main benefits over all other Statistical/Data Analysis packages, as far as I know.

1) Data size = truly unlimited. I learned to span 6 DASD (Direct Access Storage Devices) = disk drives on the mainframe for when I was processing > 100 million records = quotes and trading activity from all exchanges. We we went to Unix, we used 100 GB worth of temp “WORK” space, and were processing > 1 Billion transaction a day in < 1 hour (IBM p630 with 4x 1.45 GHz processors and 32 GB of memory, only the processing actually used < 4 GB).

2) Tons and tons of preprogrammed statistical functions with just about every option possible.

3) SAS can read data from almost anything: tapes, disk, etc. fixed field flat files, delimited text files (any delimiters, not just comma or tab or space), xml, most any database, all mainframe data file times. It also translates most any text value into data, and supports custom input and output formats.

SAS is difficult for most real programmers (I took my first programming class in 1977, and have programmed in more languages than I care to share) because it has a data centric perspective as opposed to machine/control centric. It is meant to simplify the processing of large amounts of data for non-programmers.

SAS used to have incredible documentation and support, at incredibly reasonable prices. Unforturnately, the new generation of programmers and product managers have lost their way, and I agree that SAS has been becoming a beast.

For adhoc work, I immediately fell in love with SAS/EG = Enterprise Guide. Unfortunately, EG is written in .net and is not that well written. I would have preferred it being written in Java so that the interface was more portable and supported a better threading model. Oh well.

One of the better features of SAS is that it is not an intepreted programming language, but from the start in 197? it was JIT. Basically, a block of code is read, compiled, and then executed. This is why it is so efficient at processing huge amounts of data. The concept of the “data step” does allow for some built in inefficiencies from the standpoint of multiple passes through the data, but that is because of SAS’s convenience. A C programmer would have done more things, in fewer passes, but the C programmer would have spent many more hours writing the programmer than SAS’s few minutes to do the same thing. I know this because I’ve done it.

Some place I read a complaint about SAS holding only one observation in memory at a time. That is a gross misunderstanding/mistake. SAS holds one or more blocks of observations (records) in memory at a time. The number held is easily configurable. Each observation can be randomly accessed, whether in memory or not.

SAS 9.2 finally fixes one the bigger complaints with PROC FCMP allowing the creation of custom functions. Originally SAS did not support custom functions, SAS wanted to write them for you.

The most unfortunate thing about SAS currently is that it has such a long legacy on uniprocessor machines, that it is having difficulty getting going in the SMP world, being able to properly take advantage of multi-threading and multi-processing. I believe this is due to lack of proper technical vision and leadership. As such, I believe a Java language HPC derivative and tools will eventually take over, providing superior ease of use, visualization, portability, and processing speed on today’s servers and clusters. Since most data will come from an RDMS these days, flat file input won’t carry enough weight.

But, for my current profession = Capacity Planning for computer systems, you still can’t beat SAS + Excel. On the other hand, it looks like I’m going to have to look into R.
Chuck Moore wrote:
29. May 2009 at 1:47 pm :
On a side note. As a “real” programmer, having been an expert in Pascal and C and having programmed in, oh I don’t want to list them all, but I have also done more than just take classes in Java. Anyway, Macros have a place in programming. There have been a few times I wished Java supported macros and not just assertions, out of my own laziness. I am a firm believer in the right tool for the job, and that not everything is a nail, so I need more than just a hammer. The unfortunate thing is that macros can be abused, just like goto’s and programming labels and global variables.

To me, SAS is/was the greatest data processing language/system on the planet. But, I still also program in Java, C, ksh, VBScript, Perl, etc. as appropriate. I’d like to see someone do an ARIMA forecast in Excel, or run a regression that does outlier elimination in only 3 lines of code!
tom m wrote:
11. June 2009 at 1:56 am :
If your dataset can’t fit on a single hard drive and you need a cluster, none of the above will work.

One thing you have to consider, is that using SciPy, you get all of the python libraries for free. That includes the Apache Hadoop code, if you choose to use that. And as someone above pointed out, there is now parallel processing built right in in the most recent distributions (but I have no personal knowledge of that) for MPI or whatever.

Coming from an engineer in industry (not academia), the really neat thing that I like about SciPy is the ease of creating web-based tools (as in, deployed to a web server for others to use) via deployment on an apache installation and mod_python. If you can get other engineers using your analysis, without sending them a excel spreadsheet, or a .m file (for which they need a matlab license), etc. it makes your work much more visible.
sohan wrote:
14. June 2009 at 10:30 am :
hello everyone…
i want to know about the comrative study between SAS, R, SPSS in data analysis.
can anyone provide me the papers related to those.
ed wrote:
18. June 2009 at 11:28 am :
having used sas, spss, matlab, gauss and r, let me say that describing stata as having a weak programming language is a sign of ignorance.

it has a very powerful interpreted scripting language which allows one to easily extend stata. there is a very active community and many user written add-ons are available. see: http://ideas.repec.org/s/boc/bocode.html

stata also has a full fledged matrix programming language called (mata) comparable to matlab with a c-like syntax, which is compiled and therefore very fast.

managing and preparing data for analysis is a breeze in stata.

finally stata is easy to learn.

obviously not many people use stata around here.

some more biased opinions:

sas is handy you have some old punch cards in the cupboard or a huge dataset. apart from that it truly sucks. some people say that it is good to manage data, but why not use a good relational database to do that and then use decent statistical software to do the analysis?

excel sucks obviously infinitely more that sas. apart from its (lack of) statistical capabilities and reliability, any point-and-click only software is an obvious no-no from the point of view of scientific reproducability

i don’t care fore spss and cannot imagine anyone does.

matlab is nice, but expensive. not so great for preparing/managing data.

have not used scipy/numpy myself, but have colleagues who love it. one big advantage is that it uses python (ie good language to master and use)

r is great, but more difficult to get into. i don’t like the loose syntax too much though. it is also a bitch with big datasets.
Willem wrote:
17. July 2009 at 6:53 am :
On high quality graphics in R, one should certainly check out the Cairo-package. Many graphics can be output in hip formats like SVG.
Mathias wrote:
17. July 2009 at 10:57 pm :
On the point of Excel breaking down at 10,000+ rows, apparently Excel 2010 will come with Gemini, an add-on developed by the Excel and SQL team, aiming at handling large datasets:
Project Gemini sneak preview
I doubt this would make Excel the platform of choice for doing anything fancy with large datasets anyways, but I am intrigued.
Jay Verkuilen wrote:
26. July 2009 at 9:48 pm :
Some reax, as I’ve used most of these at some point:

SAS has great support for large files even on a modest machine. A few years ago I did a bunch of sims on my dissertation using it and it worked happily away without so much batting an eyelash on a crappy four year old Windoze XP machine with 1.5 GB of memory. Also, programs like NLP (nonlinear optimization), NLMIXED, MIXED, and GLIMMIX are really great for various mixed model applications—this is quite broad as many common models can be cast in the mixed model framework. NLMIXED in particular lets you write some pretty interesting models that would otherwise require special coding. Documentation in SAS/STAT is really solid and their tech support is great. Graphics suck and I don’t like the various attempts at a GUI.

I prefer Stata for most “everyday” statistical analysis. Don’t knock that, as it’s pretty common even for a methodologist such as myself to need to fit logistic regression or whatever and not want to have to waste a lot of time on it, which Stata is fantastic for. Stata 11 looks to be even better, as it incorporates procedures such as Multiple Imputation easily. The sheer amount of time spent doing MI followed by logistic regression (or whatever) is irritating. Stata speeds that up. Also when you own Stata you own it all and the upgrade pricing is quite reasonable. Tech support is also solid.

SPSS has a few gems in its otherwise incomprehensible mass of utter bilge. IMO it’s a company with highly predatory licensing, too.

R is nice for people who don’t value their time or who are doing lots of “odd” things that require programming and extensibility. I like it for class because it’s free, there are nice books for it, and it lets me bypass IT as it’s possible to put a working R system on a USB drive. I love the graphics.

Matlab has made real strides as a programming language and has superb numerics in it (or did), at least according to the numerics people I know (including my numerical analysis professor). However, Statistics Toolbox is iffy in terms of what procedures it supports, though it might have been updated. Graphics are also nice. But it is expensive.

Mathematica is nice for symbolic calculation. With the MathStatica addon (sadly this has been delayed for an unconscionable amount of time) it’s possible to do quite sophisticated theoretical computations. It’s not a replacement for your theoretical knowledge, but is very helpful for doing all the inaccurate and tedious calculations necessary.
Brett D wrote:
27. July 2009 at 10:58 am :
I started in Matlab, moved on to R, looked at Octave, and am just getting into SciPy.

Matlab is good for linear algebra and related multivariate stats. I could never get any nice plotting out of it. It can do plenty of things I never learnt about, but I can’t afford to buy it, so I can’t use it now anyway.

R is powerful, but can be very awkward. It can write jpeg, png, and pdf files, make 3D plots and nice 2D plots as well. Two things put me off it: it’s an absolute dog to debug (how does “duplicate row names are not allowed” help as an entire error message when I’ve got 1000 lines of code spread between 4 functions?), and its data types have weird eccentricities that make programming difficult (like transposing a data frame turns it into a matrix, and using sapply to loop over something returns a data frame of factors… I hate factors). There are a lot of packages that can do some really nice things, although some have pretty thin documentation (that’s open source for you).

Octave is nicer to use than R ( = Matlab is nicer to use than R), but I found it lacking in most things I wanted to do, and the development team seem to wait for something to come out in Matlab before they’ll do it themselves, so they’re always one step behind someone else.

I’m surprised how quickly I’m picking up SciPy. It’s much easier to write, read and debug than R, and the code looks nicer. I haven’t done much plotting yet, but it looks promising. The only trick with Python is its assignments for mutable data types, which I’m still getting my head around.
Mike wrote:
29. July 2009 at 9:45 pm :
Mathematica is also able to link to R via a third party add-on distributed by ScienceOps. The numeric capabilities of Mathematica were “ramped” up 6 years ago so should be thought of as more than a symbolic (only) environment. Further info here:

http://reference.wolfram.com/mathematica/note/SomeNotesOnInternalImplementation.html#28959

(I work for Wolfram Research)
brendano wrote:
30. July 2009 at 11:43 pm :

R is nice for people who don’t value their time or who are doing lots of “odd” things that require programming and extensibility.

Hah!

Everyone really likes Stata. Interesting.
Yaroslav Bulatov wrote:
19. August 2009 at 6:17 pm :
I use Python/Matlab for most analysis, but Mathematica is really nice for building demos and custom visualization interfaces (and for debugging your formulas)

For instance, here’s an example of taking some mutual fund data, and visualizing those mutual funds (from 3 different categories) in a Fisher Linear Discriminant transformed space (down to 3 dimensional from initial 57 or so)

http://yaroslavvb.com/upload/strands/dim-reduce/dim-reduce.html
Brendan O'Connor wrote:
21. August 2009 at 3:44 am :
A post on R vs. Matlab: To R or not to R
brendano wrote:
21. August 2009 at 3:48 am :
Also, a discussion looking for solutions that are both fast to prototype and fast to execute: suitable functional language for scientific/statistical computing
Cristian wrote:
1. September 2009 at 3:21 am :
I do not understand why SAS is so much hailed here because it handles large datasets. I use Matlab almost exclusively in finance and when I have problems with how large the data sets are then I don’t use SAS by I use mysql server instead. Matlab can talk to mysql server and thus I do not see why SAS is needed in this case.
Mike wrote:
11. September 2009 at 6:38 am :
I have used Stata and R but for my purposes I actually prefer and use Mathematica. Unsurprisingly nobody has discussed its use so I guess I will.

I work in ecology and I use Mathematica almost exclusively for modeling. I’ve found that the the elegance of the programming language lends itself to easily using it for statistical analysis as well. Although it isn’t really a statistics package being able to generate large amounts of data and then process them in the same place is extremely useful. To make up for the lack of built in statistical analysis I’ve built my own package over time by collecting and refining the tests I’ve used.
For most people I would say using Mathematica for statistics is way more work than it is worth. Nevertheless, those who already use it for other things may find it is more than capable of performing almost any data analysis you can come up with using relatively little code. The addition of functionality targeted at statistics in versions 6 and 7 has made this use simpler, although the built in ANOVA package is still awkward and poorly documented. One thing it and Matlab beat other packages at hands down is list/matrix manipulation which can be extremely useful.
Paul Kim wrote:
14. September 2009 at 9:10 pm :
I am using MATLAB along with SPSS. Does anyone know about how to connect SPSS with MATLAB? Or can we use any form of programming (e.g., “for” loops and “if”) in SPSS to connect with MATLAB?
Thank you.

Paul
Mattia wrote:
25. September 2009 at 1:39 pm :
I worked at the International Monetary Fund so I thought I’d add the government perspective, which is pretty much the same as the business one. You need software that solves the following equation

maximize amount of useful output
such that: salaries of staff * hours worked - cost of software < budget

It turns out IMF achieves that by letting every economist work with whatever they want. As a matter of fact, economists end up using Stata.

Consider that most economics datasets are smaller than 1Gb. Stata MultiProcessor will work comfortably with up to 4Gb on the available machines. Stata has everything you need for econometrics, including a matrix language that is just like Matlab and state of the art maximum likelihood optimization, so you can create your own “odd” statistical estimators. Programming has a steeper learning curve than Matlab but once you know the language it’s much more powerful, including very nice text data support and I/O (not quite python, but good enough). If you don’t need some of the fancy add-on packages that engineers use, like say “hydrodynamics simulation”, that’s all you need. But most importantly importing, massaging and cleaning data with Stata is so unbelievably efficient that every time I have to use another program I feel like I am walking knee-deep in mud.

So why do I have to use other programs, and which?

IMF has one copy of SAS that we use for big jobs, such as when I had 100Gb of data. I won’t dwell on this because it’s been covered above, but in general SAS is industrial-grade stuff. One big difference between SAS and other programs is that SAS will try to keep working when something goes wrong. If you *need* numbers for the next morning, you go to bed, the next morning you come and Stata has stopped working because of a mistake. SAS hasn’t, and perhaps your numbers are garbage, but if you are able to tell that they are simply 0.00001% off then you are in perfectly good shape to make a decision.

Occasionally I use Matlab or Gauss (yes, Gauss!) because I need to put the data through some black box written in that language and it would take too long to understand it and rewrite it.

That’s all folks. Thanks for the attention.
Mattia wrote:
25. September 2009 at 6:42 pm :
No that was not all, I forgot one thing. Stata can map data using a free user-written add-in (spmap), so you can save yourself the time of learning some brainy GIS package. Does anyone know whether R, SAS, SPSS or other programs can do it?
brendano wrote:
25. September 2009 at 7:37 pm :
R has some packages for plotting geo data, including “maps”, “mapdata”, and also some ggplot2 routines. Now I just saw an entire “R-GIS” project, so I’m sure there’s a lot more related stuff for R…
‌??? ????? ???? (R, Matlab, SciPy, Excel, SAS, SPSS, Stata) « ????? ??????? wrote:
30. September 2009 at 6:59 am :
[...] ????? ????? ???? ?????‌??? ???? ?? ?????? ????. ??? ?? ????? ?? ?????? ???? ? ????? ?????? ????? ???? ????. ??? ?? [...]
Tao Wu wrote:
30. September 2009 at 5:56 pm :
Hi, all. I think I should mention about a C++ framework based software, named as ROOT. see http://root.cern.ch

You will see ROOT is definitely better than R.
Tao Wu wrote:
30. September 2009 at 5:59 pm :
As I can see, the syntax and grammar of R are really stupid. I can not image that R, S, S+ have been widely used by financial bodies. Furthermore, they are trying to claim they are very professional and very good at financial data analysis. I can predict that if they shift to ROOT (a real language with C++), they will see the power of data analysis.
Patrick Burns wrote:
2. January 2010 at 7:10 pm :
xin (April 19) writes:
> the majority of ‘R’ers on this thread act like a bunch of rebellious teens …

Well spotted — I’ve been a rebellious teen for decades now.
Wei Zhang wrote:
10. January 2010 at 10:35 am :
People in my work place, an economic research trust, love STATA. Economists love STATA and they ask new comers to use STATA as well. R is discouraged in my work place for excuses like it is for statisticians. Sigh~~~~

But!!! I keep using it and keep discovering new ways of using it. Now, I use ‘dmsend’ function from the ‘twitteR’ package to inform me the status of my time-consuming simulations while I am not in office. It is just awesome that using R makes me feel bounded by nothing.

BTW, anyone knows how to use R to send emails (on various OS, Win, Mac, Unix, Linux). I googled a bit and not very promising. Any plans to develop a package?

If we had the package, we can just hit ‘paste to console’ (RWinEdt) or C-c C-c (ESS+Emacs) and let R to estimate, simulate and send results to co-authors automatically. What a beautiful world!!

I use Matlab and STATA as well but R completely owns me. Being a bad boy naturally, I start to encourage new comers to use R in my work place.
ynte wrote:
13. January 2010 at 8:30 pm :
I happened to hit this page, and I am impressed by the pro’s and con’s.
Been using SPSS for over 30 years and I’ve been appreciating the steep increase in usability from punch card syntax to pull down menu’s. I only ran into R today because it can handle Zero Inflated Poisson Regression and SPSS can’t or won’t.
I think it is Great to find open source statistical software. I guess it requires a special ment framework to actually enjoy struggling through the command structure, but if I were 25 years younger………
It really is a bugger to find that SPSS (or whatever they like to be called) and R come up with different parameter estimates on the same dataset [at least in the negative binomial model I compared].
Is there anyone out there with experience in comparing two or more of these packages on one and the same dataset?
Wei wrote:
16. January 2010 at 9:58 am :
@ynte
Why don’t you join R: mailing list? If you ask questions properly there, you will get answers.

I would suggest a place to start: http://www.r-project.org/mail.html

Have fun.
peng wrote:
27. January 2010 at 10:22 am :
hi friends,
I am new to R.I would like to know R-PLUS.Does any know where can I get the free training for R-PLUS.

Regards,
Peng.
Wayne wrote:
12. February 2010 at 8:38 pm :
I use R.

I’ve looked at Matlab, but the primitive nature of its language turns my stomach. (I mean, here’s a language that uses alternating strings and values to imitate named parameters? A language where it’s not unusual to have a half page of code in a routine dedicated to filling in parameters based on the number of supplied arguments.) And the Matlab culture seems to favor Perleqsue obfuscation of code as a value. Plus it’s expensive. It’s really an engineer’s tool, not a statistician’s tool.

SAS creeps me out: it was obviously designed for punched cards and it’s an inconsistent mix of 1950’s and 1960’s languages and batch command systems. I’m sure it’s powerful, and from what I’ve read the other statistics packages actually bend their results to match SAS’s, even when SAS’s results are arguably not good. So it’s the Gold Standard of Statistics ™, literally, but it’s not flexible and won’t be comfortable for someone expecting a well-designed language.

R’s language has a good design that has aged well. But it’s definitely open source: you have two graphical languages that come in the box (base and lattice), with a third that’s a real contender (ggplot2). Which to choose? There are over 2,000 packages and it takes a bit of analysis just to decide which of the four Wavelet packages you want to use for your project — not just current features, but how well maintained the package appears to be, etc.

There are really three questions to answer here: 1) What field are you working in, 2) How focused are your needs, and 3) What’s your budget?

In engineering (and Machine Learning and Computer Vision), 95% of the example code you find in articles, online, and in repositories, will be Matlab. I’ve done two graduate classes using R where Matlab was the “no brainer” choice, but I just can’t stomach Matlab “programming”. Python might’ve been a good choice as well, but with R I got an incredible range of graphics combined with multiple a huge variety of statistical and learning techniques. You can get some of that in Python, but it’s really more of a general-purpose tool when you definitely have to roll your own.
Bookmarks for February 12th from 15:49 to 15:54 « Johnny Logic wrote:
13. February 2010 at 5:55 am :
[...] Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata – Brendan O… – Lukas and I were trying to write a succinct comparison of the most popular packages that are typically used for data analysis. I think most people choose one based on what people around them use or what they learn in school, so I’ve found it hard to find comparative information. I’m posting the table here in hopes of useful comments. [...]
Jay wrote:
17. February 2010 at 9:43 am :
Yeah, quite the odd list. If *Py stuff is in there, then PDL definitely should be too.
Statistical functions in Excel — The Endeavour wrote:
17. February 2010 at 12:03 pm :
[...] Comparison of data analysis packages from Brendan O’Connor [...]
stat_stuff wrote:
25. February 2010 at 10:24 am :
i like what you wrote to describe spss, clear and consise….nuf said :-)
forkandwait wrote:
27. February 2010 at 12:05 am :
I would like to comment on SAS versus R versus Matlab/ Octave.

SAS seems to excel at data handling, both with large datasets and with wacked proprietary formats (how else can you read a 60GB text file and merge it with an access database from 1998). It is really ugly though, not interactive/ exploratory, and graphics aren’t great.

R is awesome because it is a fully featured language (things like named parameters, object orientation, typing) etc, and because every new data analysis algorithm probably gets implemented in it first these days. I rather like the graphics. However, it is a mess, with bad naming conventions that have evolved badly over time, conflicting types, etc.

Matlab is awesome in its niche, which is NOT data analysis, but rather math modeling with scripts between 10 and 1000 lines. It is really easy to get up an running if you have a math (ie linear algebra) background, the function file system is great for a medium level of software engineering, plotting is awesome and simpler than R, the datatypes (structs) are complex enough but dont’ involve the headaches of a “well developed” type system. If you are doing data management, gui interaction, or dealing with categorical data, it might be best to use SQL/ SAS or something else and export your data into matrices of numbers.

I would like numpy and friends, but ZERO BASED INDEXING IS NOT MATHEMETICAL.

Just my 2c
anlaystenheini wrote:
16. April 2010 at 4:52 pm :
This is a great compilation, thank you.
After working as an econometrics analyst for a while mainly using stata, I can tell the following about STATA:
Stata is relativly easy to get startet with and to produce some graphics quickly (that’s what all the business people want, click click here’s your powerpoint presentation with lots of colourful graphics and no real content).
BUT if you want to automate things and if you want to make stata to do things it isn’t capable of out of the box, it is pure pain!

The big problem is: On one hand Stata has a scripting/command interface, which is not very powerful and very very inconsistent. On the other Hand, stata has a fully featured matrix-orientated programming language with c-like syntax, which is c-like, therefore not very handy (c is old and not made for mathematics, the matlab language is much more convenient), and which doesn’t work well with the rest of stata (you have a superflous level for interchanging data from one part to the other).

All together programming STATA feels like persuading STATA:
Error messages are almost useless, the macro text expansion used in the scripting language is not very suitable for things that has to do with mathematics (texts can’t calculate), and many other little things.
It is very inconsitent sometimes very clumsy to handle and has silly limitations like string expressions limited to 254 chars like in the early 20th century.

So go with stata for a little ad hoc statistics but do not use it for more sophisticated stuff, in that case learn R!
George Wolfe wrote:
19. April 2010 at 11:13 pm :
I’ve used Mathematica as a general purpose programming language for the past couple of years. I’ve built a portfolio optimizer, various tools to manipulate data and databases, and a lot of statistics and graphs routines. People who use commercial portfolio optimizers are always surprised at how fast the Mathamatica optimizations run - faster then their own optimizers. Based on my experience, I can say that Mathematica is great for numerical and ordinary computational tasks.

I did have to spent a lot of time learning how to think in Mathematica - it’s most powerful when used as a functional language, and I was a procedural programmer. However, if you want to use a procedural programming approach, Mathematica supports that.

Regarding some of the other topics discussed above: (1) Mathematica has build in support for parallel computing, and can be run on supercomputing clusters (Wolfram Alpha is written in Mathematica). (2) The language is highly evolved and is being actively entended and improved every year. It seems to be in an exponential phase of development currently - Stephen Wolfram outlines the development plans every year and the annual user conferenced - and his expectations seem to be pretty much on target. (3) Wolfram has a stated goal of making Mathematica a universal computing platform which smoothly integrates theoretical and applied mathematics with general purpose, graphics, and computation. I admit to a major case of hero worship, but I think he is achiving this goal.

I’m going on and on about Mathematica because, in spite of its wonderfulness, it doesn’t seem to have taken it’s rightful place in these discussions. Maybe Mathematica users drop out of the “what’s the best language for x” after they start using it. I don’t know, really. But anyway, that’s the way I see it.
Dale wrote:
25. April 2010 at 12:54 am :
I am amazed that nobody has mentioned JMP. It is essentially equivalent to SPSS or STATA in capabilities but far easier to use (certainly to teach or learn). The main reason why it is not so well known is that it is a SAS product and they don’t want to market it well for fear that nobody will want SAS any more.
ad wrote:
25. April 2010 at 1:23 pm :
In the comparison I did not see Freemat. This is a open source tool that follows along the lines of MATLAB. It would interesting to see how the community compares Freemat to Matlab
bupka's online wrote:
27. April 2010 at 4:26 am :
bupka’s online menyediakan buku terpakai (used books) berkualitas dan asli
original dengan harga miring,banyak buku teknik. silahkan kunjungi
http://bupka.wordpress.com

buku MATLAB yg dibicarakan diatas, ada stok saat ini.
silahkan liat2 lainnya juga.
Farhat wrote:
27. April 2010 at 9:37 am :
@Wolfe: I have used Mathematica a lot over the past 8 years and still use it for testing ideas as small pieces of code can do fairly sophisticated stuff, I’ve found it poor for large datasets and longer code development. It even lacked things like support for a code versioning system until recently. The cost is also a major detractor. Mathematica costs like $2500 or so last time I checked. Also, some of the newer features like Manipulate seem to create issues, I had a small piece of code using that for interactivity which sent the CPU usage to 100% regardless of whether any change was happening or not.

Also, SAGE ( http://www.sagemath.org ), the open source alternative to Mathematica has gotten quite powerful in the last few years.
yinyangwriter wrote:
8. May 2010 at 6:16 am :
I just wanted to mention that Maple, which has not been commented on yet in this post or in the subsequent thread, generates beautiful visuals and I used to program in it all the time (as an alternative to Mathematica which was used by the “other camp” and I wouldn’t touch).

Also, I’m starting to use Matlab now and loving how intuitive it is (for someone with programming experience anyway). st
Jason wrote:
9. May 2010 at 5:40 pm :
let me quote some of Ross Ihaka’s reflection on R’s efficiency….

“I’m one of the two originators of R. After reading Jan’s
paper I wrote to him and said I thought it was interesting
that he was choosing to jump from Lisp to R at the same
time I was jumping from R to Common Lisp……

We started work on R in the early ’90s. At the time
decent Lisp implementations required much more resources
than our target machines had. We therefore wrote a small
scheme-like interpreter and implemented over that.
Being rank amateurs we didn’t do a great job of the
implementation and the semantics of the S language which
we borrowed also don’t lead to efficiency (there is a
lot of copying of big objects).
R is now being applied to much bigger problems than we
ever anticipated and efficiency is a real issue. What
we’re looking at now is implementing a thin syntax over
Common Lisp. The reason for this is that while Lisp is
great for programming it is not good for carrying out
interactive data analysis. That requires a mindset better
expressed by standard math notation. We do plan to make
the syntax thin enough that it is possible to still work
at the Lisp level. (I believe that the use of Lisp syntax
was partially responsible for why XLispStat failed to gain
a large user community).
The payoff (we hope) will be much greater flexibility and
a big boost in performance (we are working with SBCL so
we gain from compilation). For some simple calculations
we are seeing orders of magnitude increases in performance
over R, and quite big gains over Python…..”

the full post is here:
http://r.789695.n4.nabble.com/Ross-Ihaka-s-reflections-on-Common-Lisp-and-R-td920197.html#a920197

it is quite interesting to note that such a “provactive” post from one of R’s originators got 0 response from R-dev list………..
Business Intelligence Tools: looking at R as a platform for big BI. - SkriptFounders wrote:
23. May 2010 at 5:36 am :
[...] is some more information I thought was nice on the best packages for stat analysis. The only thing thats wrong here is the [...]