|New Reviews| |Software Methodologies| |Popular Science| |AI/Machine Learning| |Programming| |Java| |Linux/Open Source| |XML| |Software Tools| |Other| |Web| |Tutorials| |All By Date| |All By Title| |Resources| |About| |
Keywords: Big data, data science, machine learning, statistics, algorithms Title: Doing Data Science Author: Rachel Schutt and Cathy O'Neil Publisher: O'Reilly ISBN: 978-1449358655 Media: Book Level: Introductory Verdict: Highly recommended |
The more savvy types have already latched on to the fact that Big Data is so last year and now the phrase of the moment is Data Science. We've still got CEOs and CIOs making announcements on Big Data strategy left, right and centre, but it's clear that there are still plenty of people out there struggling to catch up with the hype. All of which begs the question of what the hell is Data Science? Is it the same as Big Data? Is this all about scale? In some ways this was an issue that I had to cope with when earning my PhD - how could I easily encapsulate the mix of machine learning, model building and statistics that covered what I was doing? At the time the nearest articulation of it was Intelligent Data Analysis, but that's a phrase that now seems quaint and old fashioned and subsumed into Data Science. All of which brings me to Doing Data Science, edited by Rachel Schutt and Cathy O'Neil, which aims to put some meat on the bones of a definition of Data Science.
The book is a by-product of a course called Introduction to Data Science held at Columbia University in the autumn of 2012. But this is not a collection of course notes or a text book. Rather this is a collection of materials from guest lecturers all attempting to describe what it is they do as data scientists, to come up with their own definitions of what data science is and to give outsiders a feel for the tools, the algorithms and the thinking around data. Now in general I'm not a huge fan of this approach, for example Think Complexity was structured around a class and read like a set of half-finished class notes, but this time it works. Although the range of material is wide, with different styles of writing from different authors, there's a certain consistency about the whole big that makes it hang together as a coherent whole.
As you'd expect the book covers a range of data-related topics, including statistical inference, algorithmic development, logistic regression, time series analysis, modelling, data visualisation and more. There's even some coverage of core Big Data topics such as Hadoop and MapReduce, but be clear, this isn't the focus of the book. The degree of math involved varies by chapter and author, but this isn't a textbook full of derivations and proofs.
The unifying theme of the book is that data science is an iterative, interactive process and not a set of finished recipes that you can just apply blindly to the data sets to hand. The mark of the true data scientist is an affinity to the data - you need to develop an understanding and a feel for it. Treating it as an abstract set of numbers or text divorced from context is a recipe for failure, not success.
In addition to the core material, there are plenty of thought experiments, sample codes, worked examples and exercises. This is a book that you can read in linear fashion from cover to cover, but it's also a book that you can dip into, use as a resource and as a help in developing your own data products.
If you want to know what people mean when they talk about Data Science, then this is as good a place to start as any. It's an interesting and thought-provoking reading and one that can be recommended without the need to develop a recommendation engine first.