Why becoming a data scientist is NOT actually easier than you think

Most data scientists and the companies that employ them are not using Matlab/Octave. They have backend web services written in Java, Python, Scala, or Ruby. These languages are not covered. Python has libraries like Scipy, Numpy, and Scikit-learn that are great for solving numerical problems. Java has a bunch of libraries too like the Mahout math library [2]. R is used by most statisticians (again not covered in the course). When your boss (or a customer) comes to you and says you need to integrate an algorithm into a pre-existing web service ( example -they need a recommendation engine), and you say “I only know Matlab” that is going be a huge problem. You don’t just pick up Java/Python/C++/Scala/whatever in a few days on the job.

.. Coursera sets up all the data sets for you. They even write the scripts to load the data. (see week 6 SVM email classification, they wrote all of the regex expressions to clean the emails for you) That doesn’t work in the real world. Real-world data is ugly, and unstructured. You need to know regular expressions and UNIX commands like sed, grep, tr, cut, sort, awk, and map/reduce to clean these data sets up and put them into “Coursera” format. Notice I said UNIX commands, which implies you need to be somewhat comfortable on UNIX/LINUX, which may be a steep learning curve if you’re currently using Windows.