Well, I do not suppose everyone again stuck knee-deep in data would go along with it being so "sexy", confronted with poor data quality, hardware and software constraints, interface problems and supposedly "simple" issues that, when you look more closely, you can turn this way and that like a Rubik's Cube. Nevertheless there is no doubt that the data scientist (or perhaps more generally and with less emphasis on science simply the "data analyst") is set to play a hugely important role.
The first article referred to above tells us that there is still no clear definition. I find the role presented with a relatively heavy onus of information technology/mathematics, with pointers to aspects like NoSQL, machine learning, predictive analytics, R or Python statistics language.
Nothing against that, but for me the following part of the article, quoted here in translation, is interesting:
"Data science = mathematics + information technology + domain knowledge
A data scientist needs (at least) knowledge in two classic subjects: mathematics and information technology. Added to this they will ideally possess knowledge of the particular field of application, because the key task of a data scientist is to find answers to questions, from numerous data sources, that provide the customer (internal or external) with value added for a concrete set of issues." [Heise]
So domain knowledge as part of data science. In the past we published here a guest contribution from Prof. Dr. Georg Herde, detailing the business informatics degree course at Deggendorf Institute of Technology. In my opinion studies are precisely aimed at combining the three aspects of mathematics, information technology and domain knowledge. Personally I would lay more weight on the latter than in the quoted text. I would also reformulate "ideally" as something like "by all means". I think the best analyses, the best data analysts or data scientists, if you like, can be singled out for their solid, profound and extensive domain knowledge. Of course you can argue that many an analytic method sets up on "impartial" information purely immanent to the data (to determine relationships between data/attributes, for instance, that were not known before). Nevertheless, when analyzing everyday business it is a huge advantage to import precisely this domain knowledge. And it is the only way to produce the value added mentioned above for the internal or external customer.
Perhaps this viewpoint is influenced by our way of working, that the analytical approach begins with localization of the data, and does not stop after "tool-supported data analysis", but that the results must also be evaluated and communicated externally. Domain knowledge, for us that is knowledge of business management processes, business-oriented knowledge, and sensitivity for possible risks brought to light by analysis.
Let me illustrate that by a brief example. Assuming the question was "Determine the average credit ratio overall, and the credit ratio per customer". In other words we look at the total credits in the calendar year, and divide it by the total invoices per calendar year. OK, actually a pocket calculator is enough for this purpose, and methods of mathematics and information technology stay within reasonable bounds – but in the example we are concerned primarily with domain knowledge. We assume that the data were localized appropriately in the source system, e.g. in an SAP(R) system in the SAP(R) FI-AR Accounts Receivable module in the tables of the customer subledger. The data were extracted with the relevant fields, imported into analytics software such as ACL(TM), and evaluated with appropriate formulas, calculated fields, functions and commands.
The result in the fictitious example is an average ratio of 8.34% across all customers. This means that we (the company for which the data were analyzed) refund customers on average sums of just under 10%. It may be the result of outline agreements with year-end bonuses if a certain business volume is exceeded, but it can also be caused by returns and complaints. (Strictly speaking, just understanding why there should be credits is domain knowledge.)
Now to look at the ratio per single customer, taking four examples from a total of say 10,539 customers: