2020 could be called The Year Data Science Grew Up. Organizations of all kinds significantly ramped up their adoption of data-oriented applications and turned to data science to solve their problems—with varying degrees of success. In the process, data science was increasingly called upon to show its maturity and prove its real value, demonstrating that it actually worked in production.
The emergence of a deadly global pandemic threw a wrench into designs—not all of them good—that had grown over the course of years in ways that have become difficult to maintain, adjust, or improve upon today. COVID-19 required the rapid analysis and sharing of massive amounts of data. Predictive models were run and updated with a new urgency amid constantly changing conditions—with all the world judging their accuracy and integrity.
The past 12 months have revealed how valuable data science can be while also exposing its limitations. In 2020, there were numerous challenges to data science’s credibility, adaptability, and ultimate usefulness that will need to be addressed in 2021.
Let’s look at the key levers.
Data science in 2020
This proliferation of data science, while exciting, falsely suggested that the field is now somehow settled. On the contrary, data science remains very much a “new” field, innovating at a rapid clip.
If one followed the hype cycle, data science appeared to go mainstream in 2020, with vendors across the landscape co-opting AI. Every product or service seemed to have artificial intelligence somehow attached, no matter how loosely. As such, expectations rose to impossible heights, with companies expecting smart data solutions to solve all of their problems. Data science just doesn’t work that way.
Fortunately, people now are moving beyond the hype and asking the right questions in order to understand what data science can and can’t accomplish. Thus data science is now receiving attention based on its quality and the return on investment that is possible when constructed the right way.
One of the fundamental challenges of data science has always been finding a way to repeatedly and reliably take a model from creation and put it into production. This can significantly hinder realization of ROI—which was certainly the case after the onslaught of COVID-19. Consider all the behaviors that changed throughout the pandemic. Machine learning models built prior to COVID-19, at minimum, needed to undergo at least an update, if not an entire redesign and retraining, to account for these changes.
Depending on the problem domain and what the models were asked to solve for, the new reality might look radically different from the pre-COVID world, so much so that the millions of data points relied upon for insights break down because old base assumptions no longer hold. Models needed to be updated to incorporate new data and adjust to the new reality, and the entire process from data science creation to production had to be revisited.
Because this has traditionally been quite difficult to do and because companies were suddenly forced to revise models quite rapidly, the rigor and frequency with which models were tested slipped. Models were instead being created in a rush without verification. This harmed the credibility of data science to some extent.
2020 highlighted the gap between the creation of sound, tested data science models and the deployment of production-ready models that can subsequently be modified as needed without recreating the wheel. Fortunately, we are beginning to see new approaches that eliminate this gap as the year winds down.
Bias in AI models
Another issue that struck at the heart of the credibility and usefulness of data science was that of bias. Social justice moved to the forefront in 2020. The natural reaction was to try to eliminate bias wherever possible. And because every company became an AI company, there was a push to remove bias from AI models—a task that is inherently problematic.
Often when we remove bias from data science models, when we make them “non-discriminatory,” we weaken the results and ultimately the value of the models. There also exists the danger that when one component is removed from a data science model, something else creeps in, with the result that bias is not eliminated altogether but just replaced by a different kind of bias.
Mitigating AI model bias is an important issue, as data science is increasingly relied upon to help drive decisions, and we don’t want those decisions to be prejudiced or unfair. How can we create and deploy data science in an ethical way? A model must be understandable, provable, and verifiable. This is undoubtedly an area that will be explored in greater depth in the months and years to come.
Data science in 2021 and beyond
Significant strides were made in the past year to surface the issues holding back data science. As the hype cycle surrounding data science now ends, the field can become more serious and focused on innovation and problem solving.
Perhaps the most exciting opportunity for data science is the momentum behind an integrated deployment approach. With widespread availability of technology to close the gap between creation and production, data scientists will no longer have to translate between several different technologies. This will be game changing, saving time and frustration while yielding more accurate outcomes.
As it becomes much easier and faster to move models from testing to production, data science will deliver a far greater return on its investment to multiple stakeholders—not just data scientists. Organizations will benefit by enabling different groups to consume and understand data insights.
2nd generation collaboration
Expect to see different groups get involved with the creation and development of data science moving forward. Business analysts and engineers need to work with data scientists, all collaborating together to get it right. Each group brings a different perspective to the table, which makes data science more insightful, impactful, and useful for business purposes.
The advanced collaboration required specifically for data science will take the form of combining collaboration models at various levels to meet different needs. By sharing components, organizations will be able to wrap up a certain piece of expertise, data blending, machine optimization, or even a reporting module and share it across the organization. Such functional and purposeful collaboration combined with the appropriate amount of automation will characterize the next phase of data science.
One consequence of COVID-19 has been an acceleration of digital transformation initiatives, and cloud and hybrid environments have become much more prevalent. This trend will continue throughout 2021.
Organizations are not locking into one cloud, or even just moving all of their data into the cloud. Many on-premises environments remain, and companies will want to include their data center infrastructure in the mix without purchasing huge computational resources that will only be used every so often.
Instead, they will look for elasticity and the ability to scale hybrid environments up and down to meet the resource requirements of specific workloads. As such, it is essential that data science can be conducted in a variety of environments and shared across the data center and cloud in order to maximize effectiveness. Outstanding options are emerging to enable data science adoption to expand in new ways.
Data science maturity is all over the map today. The space between the organizations that are just getting on board and those that have been in the trenches for a while may narrow some in 2021, but the gulf will persist for a good while longer.
The reason? The organizations that have implemented data science successfully and that understand its capabilities and limitations will continue to experiment using open source technologies to try something out. If it works, they can make it available for broader use. They will feel free to play and push the envelope without draining IT budgets on a hunch, and this is where the greatest innovation will happen.
At the same time, data science will become more accessible. Low-code capabilities are beginning to reach more users across the enterprise, facilitating greater opportunities. With more people understanding data science and using it to solve problems faster than ever before, the benefits of data science will be democratized and new possibilities will be unlocked.
Data science came a long way in 2020, despite hitting some bumps with the pandemic. Because we’re being forced to confront key data science challenges, very exciting advances are occurring. 2021 will be the year data science gets real and shows its return on investment in deep and meaningful ways.
Michael Berthold is CEO and co-founder at KNIME, an open source data analytics company. He has more than 25 years of experience in data science, working in academia, most recently as a full professor at Konstanz University (Germany) and previously at University of California, Berkeley and Carnegie Mellon, and in industry at Intel’s Neural Network Group, Utopy, and Tripos. Michael has published extensively on data analytics, machine learning, and artificial intelligence. Follow Michael on Twitter, LinkedIn, and the KNIME blog.
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to email@example.com.
Copyright © 2021 IDG Communications, Inc.
#learned #data #science #whats