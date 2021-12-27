ContributorsPublishersAdvertisers
The Four Most Valuable Data Projects I Worked on in 2021

 3 days ago

Cover picture for the articleWhether you’re a Data Scientist, BI developer, or Data Engineer, your purpose is to extract value from data. In 2021, I stepped outside the traditional box that defines a data and analytics job. Rather than worry about the type of work I was doing, I focused solely on providing value through...

Most data warehouse projects fail. Here’s how not to.

In today’s data-driven reality, data warehouses are becoming increasingly crucial for companies in all sectors. Yet despite the hours and cash poured into them, some 80% of data warehouse projects ultimately fail to achieve their aims. There’s no shame in this at all; in fact, it happens far more often than you think. When the giants of the business world get it wrong, it might hit the headlines. But for most enterprises, the high project failure rate is simply kept quiet.
Different Methods for Pivot Tables in SQL

Data professionals often look at transactional data and need a pivot table for further analysis. For example, a banking expert might need to review each transaction to determine how certain accounts settle or a sales analyst might need to review individual transactions to determine how well certain products sell. A...
The Best Way to Manage Unstructured Data Efficiently

Handling unstructured data with object storage and data lakes. As much as 90 percent of data is defined as unstructured data. And unstructured data is growing by 55–65 percent each year. Source: Forbes. If you have been working in data science for a while you must have noticed the...
The Secret of Delivering Machine Learning to Production

87% of ML projects are eventually not delivered to production (VB). This article from 2019 is cited in almost every MLOps startup pitch deck, and this number is well-established in the ML discourse. To be totally honest, I have tried to trace back this number and figure out how it was retrieved — and didn’t find any reliable source or research to support it. However, this number seems quite reasonable if you also consider projects that were stopped at an early stage of PoC. The more painful number is the relative amount of projects that were already committed to the management or even to customers, in which significant efforts have already been invested — that were terminated before (or after) hitting production. In my previous post: “how to run due diligence for ML teams”, I give a high-level overview of the ingredients of successful ML teams. Here, you can find some practical advice on how to build high-impact ML teams.
Complete Python Starter Guide For Data Science For 2022

Covering all the basics and elemental concepts of Python that you required for kickstarting Data Science with code examples. Python is one of the most significant programming languages in the modern era. Even though the language was developed almost three decades ago, there is so much constantly evolving that it still holds immense value and a lot more to offer, especially in terms of Data Science and Artificial Intelligence.
DeepLearning.AI and the ML tutorial code quality problem

Andrew Ng’s Coursera courses provided me my first introductions to machine learning in 2014. The classes offer an accessible, comprehensive review of ML methods; they got me hooked!. Recently, I revisited some course content and enrolled in the DeepLearning.AI TensorFlow Developer specialization as a refresher. As before, I’m astonished...
Use Pipe Operations in Python for More Readable and Faster Coding

A handy Python package to save a ton of coding time and improve readability with shell-styled pipe operations. Python is already an elegant language to program. But it doesn’t mean there is no room for improvement. Pipe is a beautiful package that takes Python’s ability to handle data to...
Implementing an efficient generalised Kernel Perceptron in PyTorch

The Perceptron is an old linear binary classification algorithm that has formed the basis of many Machine Learning methods, including neural networks. Like many linear methods, kernel tricks can be used to enable the Perceptron to perform well on non-linear data, and as with all binary classification algorithms, it can be generalised to work for a k-class problem.
Leveling Up Your Machine Learning Projects

A practical guide on how to structure machine learning projects. Starting out in data science, notebooks are your friend. They are great at being multi-purpose tools to visualize and explore the data but are not the best as your project becomes more complicated. For such projects, we will want code...
How to create APIs on top of Synapse serverless SQL pools

Using Azure Synapse, ADLSgen2, Python and Web Apps. Synapse serverless SQL pools is a service to query data in data lakes. Key is that data can be accessed without the need to copy data into SQL tables. Typically, serverless pools are not used to serve external APIs. This is because external APIs require predictable and low latency response times. Services like Cosmos DB and SQL database are better suited for that. However, two reasons to use serverless pools to serve APIs are as follows:
Deploying Docker Containerised ML Models on AWS Elastic Beanstalk

What do an iPhone photo library, an Amazon shopping basket and a Netflix home page all have in common?. One way or another, each of these applications interacts with a Machine Learning model to improve user experience and to better serve end users. It is without doubt that machine learning...
3 Tips to Prevent Your Project from Ending Up in the Data Science Graveyard

Research shows 87% of Data Science projects never make it to production (VentureBeat 2019). As a Data Scientist, I have seen Data and AI projects succeed and fail at all stages: research, development, and deployment. To maximize success, I have identified three internal principals that increase the probability that a project will go into production and deliver meaningful value to clients. Learn how to keep your Data Science project out of the Data Science graveyard and instead get it into the hands of your users.
3 Steps to Build and Deploy your NLP model as a Microservice on Azure

The easiest and cheapest way to deploy ML models on Azure. After spending countless hours training your model, you now need to make it available for other applications or services. Depending on how you approach the deployment to the cloud, this process may take several hours or just a few...
4 Data Science Competition Platforms Other Than Kaggle

Kaggle is one of the most popular data science community and is well know for hosting top tier machine learning competitions with attractive prize pool. Here are 4 other fast growing communities with challenging machine learning problems that might interest you. Zindi. Zindi is a social enterprise whose mission is...
How to Use the SQL GROUP BY Clause and Aggregate Functions

Be sure to SUBSCRIBE here to never miss another article on data science guides, tricks and tips, life lessons, and more!. In this article, we’re focusing on the fundamentals of SQL, particularly aggregate functions. What are Aggregate Functions in SQL?. Aggregate functions are functions that are performed over one...
5 Advanced Tips on Python Functions

Notes from Fluent Python by Luciano Ramalho (Chapter 5–6) Did you learn to code in Java, then moved to python? If started with OOP but now work in python, this post is for you. In chapters 5–6 of Fluent Python, Luciano Ramalho discusses how traditional object-oriented paradigms are not...
Constrained Logistic Regression with Python

Logistic regression is a basic yet popular classification model. Despite its simplicity, logistic regression is a powerful tool that is used in real-world contexts. The method’s primary benefit is perhaps its explainability, thanks to the ease with which its parameters/coefficients may be interpreted. As a result, one can derive insights from the model by fully understanding the impact of each feature on the model.
22 predictions about the Software Development trends in 2022

In only a few days, we will say goodbye to 2021 and welcome a new year: 2022. After the disastrous, pandemic-hit 2020, 2021 was a year of resiliency and fight-back for humanity. Thanks to technological advancements, countries were able to vaccinate people en masse. For the Software Development and IT industry, 2021 was a significant year as expected.
Bank Customer Churn with Tidymodels — Part 1: Model Development

Exploring imbalanced classification with Tidymodels. Imagine you’re a data scientist at a large multi-national bank and the Chief Customer Officer approaches you to develop a means of predicting customer churn. You develop a snapshot dataset of 10,000 customers with class imbalance of 1:4 in favour of customers not leaving to use to train such a binary classification model. To assist in model development, you decide to investigate various sampling techniques that might help with the class imbalance.
MedicalXpress

Real-world data provides valuable insights into COVID vaccine roll-out

Analysis of Europe's largest patient data record has provided valuable insights into the UK's COVID-19 vaccine roll-out, an Imperial expert has said. Erik Mayer, from Imperial College London's Department of Surgery and Cancer and Director of the NIHR Imperial BRC iCARE group, told a recent Academic Health Science Centre (AHSC) online seminar how real-world data was used to inform vaccination strategy in London.
