Datafied Blog: January 2023

Monday, 16 January 2023

Simplified explanation of virtual environment

Creating a virtual environment is an essential tool that every developer needs while working on one or more projects that require a specific version or requirements of packages or libraries. Ideally, your development environment should be as clean as you can make it by only installing those libraries that your code needs, and only using the versions of the compiler/interpreter that you plan to use. This article will be a guide to understanding what environment and virtual environment means in programming and the importance of creating a virtual environment. Subsequent articles will focus on the practical implementation of a virtual environment in Python and R.

What is an environment?

In programming, an environment is literally everything installed on your machine which can affect the development and/or testing of your application. This can include :

The editors/IDEs you are using
Compilers/interpreters and the exact versions you are using
The operating system installed on your machine
The environment variables set on your machine
The extra libraries installed on your machine (third-party libraries)
The available RAM and Disk space on your machine
The local network capacity and connectivity

What is a virtual environment?

In a bid to keep our development environment clean and less choked, it is best practice to install and work with only libraries and packages that your codes require to complete a development project (could be an app, Machine learning pipeline, software, etc.). This is where the concept of a virtual environment is birthed.

A virtual environment is referred to as a tool that helps to manage dependencies (packages or libraries) required for working on one or more development projects by creating an isolated environment that is different from the default environment which gives the flexibility of installing only packages you require to successfully implement the development project.

A lot of beginners or people transitioning to working with a programming language for data analytics often adopt the process of installing all the required libraries they need for working on a project on their machine, writing all their codes in a single script or notebook, and running the program to get the expected output. While this might work fine in a simple scripting project, it is not ideal when working on complex development projects like building a piece of software, analytics pipeline, etc. Oftentimes, you will be required to work with multiple files, multiple packages, and dependencies. As a result, you will need to isolate your development environment for that particular project.

Importance of virtual environment

Having gained a fundamental understanding of what a virtual environment is. We would explore why it is important to set one up for your projects.

Let’s imagine a scenario:

Nnamdi is a Machine Learning engineer working on two projects (App A and App B) at almost the same timeframe. For both projects, he is working with his default environment and system-installed packages. Let’s say he requires different versions of PackageX for the different projects. For example, PackageX version 1.0 for App A and Package X version 1.1 for App B. The different versions of PackageX contain some significant changes.

Nnamdi started working on App A. When it was time to begin App B, he had to install PackageX version 1.1 in his default environment and everything seemed to be working fine. However, he discovered he had to make some changes to App A. Going back to the previous project, he ran into all sorts of errors and the app is not working as expected.

What do you think happened in Nnamdi’s case?

App A is no longer working as expected because it was developed on the framework of PackageX version 1.0. When he switched to working on App B, he had to install another version of PackageX he needs based on the app's peculiarity. Going back to working on App A means he will have to install PackageX version 1.0 again, thus, overwriting the already installed PackageX version 1.1 used for App B. He will do this anytime he wants to switch between projects.

This scenario is common when working on a software or app development project. And one way to get around it is by using virtual environments.

Outlined below are some benefits of setting up your virtual environment.

Flexibility to manage dependencies

Your development space is separated and is not impacted by packages that are already installed on your computer or other virtual environments. Thanks to the use of a virtual environment, you can install the packages you need specifically for your project.

Easy Reproducibility

Your program can be easily packaged and sent to other developers for replication. To make it simple for other developers to copy and install all the dependencies used in your environment, you can quickly construct a list of dependencies and sub-dependencies for your project in a file.

Manoeuver Installation Privilege Lockouts

To install packages into the host site-packages directory, you might require administrative access to the machine. If you work in the corporate settings, you are most likely not going to have unrestricted access to the machine you are using. To manoeuvre this, you can install and operate with external packages if you use virtual environments since they let you construct a new installation place within the boundaries of your user rights.

Dependency Conflicts Management

You could need a different version of an external library for one of your projects than for another. You cannot use two distinct versions of the same library if there is only one place to install packages. It is suggested that you use a virtual environment most frequently for this reason. Just like the case of Nnamdi described above.

Conclusion

Using a virtual environment will save you a lot of hassle in the long run whether you are building websites for clientele, coding as a hobby on your computer, or working in a corporate environment. Since we know what a virtual environment is and why it is crucial to create one, I will walk you through a step-by-step procedure on how to create one by yourself in the next post.

Stay tuned…

Ogechi Raphael

Ogechi is an excellence-oriented data science professional with keen eyes for identifying opportunities and strategic methods for improvement. Skilled in predictive modelling, data mining, hypothetical testing, and database management, she is detailed and flexible with a strong focus on excellent delivery by devising and running effective processes. Experienced at innovation processes such as problem definition, customer discovery, ideation, prototyping, design thinking, and managing the full project lifecycle to drive innovation. She aims to use her data expertise to create human-centered technologies that will enhance the health outcomes of people with chronic diseases.

Monday, 9 January 2023

Data Cleaning in Plain English

In this article, my job is to explain the concept of data cleaning in the simplest form. You will learn about what causes unclean data, why it’s so important to get your data cleaned, the processes involved in data cleaning, and some important tips to follow when performing data cleaning tasks.

What is data cleaning?

“Data cleaning is the process of identifying and correcting (or removing) errors, inconsistencies, and missing values in a dataset. It is a crucial step in the data wrangling process and is typically performed before data analysis to ensure the data quality is high, preserved, and can be used effectively.”

What causes unclean data?

Unclean data or data errors and inconsistencies can be caused by a number of things which include, but not limited to:

Human errors
Inaccurate data entry
System faults, and
Old or wrong data

Data cleaning techniques can be as simple as spelling checks and record duplication removal or as complex as imputing missing data and finding and correcting anomalies.

Why is it important to get your data cleaned?

The goal of data cleaning is to improve the quality of data and make it more useful for analysis. A well-cleaned dataset will be more accurate, consistent, and reliable, which can lead to more accurate and reliable results in data analysis.

Practical implementation of data cleaning techniques

For some reason, data often contain missing values and outliers. Outliers are not always wrong so we have to deal with them with optimal care in order not to obtain a false result from our analysis. I will demonstrate some data cleaning processes with some illustrations and examples to enable you to understand the concept of data cleaning easily.

Let’s say you are handed a customer dataset for spend-data profile analysis showing the first ten records as shown in the table below:

Obviously, the data looks unclean in the sense that some missing values and outliers can be spotted.

I will walk you through the steps of getting this data clean with the aim to help you understand all the procedures involved in the data cleaning process. Let’s work with the flow chart provided below as a guide in dealing with missing values, outliers, inconsistencies, and whatnot.

Source: ExploreAI Lecture Material - Storing and Cleaning Data

Assuming you are looking at a particular cell in your dataset, the first question you will need to ask yourself is; Is that cell empty?

Identifying the outliers

If it’s not, you can move on to determine if it’s an outlier. As the name suggests, an outlier is a value that is significantly different from the other values in a dataset. Outliers can have a significant impact on statistical analyses since they can skew the results (i.e. throw the result off balance) and give a misleading picture of the data. There are a number of ways to identify outliers in a dataset, such as using statistical tests or visualization techniques (these techniques will be discussed in future posts). It is important to carefully examine outliers and consider whether they should be included in the analysis, as they may represent errors or unusual observations that do not reflect the underlying pattern in the data.

Dealing with the outliers

Let’s continue with our flow chart, if the number is not an outlier, then you leave it that way. If you think that the value is an outlier, you then need to decide if it’s obviously wrong. If it is, replace such value with a blank, that is, you delete the value. However, if the value is definitely an outlier because it looks different but we can’t be sure that the value is obviously wrong, then we do what is known as “marking the observation”. Marking means taking a record of which values we believe to be suspicious, in this case, an annual income of $125,000 compared to the annual income of other undergrads of about $4000 is extremely high and certainly an outlier but it’s not impossible that that might happen. Marking could be a written note, a little mark next to each row or you could add it to a list of marks kept on a separate table, you are at liberty to do whatever is cool with you. The idea is that if you continue with such data and obtain a result that seems fishy, you can consult the list of marked observations and see if they were responsible.

Dealing with missing values

Now, let’s see how to deal with the blanks. The first consideration to make here is an important one. We look at the rest of the values in the row in question, and if they are too many missing values we need to get rid of the entire row. You can choose a threshold depending on the project you’re working on, 40/60 is not a bad idea. This means that if 40% and above of the records in a particular row are missing, you can delete the entire row. Here, we have a row of 6 features, 3 of which are blank which represents about 50% of the records in that row. Based on our set threshold of 40%, it is only reasonable that we delete that row completely.

Note: Same can be applied while working with columns.

Why do we delete rows/columns with too many missing values? You may want to ask. The simple answer is that we don’t want too many made-up values in our dataset so we can maintain the data integrity since the next step is to fill up missing values with some summary statistics.

Finally, if the row contains just a small portion of the missing values, we can carry out the imputation process as follows:

If the feature contains numerical data, a common solution is to fill up the missing values with the mean or average of that feature. For example, in our dataset, the income feature is numerical, so we fill up the missing values with the mean as shown below:
If the feature is however categorical, we can fill the missing values with the mode of the feature. The mode is the most occurring category of the feature. If we were looking at the feature detailing the highest level of education of a group of people and the most common category there was "High school", we could fill in any blanks in the level of education with "High school".

Data imputation

There are a few things to note here:

Since the mean is highly sensitive to high or low outlier values, in the instances where you need to impute a numerical value for a feature that may still contain legitimate outliers, like annual income in our example, it may be more appropriate to use something like the median because the median is robust to outliers (i.e. it is not affected by outliers).
Sometimes overall mean or median is not appropriate, for example, if you needed to impute someone's annual income but that person’s highest level of education was high school, taking the mean or median salary of everyone in the whole dataset wouldn’t be very helpful.
It might be better to use the mean or median of a small subset of the data, i.e. filling missing values with the mean or median of a group more representative of the row in question. Although, this step can be tedious and time-consuming, but it is worth it.

Nnamdi Isichei

A data science professional with a background in financial modelling and statistics and tons of experience delivering high-quality educational materials and content. Proficient in managing the entire Data Science project lifecycle and actively involved in the project phases including data acquisition, cleaning, engineering, feature scaling, feature engineering, statistical modelling and machine learning.

Monday, 2 January 2023

Welcome to Datafied Blog - Data concepts simplified

Image source: https://zinginstruments.com/songs-about-new-beginnings/

Why Datafied?

The data world is a fascinating one. However, launching a career in data may not be simple, particularly if you lack the significant technical experience. At one point or the other in your journey into the data profession, I am sure you were faced with several struggles. Did you ever feel clueless when trying to make sense of the problem statements? Or feel a knot in your stomach when you have to communicate your findings to stakeholders? Or feel you need to be more knowledgeable about data analysis and statistical terms?

I can still recall how little I knew about coding, programming, statistics, mathematics, or artificial intelligence when I first began my career as a data professional. It was quite a struggle to fit in as I needed at least foundational knowledge in these fields. I had to rely on external materials, which might be time-consuming, confusing, and difficult to understand given my little technical experience.

I experienced these challenges and feel someone is going through similar situations and urgently needs support to gain a firm footing in their data career.

What we are all about

Datafied was founded in response to the need to bridge the knowledge gap for aspiring data professionals, enthusiasts, and individuals making the transition to a career in data by aiding in the simplification of complex ideas that you might come across as a beginner in the field of data or even as a working data professional.

How we intend to make an impact

We believe that everyone stands a chance to excel in their chosen data profession by being equipped with the necessary technical expertise like having strong foundational knowledge in mathematics, and statistics, hands-on experience in coding and programming, and being able to communicate most simply. We are passionate about taking the technicality associated with these data concepts by breaking them down into the simplest form that can be understood even by a 5-year-old. Driven by this passion, we want to help you simplify learning complications as you go through your journey working with data and becoming a data professional.