Putting the machine learning pieces together Reading through a data science book or taking a course, it can feel like you have the individual pieces, but don’t quite know how to put them together. Taking the next step and solving a complete machine learning problem can be daunting, but preserving and completing a first project will give you the confidence to tackle any data science problem. This series of articles will walk through a complete machine learning solution with a real-world dataset to let you see how all the pieces come together. We’ll follow the general machine learning workflow step-by-step: Data cleaning and formatting Exploratory data analysis Feature engineering and selection Compare several machine learning models on a performance metric Perform hyperparameter tuning on the best model Evaluate the best model on the testing set Interpret the model results Draw conclusions and document work Along the way, we’ll see how each step flows into the next and how to specifically implement each part in Python. The complete project is available on GitHub, with the first notebook here. This first article will cover steps 1–3 with the rest addressed in subsequent posts. (As a note, this problem was originally given to me as an “assignment” for a job screen at a start-up. After completing the work, I was offered the job, but then the CTO of the company quit and they weren’t able to bring on any new employees. I guess that’s how things go on the start-up scene!) Problem Definition The first step before we get coding is to understand the problem we are trying to solve and the available data. In this project, we will work with publicly available building energy data from New York City. The objective is to use the energy data to build a model that can predict the Energy Star Score of a building and interpret the results to find the factors which influence the score. The data includes the Energy Star Score, which makes this a supervised regression machine learning task: Supervised: we have access to both the features and the target and our goal is to train a model that can learn a mapping between the two Regression: The Energy Star score is a continuous variable We want to develop a model that is both accurate — it can predict the Energy Star Score close to the true value — and interpretable — we can understand the model predictions. Once we know the goal, we can use it to guide our decisions as we dig into the data and build models. Data Cleaning Contrary to what most data science courses would have you believe, not every dataset is a perfectly curated group of observations with no missing values or anomalies (looking at you mtcars and iris datasets). Real-world data is messy which means we need to clean and wrangle it into an acceptable format before we can even start the analysis. Data cleaning is an un-glamorous, but necessary part of most actual data science problems. First, we can load in the data as a Pandas DataFrame and take a look: import pandas as pd import numpy as np# Read in data into a dataframe data = pd.read_csv('data/Energy_and_Water_Data_Disclosure_for_Local_Law_84_2017__Data_for_Calendar_Year_2016_.csv')# Display top of dataframe data.head() What Actual Data Looks Like! This is a subset of the full data which contains 60 columns. Already, we can see a couple issues: first, we know that we want to predict the ENERGY STAR Score but we don’t know what any of the columns mean. While this isn’t necessarily an issue — we can often make an accurate model without any knowledge of the variables — we want to focus on interpretability, and it might be important to understand at least some of the columns. When I originally got the assignment from the start-up, I didn’t want to ask what all the column names meant, so I looked at the name of the file, and decided to search for “Local Law 84”. That led me to this page which explains this is an NYC law requiring all buildings of a certain size to report their energy use. More searching brought me to all the definitions of the columns. Maybe looking at a file name is an obvious place to start, but for me this was a reminder to go slow so you don’t miss anything important! We don’t need to study all of the columns, but we should at least understand the Energy Star Score, which is described as: A 1-to-100 percentile ranking based on self-reported energy usage for the reporting year. The Energy Star score is a relative measure used for comparing the energy efficiency of buildings. That clears up the first problem, but the second issue is that missing values are encoded as “Not Available”. This is a string in Python which means that even the columns with numbers will be stored as object datatypes because Pandas converts a column with any strings into a column of all strings. We can see the datatypes of the columns using the dataframe.info()method: # See the column data types and non-missing values data.info() Sure enough, some of the columns that clearly contain numbers (such as ft²), are stored as objects. We can’t do numerical analysis on strings, so these will have to be converted to number (specifically float) data types! Here’s a little Python code that replaces all the “Not Available” entries with not a number ( np.nan), which can be interpreted as numbers, and then converts the relevant columns to the float datatype: Once the correct columns are numbers, we can start to investigate the data. Missing Data and Outliers In addition to incorrect datatypes, another common problem when dealing with real-world data is missing values. These can arise for many reasons and have to be either filled in or removed before we train a machine learning model. First, let’s get a sense of how many missing values are in each column (see the notebook for code). (To create this table, I used a function from this Stack Overflow Forum). While we always want to be careful about removing information, if a column has a high percentage of missing values, then it probably will not be useful to our model. The threshold for removing columns should depend on the problem (here is a discussion), and for this project, we will remove any columns with more than 50% missing values. At this point, we may also want to remove outliers. These can be due to typos in data entry, mistakes in units, or they could be legitimate but extreme values. For this project, we will remove anomalies based on the definition of extreme outliers: Below the first quartile − 3 ∗ interquartile range Above the third quartile + 3 ∗ interquartile range (For the code to remove the columns and the anomalies, see the notebook). At the end of the data cleaning and anomaly removal process, we are left with over 11,000 buildings and 49 features. Exploratory Data Analysis Now that the tedious — but necessary — step of data cleaning is complete, we can move on to exploring our data! Exploratory Data Analysis (EDA) is an open-ended process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data. In short, the goal of EDA is to learn what our data can tell us. It generally starts out with a high level overview, then narrows in to specific areas as we find interesting parts of the data. The findings may be interesting in their own right, or they can be used to inform our modeling choices, such as by helping us decide which features to use. Single Variable Plots The goal is to predict the Energy Star Score (renamed to score in our data) so a reasonable place to start is examining the distribution of this variable. A histogram is a simple yet effective way to visualize the distribution of a single variable and is easy to make using matplotlib. import matplotlib.pyplot as plt# Histogram of the Energy Star Score plt.style.use('fivethirtyeight') plt.hist(data['score'].dropna(), bins = 100, edgecolor = 'k'); plt.xlabel('Score'); plt.ylabel('Number of Buildings'); plt.title('Energy Star Score Distribution'); This looks quite suspicious! The Energy Star score is a percentile rank, which means we would expect to see a uniform distribution, with each score assigned to the same number of buildings. However, a disproportionate number of buildings have either the highest, 100, or the lowest, 1, score (higher is better for the Energy Star score). If we go back to the definition of the score, we see that it is based on “self-reported energy usage” which might explain the very high scores. Asking building owners to report their own energy usage is like asking students to report their own scores on a test! As a result, this probably is not the most objective measure of a building’s energy efficiency. If we had an unlimited amount of time, we might want to investigate why so many buildings have very high and very low scores which we could by selecting these buildings and seeing what they have in common. However, our objective is only to predict the score and not to devise a better method of scoring buildings! We can make a note in our report that the scores have a suspect distribution, but our main focus in on predicting the score. Looking for Relationships A major part of EDA is searching for relationships between the features and the target. Variables that are correlated with the target are useful to a model because they can be used to predict the target. One way to examine the effect of a categorical variable (which takes on only a limited set of values) on the target is through a density plot using the seaborn library. A density plot can be thought of as a smoothed histogram because it shows the distribution of a single variable. We can color a density plot by class to see how a categorical variable changes the distribution. The following code makes a density plot of the Energy Star Score colored by the the type of building (limited to building types with more than 100 data points): We can see that the building type has a significant impact on the Energy Star Score. Office buildings tend to have a higher score while Hotels have a lower score. This tells us that we should include the building type in our modeling because it does have an impact on the target. As a categorical variable, we will have to one-hot encode the building type. A similar plot can be used to show the Energy Star Score by borough: The borough does not seem to have as large of an impact on the score as the building type. Nonetheless, we might want to include it in our model because there are slight differences between the boroughs. To quantify relationships between variables, we can use the Pearson Correlation Coefficient. This is a measure of the strength and direction of a linear relationship between two variables. A score of +1 is a perfectly linear positive relationship and a score of -1 is a perfectly negative linear relationship. Several values of the correlation coefficient are shown below: Values of the Pearson Correlation Coefficient (Source) While the correlation coefficient cannot capture non-linear relationships, it is a good way to start figuring out how variables are related. In Pandas, we can easily calculate the correlations between any columns in a dataframe: # Find all correlations with the score and sort correlations_data = data.corr()['score'].sort_values() The most negative (left) and positive (right) correlations with the target: There are several strong negative correlations between the features and the target with the most negative the different categories of EUI (these measures vary slightly in how they are calculated). The EUI — Energy Use Intensity — is the amount of energy used by a building divided by the square footage of the buildings. It is meant to be a measure of the efficiency of a building with a lower score being better. Intuitively, these correlations make sense: as the EUI increases, the Energy Star Score tends to decrease. Two-Variable Plots To visualize relationships between two continuous variables, we use scatterplots. We can include additional information, such as a categorical variable, in the color of the points. For example, the following plot shows the Energy Star Score vs. Site EUI colored by the building type: This plot lets us visualize what a correlation coefficient of -0.7 looks like. As the Site EUI decreases, the Energy Star Score increases, a relationship that holds steady across the building types. The final exploratory plot we will make is known as the Pairs Plot. This is a great exploration tool because it lets us see relationships between multiple pairs of variables as well as distributions of single variables. Here we are using the seaborn visualization library and the PairGrid function to create a Pairs Plot with scatterplots on the upper triangle, histograms on the diagonal, and 2D kernel density plots and correlation coefficients on the lower triangle. To see interactions between variables, we look for where a row intersects with a column. For example, to see the correlation of Weather Norm EUI with score, we look in the Weather Norm EUI row and the score column and see a correlation coefficient of -0.67. In addition to looking cool, plots such as these can help us decide which variables to include in modeling. Feature Engineering and Selection Feature engineering and selection often provide the greatest return on time invested in a machine learning problem. First of all, let’s define what these two tasks are: Feature engineering: The process of taking raw data and extracting or creating new features. This might mean taking transformations of variables, such as a natural log and square root, or one-hot encoding categorical variables so they can be used in a model. Generally, I think of feature engineering as creating additional features from the raw data. Feature selection: The process of choosing the most relevant features in the data. In feature selection, we remove features to help the model generalize better to new data and create a more interpretable model. Generally, I think of feature selection as subtracting features so we are left with only those that are most important. A machine learning model can only learn from the data we provide it, so ensuring that data includes all the relevant information for our task is crucial. If we don’t feed a model the correct data, then we are setting it up to fail and we should not expect it to learn! For this project, we will take the following feature engineering steps: One-hot encode categorical variables (borough and property use type) Add in the natural log transformation of the numerical variables One-hot encoding is necessary to include categorical variables in a model. A machine learning algorithm cannot understand a building type of “office”, so we have to record it as a 1 if the building is an office and a 0 otherwise. Adding transformed features can help our model learn non-linear relationships within the data. Taking the square root, natural log, or various powers of features is common practice in data science and can be based on domain knowledge or what works best in practice. Here we will include the natural log of all numerical features. The following code selects the numeric features, takes log transformations of these features, selects the two categorical features, one-hot encodes these features, and joins the two sets together. This seems like a lot of work, but it is relatively straightforward in Pandas! After this process we have over 11,000 observations (buildings) with 110 columns (features). Not all of these features are likely to be useful for predicting the Energy Star Score, so now we will turn to feature selection to remove some of the variables. Feature Selection Many of the 110 features we have in our data are redundant because they are highly correlated with one another. For example, here is a plot of Site EUI vs Weather Normalized Site EUI which have a correlation coefficient of 0.997. Features that are strongly correlated with each other are known as collinear and removing one of the variables in these pairs of features can often help a machine learning model generalize and be more interpretable. (I should point out we are talking about correlations of features with other features, not correlations with the target, which help our model!) There are a number of methods to calculate collinearity between features, with one of the most common the variance inflation factor. In this project, we will use thebcorrelation coefficient to identify and remove collinear features. We will drop one of a pair of features if the correlation coefficient between them is greater than 0.6. For the implementation, take a look at the notebook (and this Stack Overflow answer) While this value may seem arbitrary, I tried several different thresholds, and this choice yielded the best model. Machine learning is an empirical field and is often about experimenting and finding what performs best! After feature selection, we are left with 64 total features and 1 target. # Remove any columns with all na values features = features.dropna(axis=1, how = 'all') print(features.shape)(11319, 65) Establishing a Baseline We have now completed data cleaning, exploratory data analysis, and feature engineering. The final step to take before getting started with modeling is establishing a naive baseline. This is essentially a guess against which we can compare our results. If the machine learning models do not beat this guess, then we might have to conclude that machine learning is not acceptable for the task or we might need to try a different approach. For regression problems, a reasonable naive baseline is to guess the median value of the target on the training set for all the examples in the test set. This sets a relatively low bar for any model to surpass. The metric we will use is mean absolute error (mae) which measures the average absolute error on the predictions. There are many metrics for regression, but I like Andrew Ng’s advice to pick a single metric and then stick to it when evaluating models. The mean absolute error is easy to calculate and is interpretable. Before calculating the baseline, we need to split our data into a training and a testing set: The training set of features is what we provide to our model during training along with the answers. The goal is for the model to learn a mapping between the features and the target. The testing set of features is used to evaluate the trained model. The model is not allowed to see the answers for the testing set and must make predictions using only the features. We know the answers for the test set so we can compare the test predictions to the answers. We will use 70% of the data for training and 30% for testing: # Split into 70% training and 30% testing set X, X_test, y, y_test = train_test_split(features, targets, test_size = 0.3, random_state = 42) Now we can calculate the naive baseline performance: The baseline guess is a score of 66.00 Baseline Performance on the test set: MAE = 24.5164 The naive estimate is off by about 25 points on the test set. The score ranges from 1–100, so this represents an error of 25%, quite a low bar to surpass! Conclusions In this article we walked through the first three steps of a machine learning problem. After defining the question, we: Cleaned and formatted the raw data Performed an exploratory data analysis to learn about the dataset Developed a set of features that we will use for our models Finally, we also completed the crucial step of establishing a baseline against which we can judge our machine learning algorithms. The second post (available here) will show how to evaluate machine learning models using Scikit-Learn, select the best model, and perform hyperparameter tuning to optimize the model. The third post, dealing with model interpretation and reporting results, is here. As always, I welcome feedback and constructive criticism and can be reached on Twitter @koehrsen_will.
Read MoreI remember distinctly the moment where I thought, “Wow, that’s slow, I bet if could parallelize these calls it would just fly!” and then, about three days later, I looked at my code and just didn’t recognize it in the unreadable mash up of calls to threading and process library functions in front of me. Then I found asyncio, and everything changed. If you don’t know, asyncio is the new concurrency module introduced in Python 3.4. It’s designed to use coroutines and futures to simplify asynchronous code and make it almost as readable as synchronous code simply because there are no callbacks. I also remember that while on that quest for parallelisation a number of options were available, but one stood out. It was quick, easy to introduce and well thought of: the excellent gevent library. I arrived at it by reading this lovely hands-on tutorial: gevent for the Working Python Developer, written by an awesome community of users, a great introduction not only to gevent but to concurrency in general, and you most definitely should check it out. I like the tutorial so much that I decided it would be a good template to follow when introducing asyncio. Quick disclaimer, this is not a gevent vs. asyncio article, Nathan Road wrote a great piece on what’s similar and dissimilar between the two if you’re interested. I know you’re excited but before we dive in I’d like to quickly go over some concepts that may not be familiar at first. Update June 2018: In Python 3.7 asyncio has gotten a few upgrades in its API, particularly around managing of tasks and event loops. I’ve updated the examples to encourage adoption as I believe it’s cleaner and more concise. If you cannot update to 3.7 there are versions of the examples for 3.6 and below available in the GitHub repository for this article. Update May 2018: some readers reported that the code examples were no longer compatible with recent versions of aiohttp. I have now updated the examples to work with the most recent version at the time of this writing 3.2.1. Update Feb 2017: following some feedback I’ve decided to use 3.5 async/await syntax, I’ve updated the examples accordingly. If you’re interested the original 3.4 syntax examples are available in the Github repo for this tutorial. Threads, loops, coroutines and futures Threads are a common tool and most developers have heard of and used before. However asyncio uses quite different constructs: event loops, coroutines and futures. An event loop essentially manages and distributes the execution of different tasks. It registers them and handles distributing the flow of control between them. Coroutines are special functions that work similarly to Python generators, on await they release the flow of control back to the event loop. A coroutine needs to be scheduled to run on the event loop, once scheduled coroutines are wrapped in Tasks which is a type of Future. Futures are objects that represent the result of a task that may or may not have been executed. This result may be an exception. Got it? Pretty simple, right? let’s dive right in! Synchronous & Asynchronous Execution In Concurrency is not parallelism, it’s better Rob Pike makes a point that really made things click in my head. Breaking down tasks into concurrent subtasks only allows parallelism, it’s the scheduling of these subtasks that creates it. Asyncio does exactly that, you can structure your code so subtasks are defined as coroutines and allows you to schedule them as you please, including simultaneously. Coroutines contain yield points where we define possible points where a context switch can happen if other tasks are pending, but will not if no other task is pending. A context switch in asyncio represents the event loop yielding the flow of control from one coroutine to the next. Let’s have a look at a very basic example: $ python 1-sync-async-execution.py Running in foo Explicit context to bar Explicit context switch to foo again Implicit context switch back to bar First we declare a couple of simple coroutines that pretend to do non-blocking work using the sleep function in asyncio. Then we create an entry point coroutine from which we combine the previous coroutines using gather to wait for both of them to complete. There’s a bit more to gather than that but we’ll ignore it for now. And finally we schedule our entry point coroutine using asyncio.run, which will take care of creating an event loop and scheduling our entry point coroutine. Note that versions of Python prior to 3.7 coroutines had to be manually wrapped in Tasks to be scheduled using the current event loop’s create_task method. There was also a bit of boilerplate required to create an event loop and schedule our tasks. Please refer to the GitHub repository for code samples using these techniques. By using await on another coroutine we declare that the coroutine may give the control back to the event loop, in this case sleep. The coroutine will yield and the event loop will switch contexts to the next task scheduled for execution: bar. Similarly the bar coroutine uses await sleep which allows the event loop to pass control back to foo at the point where it yielded before, just as normal Python generators. Let’s now simulate two blocking tasks, gr1 and gr2, say they’re two requests to external services. While those are executing a third task can be doing work asynchronously, like in the following example: $ python 1b-cooperatively-scheduled.py gr1 started work: at 0.0 seconds gr2 started work: at 0.0 seconds Let's do some stuff while the coroutines are blocked, at 0.0 seconds Done! gr1 ended work: at 2.0 seconds gr2 Ended work: at 2.0 seconds Notice how the event loop manages and schedules the execution allowing our single threaded code to operate concurrently. While the two blocking tasks are blocked a third one can take control of the flow. Order of execution In the synchronous world we’re used to thinking linearly. If we were to have a series of tasks that take different amounts of time they will be executed in the order that they were called upon. However, when using concurrency we need to be aware that the tasks finish in different order than they were scheduled. $ python 1c-determinism-sync-async.py Synchronous: Task 1 done Task 2 done Task 3 done Task 4 done Task 5 done Task 6 done Task 7 done Task 8 done Task 9 done Asynchronous: Task 4 done Task 9 done Task 2 done Task 3 done Task 5 done Task 6 done Task 8 done Task 1 done Task 7 done Your output will, of course, vary since each task will sleep for a random amount of time, but notice how the resulting order is completely different, even though we built the array of tasks in the same order using range. It’s important to understand that asyncio does not magically make things non-blocking. At the time of writing asyncio stands alone in the standard library, the rest of modules provide only blocking functionality. You can use the concurrent.futures module to wrap a blocking task in a thread or a process and return a Future asyncio can use. This same example using threads is available in the Github repo. This is probably the main drawback right now when using asyncio, however there are plenty of libraries for different tasks and services. A very common blocking task is, of course, fetching data from an HTTP service. I’m using the excellent aiohttp library for non-blocking HTTP requests retrieving data from Github’s public event API and simply take the Date response header. Please do not focus on the details of the aiohttp_get coroutines below. They use asynchronous context manager syntax which is outside the scope of this article but is necessary boilerplate to perform an asynchronous HTTP request using aiohttp. Just pretend is an external coroutine and focus on how it’s used below. $ python 1d-async-fetch-from-server.py Synchronous: Fetch sync process 1 started Process 1: Fri, 29 Jun 2018 11:41:37 GMT, took: 0.76 seconds Fetch sync process 2 started Process 2: Fri, 29 Jun 2018 11:41:37 GMT, took: 0.67 seconds Fetch sync process 3 started Process 3: Fri, 29 Jun 2018 11:41:38 GMT, took: 0.68 seconds Process took: 2.11 seconds Asynchronous: Fetch async process 1 started Fetch async process 2 started Fetch async process 3 started Process 1: Fri, 29 Jun 2018 11:41:39 GMT, took: 0.70 seconds Process 2: Fri, 29 Jun 2018 11:41:39 GMT, took: 0.71 seconds Process 3: Fri, 29 Jun 2018 11:41:39 GMT, took: 0.84 seconds Process took: 0.86 seconds First off, note the difference in timing, by using asynchronous calls we’re making at the same time all the requests to the service. As discussed each request yields the control flow to the next and returns when it’s completed. The result is that requesting and retrieving the result of all requests takes only as long as the slowest request! See how the timing logs 0.84 seconds for the slowest request which is the about the total time elapsed by processing all the requests. Pretty cool, huh? Secondly, look at how similar the code is to the synchronous version! It’s essentially the same! The main differences are due to library implementation for performing the GET request and creating the tasks and waiting for them to finishing. Creating concurrency So far we’ve been using a single method of creating and retrieving results from coroutines, creating a set of tasks and waiting for all of them to finish. But coroutines can be scheduled to run or retrieve their results in different ways. Imagine a scenario where we need to process the results of the HTTP GET requests as soon as they arrive, the process is actually quite similar than in our previous example: $ python 2a-async-fetch-from-server-as-completed.py Fetch async process 2 started, sleeping for 5 seconds Fetch async process 3 started, sleeping for 4 seconds Fetch async process 1 started, sleeping for 3 seconds >> Process 1: Fri, 29 Jun 2018 11:44:19 GMT, took: 3.70 seconds >>>> Process 3: Fri, 29 Jun 2018 11:44:20 GMT, took: 4.68 seconds >>>>>> Process 2: Fri, 29 Jun 2018 11:44:21 GMT, took: 5.68 seconds Process took: 5.68 seconds Note the padding and the timing of each result call, they are scheduled at the same time, the results arrive out of order and we process them as soon as they do. The code in this case is only slightly different, we’re gathering the coroutines into a list, each of them ready to be scheduled and executed. The as_completed function returns an iterator that will yield a completed future as they come in. Now don’t tell me that’s not cool. By the way, as_completed is originally from the concurrent.futures module. Let’s get to another example, imagine you’re trying to get your IP address. There are similar services you can use to retrieve it but you’re not sure if they will be accessible at runtime. You don’t want to check each one sequentially, ew. You would send concurrent requests to each service and pick the first one that responds, right? Right! Well, there’s one more way of scheduling tasks in asyncio, wait, which happens to have a parameter to do just that: return_when. But now we want to retrieve the results from the coroutine, so we can use the two sets of futures, done and pending. In this next example we’re going to use the pre Python 3.7 way of starting things off in asyncio to illustrate a point, please bear with me: $ python 2b-fetch-first-ip-address-response-await.py Fetching IP from ip-api Fetching IP from ipify ip-api finished with result: 81.106.46.223, took: 0.10 seconds Task was destroyed but it is pending! task: <Task pending coro=<fetch_ip() done, defined at 2b-fetch-first-ip-address-response-await.py:21> wait_for=<Future pending cb=[BaseSelectorEventLoop._sock_connect_done(10)(), <TaskWakeupMethWrapper object at 0x10c11cd38>()]>> Wait, what happened there? The first service responded just fine but what’s with all those warnings? Well, we scheduled two tasks but once the first one completed the closed the loop leaving the second one pending. Asyncio assumes that’s a bug and prints out a warning. We really should clean up after ourselves and let the event loop know not to bother with the pending futures. How? Glad you asked. Future states (As in states that a Future can be in, not states that are in the future… you know what I mean) These are: Pending Running Done Cancelled As simple as that. When a future is done its result method will return the result of the future, if it’s pending or running it raises InvalidStateError, if it’s cancelled it will raise CancelledError, and finally if the coroutine raised an exception it will be raised again, which is the same behaviour as calling exception. But don’t take my word for it. You can also call done, cancelled or running on a Future to get a boolean if the Future is in that state, note that done simply means result will return or raise an exception. You can specifically cancel a Future by calling the cancel method (oddly enough), which is exactly what asyncio.run does under the hood in Python 3.7 so you don’t have to worry about it. $ python 2c-fetch-first-ip-address-response-no-warning.py Fetching IP from ip-api Fetching IP from ipify ip-api finished with result: 81.106.46.223, took: 0.12 seconds Nice and tidy output, gotta love it. This type of “Task is destroyed but is was pending” error is quite common when working with asyncio and now you know the reason behind it and how to avoid it, I hope you can forgive my little detour to pre-3.7 land. Futures also allow attaching callbacks when they get to the done state in case you want to add additional logic. You can even manually set the result or the exception of a Future, typically for unit testing purposes. Exception handling Asyncio is all about making concurrent code manageable and readable, and that becomes really obvious in the handling of exceptions. Let’s go back to an example to illustrate this. Imagine we want to ensure all our IP services return the same result, but one of our services is offline and not resolving. We can simply use try...except, as usual: $ python 3a-fetch-ip-addresses-fail.py Fetching IP from ip-api Fetching IP from ipify Fetching IP from borken ipify finished with result: 81.106.46.223, took: 5.35 seconds borken is unresponsive ip-api finished with result: 81.106.46.223, took: 4.91 seconds We can also handle the exceptions as we process the results of the futures, in case an unexpected exception occurred: $ python 3b-fetch-ip-addresses-future-exceptions.py Fetching IP from ip-api Fetching IP from ipify Fetching IP from borken borken is unresponsive Unexpected error: Traceback (most recent call last): File "3b-fetch-ip-addresses-future-exceptions-await.py", line 42, in main print(future.result()) File "3b-fetch-ip-addresses-future-exceptions-await.py", line 30, in fetch_ip ip = json_response[service.ip_attr] KeyError: 'this-is-not-an-attr'ipify finished with result: 81.106.46.223, took: 0.52 seconds Didn’t see that one coming… In the same way that scheduling a task and not waiting for it to finish is considered a bug, scheduling a task and not retrieving the possible exceptions raised will also throw a warning: $ python 3c-fetch-ip-addresses-ignore-exceptions-await.py Fetching IP from borken Fetching IP from ip-api Fetching IP from ipify borken is unresponsive ipify finished with result: 81.106.46.223, took: 1.41 seconds Task exception was never retrieved future: <Task finished coro=<fetch_ip() done, defined at 3c-fetch-ip-addresses-ignore-exceptions-await.py:20> exception=KeyError('this-is-not-an-attr')> Traceback (most recent call last): File "3c-fetch-ip-addresses-ignore-exceptions-await.py", line 29, in fetch_ip ip = json_response[service.ip_attr] KeyError: 'this-is-not-an-attr' That looks remarkably like the output from our previous example, minus the tut-tut message from asyncio. Timeouts What if we don’t really care that much about our IP? Imagine it being a nice addition to a more complex response but we certainly don’t want to keep the user waiting for it. Ideally we’d give our non-blocking calls a timeout, after which we just send our complex response without the IP attribute. Again wait has just the attribute we need: Notice the timeout argument on wait, we’re also adding a command line argument to test what happens if we do allow the requests some time. I also added a some random sleeping time to ensure things didn’t move too fast. $ python 4a-timeout-with-wait-kwarg-await.py Using a 0.01 timeout Fetching IP from ipify Fetching IP from ip-api {'message': 'Result from asynchronous.', 'ip': 'not available'}$ python 4a-timeout-with-wait-kwarg-await.py -t 5 Using a 5.0 timeout Fetching IP from ipify Fetching IP from ip-api ip-api finished with result: 81.106.46.223, took: 0.22 seconds {'message': 'Result from asynchronous.', 'ip': '81.106.46.223'} Conclusion Asyncio has extended my already ample love for Python. To be absolutely honest I fell in love with marriage of coroutines and Python when I first discovered Tornado but asyncio has managed to unify the best of this and the rest of excellent concurrency libraries into a rock solid piece. So much so that a special effort was made to ensure these and other libraries can use the main IO loop, so if you’re using Tornado or Twisted you can make use of libraries intended for asyncio! As I said before its main problem is the lack of standard library modules that implement non-blocking behaviour. You may find that a particular technology that has plenty of well established Python libraries to interact with will not have a non-blocking version, or the existing ones are young lived or experimental. However, the number asyncio compatible libraries always increasing. Hopefully in this tutorial I communicated what a joy is to work with asyncio. I honestly think it’s the piece that will finally make adaptation to Python 3 a reality, it really feels you’re missing out if you’re stuck with Python 2.7. One thing’s for sure, Python’s future has completely changed, pun intended. P.S. If you want more asyncio goodness I’ve written a two-part follow up article to this one: Asyncio Coroutine Patterns: Beyond await and Asyncio Coroutine Patterns: Errors and Cancellation, happy awaiting! Hacker Noon is how hackers start their afternoons. We’re a part of the @AMI family. We are now accepting submissions and happy to discuss advertising & sponsorship opportunities. If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!
Read More