Introduction to Data Science and Computational Thinking notes

Learn Data Science: Free, Self-Taught, and Python-Powered (SUN Data)

DR JUAN H KLOPPER

SCHOOL FOR DATA SCIENCE AND COMPUTATIONAL THINKING AT STELLENBOSCH UNIVERSITY

Every year, the School for Data Science and Computational Thinking offers an online workshop in Data Science with Python, free of charge to any participants across the globe. This workshop is typically offered during the mid-semester break in the second semester at Stellenbosch University. We make this freely available so you may upskill yourself. Please note, all content belongs to Stellenbosch University.

INTRODUCTION

Our world has undergone tremendous changes in the last few decades. Knowledge has become the cornerstone of our civilisation. We have embraced the use of data to gain that knowledge. This has been brought about by our ability to generate and capture vast amounts of data. Together with an explosion of data has come the ease of access to data and the ease of extracting knowledge from data.

Extracting knowledge from data was traditionally the task of statisticians and data analysts. Formal training in statistics was not for the masses. As data became abundant in so many fields, domain experts in these fields needed to learn how to analyse data. Their unique perspective, knowledge, and experience are invaluable in making sense of all the data in their fields.

Modern computers and software have led to the democratisation of analyses, understanding data, and the use of data. Expensive, closed-source software from large corporations only available to the elite have made way for free and open-source computer languages such as Python, R, and Julia. Any domain expert or interested party can now use data to contribute to the fundamental understanding of our world, solving problems today that would otherwise have taken decades more to solve.

Data Science is the umbrella term for this ability to gather, manipulate, analyse, visualise, and learn from data. There has never been a more exciting time.

The journey is full of surprises and moments of enlightenment as you join the massive and ever-growing community of Data Scientist. The skills that you will become aware of during this course will open a new world.

Our vehicle will be the most popular language in Data Science. Python is an easy to learn, yet extremely powerful computer language. We will use cloud-based computing, negating the need to install any software one your own computer.

The course has been created to jump-start your Data Science skills. As such, it is very dense with information. I want you to have the best and most complete start to your new abilities. One week is not enough to learn all these new skills. As with learning a new spoken language, you will need time and experience well beyond just this week. I do, however, want to leave you with a clear path forward. A few toy examples will not satisfy you. Instead, this course aims to highlight everything that is possible. I do not want to leave you wanting. I want you to become an expert Data Scientist in your field.

There are several educational resources available for this course. First and foremost are a set of detailed video tutorial that serves as your first contact with the course. The video lectures make use of extensive notes and code. The code is available as Google Colab notebooks. The notebooks are also provided as reference documents in portable document format (PDF) for you to read. There are also sets of exercise materials that you complete before each day’s live session. During these sessions we work through the exercise material.

The School for Data Science and Computational Thinking wants to be your partner in the future, and we hope that you stay in touch after the course.

The course comprises 14 modules, which are described below. All the modules contain Google Colab notebook files. Modules 1, 3, 4, 5, 6, 7, 8, 9, 10, 13, and 14 contain video lectures and PDF notes. Modules 3, 4, 5, 6, 7, 9, 10, and 11 contain exercise notebook files that you should attempt to complete. They will form the basis of the live sessions. A set of solution files are also available that you can use if you get stuck. Try not to use them, though. In Python, there are many ways to solve a problem and you will learn more by discovering your own solutions.

Please watch the announcement video to learn more about the course: https://youtu.be/4TlOgImDF5A It contains instructions on how to set up a Google account that you will need to complete this course. It also documents how to download, unzip, and upload all the required files to your Google account.

COURSE LOGISTICS

The course contains educational material, exercise material, and live, online sessions. The educational material consists of video lectures and print documents (PDF format).

This is a one-week short course. The video material and or PDF documents must be viewed or read each morning (or the night before) prior to attempting the exercises. A live session will commence at 14H00 each day during which the instructor will work through exercises and have a discussion with the participants about the material.

A video is available (linked in the introduction section of this document) that walks you through preparing for the course and setting up the required online account (which is free). The video must be watched prior to starting the course. All setup steps must similarly be completed prior to the start of the course. No download or installation of programs on your computer is required. The account that you need to create is a Google Drive account (Gmail and Google Docs). You will also be provided with a set of files that must be uploaded to your Google Drive. The video will also contain information on which of the 14 modules of this course will be completed on each day.

Click here ( for the zip file) for other material needed for the course.

MODULES

01 INTRODUCTION TO DATA SCIENCE

This module introduces the topic of Data Science and the tools that we use to do our analysis. At the end of the module there is a short example project. You may either watch the video or read the accompanying PDF document. There is no exercise file in this module.

Video lecture link: https://youtu.be/4EMfOI1PiLQ

Click here for notes. <---- click on the word 'here'.

SECTION	DECSRIPTION
Defining Data Science	Data Science is the amalgamation of tools from statistics, mathematics, and computer science that provide us with the ability to learn from data to understand and improve our world.
The Tools of Data Science	Computer languages have revolutionised our ability to gather and analyse data. Python has emerged as the leading language in the field of Data Science. It is easy to learn, free of charge, and a large community of researchers and users have grown an ecosystem for Python. It is possible to create games and applications using Python. Its power to work with, analyse, visualise, and interpret data is at the core of its success.
Example Data Science Project	There is a lot to cover in a course on Data Science using Python. This short project serves as a demonstration of the power and ease of use of the language.

02 DATA AND DEFINITIONS

This module introduces terms and definitions that you should be familiar with. Read the accompanying PDF document. There is no video lecture or exercise file for this module.

Click here for notes. <---- click on the word 'here'.

SECTION	DECSRIPTION
Data Types	There are classification systems for the type of data that we collect. The most emphatic divides data into a numerical and a categorical type. There are also subtypes for two main types. We learn about these subtypes with examples.
Sample Space	Age in years for humans is typically 0 to 100 years. A laboratory test can only be low, normal, or high with respect to a reference range of normal values. Values for a variable have a range or set. When we collect data for a variable from a sample of subjects, each value comes from this range or set of elements.
Tidy Data	When data is captured in the form of a spreadsheet or even when it is extracted from a database, it is ideal to have each row represent a subject and each column a well-defined data type. This long-format of data is emphasised and used in Data Science.
Research Questions	Computational thinking and research in general require clarity in the expression of a research question. This clarity enables us to convert our curiosity into data that we can examine in a structured way using a computer language.
Trials Experiments Outcomes	There are many confusing and overlapping terms in Data Science. This section clarifies some of the commonly used terms.
Populations and Samples	In most cases we cannot collect data from all members of a population. Instead, we take a random sample from the population. The results are then inferred back to the population.
Randomisation	To be able to infer the results of analysis of the data from a sample, we must select subjects without bias. Randomisation helps to ensure this selection. There are numerous ways to randomise the selection of subjects from a population. The common methods are discussed.

03 PYTHON

This module introduces the Python programming language. Watch the accompanying video lecture or read the PDF document. There is an exercise file that you should attempt to complete. A new computer language can only be conquered by using it yourself.

Video lecture link: https://youtu.be/NLPoN5Oy_jw

Click here for notes. <---- click on the word 'here'.

SECTION	DECSRIPTION
Computer Languages	There are many computer languages, most geared towards a specific task. Python is a general-purpose language. Its ease of use and clarity in its syntax has made it the leading language in Data Science.
Tools	To use a computer language, we need to have a program in which we type the actual code. Many such tools are available. Here we introduce Google Colaboratory. It is part of Google Drive and available to anyone with a Google email account. It is as simple to use as Google Docs and is free.
Arithmetic	Python is easiest to introduce using simple arithmetic. Python can serve as a giant calculator.
Conditionals	Conditionals test a question. Is 2 greater than 4? A simple concept but of extreme importance in Data Science where we extract and manipulate data based on these questions. A conditional can only return a True or a False value. We use this to include and exclude data from our analysis.
Functions	A computer language such as Python contains many keywords. They make up the syntax of a language. Many keywords are functions. They take input values that we provide and return a useful value after taking appropriate action.
Python Data Types	Objects in Python are of a certain type. Numbers might be integers (whole numbers) or decimal values. We can group numbers together into list objects. Here we explore the basic data types in Python.
Math Package	Packages are code that we import into a running session of Python. These packages contain extra functions and functionality that expand the use of Python. We use the Math package to expand on the mathematical operations that we can perform in Python.
Computer Variables and Assignment	Objects in Python can be stored in computer memory for reuse. This is done by providing an appropriate name for an object and then assigning the object to that name. The name, referred to as a computer variable, signifies the area in memory that contains the object. In Python, as in most other languages, the equal symbol serves as the assignment operator.
Collection	Numbers, words, and other objects can be combined into collection. The three main collection types in Python are lists, tuples, and dictionaries. Here we explore examples of each and learn when and how to use each.
Loops	There are various loop operators in Python. They allow us to control the flow of execution of our code. This is particularly useful when we need to iterate over many instances of our analysis while certain conditions are met.
If Elif Else	These keywords are used in conjunction with loops to control the flow of execution of our code and analysis.
List Comprehension	List comprehension is a useful and fast way to generate data based on calculations. It can save a lot of code writing and time.
Numpy	One of the fundamental packages in Python is Numerical Python or numpy for short. It adds a host of functions and functionality to Python that are geared towards the analysis of data.

04 IMPORTING AND MANIPULATING TABULAR DATA

In this module we import data from spreadsheet and learn how to manipulate it using one of the most successful packages called pandas. You can watch the video lecture or read the PDF document. There is an exercise file that you should attempt to complete.

Video lecture link: https://youtu.be/SoxroY_1QvU

Click here for notes. <---- click on the word 'here'.

SECTION	DESCRIPTION
Pandas	The pandas package is one of the main reasons for the success of Python in Data Science. It allows us to create, import, manipulate, analyse, and plot data.
Importing Data	It is most common to have data saved as a spreadsheet file. The best format in Data Science is the comma separated values (CSV) file. Microsoft Excel, Google Sheets, and database application can export data as CSV files. They strip away all the extraneous additions that these applications add to data. These include formatting and colouring. We only need the actual values when analysing data. Pandas makes it easy to import data files.
Extracting Data	In many cases we only require a subset of our data for analysis. Here we learn how to use pandas to manipulate and extract our data.
Filtering Data	We can narrow our search by filtering out any unnecessary information.
Updating and Changing Data	Pandas allows for updating of data values and the addition and removal of data. These are required tasks as we explore the information contained in data.
Sorting Data	Many tests require sorting of data whether it be alphabetical or numerical. It is also a useful task when visualising data.
Missing Data	It can be rare to find a data set that contains values for all subjects and variables. Dealing with missing data has a direct impact on data analyses.
Dates and Times	There are many date and time formats. Computer hardware and software applications can be set to default formats, often in competing formats on the same computer. Data for dates and times can also be entered in many formats. Dealing with dates and times in a data set is challenging. Here we learn how to use Pandas to standardise formats for analyses.

05 SUMMARISING DATA

Module five introduces the descriptive statistics. Describing our data is the first step in understanding the story that our data is trying to tell. You may either watch the video lecture or read the PDF document. There is an exercise notebook that you should complete.

Video lecture link: https://youtu.be/H4-MjjFAgpI

Click here for notes. <---- click on the word 'here'.

SECTION	DESCRIPTION
Counting	Frequency is the number of times that a sample space from a variable appears in a data set. We can also divide by the sample size to give is a relative frequency. Counting the often a particularly useful summary of data.
Measure of Central Tendency	It is not possible to stare at rows and rows of numbers and learn anything from the exercise. Instead, we calculate single values that are representative of the whole. These values include the mean, the median, and the mode.
Measures of Dispersion	We also gain knowledge from a set of numbers if we know how spread they are. These measures include ranges, variances, standard deviations, and percentiles.

06 DATA VISUALISATION

The cornerstone of Data Science is data visualisation. The ability to plot our data gives us a rich understanding of the data and allows us to communicate our findings. You can watch the video lecture or read the PDF document. There is an exercise file that you should attempt to complete.

Video lecture link: https://youtu.be/AlKIQA7eWgU

Click here for notes. <---- click on the word 'here'.

SECTION	DESCRIPTION
Python Data Visualisation Ecosystem	Python has many packages that allow us to visualise data. This brings an even richer understanding of the knowledge hidden in the data. The plotly library generates interactive plots, ideal for Data Science.
Bar Plots	Bar plots are visual representations of frequency and relative frequency. They are the preferred visual representation of categorical or discrete data.
Histograms	Histograms similarly visualise frequencies and relative frequencies. Unlike bar charts, they are used for continuous numerical data.
Box-Whisker Plots	Box-and-whisker charts give an indication of the distribution of numerical data values by incorporating the median and quartiles of the data.
Scatter Plots	Scatter plots compare pairs of numerical variables for each subject. They allow us to visualise correlation between numerical variables and can help in visualising linear models such as linear regression models.
Time Series Plots	Time series plots allow is to visualise change in a variable over time.

07 RANDOMNESS AND SAMPLING

Data captures values for random events. It is important to have a basic understanding of the concept of randomness and the probability of findings from data analysis. In this module we also explore the patterns of data values called distribution. Watch the video lecture or read the PDF document. There is an exercise file that you should attempt to complete.

Video lecture link: https://youtu.be/OmFKiHTA_CQ

Click herefor notes.<---- click on the word 'here'.

SECTION	DESCRIPTION
Randomness	The value for a variable in a subject can be viewed as random. The numpy package allows us to explore the topics and types of randomness that are essential in Data Science.
Probabilities	The likelihood or probability of a given value or statistic forms the core of expressing results in Data Science and in statistics.
Random Variables	A random variable is a function that assigns a value (that we can capture in a spreadsheet) to a random outcome. By understanding randomness and random variables and probabilities we can make sense of the knowledge in data.
Distributions	There are patterns to random variables when we consider their frequency in a data set. These patterns are termed distributions. Sampling distributions form the core of many analyses.

08 HYPOTHESIS TESTING

Our scientific method, the steps that we follow to understand our world through research, is based on the process of hypothesis testing. You can watch the video lecture or read the PDF to learn more. There is no exercise file for this module.

Video lecture link: https://youtu.be/kwocwiRM7_0

Click herefor notes.<---- click on the word 'here'.

SECTION	DESCRIPTION
Sampling based on Proportions	Given a set of data for a categorical variable we determine how likely it is to have found the frequency of each sample space element.
Differences in Means	Hear we look at how likely it is to find the difference in means for a variable given two groups.
Hypothesis Testing	Hypothesis testing is the bedrock of the scientific method. We start with a research question stated in such a way that it is amenable to the collection of data of specific types that can be analysed to provide an answer to our question
Stating Hypotheses	Here we learn to state the two hypotheses that make up the method of hypothesis testing: the null hypothesis and alternative hypothesis.
Estimating Differences	Here, we look at an example problem to illustrate hypothesis testing.
One-tailed Hypothesis	While the two-tailed alternative hypothesis is used in most cases, we also learn about one-tailed hypothesis.

09 COMPARISONS FOR A NUMERICAL VARIABLE

In this module we put our knowledge from the previous module to work, comparing two means for a numerical variable. We learn how to use the null hypothesis to construct many possible outcomes and to see the probability of the findings from our won research. There is an accompanying video lecture that you can watch, or you can read the PDF document. Also attempt to complete the exercise file.

Video lecture link: https://youtu.be/YOLuscAc24E