Data Science for Everyone


 

Register Here

 

Date:

6 - 10 April 2020

Venue:

Murchison House, The King's Buildings, The University of Edinburgh, Edinburgh, Scotland, UK

Places:

20 (first come, first served)

Registration fee:

University of Edinburgh Staff/Students - £400

Non- University of Edinburgh Staff/Students - £550

(includes coffee/tea, but no lunch)

Information:

Contact our training team

 


This course is an excellent introduction to data handling and analysis using Python, ragardless of what field you work in or what coding experience you have.

Python is a dynamic, readable language that is a popular platform for all types of data analysis work, from simple one-off scripts to large, complex software projects. One of the strengths of the Python language is the availability of mature, high-quality libraries for working with scientific data. Integration between the most popular libraries has lead to the concept of a "scientific Python stack": a collection of packages which are designed to work well together.

This workshop is split into two sections. In the first two days we will introduce the basics of the Python language for those new to programming. For students with previous experience of Python or other languages, this will serve as a refresher and a chance to discuss best practice and focus on the parts of the language that we will need later.

In the second two days we will see how to leverage the libraries in the scientific Python stack to efficiently work with and visualise large volumes of data. Specifically, we will cover:

    • pandas for reading, cleaning and manipulating tabular data
    • numpy for efficiently working with arrays of data
    • scipy for basic statistics
    • seaborn and matplotlib for data visualization

 

Instructor

Dr. Martin Jones (Founder, Python for Biologists)
 

Who should attend

This workshop is aimed at complete beginners and assumes no prior programming experience. Rather than attempting to give a comprehensive overview of Python, we will instead concentrate on how best to use existing libraries to accomplish a lot while writing a very small amount of code! There will be opportunities to use your own data throughout, and the final day is set aside as workshop time for you to work on your own datasets with help from the instructor. If in any doubt as to whether the workshop is suitable for you, take a look at the detailed session content below or drop Martin Jones (martin@pythonforbiologists.com) an email.

 


 

 

Workshop session content

The workshop is delivered over ten half-day sessions (see the detailed curriculum below). Each session consists of roughly a one hour lecture followed by two hours of practical exercises, with breaks at the organizer’s discretion. There will also be plenty of time for students to discuss their own problems and data.

1. Introduction
In this session we'll get familiar with the notebook environment in which we'll be working, and cover the very basics of writing and running Python code. We will briefly cover the different parts of code – variables, functions, methods and arguments – and discuss how these simple building blocks can be combined to make programs that do useful things. We will finish this session by looking at Python's tools for text manipulation, then using them to solve some exercise problems.

2. Lists and loops
In this session we will take our first steps into working with larger datasets by examining lists (which allow us to store multiple bits of information) and loops (which allow us to process them). This will require learning a little bit more about Python's syntax, which will set us up well for future sessions.

3. Conditions
In this session we will take the next logical step and learn how to write programs that can make decisions and implement rules for working with data. Python has a variety of ways to express complicated rules, and we will learn the circumstances that best suit each. We'll also focus a bit more on the importance of readability when there are multiple different ways to achieve the same thing in a program.

4. Organizing and structuring code
In this session we will discuss functions that we’d like to see in Python before considering how we can add to our computational toolbox by creating our own. We examine the nuts and bolts of writing functions before looking at best-practice ways of making them usable. We also look at a couple of features of Python – named arguments and defaults – that are very heavily used in the libraries that we will cover next.

5. Reading and processing data with pandas
In this session we will introduce the first of our scientific python packages: pandas. We will learn how to get our data out of files and into our Python programs. In the process we will have a chance to discuss file formats, missing data, and pandas' data model. We will learn how to efficiently carry out basic calculations and transformations on data and, crucially, how to select and filter data for analysis.

6. Distributions and relationships with seaborn
In this session we will begin to look at visualisation. We will start with the workhorses of data visualization: the histogram and the scatter plot. Studying a few examples of each will allow us to get familiar with the seaborn interface and to cover a few points about visualizations that communicate effectively. We will combine this with a look at styles and colours, which contribute greatly to the readability of our charts. Here we will also discuss the use of statistical methods to discover patterns that are not obvious from simply looking at the data.

7. Relationships in different categories
In this session we will cover the most powerful aspect of seaborn: dividing up our data into different categories to look for patterns. This will build on the tools from the previous session, and there will be lots more to discuss about using them to rapidly explore new datasets. We will learn how to use pandas' grouping ability to produce summary tables, and how to use heatmaps - a fantastic and under-utilized tool for representing complex categorical data - to visualise them. We will also extend our und

8. Distributions in categories
In this session we will round up our survey of seaborn's chart types by bringing in those that directly compare categories. Here we find the classic box plot, as well as the more exotic swarm, violin, and boxen plots which aim to deal with some of its shortcomings. Drawing from a range of example datasets will allow us to illustrate which type of data suit each one best. We will also return to pandas to learn how we can sometimes represent continuous data as categories, and what trade-offs are involved.

9 &10.
The last day is set aside for workshop time. This is an ideal opportunity for students to apply the material to their own datasets (or to examples from their own field) with help from the instructor. Alternatively, we can use the time to discuss topics of particular interest that haven't be covered in the standard syllabus, or to continue to work on exercises from throughout the week.