We are accepting registrations for Spring 2018 workshops  (Workshop descriptions of these can be found below; also note that some of these may have pre-requisites)

Introduction to R and RStudio (online), Spring 2018 (Instructor: Shawn O'Neil)

♦ OSU-affiliated staff, faculty, and students are eligible for a discount (normal price: $500, price after discount: $200), email for the discount code before registering.

♦ This workshop will be online-only and instructor-led, from Apr. 9 to May 18, 2018. It is not available for academic credit.


Introduction to Unix/Linux and Command-Line Data Analysis, Spring 2018 (Instructor: Matthew Peterson)

♦ Students wishing to take Introduction to Unix/Linux and Command-Line Data Analysis for credit (MCB599) need to register via the OSU Registrar.


In collaboration with the Department of Statistics and Molecular and Cellular Biology, the CGRB offers a number of workshops and classes available to both internal and external faculty, staff, postdocs, and students. Generally these are 1- or 2-credit sized (3 or 6 weeks), and are usually offered in academic terms according to the schedule illustrated below. Skills and knowledge covered in classes along "tracks" are pre-requisites (e.g. skills from Command-Line Data Analysis are required for Genotyping By Sequencing and Metagenomics I, while skills from Metagenomics I are required for Metagenomics II.) Dashed lines represent recommended but not pre-requisite dependencies. If you are unsure about you abilities for any of these, see their descriptions below or feel free to contact the trainers with questions.


Workshop Cycle

Pre-requisites: The classes annoted with a "bell curve" above require some incoming knowledge of statistics, including ANOVA, regression, p-values, and multiple-test-correction. These pre-requisites can be obtained with OSU classes ST 411/511, or ST201/202 or ST 351/352. The Kahn Academy Statistics playlist may also be a helpful resource.

Most of these utilize our Advanced Cyberinfrastructure Teaching Facility.

Participants who satisfactorily complete a workshop will receive a certificate.

Infrastructure Programming Analysis


Command Line

Introduction to Unix/Linux (3 weeks @ 3 hrs per week)

This workshop introduces the natural environment of bioinformatics: the Linux command line. Material will cover logging into remote machines, filesystem organization and file manipulation, and installing and using software (including examples such as HMMER, BLAST, and MUSCLE). Finally, we introduce the CGRB research infrastructure (including submitting batch jobs) and concepts for data analysis on the command line with tools such as grep and wc.

Command-Line Data Analysis (3 weeks @ 3 hrs per week)

The Linux command-line environment has long been used for analyzing text-based and scientific data, and there are a large number of tools pre-installed for data analysis. These can be chained together to form powerful pipelines. Material will cover these and related tools (including grep, sort, awk, sed, etc.) driven by examples of biological data in a problem-solving context that introduces programmatic thinking. This course also covers regular expressions, a useful syntax for matching and substituting string and sequence data.



Python I (3 weeks @ 3 hrs per week)

This workshop introduces programming concepts, driven by examples of biological data analysis, in the Python programming language. Topics covered will include variables and data types (including strings, integers and floats, dictionaries and lists), control flow (loops, conditionals, and some boolean logic), variable scope and its proper use, basic usage of regular expressions, functions, file input and output, and interacting with the larger Unix/Linux environment.

Python II (3 weeks @ 3 hrs per week)

Part II of the Python series expands on basic programming topics and explores a common concept in modern software development called Object Oriented design, driven again by examples of biological data analysis. Although we will not cover the subtopics of inheritance or public/private variables, we will discuss the use of objects (and their blueprints: classes) in encapsulating functionality into easily used blocks of code that more closely match the biological concepts at hand. Other topics in this area include APIs and syntactic sugar. Finally, we’ll use these ideas to explore creating and using packages such as the BioPython package.


Data Programming in R (6 weeks @ 3 hrs per week)

The R programming language is widely used for the analysis of statistical data sets. This course introduces the language from a computer science perspective, covering topics such as basic data types (e.g. integers, numerics, characters, vectors, lists, matrices, and data frames), importing and manipulating data (in particular, vector and data-frame indexing), control flow (loops, conditionals, and functions), and good practices for producing readable, reusable, and efficient R code. We'll also explore functional programming concepts and the powerful data manipulation and visualization packages dplyr and tidyr, and ggplot2.

Recursion and Dynamic Programming (3 weeks @ 3 hrs per week)

This workshop explores and implements several problem-solving techniques that are related and particularly beautiful: recursion, (self-referential processes), dynamic programming (guided deconstruction of problems into subproblems), and induction (the mathematical proof technique). After some practice with basic recursive methods (the Fibonacci sequence and predator-prey recurrence relations), we'll cover variations of the sequence-alignment problem from multiple perspectives, including the local heuristic used by BLAST.

Topics covered along the way include memoization, dynamic programming, proofs by contradiction, and data structures like lists, hash tables, stacks, trees, and the call stack. Finally, we'll explore graphical modeling of plants with L-system fractals and if time allows other applications that utilize recursion and dynamic programming (e.g. Hidden Markov Models).



Genotyping By Sequencing (3 weeks @ 3 hrs per week)

This course covers the analysis of data generated by genotyping-by-sequencing (GBS), a restriction-enzyme approach allowing the deep sequencing of many individuals or samples at select regions of the genome. We'll cover potential applications of GBS (e.g. population-level analyses), processing GBS data with the Tassel, Uneak, and Stacks software, and interpretation of the results.


Metagenomics I (3 weeks @ 3 hrs per week)

 Description coming soon.

Metagenomics II (3 weeks @ 3 hrs per week)

 Description coming soon.


RNA-Seq I (3 weeks @ 3 hrs per week)

This first of a series on analyzing RNA-seq data covers the development of de-novo transcriptome assemblies. This includes data cleaning and preparation, comparing methods of assembly, filtering of contigs and assessing the quality of output. Other topics include variant detection and annotation.

RNA-Seq II (3 weeks @ 3 hrs per week)

This second in the series on analyzing RNA-seq data covers the analysis of differential expression. Topics include data preparation, read mapping, region identification and statistical analysis with R and Bioconductor.


Conservation Genomics (3 weeks @ 3 hrs per week)

The main focus of the workshop is on natural populations, especially those that are endangered; but anyone who needs to draw inferences to solve broader research questions (parentage, phenotypic plasticity, using GBS markers to detect and mesure selection, basic and advanced phylogenomics, comparative genomic analyses, host specificity and disease adaptation) can benefit.