Data Science Workbook
The Data Science Workbook offers a solid introduction to Data Science by providing principles universal across disciplines. This resource summarizes the basics of background knowledge & good practices tips, proposes state-of-the-art tools and methods based on benchmarks & reviews, and provides the user with hands-on tutorials to learn through examples of real-world applications.
01. Introduction to Data Science
Data Science is an approach shaped in response to digital information whose size and unstructured nature are far beyond the capabilities of conventional tools. It is not just about how to process or store Big Data but also how to improve knowledge retention. Data Science emerges as the fourth paradigm besides empirical, theoretical, and computational, where modern computing techniques (e.g., machine learning) lead to discovering the insights from data-driven analyses. |
02. Introduction to the Command Line
Command Line is a text interface for the computer’s operating system that passes on-demand the predefined commands and triggers the execution of various processes. That gives the user powerful computing capabilities, great analytical flexibility, and significant time savings from automating tasks. In this section, you become familiar with a terminal, including command-line navigation in the file system, and learn vitally important commands for parsing text files. |
See the detailed section index
1. Terminal: a text-based interface for command-line operations
2. Introduction to UNIX Shell
⦿ 2.1 Basic Commands: Navigation, File Creation & Preview
⦿ 2.2 Text Files Editors
⦿ 2.3 System Info and Access Permissions
⦿ 2.4 Admin Commands
⦿ 2.5 Tutorial: Getting Started with UNIX
3. Useful Text Manipulation Programs
⦿ 3.1 Tutorial: GREP – simple search for regular expressions
⦿ 3.2 Tutorial: SED – edit stream text
⦿ 3.3 Tutorial: AWK – advanced text processing
⦿ 3.4 Tutorial: BIOAWK – biological data manipulation
03. Setting Up Computing Machine
A computer is an essential tool in modern digital reality. Using it, you can acquire, process, store, and visualize various types of information. It is worth knowing how to adapt its configuration to your research work. There is treasured software facilitating team communication, project management, presentation of results, and improving the developer experience. You won’t be disappointed to learn about it! |
See the detailed section index
1. Operating System Installation
⦿ 1.2 Windows OS Installation
⦿ 1.2 Linux OS Installation
2. Must-Have Software
⦿ 2.1 Basic Office Software
⦿ 2.2 Basic Developer Tools
⦿ 2.3 Basic Developer Libraries
3. Various Methods of Software Installation
⦿ 3.1 Tutorial: Installations on MacBook Pro
⦿ 3.2 Tutorial: Installations on Windows
⦿ 3.3 Tutorial: Installations on Linux
04. Development Environment
In Data Science, everyone more-or-less is a developer. Whether you write simple scripts in Bash, submit jobs to a queuing system on an HPC infrastructure, or develop advanced software and web applications in any programming language, an integrated development environment (IDE) makes your daily work easier. Among useful features, there are built-in browsing of the file system, autocomplete of syntax, a preview of available attributes, error detection, customized formatting, and even source code rendering (e.g., markdown or HTML). Even if you’re only using open-source pipelines, it’s worth familiarizing yourself with common editors such as Atom or Jupyter to get the most out of them. |
See the detailed section index
1. Integrated Development Environment (IDE)
⦿ 1.1 Tutorial: Atom Editor
2. Python programming environment
⦿ 2.1 Jupyter: Interactive Web-Based Multi-Kernel Programming Interface
⦾ Tutorial: Getting Started with JupyterLab
⦾ Tutorial: Getting Started with Jupyter Notebook
⦾ Tutorial: Sharing Jupyter Notebooks via MyBinder
⦿ 2.2 PyCharm: IDE for Professional Python Developers
3. R programming environment
⦿ 3.1 RStudio: Integrated Environment for R Programming
⦾ Tutorial: Setting Up RStudio
05. Introduction to Programming
With some knowledge of scripting and algorithm design, you can easily encapsulate a repetitive task in a loop that starts with a single command and runs in the background of your schedule. It makes a huge difference compared to manually renumbering 1000 files. The larger the data set, the greater the savings in researcher time, reduced human error and increased reproducibility and standardization. Here you’ll learn Bash scripting basics and be introduced to two of the most widely used programming languages, R and Python. |
06. High-Performance Computing (HPC)
Although today’s handy laptops perform many advanced and computationally intensive tasks, projects involving Big Data require significantly more resources. That need is satisfied by the HPC infrastructure, built from a network of computing clusters combined with immense memory. Access to these resources is remote, so job submission and data preview occurs through an interface on any local computing machine from any (allowed) geolocation. The HPC infrastructure is a shared community space, so you might want to familiarize yourself with the usage policy to avoid disrupting peer work. |
See the detailed section index
1. Setting up your home directoryb for data analysis
⦿ 1.1 .bashrc example file
2. Introduction to HPC infrastructure
⦿ 2.1 XSEDE Supercomputer
⦾ XSEDE Supercell Storage
⦿ 2.2 SCINet Network
⦾ Atlas Computing Cluster
⦾ Ceres Computing Cluster
⦾ Juno Storage
⦿ 2.3 ISU HPC
⦾ Condo Computing Cluster
⦾ Nova Computing Cluster
⦾ LSS Storage
3. Secure Shell Connection (SSH)
⦿ 3.1 SSH Shortcuts
⦿ 3.2 Password-less SSH
4. Remote Data Access
5. Available Software
⦿ 5.1 Accessing pre-Installed Modules
⦿ 5.2 Installing Custom Programs in User Space
6. Introduction to Job Scheduling
⦿ 6.1 SLURM
⦾ SLURM Cheatsheet
⦾ Tutorial: Creating SLURM Job Submission Scripts
⦾ Tutorial: Submitting Dependency Jobs using SLURM
⦿ 6.2 PBS
⦾ PBS Cheatsheet
⦾ Tutorial: Creating PBS Job Submission Scripts
⦾ Tutorial: Submitting Dependency Jobs using PBS
7. Introduction to GNU Parallel
8. Introduction to Containers
⦿ 8.1 Singularity
⦾ Tutorial: Creating Containers using Singularity
⦾ Tutorial: Modifying Existing Containers
⦾ Tutorial: Singularity on your Mac via Vagrant
⦿ 8.2 Docker
07. Data Acquisition and Wrangling
Information is the foundation of the learning process. Data acquisition and wrangling are crucial parts of Data Science that lead to extracting knowledge from the information. With large, difficult to transfer data, remote access is the rule, almost exclusively via a command-line interface. Luckily for you, knowing a few tricks make it easy to access and visualize data in a friendly way on a remote machine. As you explore this section, you will also learn how to manage Excel spreadsheets and efficiently manipulate massive data with Python. |
See the detailed section index
1. Remote Data Access
⦿ 1.1 Remote Data Transfer
⦾ Tutorial: Copying Data using SSH
⦾ Tutorial: Copying Data using Globus
⦾ Tutorial: File Transfer using irods
⦾ Tutorial: File Transfer using SRA Toolkit
⦾ Tutorial: Downloading Online Data using WGET
⦾ Tutorial: Downloading Online Data using Web Scraping
⦾ Tutorial: Downloading Online GitHub Repos using GIT
⦾ Tutorial: Downloading Online GitHub Folders using SVN
⦿ 1.2 Remote Data Preview without Downloading
⦾ Tutorial: Viewing Text Files using UNIX commands
⦾ Tutorial: Viewing PDF Files using X11 SSH connection
⦾ Tutorial: Mounting Remote Folder on Local Machine
2. Data Manipulation
⦿ 2.1 Manipulating Excel Data Sheets
⦾ Tutorial: Create Workbook from Multiple Text Files
⦾ Tutorial: Export Multiple Worksheets as Separate Text Files
⦾ Tutorial: Create Index for All Worksheets
⦾ Tutorial: Merge Two Spreadsheets using a Common Column
⦿ 2.2 Manipulating Text Files with Python
⦾ Tutorial: Read, Write, Split, Select Data
⦾ Tutorial: JSON Module - Encoding & Decoding JSON Data
⦾ Tutorial: Math Module - Various Mathematical Functions
⦾ Tutorial: Math Module - Pandas Library - Data Structure Manipulation Tool
⦾ Tutorial: Math Module - Numpy Library - Multi-Dimensional Arrays Parser
⦾ Tutorial: SciPy Library - Algorithms for Scientific Computing
08. Data Visualization
Nowadays, data visualization is much more than just listing numbers in a table. Infographics and interactive charts perhaps best describe the need for the modern presentation of results. Such visualization is meant to make findings intuitively intelligible and easily understandable at a glance, even to the non-expert. At the same time, it should give the resources to interactively dig deeper into the details for those interested or those evaluating the merits. Let the science-based graphic design kick in the artist in you! |
See the detailed section index
1. Introduction to Scientific Graphic Design
⦿ 1.1 Raster Graphics Tools
⦿ 1.2 Vector Graphics Tools
⦿ 1.3 Adobe Creative Cloud
⦿ 1.4 Template-based Web Tools
2. Introduction to Scientific Graphing
⦿ 2.1 Gnuplot – Creating Plots in the UNIX Shell
⦿ 2.2 Plotly-Dash – Data Processing & Interactive Plotting with Python
⦾ Introduction to Plotly (Python library)
⦾ Introduction to Dash (Python library)
⦾ Interactive Graphing – Local Server with Web-Based Interface
⦾ Plotly Graphing - Interactive Examples in the JupyterLab
◦ Tutorial: Creating XY Scatter Plot
◦ Tutorial: Creating 1D Volcano Plot
◦ Tutorial: Creating Heatmap
◦ Tutorial: Creating Dendrogram
◦ Tutorial: Creating Clustergram (Heatmap with Dendrograms)
⦿ 2.3 RStudio – Data Processing & Plotting with R
⦾ Setting Up an RStudio Environment
⦾ Tutorial: Creating Boxplots in R
⦾ Tutorial: Creating Heatmaps in R
09. Project Management
It is undeniable that good project management leads to improved reproducibility and research productivity. Project management becomes significantly more important as the size of the project and the amount of data increases. Project management is not just about how you organize your data, files and folders but also how you record the steps in data analysis that leads to the publication of your research. |
See the detailed section index
1. Introduction to Project Management
2. Project Management Tools
⦿ 2.1 Introduction to GitHub
⦾ Tutorial: Git commands (part 1)
⦾ Tutorial: Git commands (part 2)
⦿ 2.2 Introduction to BitBucket
⦿ 2.3 Introduction to ZenHub
⦿ 2.4 Tutorial: Research Reproducibility
3. Documentation Improvement Tools
⦿ 3.1 Introduction to Markdown
4. Team Communication Tools
⦿ 4.1 Introduction to Slack