Introduction to the Linux command-line
Introduction - Why you should learn Linux
If you’ve never used Linux before, you might be asking “do I really need to learn this? Why can’t I just use Windows or Mac?”. The short answer is “yes”. The long answer is “yes, and that’s a good thing.”
Linux is overwhelmingly common on scientific servers and clusters, as well as cloud computing platforms, for both historical and technical reasons. While it’s technically possible to use a graphical interface to connect with clusters over the network, it takes a lot of network bandwidth and processing power, so it’s not feasible for clusters with more than a few users. Instead, we must use the command line, which comes with the added bonus of allowing us to automate common, repetitive tasks (such as data analysis pipelines or programs with complicated sets of parameters).
Even if you never use Linux on your personal computer or workstation, learning to efficiently utilise the command line can still make your workflow much faster and easier in the long-run. If you’re a Windows user, a lot of the logic will transfer to Windows Powershell even though the individual commands will differ. And if you’re a Mac user, most of this guide will transfer to the Mac command line basically as-is (macOS is secretly a Unix operating system).
Finally, Linux command line skills are highly valued in the private sector, should you choose to take that route in the future. Almost all cloud computing and data analysis platforms use Linux “under the hood”, and many modern machine learning tools are made with Linux in mind.
All of this means you’re almost guaranteed to need to use the Linux command line at some point in the future. But don’t despair! You don’t need to learn very much to get started, and everything else can be picked up as you go. Consequently, this guide is not intended to be complete or comprehensive; it’s more of a crash-course, covering just enough to get you up and running and producing results. I have provided a list of further resources for more advanced topics not covered in this guide.
Table of Contents
- Introduction - Why you should learn Linux
- Terminology
- Setup and required software
- Where to get help
- Logging in
- Getting around - navigating the file system
- Reading and editing text
- Output redirection and pipes
- Automate common tasks - command-line scripting
- HPC clusters - the basics of submitting jobs
- Software module system
- Transferring files to and from remote servers
- Some useful commands
- Finding executables in non-standard directories
- Test your knowledge
- Resources and further reading
Terminology
Below are some useful definitions and terminology which will appear throughout this guide:
- Shell: a program that lets us give commands to a computer and receive output. It is also referred to as the terminal, terminal emulator or command line.
- Bash: the most commonly used shell program on Linux and Mac systems. Bash can also refer to the specific language used to issue commands to the bash shell.
- Flag: optional parameter which can be passed to a command-line program to change its behaviour.
Flags usually take the form of one or two consecutive
-
characters, followed by a letter or a word: e.g.--help
or-i
. Multiple single-letter flags can be combined together into a single option to save on typing, e.g.ls -l -a
can be shortened tols -la
. - SSH: stands for secure shell. Program used to connect and login to a remote, network connected computer (such as a HPC cluster).
- Operating system: software such as Windows, macOS or Linux which underpins the operation of a computer. Manages software interfacing with the computer’s hardware, and handles starting and scheduling user programs.
- Unix: a family of operating systems with similar design and functionality. Linux and macOS are both Unix systems.
- File system: program which manages the storage, retrieval and organisation of data (files) on disk.
- Directory: An abstraction which allows for multiple files to be grouped together in a single “location”. Often also referred to as a folder.
- Cluster: a computer system consisting of multiple smaller, tightly interlinked computers, which are capable coordinating to carry out large, computationally intensive calculations in parallel. Often referred to as a supercomputer.
- Node: a singular, self-contained computer, many instances of which are interlinked to form a cluster. A cluster may contain several different types of nodes, such as nodes with large amounts of RAM or attached GPUs.
- Login node: a special type of node which serves as the gateway of a cluster. Users SSH into the login node, which then provides (managed) access to the rest of the cluster.
Throughout this guide I will demonstrate the syntax of commands through the use of dummy arguments -
placeholder names which you will need to replace with the actual file, path or argument when running the
command. Dummy arguments are denoted by being enclosed in angle brackets (“<” and >”), so get <file>
doesn’t mean you should literally type “get <file>”, but rather that you would substitute the name of
the target file, to get something like get output.txt
.
Setup and required software
If you already use Mac or Linux (e.g. Ubuntu), then you’re all set: everything you need to follow this tutorial should already be installed on your computer. If something you need isn’t installed, it will be available on the App store (for Mac) or your Linux distribution’s package manager (e.g the Software Center in Ubuntu).
If you’re using Windows, you’ll need to install some programs before you can get started. First, you’ll need a program which can execute Linux command-line programs. There are three main programs which can do this on Windows:
- Cygwin: a standalone programming environment which includes a terminal emulator and lots of Linux command-line utilities. Cygwin is very easy to install and will include most utilities you’ll need.
- Git for Windows: similar to Cygwin, but focused on source code management. Software Carpentry has a video tutorial detailing how to install and setup Git for Windows.
- Windows subsystem for Linux(WSL): a full Linux emulator, which is bundled with all new Windows 10 installations. WSL is more powerful and fully-featured than Cygwin or Git for Windows and even lets you run programs compiled for Linux from the Windows command-line, but it’s a bit trickier to install than the other two tools.
You’ll also need to use SSH to connect and login to the cluster. Fortunately, Windows 10 includes its
own SSH program by default. It can be accessed by running ssh.exe
in any of the programs listed
above, or in the Windows command line. More detailed instructions for using SSH can be found in the
Logging in section of this guide.
Where to get help
If you get stuck, there are several places you can check for help. Unfortunately there is not much standardisation of documentation between programs, but there are a few “safe bets” to check first.
The first port of call should always
be the manuals, referred to as man pages, which come bundled with Linux. These are accessed by the
terminal command man
, followed by the name of the program you want to read about. For example, you can
access the manual page for the less
command by typing man less
at the command prompt. If you don’t
remember the specific name of the command, you can search the man pages with apropos
. For example,
running apropos editor
will search for any man pages which include the word editor
in their name or
description (in this case, it returns the names of lots of different text editor programs). Bash also
has a command called info
which will read “information documents” about a given file. It is
similar to man
, but some programs may have an info
page but not a man
page.
Sometimes a program won’t come with a man page, but the developers might still provide documentation.
Almost all programs follow a similar convention here: adding --help
or -h
after the program’s
command will almost always make it print a help message (usually including a short description and a
list of arguments/parameters it accepts). For example, running the command less --help
will show a
summary of commands that the program less
accepts, as well as general advice on how to use it.
Unfortunately, different programs may use either --help
or -h
or both, so you’ll have to try them to
find out. These optional parameters which modify commands’ behaviour are referred to as flags.
Finally, managed HPC clusters usually have very good online documentation. The big three you’re likely to use at UQ are UQ’s Research Computing Centre (for Bunya and Wiener), NCI (for Gadi), and the Pawsey Centre (for Setonix). These online resource contain both general information (e.g. compiling and using software), and information specific to using those clusters (e.g. running and checking the status of computational jobs).
Logging in
Before you can use the clusters, you’ll need to apply for an account with the organisation that manages them - your supervisor will be able to tell you which ones you’ll be using and will need to approve your application. The relevant signup forms are:
- RCC (Bunya and Wiener): https://rcc.uq.edu.au/high-performance-computing
- NCI (Gadi): https://my.nci.org.au/mancini/signup/0
- Pawsey (Setonix): https://pawsey.org.au/support/
Once your account is established, the next step is to log on to the cluster over the internet via a
program called SSH. This will establish a connection between your computer’s terminal and the
cluster’s login node, so that any commands you type into SSH will be set across the network and executed
on the cluster, with the results sent back across the network to be displayed on your computer. Take
note that the SSH connection is bound to the terminal window in which you ran the SSH command (either
ssh
on Mac or Linux, or ssh.exe
on Windows), so any commands you type in other terminal windows
(including ones you start after you launch SSH) will be executed on your computer and not on the cluster.
If you’re using Mac or Linux, open a terminal window and run the command ssh
<username@cluster.address>
where you replace username
with your username on the cluster (which would
have been sent in an email when you signed up) and cluster.address
is the network address of the
cluster - this can be found in the online documentation of the cluster you’re logging in to. The SSH
commands for the big-three clusters you’re likely to use are:
Bunya (RCC) | ssh username@bunya.rcc.uq.edu.au |
Wiener(RCC) | ssh username@wiener.hpc.dc.uq.edu.au |
Gadi (NCI) | ssh username@gadi.nci.org.au |
Setonix (Pawsey) | ssh username@setonix.pawsey.org.au |
Your terminal may print a warning about “unknown server”, type “yes” to continue. The process is almost
the same on Windows, except you need to open a command line window (either cmd.exe
or Powershell) and
run ssh.exe <username@cluster.address>
.
If there are no errors, you should now see a new prompt (the words and characters just to the left of
where commands appear when you type) that looks something like username@setonix-1:~>
(it will be
different for different clusters), which means that any commands you type in this window will be
executed on the cluster. Congrats! You’re now using a supercomputer.
Many clusters (including Setonix, Gadi and Bunya) will print a welcome notice when you first SSH in. This notice usually includes information on upcoming maintenance, information about recent changes to the system and a reminder of where to get help. These are usually worth paying attention to, as it’s one of the main way that the cluster’s maintainers get information to you.
Finally, when you’re finished with the remote session, type exit
to close the SSH connection and
logout. Most of the time, the SSH connection will automatically close if you leave it unattended for too
long (where “too long” could be anywhere from an hour to a day), but that can be messy so it’s always
best to explicitly log out when you’re finished.
Getting around - navigating the file system
It’s important to understand how to navigate and use the file system on the cluster. Broadly speaking, data on the computer is stored in files, which are grouped together into a hierarchy of directories (also called folders). This holds true on Windows, Mac and Linux. You may be used to navigating the file system and manipulating files (e.g. copying, renaming or deleting) through a graphical program called a file manager such as Finder on Mac or File Explorer on Windows. Since the clusters you’ll be using do not have a graphical interface, you’ll need to use the command line to manipulate files on the cluster.
There is an extra degree of complexity on HPC clusters, compared to a personal workstation, as files need to be accessible to all nodes on the cluster (not just the one physically attached to the disk). The exact details will vary from machine to machine, but you can usually count on there being two main “locations” to store files:
- Your home directory: this directory is usually persistent storage, so files stored in the home directory (or its sub-folders) are backed up and last until you delete them or your account no longer exists. It usually has small storage capacity, so only essential files that you don’t want to lose should be stored here. Equivalent to “My Documents, My Pictures, etc” on Windows or “Home” on Mac, it is labeled with your username on the cluster.
- Scratch space: this directory usually has a lot of storage capacity, but is usually not backed up, so it’s a good idea to transfer files to home (or off the cluster entirely) if you’ll need them later. What’s more, some systems (including RCC and Pawsey) regularly purge unused files, which is even more reason to transfer them off when you’re done.
All directories have certain permissions, which determine who can access files stored in them. You will always have permission to read from and write to files in your home directory and in your folder in scratch space (which is labeled with your username on the cluster), while you may have read-only permission for files in directories belonging to other members of your group. You will not be permitted to access files belonging to other users not in your group.
On Linux, all directories and files are identified, or “located”, with a file path, in which
directories and sub-directories are separated by /
characters (this is different to Windows which uses
the \
character). For example, your home directory might have the path /home/username/
, which
indicates that the username
folder is a sub-folder of the /home
folder (which holds the home
directories of every user on the system). The paths to home and scratch space on clusters could be
quite long, so you’ll need to look it up in the cluster’s online documentation. Since the exact location
will differ between systems, Linux provides the shortcut ~/
which will always refer to your home
directory, whatever its actual path.
Directories are like “places”, analogous to drawers in a filing cabinet, and you will be “in” exactly
one “place” whenever you’re logged in with the shell. This is called the “working directory”
Commands you type will execute “in” this place,
for example by reading and writing files, so it’s important to make sure you’re in the right directory
before you do anything. Linux provides the shortcut ./
which refers to the current directory, as well
as the shortcut ../
which refers to the current parent directory (i.e. the directory “above” the
current working directory, so if you’re in /home/username/data/run
then ../
will refer to
/home/username/data
). Any other filenames which start with a ‘.’ character are “hidden files”, which
are not included in file listings unless requested with specific flags (more on that in a moment).
There are a couple of bash commands which are important to know when navigating the file system:
pwd
(stands for “print working directory”): prints the path of the current working directory.cd <dir>
(stands for “change directory”): moves you to the specified directory. For examplecd ~/
will change the working directory (or “move” you) to your home directory.-
ls
(short for “list”): prints a list of files in the current directory. You can also type a directory path afterls
and it will print the names of files in the target directory (instead of the current one). For example,ls ~/data
will print all files in thedata
directory, a sub-directory of your home directory (assuming it exists).Three import flags to remember are
-l
(“long output”),-h
(“human-friendly”) and-a
(“all”). cp <src> <dest>
(short for “copy”): copies a file from one location (the source, abbreviated tosrc
) to another (the destination,dest
). For example, to transfer the fileoutput.txt
from the current directory to the home directory, you would runcp output.txt ~/output.txt
. If you do not provide a filename in the destination then the new file will have the same name as the old one, so the previous command could be shortened tocp output.txt ~/
. Be careful when copying files: if you give a destination filename which already exists, thencp
will overwrite it without warning you. You can give the-i
flag tocp
to tell it to ask for confirmation before overwriting any files (e.g.cp -i <src> <dest>
). This command is conceptually equivalent to copying and pasting with File Explorer (Windows) or Finder (Mac).mv <src> <dest>
(short for “move”): similar tocp
except it moves the source file, rather than copying it. This means that aftermv
has completed, the original file will no longer be present - it will have moved completely to the new destination. Like withcp
,mv
will not warn you if you’re about to overwrite a file, unless you pass it the-i
flag. This command is conceptually equivalent to cutting and pasting with File Explorer (Windows) or Finder (Mac).rm <file1> <file2> ...
(short for “remove”): delete the specified file(s) from the system. Be aware that there is no equivalent to the “Recycle Bin” on HPC clusters, so make doubly sure you want a file gone before you delete anything withrm
.rm
will also not remove directories without the-r
flag, so you must dorm -r <dir>
to delete a directory.mkdir <dir>
(stands for “make directory”): create the specified directory, which can be either a relative or absolute path.mkdir
will fail and print an error if the specified directory already exists.-
tree
: similar in concept tols
, but used for visualising the entire directory structure in a convenient “tree”-like representation. Unlikels
,tree
is *recursive, which means it will not only display the contents of the target directory (or current directory if not given any arguments), but also the contents of each sub-directory, and so on until there are no more files to display. The output is formatted like a tree (the abstract structure from graph-theory, not a biological tree), with the current directory at the root node and with each node’s children representing the files and folders it contains.For example, say we have a directory called data, which contains some files and two sub-folders
Aggregator
andInterface
. Runningls
while indata
would show:Aggregator Interface timing.dat SSGK.in template_SSGK.in
Which tells us what’s in the current directory, but we’d need to run
ls
on all of the sub-folders (Aggregator
andInterface
) to find out what’s in them.On the other hand, running
tree
in the same directory would show us:$ tree . ├── Aggregator │ ├── aggregate.awk │ ├── Parallel │ │ ├── INTCF1_3 │ │ ├── PROPS │ │ ├── PROPS_2 │ │ ├── TCF1_2 │ │ └── TCF_3 │ └── Serial │ ├── INTCF1_3 │ ├── PROPS │ ├── PROPS_2 │ ├── TCF1_2 │ └── TCF_3 ├── Interface │ └── F77 │ ├── base.in │ └── VELOCITY_TEMP ├── timing.dat ├── SSGK.in └── template_SSGK.in
This view provides much more information: we can see that
Aggregator
has two-subdirectories, whileInterface
has one, as well as what files are in those sub-directories. While this is a somewhat simplified example,tree
can be very useful when navigating an unfamiliar directory structure, such as a large, unfamiliar codebase (maybe you need to learn how to use some open-source simulation software) or a complex data-set.
Typing out a long filename and path can be tedious and error-prone, so bash provides a shortcut called
tab-completion. Pressing tab while part-way through typing a command will automatically fill-in the
rest of the command: if you type cd /home/username/da
and hit the “tab” key, the shell will complete
the rest of the command to cd /home/username/data
(it will only fill in the command, but not execute
it. You still need to hit “enter” to execute the complete command). If there are multiple possible
results which a partial command could complete to, then the shell will only fill in part they have in
common. As an example, there are multiple command line programs which start with the letters “tr”, so if
you type tr
and hit “tab” then bash will not be able to decide which one to complete to so it will do
nothing. Pressing “tab” a second time will print a list of all the possible matches, which might look
like:
$ tr
tr traceroute trap troff trust
tracepath traceroute6 tred true
tracepath6 tracker tree truncate
You can then continue typing the command you want, and if you press “tab” again after typing a few more letters then bash will fill in as much as it can again. You can do this as many times as you like when typing out a command, which has the dual benefit of cutting down on the amount of typing you need to do, while also providing reminders of the available directories or commands.
Sidebar for Mac users: tab-completion does work natively on the macOS terminal, but may require some
extra configuration to achieve the above behaviour. By default, tab-completion will not list possible
matches for an ambiguous completion; it will make an alert sound instead (unless you have disabled
system sounds). In order
to change this, you will need to edit the file ~/.inputrc
(the easiest way to do this is via the
TextEdit
program), and add the following two lines:
set show-all-if-ambiguous on
TAB: menu-complete
Finally, even though Linux imposes very few technical restrictions on what name a file or directory can
have, there are still some “best practices” which will make your life much easier. First, spaces in
filenames can be extremely annoying to deal with on the command line: since bash uses spaces to
determine when a new command or flag has started, spaces in filenames need special treatment to work
properly. Let’s say you have a file called calculation results.txt
. If you were to type rm
calculation results.txt
, bash would interpret the space to mean that you’re actually referring to two
filess called calculation
and results.txt
, neither of which are the original file. Instead, you need
to escape the “space” in the name by prefixing it with a backslash (“\”): rm calculation\
results.txt
. You’ll need to do this for every space in a filename, as well as other special
characters, like asterisks (“*”), brackets (“(“ and “)”) and ampersands (“&”) (a complete list of all
special characters in bash can be found here: http://mywiki.wooledge.org/BashGuide/SpecialCharacters).
It’s best to avoid using spaces and other special characters entirely, since constantly escaping
characters in a filename is tedious and error-prone. So what should you do if you want to make a
filename that contains multiple words? The best option is to use either a hyphen (“-“) or underscore
(“_”) where you would usually use a space to separate words, so our example file would become
either calculation-results.txt
or calculation_results.txt
. Either option is fine, but it’s best to
pick one and stick with it, since using a consistent naming scheme makes it easier to remember and
search through files.
Where to store your files on the cluster
So now that you know how to navigate the file-system, you may be wondering “where and how should I store all my files?”. Like we saw before, HPC clusters tend to have very specifically structured file-systems, so it’s important to make sure that you’re using them as intended to get the most out of the system.
As with everything in this guide, different clusters will have different guidelines for file-system access, so it’s a good idea to at least skim the manual pages. For RCC, the relevant page is http://www2.rcc.uq.edu.au/hpc/guides/index.html?secure/Storage_userguide.html (must log in with your UQ credentials to access it), for NCI it is https://opus.nci.org.au/display/Help/3.+Storage+and+Data+Management, while for Pawsey it is https://support.pawsey.org.au/documentation/display/US/File+Systems%2C+File+Transfers+and+File+Management (which even comes with a nice video tutorial!). All systems use the same general principles, however.
Generally speaking, it’s best not to run calculations inside your home directory: if your calculation
generates lots of temporary files then it could overwhelm the filesystem and make it unresponsive for
all users. Not only is this bad for your code’s performance, you’ll probably get a cranky email from the
system administrator telling you to knock it off (which is never fun). Instead, you should run jobs in a
temporary directory on the scratch space. Scratch filesystems are usually designed to handle lots of
activity without slowing under the load, while also having much more storage available for use than your
home directory will. On Gadi and Setonix, the scratch directory has the path
/scratch/<PROJECT>/username
(where you replace PROJECT
with the project code your research group is
using. Ask your supervisor for this if you’re unsure. Replace username
with the username you use to
login to Setonix or Gadi), while the RCC systems have /30days/<username>
and /90days/<username>
(where username
is your UQ username). As the name might suggest, /30days
is
cleared out every 30 days, while /90days
is cleared out every 90 days.
Once you’ve generated your data, you should move the important files to your home directory in one go. Remember that scratch space is usually not backed-up and is regularly cleared out on Pawsey and RCC systems, so it’s crucial that you move any important results or data to your home directory for safe keeping.
Reading and editing text
Reading and editing documents via the Linux command line is not too dissimilar to doing so via a
graphical interface. The biggest difference is that almost everything you’ll use to run calculations and
analyse the output will be plain text documents (which universally have the .txt
filename suffix), so
you won’t be able to use a word processor like MS Word or Mac Pages to edit them (they do too
much automatic formatting, spell-checking, save in the wrong file format, etc). But even on the command
line, there are a range of powerful, easy-to-use programs for editing text available on almost clusters.
First, let’s talk about viewing the contents of text files. Sometimes you don’t necessarily want to edit a file, but need to know what’s in it, and often you’ll need to see the contents at a specific line or lines. There are four main command line tools to do this:
less
: a type of program known as a pager, which allows you to display and scroll back and forth through the contents of a file.less
only reads the parts of the file that it currently needs, so it can be much faster than a text editor when you need to view the contents of extremely large text files (i.e. > 100,000 lines, not unusual for simulation output).cat
(short for “concatenate”): prints the entire contents of a file or files to the terminal window, without the ability to scroll back and forth. If multiple file names are passed tocat
, then it will print their contents one after the other, essentially joining, or concatenating, the output (note that this does not change the contents of any of the input files). For example, if the filefile1.txt
contains the linefoo
andfile2.txt
contains the linebar
, then runningcat
would give:$ cat file1.txt file2.txt foo bar
which is useful for tasks like combining multiple data files into a single text file.
head
andtail
: these are mentioned together, as they perform very similar roles.head
prints the first 10 lines of a file to the terminal, whiletail
prints the last 10 lines. These default values can be overridden via the-n <num>
command-line flag, so to print the first 20 lines ofdata.txt
you would runhead -n 20 data.txt
.
Now, for actually editing text files, there are three major text editors available on almost every Linux system:
- nano: a simple, no-frills text editor. It has no special
features - it just opens, writes and saves text files. Nano is very easy to use, and is the closest
equivalent to a “notepad” type application on the Linux command-line. Setonix requires you to load the
nano
module before you can use it (viamodule load nano
). - vim: a more
fully-featured and customisable text editor with loads of special features like syntax-highlighting
and the ability to define custom macros. The flip-side of this
extra flexibility is that it has a somewhat steep learning curve. Vim is available on every Linux
system you are likely to encounter, with no special effort required (just run
vim
in the command-line). - emacs: a text editor with is similarly full-featured and customisable as vim, albeit with a completely different user-interface. Emacs also has a steep learning curve, but can be very powerful once you learn it. Again, emacs will be available on every Linux system you are likely to encounter.
If you want to edit text files on your personal computer before transferring them to the cluster, some useful open-source graphical applications are:
- Atom (Windows, Mac, Linux)
- Geany (Windows, Mac, Linux)
- notepad++ (Windows only)
- Kate (Windows, Mac, Linux)
All of the above applications are free and open-source and support “advanced” features like syntax-highlighting for programming languages, autosaving and tabbed editing. The default text editors on Windows and Mac (notepad and TextEdit, respectively) can also be used, but are extremely barebones and lack nice usability features. If you’re going to be writing code, then it may be worthwhile using a full integrated development environment (IDE) as well. An IDE will do syntax highlighting, automatic code correctness checks and most have integrated debugging and source code management tools which makes the development process much easier. Microsoft Visual Studio Code (or the full Visual Studio on Windows) is a free IDE for Windows, Mac and Linux which will be sufficient for most code development you’re likely to do.
Output redirection and pipes
There are two very important, Unix-specific ways of manipulating text that has no clear analogue in graphical applications: output rediction and pipes. These concepts are key to using the command line effectively, and are best explained with specific examples.
When you run a command in the shell, it will usually print its output to the active terminal window;
referred to as printing to standard output, or stdout
for short. Sometimes a program will
need to print error messages, which is referred to as printing to standard error, or stderr
.
Although both stdout
and stderr
both print to the terminal window by default, it is possible to
save one or both to a file or use the output of one program as the input for another program - referred
to as output redirection. In bash, output redirection is represented by the syntax prog > file
,
which says that the standard output from prog
will be saved to file
instead of printed to the
terminal. A command’s standard error can be redirected with a similar syntax: prog 2> file
. Finally,
an important warning: redirecting output to an existing file with >
will completely overwrite its
contents, which may not be what you intended to do. If you want to append the output of a command to
a file (i.e. preserving the original contents) such as when keeping a log of some simulation run, you
must instead use >>
(e.g. prog >> file
).
Output redirection is wonderful for saving the results of programs for later use (it saves you manually
copy-pasting the output of a simulation once it’s done), but is limited to saving output to a file. If
we want to do something fancier, we can use a related concept called pipelines. Bash (and other
shells) use the |
character to indicate that the output of the preceding program should be redirected,
or piped, into the input of the next. For example, if we want to pipe the output of prog1
into
prog2
we would type prog1 | prog2
. Multiple pipes can be used in a single command, and programs in a
pipeline are run concurrently - prog1
and prog2
are started at the same time, and prog2
processes data from prog1
as soon as it becomes available. This means that combining programs in a
pipeline is both more flexible than writing a single large program to do everything, and also faster,
since it automatically exploits some of the available parallelism in the overall task.
This has all been very abstract, so let’s look at some concrete examples. A very common use for pipes
(probably the one that I use most in terms of sheer frequency) is piping very large output streams to
less
, to make it easier to read and scroll through. As we saw in Getting around - navigating the file
system, the command tree
can produce a lot of output if run in a deeply nested directory. It’s much
easier to read the output if we pipe it through less
by doing tree | less
. Similarly, if we only
wanted to see the first few lines of output, we could do tree | head
.
There’s more to pipes than just making output easier to read, though. Linux has a wide range of little utilities which do one task, and are designed to be slotted into pipelines. To borrow a metaphor from the early days of Unix 1, pipelines are a way “of coupling programs like garden hose - screw in another segment when it becomes when it becomes necessary to massage data in another way”. Essentially, pipelines allow us to do ad-hoc data analysis in the shell, without having to write our own tools from scratch in Python Fortran.
For example,
the grep
command searches through a stream of text (either the contents of a file or the output of a
command) and prints all lines containing a specific pattern (e.g. a string). Let’s say we want to search
through the output of some program for the string “CH4” - we can either save the output to a file and
search that file:
$ ./prog > output.txt
$ grep "CH4" output.txt
or we can compress these two steps into one command by piping the output into grep
:
$ ./prog | grep "CH4"
which is easier to read, and will be much faster since grep
will print the lines as the are produced,
rather than having to wait for the program to finish. Furthermore, if we wanted to then save the first
15 matching lines to a file, we could extend our pipeline with head
and output redirection into:
$ ./prog | grep "CH4" | head -n 15 > output.txt
Pipes are one of the most important concepts covered in this document, and using them effectively is key to getting the most out of the Linux command-line.
Automate common tasks - command-line scripting
In addition to typing in commands and receiving the responses one at a time (so-called interactive use), bash supports the ability to write short programs called shell scripts which contain a sequence of commands to be executed automatically. Since this provides the ability to group a number of commands together and execute in a “batch”, they are also often called batch scripts (which is the most common terminology in Windows).
A shell script is just a plain text file which contains some commands for bash to execute, and has
functionally the same syntax as typing directly into the command line. Before it can be executed,
though, the operating system (Linux) needs to know what type of script is contained in a text file. This
is achieved with a construct known as a “shebang”: the characters #!
followed immediately by the path
to the shell program. On most systems, this will be /bin/bash
, so your shell-scripts must start with
the line: #!/bin/bash
. The shebang must be the first line in the script, otherwise it will not work.
Additionally, the convention is to use the .sh
suffix for shell script files (e.g. script.sh
), but
this is only to make it easier to remember what different files do - the Linux operating system does not
care what you call the file as long as it has the shebang in the right place.
Any other lines starting with a “#” are ignored by the shell and do not affect execution of the script; they are comments which serve to document and explain what the script is doing. As with all programming, it is a good idea to write a comment whenever you’re doing something which may not be immediately obvious to somebody unfamiliar with the script (this could be one of your co-workers, or it could be you in six months time).
The rest of the script can contain any number of commands, which will be executed in sequence (i.e. in the same order as they appear in the file) and will have the same results as if you had entered the commands yourself: they will print output, modify files and, for certain commands, ask for confirmation before proceeding.
Once you’ve written your script, you need to tell the operating system that it’s a program to be run,
rather than just a static text file. This is referred to as “marking it as executable” or “giving it
executable permissions”, and is achieved by the command chmod +x <script>
(chmod
stands for “change
mode” and the +x
stands for “execute”). Once the script is executable, you can run it by typing out
the name and path of the script. If you’re in the same directory as the script (the most common case),
this can be shortened by using the “./” shortcut, so if we have a script called script.sh
then we can
execute it by typing ./script.sh
.
Shell-scripting has some very advanced features on top of just running commands (far too many to fit in this introduction), but the most important use-case for scripts when you’re starting out is to automate common work-flows and pipelines. Typing the same command (or set of commands) out multiple times in a row is tedious and error-prone (typos, accidentally using the wrong flag, etc), so if you find you find yourself using the same commands or pipelines more than three times then it’s a good idea to transform it into a script. Saving complicated pipelines or sequences of commands in a script also makes it easier to remember them later on - rather than needing to memorise the exact sequence of commands used to generate a file you just need to look at the contents of the script.
Shell scripting is also important because it is the way we submit and run computational jobs on HPC clusters.
HPC clusters - the basics of submitting jobs
HPC clusters can have hundreds or thousands of users sharing the same set of resources, so they use software called a job scheduler to ensure everyone gets has fair access to the cluster. In order to run simulations on the cluster, you need to create a job script, which is a program (written in the bash scripting language) that tells the scheduler what sort of job you’d like to run: how many CPUs you need, how much memory you think you’ll need, how long you think it’ll take, and what programs to run. The job scheduler then uses all of your requests to calculate a priority and places your job in a queue (which could contain dozens or hundreds of jobs from other users). When it’s your job’s turn, and there are enough free resources, then the job scheduler will run your script. You can submit multiple jobs to the queue at the same time, potentially requiring different sets of resources for each job, and the scheduler will handle the queueing and running automatically.
Making a job script from scratch can be a little bit fiddly, and each cluster has its own way of handling jobs, with RCC and NCI using software called PBS Pro, while Pawsey uses SLURM as its job scheduler. Fortunately the major clusters have online documentation with example job scripts you can modify to suit your needs:
- RCC: Documentation + example script
- NCI: Gadi Jobs
- Pawsey: Job scheduling, example scripts
The above documentation also covers the special commands you’ll need to use to submit, cancel or check the status of compute jobs, which will again be different between clusters.
RCC systems allow you to run as many jobs as you like (as long as you don’t overwhelm the cluster), but
computational jobs on the NCI or Pawsey systems are billed against your project’s allocation. Each
project is given a budget of a certain number of service units (SUs), which represent the amount of
resources and time a project is allowed to use in a quarter (i.e. three months), and are shared between
all members of a project. Whenever anyone in the project runs a job, an amount of service units are
deducted from the project’s budget, which can be checked by running nci_account
on Gadi or
pawseyAccountBalance
on Setonix.
Regardless of the underlying job scheduler, you should never run large computational jobs such as simulations or compiling large codebases on the login nodes. Anything which may take more than a few minutes or use more than a few GB of RAM should be run as a compute job via the scheduler. The login nodes are shared between all users and do not have very much computational power, so running a large job on the login node will slow down or even crash other users’ sessions. This is a surefire way to get a cranky email from the system administrators, and may even result in your account being suspended. Don’t run on the login node; use the scheduler.
Software module system
Users of an HPC system often require specific versions of software for their workflows, while some software packages clash and cannot be used at the same time (for example, a particular program might compile with version X of a compiler, but not version X+1).
Instead of making every version of every program available to all users at all times, clusters instead use modules to allow you to pick and choose which software to use at a given time. Typically, a cluster will make a very limited set of essential programs available by default, with more specialised software such as compilers or simulation software available as optional modules.
To load a module (and thus make the program it contains available to use), run the command
module load <module_name>
where module_name
is the name of the module file to load.
A module’s name usually has the form <program>/<version>
, so to load version
2.14 of the program NAMD, you would do module load namd/2.14
. To unload a module file, run module
unload <module_name>
, while module swap <module1> <module2>
will unload module1
and load module2
in its place.
To list the modules available on the
system, run module avail
, while running module avail <module>
will list all available versions of a
specific module. If you’re not sure which module contains the program you’re interested in (or if the
program is even installed), you can run module search <query>
, which will search
for any modules whose name or description matches a given query. Alternatively, you may want to only
list the modules which you have loaded, which can be achieved by running module list
.
Sometimes it’s not obvious what a module actually does or what software it provides. The command
module show <module_name>
displays information about a given module, including programs and libraries
it provides.
Transferring files to and from remote servers
At some point, you’ll want to transfer files between the cluster and your computer. The easiest way to do this is through an SFTP (Secure File Transfer Protocol) program running on your computer (not on the cluster), which starts a connection between your computer and the cluster and allows you to interactively select files to transfer back and forth.
There are both command-line and graphical SFTP programs; both are fine and it’s up to you which you prefer to use. For Mac, the easiest to use graphical SFTP program is Cyberduck, while for Windows the easiest is WinSCP. There is no graphical SFTP program which is available on all Linux distributions, but if you’re using Ubuntu then a good option is gFTP, which should be available in the Software Centre.
For all of the above programs, you’ll
need to make a new connection to the cluster before you can start using them: in Cyberduck this is
achieved by clicking the “New Connection” button in the main window, while WinSCP will automatically
launch a wizard to do this when you start it. Make sure to select “SFTP” as the “File Protocol”, then
enter the cluster’s address under “Host Name”, then enter your username and password. For example, if
your username is jsmith
and you want to transfer files to or from Setonix, you would enter
setonix.pawsey.org.au
as the Host Name and jsmith
as the username.
The command-line program sftp
is almost universally available, and only slightly-less user-friendly
than the graphical programs. Mac and Windows both have command-line clients which function the same way;
on Mac the command is sftp
, while on Windows it is invoked as sftp.exe
. Connecting to a server is
very similar to SSH - run the command sftp <username@cluster.address>
in a new terminal window (or
sftp.exe <username@cluster.address>
if using Windows) and enter your password when prompted. You can
move around and explore the cluster’s file system with the same commands you’d use in an SSH session
(cd
, ls
, etc), and can navigate your personal (or local) computer’s file system by using the same
commands prefaced with an “l” (lcd
, lls
, etc). To transfer files from the cluster to your computer,
run get <file>
. To transfer files from your computer to the cluster, run put <file>
. In both cases,
<file>
can either be a bare filename, in which case sftp
will transfer from the current working
directory (on either the cluster or your computer, depending on whether you are using put
or get
),
or you can specify the full path to the file. sftp
also supports tab-completion for both commands and
file names/paths.
Some useful commands
Finally, here are some useful commands, tips and tricks that didn’t quite fit elsewhere in this guide:
-
Wildcards: Wildcards are a useful shell construct which allows you to access or manipulate multiple files at once through pattern matching. Wildcards are represented by certain special characters, the most common of which are “*”, which matches zero or more of any character, and “?” which matches exactly one character. This is best demonstrated through examples.
Let’s say we wanted to remove all log files in the current directory, which are created by programs to keep details of their execution; they’re useful for debugging, but we may not need them once we’ve generated the data. Log files generally end with the suffix
.log
, examples might beCH4.log
,C2H6.log
and so on. Instead of manually typing in all file names likerm CH4.log C2H6.log C3H8.log
, we can use a wildcard*.log
, which will match all files whose name ends in the characters.log
. So the commandrm *.log
tells the shell to enumerate all files which end in.log
and pass them to the commandrm
. We aren’t aren’t just limited to whole words, either. If we had a set of log files whose names have the form “- - .log" (for example, the date on which they were generated), we could list all files from 2020 by doing `ls 2020-??-??.log` (remember that `?` matches exactly one character, so files like "2020-full.log" won't be affected by this). Wildcards can be used for any command which requires a filename or path, and can greatly reduce the amount of typing needed to select large numbers of files for a particular operation. The only caveat is that the shell won’t warn you if you write a wildcard which catches files you didn’t intend it to, so there is always a danger of accidentally deleting important files. Consequently, if you’re using wildcards for a destructive operation (like
rm
-ing files), it’s a good idea to test which files the wildcard matches withls
and check that it’s what you expected (i.e. before you dorm *.out*
, runls *.out*
and check that you haven’t caught anything you don’t want to delete). -
find
: as its name might suggest,find
finds files in the filesystem (try saying that five times fast). The syntax offind
is somewhat intricate, but the basic usage requires you to specify a starting directory and the criteria to search for;find
then searches recursively “down” from the starting point and prints any files which match the search criteria. The recursive searching is what separates it from a tool likels
, which only lists files in a single directory.For example, to find all files in the current directory and its children with the suffix
.log
you would typefind . -name "*.log"
:.
is the target directory,-name
tellsfind
to match filenames, and*.log
is a wildcard pattern which matches any file name ending in “.log”. -
grep
: searches the contents of a file (or files) and prints all lines containing a particular pattern. Think of it like a command-line version of usingctrl+f
to search a document in a word processor. In its most basic usage,grep
accepts a pattern to search for, and a list of files to search through (which could also be a wildcard), so to search for the string “Temperature” in the file “output.txt” you would dogrep Temperature output.txt
. By default,grep
only searches for exact matches to the specified pattern and is case-sensitive, sogrep Temperature <file>
will not match the string “temperature”.grep
supports a plethora of command-line options, so be sure to check out the man page if you need to do something complicated (doman grep
), but some useful ones are:-n
: print the line in the file in which a match occurs, as well as the match.-i
: ignore case when matching strings, so “Temperature” and “temperature” would match, for example.-v
: invert match, so only lines not containing the specified pattern will be printed.-A <num>
,-B <num>
: print<num>
lines After or Before matching lines. By default,grep
will only print the contents of lines containing matches; with these options it prints the surrounding lines (before or after) each match to give context for the matches.-o
: only print the matches, not the lines containing them.-w
: only print matches which are a whole word, rather than matching all substrings. For example, this would mean thatgrep -w bash
only matches the the complete word “bash” and would not match “bashrc”, for example.-x
(or--line-regexp
): like-w
, but only print matches which are a whole line.
In addition to plain text,
grep
also supports matching against regular expressions (often shortened to regex), which are like wildcards, but with much richer functionality. There is not enough space in this guide to explain regular expressions in any level of depth; check out this guide if you’re interested. -
tee <file>
: prints its input to both standard output and a specified file.tee
is almost always used as part of a pipeline where you want to see the output of a command and also save it to a file for later: e.g../prog | tee output.txt
. -
echo <string>
: prints a string to the terminal. For example,echo "Hello, World!"
prints “Hello, World!” to the terminal. This is mostly useful for printing status updates from a script (e.g.echo "Removing log files..."
). -
wc <file>
(short for “word count”): counts the lines, words and characters in a file (or group of files) and prints the counts to the terminal. By default,wc
outputs one line per file of the form:<num_lines> <num_words> <num_characters> <filename>
, with an extra line for the total counts if more than one file is processed. The output format can be changed by passing the-l
(only count the number of lines),-w
(only count the words),-m
(only count the characters). If no files are specified,wc
will operate on standard input (i.e. typing in words manually) and can be used in pipelines. -
sort
: sorts a stream of text data supplied either from standard input or a file.sort
can handle alphabetical, numerical (via the-n
flag) or general floating point (-g
) data, and it is possible to select which column to sort by via the-k
flag. The column separator can be set via the--field-separator=<sep>
option, in the case of multiple-column input like a CSV file (wheresep
would be “,”). -
Bash variables: it’s possible to declare a variable in bash via the syntax
<variable>=<value>
, where<value>
is any arbitrary string. Variables can be used to store and manipulate values to be used later in a script or interactive session. Their names can contain letters, underscores and numbers (importantly, not hyphens or asterisks), but cannot start with a number.There are two important notes on the syntax of variable assignment. First the lack of space between the variable name and the “=” is significant, since bash will interpret the variable name as a command (and probably fail, since it does not exist yet) if there is a space (i.e.
variable = value
). Second, even though you can enter anything you like, bash will treat everything that’s not a command as a string, so numerical values have no special significance to bash (they may have significance when passed to certain commands likesort
which do handle numerical values). By convention, shell variable names are written in UPPER CASE, although this is not a technical requirement.Variables are referenced by prepending their name with a “$”, e.g.
$variable
, which will cause bash to substitute the variable’s value in place of its reference. For example, if we have a variableOUTPUT_FILE=output.txt
we can redirect a program’s output to it by./prog > $OUTPUT_FILE
. You can also use the value of a variable inside a string, such as a filename, by encasing the variable in curly-braces “${“ and “}” (you still need the “$” in front), so if you wanted to make a filename based on the value of the variableRUN_NUMBER
you could type something like./prog > ${RUN_NUMBER}.txt
.Additionally, many programs read from shell variables to determine their run-time behaviour. For example, programs using the OpenMP parallel programming framework (including LAMMPS and VASP) check the value of the variable
OMP_NUM_THREADS
to determine how many CPU cores to use in the calculation. In order for the value of a variable to be visible to any programs you run, it needs to be exported, which is most easily achieved by putting the keyword “export” in front of its declaration; instead of writingOMP_NUM_THREADS=8
you would writeexport OMP_NUM_THREADS=8
. -
~/.bashrc
and aliases: you can configure bash and have it store variables and settings across sessions by modifying a file named~/.bashrc
. Bash will execute all commands in~/.bashrc
(including setting and exporting variables) at the start of each new shell session (e.g. logging in via SSH). There will be a specially-tuned default~/.bashrc
file on the clusters, so it’s best to add new commands and configuration options after any pre-existing lines.There are a few general-purpose things you can add to your
~/.bashrc
to tailor the shell to your needs. The biggest of these are aliases, which allow you to create your own shortcuts for complicated shell commands. Aliases take the formalias <new>="<old>"
, wherenew
is the shortcut you wish to create andold
is the original command (old
must be encased in quotes). For example, if you regularly usegrep
with the-n
option, you could addalias grepn="grep -n"
to your~/.bashrc
, so whenever you rungrepn
the shell will automatically executegrep -n
instead. Aliases must be stored in~/.bashrc
if you want to keep them active for all bash sessions; it’s possible to interactively define aliases while using bash, but they will disappear when you exit the shell and won’t be available in other terminal windows if you don’t save them.Finally, once you’ve made changes to your
.bashrc
, you’ll need to either reload bash (by starting a new terminal window, for example) or runsource ~/.bashrc
for the changes to take effect. -
Easier SSH and SFTP logins for Mac and Linux: to save having to memorise and type out the username+address combination for every cluster you use, it’s useful to save the various addresses in the
.bashrc
on your personal computer. The best way to do this is by adding a line exporting the username+address as an environment variable, such asexport bunya="username@bunya.rcc.uq.edu.au"
(note the quotation marks). This would then allow you to log in to Bunya by doingssh $bunya
and to get files by doingsftp $bunya
. It’s a good idea to do this for all clusters you have access to, with each cluster getting its own variable on its own line in the.bashrc
file.
Finding executables in non-standard directories
In order to execute programs, bash needs to know the directory in which their executable file is stored.
One way to do this by supplying the path (full or relative) when executing the command, such as
~/Code/lammps/bin/lmp args
, but what if you don’t want to type out the full path every time you run
the command?
Whenever you run a command without providing the full path, bash searches through a list of pre-defined
directories for an executable with that name and executes the first match it finds. This list is stored
in the environment variable PATH
, and takes the form of a list of directories (which must be absolute
paths) separated by colons (“:”). The best way to add new directories for bash to search when executing
commands is to add a line to your .bashrc
prepending the new directories to the existing list:
export PATH=/new/path/to/exe:$PATH
This command says to set the new value of PATH
to the new directory plus the rest of the existing
PATH
list. We can’t just set PATH=/new/path/to/exe
, since that will override the default set of
directories, so bash will no longer be able to find important system commands like ls
, making your
shell unusable.
Returning to our original example, if we wanted to be able to run LAMMPS by just typing lmp
, we can
add its installation directory to the list of directories bash will automatically search through by
adding the following line to .bashrc
:
export PATH=/home/user/Code/lammps/bin/lmp:$PATH
This can be repeated as many times as needed for as many programs as you like.
Test your knowledge
One of the greatest demonstrations of the power of the UNIX shell came in the form of dueling magazine columns between two prominent computer scientists in the 1980s. A problem was posed to Don Knuth (a founder of the field of academic computer science) and Doug McIlroy (one of the original developers of the UNIX operating system): Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies. Knuth developed an incredibly intricate, 10+ page long program from scratch, while McIlroy did the same in a six-command shell pipeline. The solution, and McIlroy’s explanation are reproduced below:
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q
If you are not a UNIX adept, you may need a little explanation, but not much, to understand this pipeline of processes. The plan is easy:
1) Make one-word lines by transliterating the complement (-c) of the alphabet into newlines (note the quoted newline), and squeezing out (-s) multiple newlines.
2) Transliterate upper case to lower case.
3) Sort to bring identical words together.
4) Replace each run of duplicate words with a single representative and include a count (-c).
5) Sort in reverse (-r) numeric (-n) order.
6) Pass through a stream editor; quit (q) after printing the number of lines designated by the script’s first parameter (${1}).
Playing with this pipeline is a great way to internalise the essential logic of the command line. What follows is a small set of guided problems based on the above pipeline that will help you cement your understanding of the Linux command line.
Generating the input text
Before we can use our pipeline, we need to generate a set of input text to test it on. Let’s use the
manual/information page for bash. First, try running info bash
in the command line.
Q) Why will this output not work in our pipeline?
A) Because it is piping the output through a pager (on my computer it uses the antiquated more
, which
is like less
but with fewer features), not stdout
. Since the output is already being consumed by a
program, putting info bash
in a pipeline as is will not do us any good, since there will be nothing
for the next program to consume.
We will therefore need to tell info
to print to standard output, without piping through more
. We
also need to tell it to print every “bash” info page at once, rather than paging through them. This
is achieved by passing the -a
and -o -
options to info. Try running info bash -a -o -
on your
command line and observe the difference.
Transliterating
The next command in our pipeline is tr
, which stands for “translate” or “transliterate”. It’s used to
essentially swap certain characters, or sets of characters, with another set of characters, such as when
translating some file to all upper- or lower-case. In this case, we want to use tr
twice.
The first instance removes all instances of non-alphabetical characters and replaces them with a
newline - a special character which inserts a line-break into the output (equivalent to pressing the
“enter” key in a text editor or word-processor). This has the effect of separating out words and
printing them on separate lines (since whitespace is non-alphabetical), as well as removing any
numbers or special characters (like “-“ or “:”).
The second invocation of tr
replaces all upper-case characters (A-Z
) with their lower-case
equivalents (a-z
), to ensure that words are not double-counted due to capitalisation (e.g.
“Interpreted” and “interpreted” should be counted as a single word).
Putting it all in a script
Now we’re ready to put our pipeline into a shell script to make it easier to use.
Q) Use the information in the “Automate common tasks” section to make an executable shell script to
hold your modified pipeline. Give it a descriptive name like word_historgram.sh
.
Now, you may notice that there’s one part of the pipeline that we haven’t discussed: sed ${1}q
. First,
sed <num>q
prints the first <num>
lines of its text input. It’s functionally the same as head
, but
McIlroy used it in his solution because head
hadn’t been written yet! You can replace sed ${1}q
with head -n ${1}
if you like 2.
Now let’s talk about ${1}
. Numbered shell variables store the arguments which were passed to the shell
script when launched. The first argument is stored in $1
, the second is stored in $2
and so on, so
if we call our script with word_histogram.sh 10
, then ${1}
will have the value of 10
, and sed
will print the first 10 lines of the sorted list.
Q) What are the 12 most frequently used words in the bash information page?
Find the frequency of a specific word
Finally, it might be useful to be able to specify a particular word and find its frequency in the info
pages. We can do this using grep
, with the -o
flag to print just the matching word (as opposed to
the whole line, which is the default behaviour). It may also be useful to invoke grep
with the -i
flag, which tells it to ignore case when finding matches.
Q) Modify your script so that it can take a word as input and count the number of times this word appears in the text. Compare the results to the original histogram pipeline for the word “shell”. What are some reasons it might give different answers? Is there a shorter way to achieve this than a simple substitution into the original pipeline?
A) As we have invoked it, grep
will match sub-strings, as well as whole words; grep -i "shell"
will match both “shell” and “subshell”. This is not the case for the tr
-pipeline, which does only
count whole words. We can fix this by passing -w
to grep
, which tells it to only match whole words.
Resources and further reading
Once you’re up and running on the clusters, you may want to check out the Software Carpentry workshops on the Unix shell: there’s a tutorial on the basics and a follow-up advanced course (which is currently only partially finished). You can either skim over the tutorials and work through specific sections as you need them, or do the course all in one go (Software Carpentry estimates that both workshops can be completed in a few hours).
In particular, it’s worth looking over the module on pipes and taking time to do the example problems.
Some other resources you may find useful:
- Learn bash in Y minutes: covers similar material to the Software Carpentry workshops, but is structured more like a reference manual than a tutorial. It’s quite terse, so it’s faster to work through than Software Carpentry, but requires more “reading between the lines” to get the most out of it.
- Mastering Linux Shell Scripting: A practical guide to Linux command-line, Bash scripting, and Shell programming. This is a more long-form introduction to Linux shell scripting. It’s useful as a learning resource with worked examples, and is available as an eBook through many university libraries.
- devhints bash cheat sheet: concise, comprehensive reference guide to bash syntax and concepts. Extremely useful for when you need to look up something you’ve forgotten how to do.
- Bruce Barnett’s UNIX tutorials: somewhat old but still useful tutorials dealing with “UNIX shell programming and various other arcane subjects of interest to wizards”. Covers many useful utilities and lesser known bash tricks.
-
From the memo by Douglas McIllroy proposing the addition of pipes to (original) Unix: http://doc.cat-v.org/unix/pipes/. ↩
-
sed
(stands for “stream editor”) is a utility to programmatically manipulate and edit streams of text (either text files forstdin
) and is one of the most powerful shell tools in Linux. The keen reader may appreciate going over this unofficial documentation. ↩