Overview
Sed is a standard Linux utility that can be immensely helpful when working with text. One such use case is to extract sub-strings from string, that being said sed is not a text extractor, it is a text editor.
That means that rather than telling sed what sub-string/pattern you want, you need to tell sed how to reconstruct the string according to your desired pattern.
Let’s say you have a string time of completion: 14:24 and you want to extract the time. The sed command for this would be…
|
|
So…
|
|
Breakdown
The Input String
|
|
Using echo we create a text stream that is then redirect/piped to the next command CMD via the | pipe. In this CMD could be any other command/utility that accepts a text stream such as cat or less.
Sed Flags
|
|
This is where the magic actually happens. The -r flag is used to enable extended regular expressions, which is often useful but this can also be done without it.
|
|
This accomplishes the same task but is less readable so when possible (any modern Linux system) the -r flag is preferable.
The the other flag -n prevents the automatic printing of the pattern space. The reason why the flag is useful is in the event that the string doesn’t have the desired pattern.
For example…
|
|
We get no output because the time is invalid/doesn’t match out format — q4:24.
Now let’s see what happens if we don’t used the -n flag…
|
|
Now when there is not a matching pattern, the entire input is printed out, which in most use cases is not the desired outcome.
Note: You may have noticed that in the example with the -n flag, the p at the end of the sed expression is missing, the p tells sed to print out the output of the expression, if there is an output (as in sed is able to match the pattern). If we exclude -n but keep the p and the pattern is in the input string, the pattern match will be printed out twice, which is why it is excluded.
|
|
Regex
The regex (regular expression) is the pattern we’re trying to get sed to extract from our string. Let’s start off with the contents of the capture group.
|
|
[0-9]specifies that we want a character that is between0and9, so a numeric digit.{2}specifies how many[0-9]{2}equates to[0-9][0-9]
- Then the
:is a literal character, since the sub string14:24has the literal:in it.
Then that regex is wrapped in parenthesis to specify it as a capture group…
|
|
So what about the rest?
|
|
This goes back to what sed is and what sed isn’t, that being that sed is an editor but is not an extractor. So you need to account for the entire string, not just the pattern that you want to capture.
- The
.means any character, it is a wildcard symbol- You can have capture periods explicitly via
\.
- You can have capture periods explicitly via
- The
*specifies how many, that being any (zero or more) .*essentially means capture everything.*(REGEX).*will capture the entire string which contains the capture group(REGEX).*([0-9]{2}:[0-9]{2}).*captures all strings that have an instance of the[0-9]{2}:[0-9]{2}pattern
Sed Expression
|
|
In generic terms this is…
|
|
- ’s’ is the command we are issuing to
sed, that being substitute command- This is by far the most important command and is used 99% of the time
REGEXis our regexOUTPUT PATTERNis pattern for our output\1refers to the first capture group\2for the second,\3for the third, and so on
- Then after the last
/we have any additional operationspprints the result of theOUTPUT PATTERN- If the result is empty (say the capture group was not in the input) then it will not print anything
Putting it all together we get…
So…
|
|
We could modify the OUTPUT PATTERN to print the time twice, separated by a space…
|
|