Course Content#
Data Extraction Operations#
| Command | Function | Command | Function |
|:----:|:----|:----:|:----|:----:|:----|:----:|:----|
| cut | Split | grep | Search |
| sort | Sort | wc | Count characters, words, lines |
| uniq | Remove duplicates | tee | Bidirectional redirection |
| split | File splitting | xargs | Argument substitution |
| tr | Replace, compress, delete | | |
cut#
—— Kitchen Knife
- -d c Split by character c
- c can be surrounded by "" or '', very flexible
- Defaults to space
- -f [n/n-/n-m/-m] Select the n-th block / n-end blocks / n-m blocks / 1-m blocks
- -b [n-/n-m/-m] Bytes
- -c [n-/n-m/-m] Characters
grep#
Common operations can be found at: tldr grep
- -c Count occurrences found [per file]
- -n Output line numbers in order
- -v Output non-matching [complement]
- -C 3 Output content of 3 lines nearby
- -i Case insensitive
- [PS] -n and -c conflict, -c results take precedence
- Example
- Count the number of lines containing doubleliu3 in the path
- Use cut+awk or directly use awk
- View running processes related to hz
- ps -ef: Similar to Task Manager in Windows
- In comparison, ps -aux displays more information, leaning towards UNIX systems in earlier versions
- [PS] In ps -ef, PPID is the parent process ID
- Count the number of lines containing doubleliu3 in the path
sort#
Similar to Excel's sorting function, defaults to sorting by ASCII code based on the start of the line
- -t Separator, defaults to TAB
- -k Sort by which field
- -n Numeric sort
- -r Reverse sort
- -u uniq [but not convenient for counting]
- -f Ignore case [default]; -V [do not ignore case]
【Example】Sort user information by uid numerically
cat /etc/passwd | sort -t : -k 3 -n
【Tested】
- Defaults during sorting
- Omits special characters like underscores "_", "*", "/", etc.
- Also ignores case, placing lowercase before uppercase after sorting [-V can ignore case]
wc#
- -l Line count
- -w Word count
- -m Character count; -c Byte count
【Example】
① Total number of recent logins to the system
last | grep -v "^$" | grep -v "begins" | wc -l
② Statistics related to the PATH variable: character count, word count, variable count, take the last variable
uniq#
Only consecutive duplicates count as duplicates, generally used with sort
- -i Ignore case
- -c Count
【Example】Count the number of recent user logins, sorted from largest to smallest
tee#
Displays in the terminal and writes to a file
- Defaults to overwrite the original file
- -a Append without overwriting
【Example】
split#
Splits a file, suitable for handling large files
- -l num Split by num lines
- -b size Split by size, 512 [default byte], 512k, 512m
- May have incomplete first or last lines, so can use the following method👇
- ❗ -C size Split by at most size, ensuring not to break line data!
- -n num Split into num equal parts by size
【Example】Split the file list in /etc into a file every 20 lines
xargs#
Argument substitution, suitable for commands that do not accept standard input, equivalent to using command substitution
- -exxx Read until "xxx" ends
- -p Ask before executing the entire command
- -n num Specify the number of parameters received each time⭐, rather than passing all at once
- Suitable for commands that can only read one parameter [like id] or related scenarios
tr#
Character replacement, compression, deletion for standard input, can refer to tldr tr
tr [options] <char/charset1> <char/charset2>
- Defaults to replacing charset1👉charset2, one-to-one correspondence
- Characters in charset1 that exceed charset2 are replaced by the last character of charset2
- -c Replace all characters not belonging to charset1👉charset2
- -s Compress, turning consecutive duplicate characters 1👉 one character 1 【❗ When used with -c, refer to charset2】
- -d Delete all characters belonging to charset1
- -t First delete characters in charset1 that exceed charset2, then replace one-to-one correspondence【❗ Note the difference from the default method】
【Example】
① Simple usage
- -s also requires parameters, defaults to using charset2 when used with -c
- -t only cares about the characters that are one-to-one matched, ignoring the extra characters in charset1 compared to charset2
② Word frequency statistics
- tr replace👉sort👉remove duplicates count👉sort👉display top few
Soft and Hard Links#
Hard links can reduce storage usage, while soft links have an additional inode and block
【Background Introduction】
- ext4 file system——three components——inode, block, superblock
- inode: File node
- One inode corresponds to one file
- Stores file information [file permissions, owner, time...] and the actual location of the file [blocks location]
- If it cannot store the blocks location, it will use multi-level blocks
- block: The actual storage location of the file, one block is generally 4096 bytes
- superblock: At least two, storing overall information of the file system [like inode, block...]
【Hard Link】 Equivalent to an alias
- Shares the same inode as the original file
- The number of links to the file > 1
- [PS]
- Deleting the alias does not affect the original file
- The current directory and parent directory are both hard links [special directories]
- However, hard links do not support directories, which may cause issues like infinite loops, refer to:
- Why are hard links to directories not allowed in UNIX/Linux?——StackExchange
- Why are hard links not allowed for directories?——StackExchange
- Why can't directories be hard linked in Linux?——Zhihu
【Soft Link】 Equivalent to a shortcut
- Creates a new file, has its own inode, points to a block [stores the actual location of the file]
- File type is link
- [PS] More commonly used than hard links
Linear Sieve#
-
Arrays in shell do not need to be initialized, they start empty
- In fact, regardless of the variable, everything starts empty, with no type
-
Learn the null check operation of [variable x == x]
-
Array indices can directly use variable names, without needing to wrap in ${}
【Comparison with Prime Sieve Effectiveness】
Find the sum of primes from 2 to 20000
-
In shell, the effectiveness is not as good as the prime sieve, possibly because:
- Shell involves many system calls, cannot purely look at the mathematical level
- Executing shell scripts can max out the CPU, making it incomparable to C language performance
- Modulus operations
➕ SED Script Editing#
Mainly used for editing scripts, syntax can refer to vim
【Common Operations】 Replace [batch, by line, match], delete
# Replace the first string pattern matched in each line [supports regular expressions]
sed 's/{{regex}}/{{replace}}/' {{filename}}
# Replace the first string in matched lines, delete matched lines
sed '/{{line_pattern}}/s/{{regex}}/{{replace}}/' {{filename}}
sed '/{{line_pattern}}/d' {{filename}}
# Can use other delimiters, like '#', to handle scenarios needing '/' character
sed 's#{{regex}}#{{replace}}#' {{filename}}
# Delete all lines between two matching patterns
sed '/{{regex}}/,/{{regex}}/d' {{filename}}
# [Special] Delete content after/before matching lines
sed '/{{regex}}/,$d' {{filename}}
sed '/{{regex}}/,$!d' {{filename}}
- sed -i Write modifications to the file
- sed '.../g' Adding g at the end allows for global operations
- sed '.../d' Adding d at the end is for delete operations
- sed '/.../,/...' Commas can be used to match two patterns [matching a paragraph]
- Refer to How to delete all lines between two matching patterns using sed?——Tencent Cloud Community
【Command Demonstration】
① The above common commands operate in sequence
② Used to replace certain configurations in configuration files
-
Delete 👉 Add
-
Backup before deletion
-
Add an identifier #newadd during addition for easier subsequent sed operations
-
Directly delete and then add, rather than replacing directly, to avoid cumbersome pattern matching [parameter values]
In-Class Exercises#
- Find the sum of all numbers in the string: "1 2 3 4 5 6 7 9 a v 你好 . /8"
- Convert all uppercase letters in the file to lowercase: echo "ABCefg" >> test.log
- Find the last path in the
PATH
variable - Use the
last
command to output all reboot information - Sort the contents of
/etc/passwd
by username - Sort the contents of
/etc/passwd
byuid
- Find the total number of system login users in the last 2 months on the cloud host
- Sort all usernames that logged in to the cloud host in the last 2 months by frequency and output the count
- Save every ten file and directory names in the local
/etc
directory to a file - Output
uid
,gid
, andgroups
of the 10th to 20th users stored in/etc/passwd
- View users in
/etc/passwd
by username, stop when reading the user'sync'
, and output the user'suid
,gid
, andgroups
- Word frequency statistics
① Use the following command to generate a text file test.txt
cat >> test.txt << xxx
nihao hello hello 你好
nihao
hello
ls
cd
world
pwd
xxx
② Count the word frequency in a.txt and output in descending order.
Answer
【1】
[Clever solution, needs to be run in bash]
echo "1 2 3 4 5 6 7 9 a v 你好 . /8" | tr -s -c 0-9 "\n" | echo $[`tr "\n" "+"`0]
[for loop]
sum=0
for i in `echo "1 2 3 4 5 6 7 9 a v 你好 . /8" | tr -s -c 0-9 "\n"`; do
for> sum=$[$sum+$i]
for> done
[awk solution 1]
echo "1 2 3 4 5 6 7 9 a v 你好 . /8" | tr -s -c 0-9 "\n" | awk -v sum=0 '{sum += $1} END { print sum }'
[awk solution 2]
echo "1 2 3 4 5 6 7 9 a v 你好 . /8" | tr -s -c 0-9 " " | awk -v sum=0 '{for (i = 1; i <= NF; i++) {sum += $i} } END{print sum}'
【2】 cat test.log | tr A-Z a-z > test.log
【3】 echo ${PATH} | tr ":" "\n" | tail -1
【4】 last | grep "reboot"
【5】 cat /etc/passwd | sort
【6】 cat /etc/passwd | sort -t : -k 3 -n
【7】 last -f /var/log/wtmp.1 -f /var/log/wtmp | grep -v "^$" | grep -v "begins" | grep -v "reboot" | grep -v "shutdown" | wc -l
【8】 last -f /var/log/wtmp.1 -f /var/log/wtmp | grep -v "^$" | grep -v "begins" | grep -v "reboot" | grep -v "shutdown" | cut -d " " -f 1 | sort | uniq -c | sort -n -r
【9】 ls /etc | split -l 10
【10】 cat /etc/passwd | head -20 | tail -10 | cut -d : -f 1 | xargs -n 1 id
【11】 cat /etc/passwd | cut -d : -f 1 | xargs -e="sync" -n 1 id
【12】
cat test.txt | tr -s " " "\n" | sort | uniq -c | sort -n -r
[If you want the word to be first and the count to be second, use awk to reverse]
cat test.txt | tr -s " " "\n" | sort | uniq -c | sort -n -r | awk '{print $2, $1}'
Tips#
- Commonly asked in interviews: Word frequency statistics
- tr replace👉sort👉remove duplicates count👉sort👉display top few