Bo2SS

Bo2SS

4 Sorting and Searching

Course Content#

Sorting#

  • Four Quadrant Principle: Stable/Unstable, Internal/External
    • Stable: The relative positions of identical elements remain unchanged before and after sorting
    • Internal: The data needs to be loaded into memory for sorting

[Consider sorting from small to large]

Stable Sorting#

  • Insertion (Internal)
    • Front: Sorted area; Back: Unsorted area
    • Find a position for the following numbers in the front
      • Compare and exchange one by one to achieve the effect of insertion
      • Do not directly exchange with the insertion position, as this will disrupt the order of the sorted area
    • Average time complexity: O(N^2)
  • Bubble (Internal)
    • Front: Unsorted area; Back: Sorted area
    • Move forward from the first element and find the largest element to put it at the front of the sorted area and the end of the unsorted area
    • Perform N - 1 rounds of bubbling, adding one element to the sorted area each round
      • Optimization: If no exchange operation is performed during a round of bubbling, the process can be terminated directly
    • Average time complexity: O(N^2)
      • Best time complexity: O(N) ← Already sorted
      • Worst time complexity: O(N^2) ← Completely reversed
    • Insertion sort can be understood as reverse bubble sort
  • Merge (External)
    • Premise: Merge two sorted arrays (with lengths m and n) into one sorted array
      • Time complexity: O(m + n)
    • Divide and conquer until two elements can be compared in pairs, then merge the sorted arrays in pairs
    • Time complexity (best, worst, and average): O(NlogN)
      • LogN layers in total, and the merge time for each layer is O(N)
    • External sorting: 2-way merge (above), k-way merge (multi-way merge, using heaps)
      • Refer to External Sorting-Wikipedia, the sorting method for each way can be arbitrary, and finally merge them
      • Actually, whether to use external sorting depends on whether you want to or need to, but it can be done

Unstable Sorting#

  • Selection (Internal)
    • Front: Sorted area; Back: Unsorted area
    • Select the smallest element from the unsorted area in each round and exchange it with the first element of the unsorted area
      • There may be cases where it exchanges with itself, so the exchange function cannot use XOR
    • Unstable: For example, 22'1, after the first exchange, 2 is moved to the back of 2', i.e., 12'2
    • Time complexity (best, worst, and average): O(N^2)
    • Generally better than bubble sort, the number of comparisons is similar, but the exchange in bubble sort is too frequent
  • Quick (Internal)
    • [Pivot, partition]
      • Take the first element as the pivot
      • [Tail pointer] Find an element smaller than the pivot from the back to the front and put it in the position of the first element (which is now empty), [Head pointer] Find an element larger than the pivot from the front to the back and put it in the just emptied position, repeat
      • Finally, the head and tail pointers coincide, pointing to an empty position, put the pivot in it
      • Then perform the above operations on the left and right parts of the pivot separately
    • Time complexity: O(NlogN)
      • When the sequence is in reverse order and the first element is selected as the pivot, it degenerates into selection sort, O(N^2)
    • Optimization
      • Randomly select the pivot
      • Reduce the use of recursion and use loops instead
      • Swap only when both the left and right values to be swapped are found

Searching#

  • Naive Binary Search
    • Prerequisite: Monotonic sequence
    • If the value to be searched exists multiple times in the sequence, binary search cannot distinguish which one is found
  • Special Binary Search
    • 11110000
      • Introduce a virtual head pointer with an index of -1
      • The calculation of mid needs to be +1 to avoid infinite loops, such as the case of 10
    • 00001111
      • Introduce a virtual tail pointer with an index of n [array range is 0 ~ n - 1]
    • Refer to 《Interview Written Test Algorithm》-2.Binary Search Topic
    • Here, an additional concept of virtual pointer is added, by changing the search boundary [-1 ~ n - 1 or 0 ~ n], it can reflect the situation when the value is not found
  • Ternary Search
    • Image
    • Solve the problem of finding extreme points of concave and convex functions
    • Divide the original problem into one-third of the size
    • Update strategy: Keep the smaller value and the other end of the interval; Stop condition: |m1 - m2| < ESPL
    • Time complexity: O(log[3]N), slightly slower than binary search O(log[2]N)

Hash Table#

  • Data structure used for searching
  • Structure definition
    • Need a continuous space to store elements
    • Consistent with the sequence table, the element type can be changed
  • Structure operations
    • Hash function: Can map any type of element to an integer value [array index]
      • Then the value can be stored in the corresponding array index
      • Array indexes can only be integers
    • Collision handling methods
      • Open addressing [common]: Look for empty spaces later, such as quadratic probing... but it is prone to data clustering problems
      • Rehashing: Also known as hashing, using multiple hash functions
      • Chaining [common]: Store elements of each position in a linked list structure
      • Establish a common overflow area: Put the conflicting elements together in a specific area
  • Time complexity of searching: Approaching O(1)
    • Other time-consuming operations: Hash function conversion, chaining [searching linked list elements], or other time-consuming operations caused by collision handling methods
    • Only arrays and sequence tables have O(1) time complexity
  • The space utilization of the hash table is generally 50~90%
    • There will always be collision situations in the hash function, so space needs to be reserved when allocating
    • When the utilization rate reaches 70%, it can be used in industry. The fewer collisions, the higher the utilization rate, and the better the hash function

Code Demonstration#

Stable Sorting#

  • Image
  • Image
  • Image
  • See comments and red box marks for details

  • To ensure the stability of stable sorting, pay attention to the == case and do not exchange [or ensure the original order]!

  • Note: When calling the TEST macro, the second array is num, which is a copy of arr in the macro definition!

Unstable Sorting#

Version 1.0: Selection Sort & Quick Sort (Regular Version)#

  • Image
  • Image
  • Pay special attention to the quicksort algorithm, there are many optimization points

    • Self-optimization: The conditions num[] > z or num[] < z in lines 46 and 48 can add == conditions. In this way, when encountering equal elements, they will not be exchanged, and it is not necessary to exchange them, which can ensure the stability of the algorithm as much as possible [although quicksort is unstable]
    • To be optimized: Pivot selection, recursion → loop, exchange method

Version 2.0: Optimized Quick Sort#

  • Image
  • Implement the three optimization points mentioned above

  • Pay attention to the == in several boundaries: x <= y, num[] < z, num[] > z

    • ❓How to understand the judgment of ==? See the following thinking point 2
  • Lines 39 and 40: Sort the right interval, and then shrink the right boundary

Speed Comparison of 2 Versions of Quick Sort#

  • Test random sequences and reverse sequences separately
    • For a random sequence of size 200,000: the difference between the two is minimal
    • For a reverse sequence of size 100,000: the difference is significant
      • Image
      • For a reverse sequence of 200,000, the original version of quicksort will cause a stack overflow

Building a Hash Table for Strings#

  • Hash function: BKDR-Hash, collision handling method: chaining

  • Image
  • Image
  • Use head insertion when inserting values

  • The space utilization of the hash table cannot reach 100%, so the space allocated needs to be expanded, doubled

  • Use calloc to allocate space for the hash table, which can set the first address of each position in the linked list to NULL, which is safe

  • The hash function can be designed by yourself

Points for Consideration#

  • My understanding: To ensure the stability of stable sorting, pay attention to the == case and do not exchange [or ensure the original order]!
  • ❓Regarding the thinking of the boundaries in the optimized version of quicksort
    • Image
    • The optimization ideas mentioned above can be implemented in the regular version, but there are problems in the optimization ideas mentioned above, while they can be implemented in the regular version
      • Lines 32-33: Change num[] < z and num[] > z to <= and >=, respectively, will cause infinite loops
        • For num[] <= z, it will cause an infinite loop
          • Example: 2 1, with the pivot value of 2, the left pointer x keeps moving to the right end and goes out of bounds, while the right pointer y remains unchanged. The condition while (l < r) in line 27 is true, and x returns to l, resulting in an infinite loop
        • For num[] >= z, it will cause a segmentation fault - stack overflow
          • Example: 1 2, with the pivot value of 1, the right pointer y keeps moving to the left end and becomes -1, while the left pointer x and r remain unchanged. The line 39 quick_sort(num, x, r) is recursively called repeatedly, resulting in a stack overflow
        • Analysis: x and y need to move, otherwise, it will cause a stack overflow if x does not move, and a infinite loop if y does not move
      • Lines 32-34: Change x <= y to x < y, will also cause a segmentation fault
        • Example: 2 3, with the pivot value of 2, at this time, x and y point to 2 and 3 respectively. After the loop, x and y both point to 2, and x and r remain unchanged. The line 39 quick_sort(num, x, r) is recursively called repeatedly, resulting in a stack overflow
        • Analysis: x and y need to be staggered, otherwise, it is similar to the case of special binary search 111000 without +1
      • Summary: Both x and y need to move, otherwise x will cause a stack overflow, and y will cause an infinite loop
      • Personal understanding, not well considered, welcome to discuss
  • Ternary Search - Exercise
    • Image
    • First determine the boundaries: INT32_MIN ~ INT32_MAX
    • Then follow the steps of ternary search
    • Boundary considerations
      • Estimate based on the formula - b / 2a, but there is some suspicion of cheating
      • Determine the range based on the two solutions of the quadratic function = 0, but it becomes complicated
      • The time complexity of ternary search is logarithmic, and the size of the boundary range has little effect

Tips#


Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.