Here we have created two DataFrames df
and full_df
which contain two columns and three columns respectively.
Check schema
Let’s check the schema of the above DataFrame full_df
The output shows that col1
is string type and array_col1
and array_col2
are array type:
root
|-- col1: string (nullable = true)
|-- array_col1: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- array_col2: array (nullable = true)
| |-- element: integer (containsNull = true)
Similarly for DataFramedf
, the schema looks like this:
root
|-- col1: string (nullable = true)
|-- array_col2: array (nullable = true)
| |-- element: integer (containsNull = true)
In the examples that follow we will use df
for functions that take a single array as input and df_full
for functions that take two arrays as input.
Array functions
array_contains
If we need to find a particular element is present in array, we can use array_contains
function. This function returns true if the value is present in array and false otherwise.
Output:
+----+------------------+------+
|col1| array_col2|result|
+----+------------------+------+
| x| [1, 2, 3, 7, 7]| true|
| z|[3, 2, 8, 9, 4, 9]| true|
| a| [4, 5, 2, 8]| false|
+----+------------------+------+
We are looking for value ‘3’ in array column array_col2
and getting true for the first two rows where ‘3’ is present and false for last column because ‘3’ is not present.
array_distinct
This function returns only distinct values from an array and removes duplicate values.
Output:
+----+------------------+---------------+
|col1| array_col2| result|
+----+------------------+---------------+
| x| [1, 2, 3, 7, 7]| [1, 2, 3, 7]|
| z|[3, 2, 8, 9, 4, 9]|[3, 2, 8, 9, 4]|
| a| [4, 5, 2, 8]| [4, 5, 2, 8]|
+----+------------------+---------------+
Duplicate values got removed and only distinct values are present from array column result
.
array_except
This function returns the elements from first array which are not present in second array. This is logically equivalent to set subtract operation.
Output:
+----+------------------+------------------+---------+
|col1| array_col1| array_col2| result|
+----+------------------+------------------+---------+
| x| [4, 6, 7, 9, 2]| [1, 2, 3, 7, 7]|[4, 6, 9]|
| z|[7, 5, 1, 4, 7, 1]|[3, 2, 8, 9, 4, 9]|[7, 5, 1]|
| a| [3, 8, 5, 3]| [4, 5, 2, 8]| [3]|
+----+------------------+------------------+---------+
Column result
contains elements that are only present in array_col1
but not in array_col2
. For example, in the first row the result
column contains [4, 6, 9]
because these elements are present in array_col1
but not in array_col2
.
array_intersect
This function returns common elements from both arrays. This is logically equivalent to set intersection operation.
Output:
+----+------------------+------------------+------+
|col1| array_col1| array_col2|result|
+----+------------------+------------------+------+
| x| [4, 6, 7, 9, 2]| [1, 2, 3, 7, 7]|[7, 2]|
| z|[7, 5, 1, 4, 7, 1]|[3, 2, 8, 9, 4, 9]| [4]|
| a| [3, 8, 5, 3]| [4, 5, 2, 8]|[8, 5]|
+----+------------------+------------------+------+
Column result
contains the elements that are common in both the array columns (array_col1
and array_col2
). For example, in the first row the result
column contains [7, 2]
because these elements are present in botharray_col1
and array_col2
.
array_join
This Function joins all the array elements based on delimiter defined as the second argument.
Output:
+----+------------------+-----------+
|col1| array_col2| result|
+----+------------------+-----------+
| x| [1, 2, 3, 7, 7]| 1,2,3,7,7|
| z|[3, 2, 8, 9, 4, 9]|3,2,8,9,4,9|
| a| [4, 5, 2, 8]| 4,5,2,8|
+----+------------------+-----------+
Column result
contains a string which is a concatenation of all the array elements using a specified delimiter (comma in this example).
Note: if there are any null values then we can replace with third argument (nullReplacement
) with any string value.
array_max
This function returns the maximum value from an array.
Output:
+----+------------------+------+
|col1| array_col2|result|
+----+------------------+------+
| x| [1, 2, 3, 7, 7]| 7|
| z|[3, 2, 8, 9, 4, 9]| 9|
| a| [4, 5, 2, 8]| 8|
+----+------------------+------+
Column result
contains the maximum value from each array in a row. For example, in the first row the result
column contains ‘7’ because this is the maximum element in array [1, 2, 3, 7, 7]
.
array_min
This function returns the minimum value from an array.
Output:
+----+------------------+------+
|col1| array_col2|result|
+----+------------------+------+
| x| [1, 2, 3, 7, 7]| 1|
| z|[3, 2, 8, 9, 4, 9]| 2|
| a| [4, 5, 2, 8]| 2|
+----+------------------+------+
Column result
contains the minimum value from each array in a row. For example, in the first row the result
column contains ‘1’ because this is the minimum element in array [1, 2, 3, 7, 7]
.
array_position
This function returns the position of first occurrence of a specified element. If the element is not present it returns 0.
Let’s try to find the position of element say ‘7’ from column array_col2
.
Output:
+----+------------------+------+
|col1| array_col2|result|
+----+------------------+------+
| x| [1, 2, 3, 7, 7]| 4|
| z|[3, 2, 8, 9, 4, 9]| 0|
| a| [4, 5, 2, 8]| 0|
+----+------------------+------+
In the first row we get position ‘4’ because ‘7’ occurs the first time in position four. For the rest of the rows, we get ‘0’ because ‘7’ is not present.
array_remove
This function removes all the occurrences of an element from an array.
Let’s remove the element ‘7’ from column array_col2
.
Output:
+----+------------------+------------------+
|col1| array_col2| result|
+----+------------------+------------------+
| x| [1, 2, 3, 7, 7]| [1, 2, 3]|
| z|[3, 2, 8, 9, 4, 9]|[3, 2, 8, 9, 4, 9]|
| a| [4, 5, 2, 8]| [4, 5, 2, 8]|
+----+------------------+------------------+
All occurrences of element ‘7’ are removed from array.
array_repeat
This function creates an array that is repeated as specified by second argument.
Output:
+----+------------------+----------------------------------------+ |col1|array_col2 |result | +----+------------------+----------------------------------------+ |x |[1, 2, 3, 7, 7] |[[1, 2, 3, 7, 7], [1, 2, 3, 7, 7]] | |z |[3, 2, 8, 9, 4, 9]|[[3, 2, 8, 9, 4, 9], [3, 2, 8, 9, 4, 9]]| |a |[4, 5, 2, 8] |[[4, 5, 2, 8], [4, 5, 2, 8]] | +----+------------------+----------------------------------------+
Array from array_col2
got repeated 2 times in result
column. For example, in the first row the result
column contains the array [1, 2, 3, 7, 7]
twice.
array_sort
This function sorts the elements of an array in ascending order.
Output:
+----+------------------+------------------+
|col1| array_col2| result|
+----+------------------+------------------+
| x| [1, 2, 3, 7, 7]| [1, 2, 3, 7, 7]|
| z|[3, 2, 8, 9, 4, 9]|[2, 3, 4, 8, 9, 9]|
| a| [4, 5, 2, 8]| [2, 4, 5, 8]|
+----+------------------+------------------+
Array in column result
got sorted in ascending order. For example, in the last row the column result contains [2, 4, 5, 8]
which is sorted in ascending order.
array_union
This function returns the union of all elements from the input arrays.
Output:
+------------------+------------------+------------------------+ |array_col1 |array_col2 |result |
+------------------+------------------+------------------------+
|[4, 6, 7, 9, 2] |[1, 2, 3, 7, 7] |[4, 6, 7, 9, 2, 1, 3] |
|[7, 5, 1, 4, 7, 1]|[3, 2, 8, 9, 4, 9]|[7, 5, 1, 4, 3, 2, 8, 9]|
|[3, 8, 5, 3] |[4, 5, 2, 8] |[3, 8, 5, 4, 2] |
+------------------+------------------+------------------------+
Note: dropped the col1
to fit the result here in code block.
Column result
contains the union of arrays from column array_col1
and array_col2
and contains distinct values only.
arrays_overlap
This function checks if at least one element is common/overlapping in arrays. It returns true if at least one element is common in both array and false otherwise. It returns null if at least one of the arrays is null.
Output:
+----+------------------+------------------+------+
|col1|array_col1 |array_col2 |result|
+----+------------------+------------------+------+
|x |[4, 6, 7, 9, 2] |[1, 2, 3, 7, 7] |true |
|z |[7, 5, 1, 4, 7, 1]|[3, 2, 8, 9, 4, 9]|true |
|a |[3, 8, 5, 3] |[4, 5, 2, 8] |true |
+----+------------------+------------------+------+
All the values in result
column are true
because we have at-least one element common in array_col1
and array_col2
for all rows. For example, in the first row the result
column is true
because the elements ‘2’ and ‘7’ are present in both columns array_col1
and array_col2
.
arrays_zip
This function merges the i-th element of an array and returns array<struct>.
Since both the array columns have same numbers of values, let’s remove some values from one array column and see how it behaves with different values in the array with zip operation.
First, we will remove element ‘2’ from array column array_col2
and then try to zip column array_col1
with newly created column new_array_col
Output:
+------------------+---------------+-------------------------------+
|array_col1 |new_array_col |result |
+------------------+---------------+-------------------------------+
|[4, 6, 7, 9, 2] |[1, 3, 7, 7] |[[4, 1], [6, 3] ..., [2,]] |
|[7, 5, 1, 4, 7, 1]|[3, 8, 9, 4, 9]|[[7, 3], [5, 8] ..., [1,]] |
|[3, 8, 5, 3] |[4, 5, 8] |[[3, 4], ... [3,]] |
+------------------+---------------+-------------------------------+
In first row, first element of result
column is [4, 1] which is a zip of first element from array array_col1
(4) and new_array_col
(1). Also, last element of result
column is [2,] (which is a zip of 5-th element) and second value is blank because there is no 5-th element in first row of column new_array_col
.
Let’s check the type of the result
columns as well.
Output:
root
|-- result: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- array_col1: integer (nullable = true)
| | |-- new_array_col: integer (nullable = true)
Type of the result column is array<struct>.
concat
This function concatenates all the elements of both arrays into a single one.
Output:
+------------------+------------------+----------------------------+
|array_col1 |array_col2 |result |
+------------------+------------------+----------------------------+
|[4, 6, 7, 9, 2] |[1, 2, 3, 7, 7] |[4, 6, 7, 9, ..., 3, 7, 7] |
|[7, 5, 1, 4, 7, 1]|[3, 2, 8, 9, 4, 9]|[7, 5, 1, 4, ..., 9, 4, 9] |
|[3, 8, 5, 3] |[4, 5, 2, 8] |[3, 8, 5, 3, 4, 5, 2, 8] |
+------------------+------------------+----------------------------+
Column result
contains the array which is a concatenation of arrays in columns array_col1
and array_col2
.
Note: In order to fit the output we removed a few elements from result
column from the display.
element_at
This function returns the element at a specified index.
Let’s try to get the first element from each array.
Output:
+----+------------------+------+
|col1|array_col2 |result|
+----+------------------+------+
|x |[1, 2, 3, 7, 7] |1 |
|z |[3, 2, 8, 9, 4, 9]|3 |
|a |[4, 5, 2, 8] |4 |
+----+------------------+------+
Column result
contains the first element from each array. For example, in the first row the result
contains ‘1’ because this is first element in the array [1, 2, 3, 7, 7]
.
flatten
This function returns a single array from array of an arrays. If an array is more than 2 levels deep, it removes one level of nesting from an array.
Let’s first generate the nested array using the function array_repeat
as discussed above and then flatten the nested array.
Output:
+-----------------------------------+------------------------------+
|repeat |result |
+-----------------------------------+------------------------------+
|[[1, 2, 3, 7, 7], [1, 2, 3, 7, 7]] |[1, 2, 3, 7, 7, 1, 2, 3, 7, 7]|
|[[3, 2, 8, 9, 4], [3, 2, 8, 9, 4]] |[3, 2, 8, 9, 4, 3, 2, 8, 9, 4]|
|[[4, 5, 2, 8], [4, 5, 2, 8]] |[4, 5, 2, 8, 4, 5, 2, 8] |
+----------------------------------------+-------------------------+
Column result
contains all the values from an array of arrays from column repeat
but in a single array.
map_from_arrays
This function creates a map column. Elements of the first column will be used for keys and second column will be used for values.
Output:
+------------------+------------------+--------------------+
| array_col1| array_col2| result|
+------------------+------------------+--------------------+
| [4, 6, 7, 9, 2]| [1, 2, 3, 7, 7]|[4 -> 1, 6 -> 2, ...|
|[7, 5, 1, 4, 7, 1]|[3, 2, 8, 9, 4, 9]|[7 -> 3, 5 -> 2, ...|
| [3, 8, 5, 3]| [4, 5, 2, 8]|[3 -> 4, 8 -> 5, ...|
+------------------+------------------+--------------------+
Column result
contains the map generated from both the input arrays. The first element in first row is 4 -> 1
where ‘4’ is a key which is the first element from first column array_col1
and ‘1’ is the key’s value which is the first element from second column array_col2
.
reverse
This function reverses the order of elements in input array.
Output:
+----+------------------+------------------+
|col1| array_col2| result|
+----+------------------+------------------+
| x| [1, 2, 3, 7, 7]| [7, 7, 3, 2, 1]|
| z|[3, 2, 8, 9, 4, 9]|[9, 4, 9, 8, 2, 3]|
| a| [4, 5, 2, 8]| [8, 2, 5, 4]|
+----+------------------+------------------+
Column result
contains the reverse of array present in column array_col2
. For e.g in first row, result
contains [7, 7, 3, 2, 1]
which is reverse of array [1, 2, 3, 7, 7]
from column aray_col2
.
shuffle
This function shuffles the elements of array randomly.
Output:
+----+------------------+------------------+
|col1| array_col2| result|
+----+------------------+------------------+
| x| [1, 2, 3, 7, 7]| [2, 7, 1, 7, 3]|
| z|[3, 2, 8, 9, 4, 9]|[3, 8, 9, 4, 9, 2]|
| a| [4, 5, 2, 8]| [8, 4, 2, 5]|
+----+------------------+------------------+
Column result
contains shuffled elements from column array_col2
. In other words, order of elements in result
column is random. For example, in the first row the result
column contains [2, 7, 1, 7, 3]
which is the shuffled output of array [1, 2, 3, 7, 7]
from column array_col2
.
Note: Try executing the shuffle function multiple times. The order of values in the result
column will be different for each execution.
size
This function returns a number of elements in an array or map.
Output:
+----+------------------+------+
|col1| array_col2|result|
+----+------------------+------+
| x| [1, 2, 3, 7, 7]| 5|
| z|[3, 2, 8, 9, 4, 9]| 6|
| a| [4, 5, 2, 8]| 4|
+----+------------------+------+
Column result
contains the size(number of elements) of an array in column array_col2
. For example, in the first row the result
column contains ‘5’ because number of elements in [1, 2, 3, 7, 7]
is 5.
slice
This function slices the array into a sub-array. We can specify the start of the index as second argument and number of elements as third argument.
Note: Arrays in spark start with index 1. It also supports negative indexing to access the elements from last.
Let’s try to create a sub-array of 3 elements starting from index 2.
Output:
+----+------------------+---------+
|col1| array_col2| result|
+----+------------------+---------+
| x| [1, 2, 3, 7, 7]|[2, 3, 7]|
| z|[3, 2, 8, 9, 4, 9]|[2, 8, 9]|
| a| [4, 5, 2, 8]|[5, 2, 8]|
+----+------------------+---------+
In first row, result
contains sub-array [2, 3, 7] which is created with 3 elements from index 2 in [1, 2, 3, 7
, 7].
sort_array
This function sorts the array in ascending order by default. However, we can sort in descending order with second arg as asc=false
.
Output:
+----+------------------+------------------+
|col1| array_col2| result|
+----+------------------+------------------+
| x| [1, 2, 3, 7, 7]| [7, 7, 3, 2, 1]|
| z|[3, 2, 8, 9, 4, 9]|[9, 9, 8, 4, 3, 2]|
| a| [4, 5, 2, 8]| [8, 5, 4, 2]|
+----+------------------+------------------+
Column result
is sorted in descending order. For example, in the first row the result
column contains [7, 7, 3, 2, 1]
which is the descending sorted result of array[1, 2, 3, 7, 7]
from column array_col2
.
explode
This function creates a new row for each element of an array or map.
Let’s first create new column with fewer values to explode.
Output:
+----+---------+
|col1|slice_col|
+----+---------+
| x| [1, 2]|
| z| [3, 2]|
| a| [4, 5]|
+----+---------+
slice_col
contains 2 elements in an array. So upon explode, this generates 2 rows for each array.
Now let's try to explode the column slice_col
.
Output:
+----+---------+------+
|col1|slice_col|result|
+----+---------+------+
| x| [1, 2]| 1|
| x| [1, 2]| 2|
| z| [3, 2]| 3|
| z| [3, 2]| 2|
| a| [4, 5]| 4|
| a| [4, 5]| 5|
+----+---------+------+
Upon explode, 2 rows are generated for each element of an array in column slice_col
.
posexplode
This function creates a new row for each element with position of an array or map.
Let’s first create new column with fewer values to explode.
Now let’s try to explode the slice_col
with a position as well.
Output:
+----+---------+---+---+
|col1|slice_col|pos|col|
+----+---------+---+---+
|x |[1, 2] |0 |1 |
|x |[1, 2] |1 |2 |
|z |[3, 2] |0 |3 |
|z |[3, 2] |1 |2 |
|a |[4, 5] |0 |4 |
|a |[4, 5] |1 |5 |
+----+---------+---+---+
Here 2 more columns got generated as pos
(represent position) and col
(represent element value).
No comments:
Post a Comment