The world's largest supply platform for sources of shortages and hard-to-find parts

Spark's two core Shuffles

Published Time: 2022-08-20 17:22:22
In the MapReduce framework, the Shuffle phase is a bridge between Map and Reduce. The Map phase outputs data to the Reduce phase through the Shuffle process.

Spark Shuffle consists of two types: Hash Shuffle; The other is Sort based Shuffle. First, we will introduce their development process to help us better understand Shuffle:


Before Spark 1.1, Spark implements only one Shuffle mode, that is, hash-based Shuffle. In Spark 1.1, Sort based Shuffle is introduced. After Spark 1.2, the default Shuffle implementation mode is changed from Hash based Shuffle to Sort based Shuffle. The ShuffleManager used was changed from the default hash to sort. In Spark 2.0, Hash Shuffle is no longer used.


One of the main purposes of Spark's implementation of Shuffle based on Hash in the first place is to avoid unnecessary sorting. If you think of MapReduce in Hadoop, sort is a fixed step. There are many tasks that do not need sorting. MapReduce also sorts them, causing a lot of unnecessary overhead.


In the Shuffle implementation based on Hash, each Task in the Mapper phase generates a file for each Task in the Reduce phase, which usually generates a large number of files (corresponding to M * R intermediate files, among which, M indicates the number of tasks in the Mapper phase and R indicates the number of tasks in the Reduce phase.) A large number of random disk I/O operations and a large amount of memory overhead are required.


To alleviate the above problems, the ShuffleConsolidate mechanism (file merging mechanism) was introduced in Spark 0.8.1 for the hash-based Shuffle implementation to merge intermediate files generated by the Mapper side. Configure the spark.shuffie. consolidateFiles = true property to reduce the number of intermediate files. After file merging, you can change the generation mode of intermediate files to generate one file for each Task in the Reduce phase.


The execution unit is the number of Cores on each Mapper/the number of Cores allocated to each Task (the default value is 1). You can change the number of files from M * R to E * C/T * R, where E indicates the number of Executors, C indicates the number of Cores available, and T indicates the number of Cores allocated to the Task.


Spark1.1 introduced Sort Shuffle:


In the implementation of Hash-based Shuffle, the number of intermediate result files generated depends on the number of tasks in the Reduce phase, that is, the degree of parallelism on the Reduce side. Therefore, the number of files is still uncontrollable and problems cannot be solved. To better solve the problem, a sort-based Shuffle implementation was introduced in Spark1.1, and after Spark 1.2, the default implementation changed from hash-based Shuffle, Change to the Shuffle implementation based on Sort. That is, the ShuffleManager is changed from the default hash to Sort.


In sort-based Shuffle, tasks in each Mapper phase do not generate a separate file for tasks in each Reduce phase. Instead, tasks in each Mapper phase are written to a Data file and an Index file is generated. Tasks in the Reduce phase can obtain related data through the index file. The immediate benefit of avoiding large files is reduced random disk I / 0 and memory overhead. The number of generated files is reduced to 2 x M, where M indicates the number of tasks in the Mapper phase. Each Task in the Mapper phase generates two files (one data file and one index file), and the final number of files is M data files and M index files. Therefore, the final number of files is 2 * M.


Starting from the 1.4 version of Spark, the Shuffie implementation based on Tungstin-sort was introduced in Shuffle process. The optimization of Tungsten project can greatly improve the performance of Spark in data processing. (Tungsten translates to Chinese as Tungsten)


Note: In some specific application scenarios, the Shuffle mechanism based on Hash may outperform the Shuffle mechanism based on Sort.


More Products Hot Selling

MECT-110-01-M-D-RA1
Pluggable Connectors
MECT-110-01-M-D-RA1
20 Position SFP+ Receptacle Connector Solder Surface Mount, Right Angle
FG6943010R
Transistors - FETs, MOSFETs - Arrays
FG6943010R
FG6943010R Panasonic Electronic Components
TSHA4401
LED Emitters - Infrared, UV, Visible
TSHA4401
TSHA4401 Vishay Semiconductor Opto Division
MAX7456EUI-T
Audio Special Purpose
MAX7456EUI-T
MAX7456EUI+T Manufacturer Analog Devices Inc./Maxim Integrated Video IC Serial, SPI NTSC, PAL 28-TSSOP-EP Package
EPM7064SLI44-7N
Embedded - PLDs (Programmable Logic Device)
EPM7064SLI44-7N
EPM7064SLI44-7N Altera IC CPLD 64MC 7.5NS 44PLCC
APTGF75H120TG
Transistors - IGBTs - Modules
APTGF75H120TG
APTGF75H120TG Manufacturers Microchip Technology IGBT Modules Power Module - IGBT
R-785-0-0-5
PMIC - DC-DC Converter
R-785-0-0-5
R-785.0-0.5 Manufacturer Recom Power Linear Regulator Replacement DC DC Converter 1 Output 5V 500mA 6.5V - 32V Input
R570452000
Coaxial Switches
R570452000
R570452000 Coaxial Switches SPDT Ramses SMA 18GHz Latching Self-cut-off 12Vdc Diodes Pins Terminals
NTCALUG01T103G400A
Temperature Sensors - Thermostats - Solid State
NTCALUG01T103G400A
NTCALUG01T103G400A Temperature Sensors - Thermostats NTC LUG01T 10K 2% 3984K G26 40MM
AD8223ARMZ-R7
Audio Special Purpose
AD8223ARMZ-R7
Single-Supply, Low Cost Instrumentation Amplifier
M29W640GB70NA6E
Memory IC
M29W640GB70NA6E
Parallel NOR Flash Embedded Memory M29W640GH, M29W640GL M29W640GT, M29W640GB

Recommended Parts