AFAIK Hadoop MR will only use local storage between Map and Reduce; output of each job is committed to shared storage. That is where writing to HDFS takes place; writing to cloud storage is "tricker" due to non-Posix semantics, especially on rename, plus tendency to throttle. And complex SQL-equivalent statements can be multiple MR jobs. Storage of intermediate shuffle data is managed by the Yarn Node Manager, so outlives mapper/reducer processes. And you can also plug in new shufflers, e.g. for Spark. Does contain the "hosts are long lived" assumption, so doesn't suit compute-only VMs running on spot prices.