The problem arose because you used .option("mode", "overwrite"), which is meant for reading data. For writing data, like in your case, use .mode("overwrite"). I used this and it worked fine - write_df = read_df.repartition(3).write.format("csv")\ .option("header", "True")\ .mode("overwrite")\ # Using .mode() instead of .option() for overwrite mode .option("path", "/FileStore/tables/Write_Data/")\ .save() Ran dbutils.fs.ls("/FileStore/tables/Write_Data/") and it showed the entries too, post-repartitioning of the data.
final_transformation.repartition(4).write.format("csv")\ .option("header", True)\ .mode("overwrite")\ .save("/FileStore/tables/Transformed_data_12_08_2024") Write code syntax to overwrite the current data in spark
Hello sir, Great lecture. I am facing one problem, in the end part where you were partitioning, I am not getting 3 files. Just getting one file with this output [FileInfo(path='dbfs:/FileStore/tables/csv_write_repartition/*/', name='*/', size=0, modificationTime=0)]. Kindly help me.
No need. Whatever you need to become DE is available for free. In roadmap wala video you can find all the resources and technology that is required to be a DE
I mean, while writing mode = overwrite, and running the code, first time it will create a file but next time we run the code then it is not overwritting the previous file and giving error as file already exists, ideally it should replace the previous file with new one.
@@lucky_raiser Yes, there was some bug in the community edition! I had commented on other video about it and @manish_kumar_1 also confirmed that he faced the same issue..! I'm not able to recollect how we overcome that, sorry!!
How we can optimize dataframe write to csv when its a large file it takes time to write. code: df.coalesce(1).write()....only one file needed in destination path..
I don't think you can do much in this case. All the optimization techniques you can use before final dataframe creation. Since you are merging all partition at the end in to one and writing it so you don't have option to optimize it. If it is allowed you can partition or bucket your Data so whenever you read that written dataframe next time it will query faster
Save me data as a file save hogi. Save as table me data to as a file hogi hogi. But Hive metastore me entry hogi and when you run select * from table then it will look like it has been saved as a table
@@manish_kumar_1 ya correct.when we save data as SaveAsTable() data get saved.but under the hood this is file.but we can able to write sql queries on top of that.
i am getting this error can anyone help me please write_df = df.repartition(3).write.format("csv")\ .option("header", "True")\ .mode("overwrite")\ .option("path", "/FileStore/tables/write-1.csv/")\ .save() AttributeError: 'NoneType' object has no attribute 'repartition
while creating df did you use .show() in the end just remove it bcoz most probably it is return None from there df = spark.read.format("csv")\ .option("header","true")\ .option("mode","PERMISSIVE")\ .load("dbfs:/FileStore/tables/write_data_file.csv") df.write.format("csv")\ .option("header","true")\ .mode("overwrite")\ .option("path","/dbfs:/FileStore/tables/csv_write/")\ .save()