Dataiku solutions By Dheerendra Yadav

Posts

How to design simple webpage using code studio

May 15, 2025

import streamlit as st import pandas as pd import numpy as np import altair as alt import dataiku #st.title('Selfserve reports!') cust_ds = dataiku.Dataset( "name of dataset" ) df = cust_ds.get_dataframe() #create list of countries cntry_lst=df[ "city" ].unique() #create list of countries year_lst=df[ "yearmonth" ].unique() #sel = st.selectbox('Select Market',cntry_lst, key='sel') cust_seg=df[ "customer_segment_current" ].unique() sel=st.multiselect( 'Select City' , cntry_lst, default= None ,key= 'sel' ) yrs=st.multiselect( 'Year(s)' , year_lst, default= None ,key= 'yrs' ) segment=st.multiselect( 'Segment(s)' , cust_seg, default= None ,key= 'seg' ) df = df[df[ "city" ].isin (sel) & df[ "yearmonth" ].isin (yrs)] df2 = df[df[ "customer_segment_current" ].isin (segment)] #Select dim dim = df2[[ 'city' #,'yearmonth' , ...

Regular expression with Spark sql

February 27, 2025

if you have data where you want to filter out all special characters along with numeric character from string then use spark sql regular expression function - regexp_replace Example : - Name sbdgd&(&**,4343 ram7534- 5%7777 Select regexp_replace(name, '[^a-zA-z]', ' ') as name from table_name Result: Name sbdgd ram

Root path does not exist

February 13, 2025

Problem : When you execute your query you are getting below error. Root path does not exist Root path of the dataset does not exist Solution: Please check you query it may be possibility of below cause. Database.table in query name even you input dataset not synced with database or any metastoredb example : select * from db1.abc Confirm if abc is table of database db. I will possible that data saved on HDFS only. In that case you need to only table e.g select * from abc

Date format issue with spark sql

February 12, 2025

If you have date format like date_col 1/1/1978 0:00 6/6/1975 0:00 1/1/1978 0:00 1/1/1978 0:00 7/23/1988 0:00 1/1/1978 0:00 1/26/1961 0:00 3/3/1980 0:00 6/2/1984 0:00 1/2/1971 0:00 Solution : substring(TO_DATE(date_col ,'MM/dd/yy'),0,10) from table Result : 1978-01-01 1975-06-06 .. 1988-07-23 ..

Python UnicodeDecodeError with SAS file

November 07, 2024

If you are getting below error while reading SAS file using Python or Output display b' value' with rows [13:47:17] [INFO] [dku.utils] - return lib.map_infer(values, mapper, convert=convert) [13:47:17] [INFO] [dku.utils] - File "lib.pyx", line 2972, in pandas._libs.lib.map_infer [13:47:17] [INFO] [dku.utils] - File "<string>", line 15, in <lambda> [13:47:17] [INFO] [dku.utils] - UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 20: invalid start byte Solution: Please use below code to for resolution: import pandas as pd, numpy as np df = pd.read_sas('path//20240710.sas7bdat') for col in df.columns: if df[col].dtype == 'object': df[col] = df[col].apply(lambda x: x.decode('utf-8','ignore') if isinstance(x, bytes) else x) print(df)

Hive Partition sub folders HIVE_UNION_SUBDIR_1,HIVE_UNION_SUBDIR_2,HIVE_UNION_SUBDIR_8

July 18, 2024

Hive Partition have sub folders like HIVE_UNION_SUBDIR_1,HIVE_UNION_SUBDIR_2,HIVE_UNION_SUBDIR_3 Problem : When you use UNION ALL in query with Hive version 1.2.0 onward. UNION ALL will not supported and we have TEZ engine setup in hive-site.xml or hive configuration file which is responsible to create sub folders like HIVE_UNION_SUBDIR_1 etc on HDFS. When you use spark sql or other sql query to read Partition data then your resultant partition created with blank or 0 value. Solution: Please change the configuration file of hive and set below property. Its work for me in dataiku. hive.execution.engine=mr It will invoke Map Reduce process which will bit slow to process the job but it will help to stop creating the extra sub folder on HDFS.

Search This Blog