Posts

Create Web application using stremlit and AgGrid

 import streamlit as st import numpy as np import pandas as pd import altair as alt from st_aggrid import AgGrid, GridOptionsBuilder from pyspark.sql import SparkSession import re try:     st.set_page_config(layout="wide") except:     pass col1, col2 = st.columns([2,8]) with col1:      st.image('logo.jpg',width=100) spark = SparkSession.builder \ .remote("abc://host:xxx") \ .enableHiveSupport().getOrCreate() #df = spark.sql("SELECT * FROM db.table") @st.cache_data() def load_data():     #df = pd.read_csv('C:/Users/Downloads/xx.csv')     df1 = spark.sql("SELECT * FROM db.table")     df=df1.toPandas()     return df df=load_data() #df=df.toPandas() new_title = '<center><p style="font-family:sans-serif; color:Black; font-size: 20px;><h1 style="text-align:center">Self Service Portal</h1></p></center>' st.markdown(new_title, unsafe_allow_html=True) background_color = "#F0FF...

How to design simple webpage using code studio

  import streamlit as st import pandas as pd import numpy as np import altair as alt import dataiku #st.title('Selfserve reports!') cust_ds = dataiku.Dataset( "name of dataset" ) df = cust_ds.get_dataframe() #create list of countries cntry_lst=df[ "city" ].unique() #create list of countries year_lst=df[ "yearmonth" ].unique() #sel = st.selectbox('Select Market',cntry_lst, key='sel') cust_seg=df[ "customer_segment_current" ].unique() sel=st.multiselect( 'Select City' , cntry_lst, default= None ,key= 'sel' ) yrs=st.multiselect( 'Year(s)' , year_lst, default= None ,key= 'yrs' ) segment=st.multiselect( 'Segment(s)' , cust_seg, default= None ,key= 'seg' ) df = df[df[ "city" ].isin (sel) & df[ "yearmonth" ].isin (yrs)] df2 = df[df[ "customer_segment_current" ].isin (segment)] #Select dim dim = df2[[ 'city' #,'yearmonth' , ...

Regular expression with Spark sql

 if you have data where you want to filter out all special characters along with numeric character from string then use spark sql regular expression function  -  regexp_replace Example : -       Name        sbdgd&(&**,4343       ram7534- 5%7777 Select  regexp_replace(name, '[^a-zA-z]', ' ') as name from table_name Result:   Name   sbdgd   ram

Root path does not exist

Problem : When you execute your query you are getting below error.    Root path does not exist Root path of the dataset does not exist Solution: Please check you query it may be possibility of below cause.  Database.table in query name even you input dataset not synced with database or any metastoredb example : select * from  db1.abc Confirm if abc is table of database db. I will possible that data saved on HDFS only. In that case you need to only table e.g select * from abc 

Date format issue with spark sql

 If you have date format like  date_col 1/1/1978 0:00 6/6/1975 0:00 1/1/1978 0:00 1/1/1978 0:00 7/23/1988 0:00 1/1/1978 0:00 1/26/1961 0:00 3/3/1980 0:00 6/2/1984 0:00 1/2/1971 0:00 Solution : substring(TO_DATE(date_col ,'MM/dd/yy'),0,10) from table Result : 1978-01-01 1975-06-06 .. 1988-07-23 ..

Python UnicodeDecodeError with SAS file

 If you are getting below error while reading SAS file using Python or Output display b'  value'    with rows [13:47:17] [INFO] [dku.utils] - return lib.map_infer(values, mapper, convert=convert) [13:47:17] [INFO] [dku.utils] - File "lib.pyx", line 2972, in pandas._libs.lib.map_infer [13:47:17] [INFO] [dku.utils] - File "<string>", line 15, in <lambda> [13:47:17] [INFO] [dku.utils] - UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 20: invalid start byte  Solution: Please use below code to for resolution: import pandas as pd, numpy as np df = pd.read_sas('path//20240710.sas7bdat') for col in df.columns:     if df[col].dtype == 'object':         df[col] = df[col].apply(lambda x: x.decode('utf-8','ignore') if isinstance(x, bytes) else x) print(df)

Hive Partition sub folders HIVE_UNION_SUBDIR_1,HIVE_UNION_SUBDIR_2,HIVE_UNION_SUBDIR_8

 Hive Partition have sub folders like HIVE_UNION_SUBDIR_1,HIVE_UNION_SUBDIR_2,HIVE_UNION_SUBDIR_3 Problem : When you use UNION ALL in query with Hive version 1.2.0 onward. UNION ALL will not supported and we have TEZ engine setup in hive-site.xml or hive configuration file which is responsible to create  sub folders like HIVE_UNION_SUBDIR_1 etc  on HDFS. When you use spark sql or other sql query to read Partition data then your resultant partition created with blank or 0 value. Solution: Please change the configuration file of hive and set below property. Its work for me in dataiku. hive.execution.engine=mr It will invoke Map Reduce process which will  bit slow to process the job but it will help to stop creating the extra sub folder on HDFS.