Powering Penguin Insights: A Practical Guide to Snowflake Cortex AI with DeepSeek-R1 LLM

4 min read3 days ago

Snowflake Cortex is a suite of AI features that use large language models (LLMs) to understand unstructured data, answer freeform questions, and provide intelligent assistance. This tutorial is a practical hands-on guide to analyse Penguin data.

Let us get started.

Part A: Load the data to be analyzed

Download the penguins.csv to your local from here
Load the csv into a new snowflake table

At this point, you should see a new table called Penguins in your selected database

Preview of the data (It has a total of 344 records)

Part B: Analysis code in Snowflake Notebook

Create a new snowflake notebook — ANALYSIS_USING_LLM_PENGUINS

From the top right import the relevant packages

Implement the helper functions —

a) generate_deepseek_response() — This function interacts with the snowflake cortex complete function. First param is the name of the reasoning model (DeepSeek-R1 in our case) & second param is the prompt to be sent to the model

b) extract_think_content() — This function handles the response returned from the LLM model

# Helper function
def generate_deepseek_response(prompt):
    cortex_prompt = f"'[INST] {prompt} [/INST]'"
    prompt_data = [{'role': 'user', 'content': cortex_prompt}]
    prompt_json = escape_sql_string(json.dumps(prompt_data))
    response = session.sql(
        "select snowflake.cortex.complete(?, ?)", 
        params=['deepseek-r1', prompt_json]
    ).collect()[0][0]
    
    return response

def extract_think_content(response):
    think_pattern = r'(.*?)'
    think_match = re.search(think_pattern, response, re.DOTALL)
    
    if think_match:
        think_content = think_match.group(1).strip()
        main_response = re.sub(think_pattern, '', response, flags=re.DOTALL).strip()
        return think_content, main_response
    return None, response

def escape_sql_string(s):
    return s.replace("'", "''")

Implement the code for streamlit app that takes user questions & sends it to Deepseek-R1 model

# Streamlit app to send questions to the LLM
import streamlit as st
from snowflake.snowpark.context import get_active_session
import json
import pandas as pd
import re

# Write directly to the app
st.title("🐧 Ask about Penguins")

# Get the current credentials
session = get_active_session()

df = penguinsData.to_pandas()

user_queries = ["Which penguins has the longest bill length?",
                "Where do the heaviest penguins live?",
                "Which penguins has the shortest flippers?"]

question = st.selectbox("What would you like to know?", user_queries)
# question = st.text_input("Ask a question", user_queries[0])

prompt = [
    {
        'role': 'system',
        'content': 'You are a helpful assistant that uses provided data to answer natural language questions.'
    },
    {
        'role': 'user',
        'content': (
            f'The user has asked a question: {question}. '
            f'Please use this data to answer the question: {df.to_markdown(index=False)}'
        )
    },
    {
        'temperature': 0.7,
        'max_tokens': 1000,
        'guardrails': True
    }
]

df

if st.button("Submit"):
    status_container = st.status("Thinking ...", expanded=True)
    with status_container:
        response = generate_deepseek_response(prompt)
        think_content, main_response = extract_think_content(response)
        if think_content:
            st.write(think_content)
                
    status_container.update(label="Thoughts", state="complete", expanded=False)
    st.markdown(main_response)

For the question we asked, the model rightly displays -

The penguin with the longest bill length is Gentoo from Biscoe Island with a bill length of 59.6 mm (male)

One can try other questions as well.

The advantage of using Snowflake Cortex AI is that accessing large language models (LLMs) is extremely easy without a need to manage integration & API keys. Governance control is easily implemented via Cortex Guard to filter out potentially inappropriate content.

Bonus

In case you are getting curious about penguins species here is how they look

Conclusion

This assignment demonstrates the following-

Loading sample data from local into snowflake tables
Query the data using Snowflake Cortex AI
Usage of Cortex Guard & DeepSeek-R1 LLM model

If you like my tutorials, consider giving multiple claps & follow me for more interesting reads.

Powering Penguin Insights: A Practical Guide to Snowflake Cortex AI with DeepSeek-R1 LLM

Part A: Load the data to be analyzed

Part B: Analysis code in Snowflake Notebook

Written by Ashish Agarwal

No responses yet