Categories
Data Mining Development

AI code-writing assistant that understands data content and generates code

Recently I’ve encountred a client that predicts “in 6 month AI will be able to do much coding instead of man”.

…in years you’ll be able to on the fly, ask the AI to purchase a server, or create a website with X website builder… and basically, I bet it will write code on the fly on your demand where it connects to these tool’s APIs to really make things happen. It could do this now for some easy stuff but it’s unreliable and will mess up.

Now we’ve ancountered a interesing public repo, called Sketch. It’s AI code-writing assistant for Pandas (Python) users.

Sketch a Python library allowing to assist with code-writing for Data Mining. It allows a “standard” (hypothetical) data-analysis workflow, showing a Natural Language interace that successfully navigates many tasks in the data stack landscape. It’s

How it works

Sketch uses efficient approximation algorithms (data sketches) to quickly summarize your data, and feed that information into language models. Right now it does this by summarizing the columns and writing these summary statistics as additional context to be used by the code-writing prompt. In the future we hope to feed these sketches directly into custom made “data + language” foundation models to get more accurate results.

Main functionality

  1. sketch.ask

Ask is a basic question-answer system on sketch, this will return an answer in text that is based off of the summary statistics and description of the data.

! pip install sketch

Eg. we have loaded a certain data set in memory:

import sketch
import pandas as pd
sales_data = pd.read_csv("https://gist.githubusercontent.com/bluecoconut/9ce2135aafb5c6ab2dc1d60ac595646e/raw/c93c3500a1f7fae469cba716f09358cfddea6343/sales_demo_with_pii_and_all_states.csv")
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 185950 entries, 0 to 185949
Data columns (total 8 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Order ID          185950 non-null  int64  
 1   Product           185950 non-null  object 
 2   Quantity Ordered  185950 non-null  float64
 3   Price Each        185950 non-null  float64
 4   Order Date        185950 non-null  object 
 5   Purchase Address  185950 non-null  object 
 6   Credit Card       185950 non-null  object 
 7   SSN               185950 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 11.3+ MB

So , the Sketch might be asked a particular data-wise question:

df.sketch.ask("Which columns are integer type?")   

reply being the following

The columns that are integer type are: Order ID, Quantity Ordered, and Index.

2. sketch.howto

The basic “code-writing” prompt in Sketch. This will return a code-block you should be able to copy paste and use as your coding starting point. Eg.:

sales_data.sketch.howto("Plot the sales versus time")

It’ll generate the following code:

# import libraries
import matplotlib.pyplot as plt
import pandas as pd

# read the dataframe
df = pd.read_csv('sales_data.csv')

# convert the Order Date column to datetime format
df['Order Date'] = pd.to_datetime(df['Order Date'])

# create a new column for the month of the order date
df['Month'] = df['Order Date'].dt.month

# group by month and sum up the total sales for each month
monthly_sales = df.groupby('Month').sum()['Quantity Ordered']

# plot the sales versus time 
plt.plot(monthly_sales)
plt.xlabel('Month')
plt.ylabel('Sales')
plt.title('Sales versus Time')
plt.show()

As we run the code, replacing line 6 with df=sales_data, the result is the following:

See the examples at the video on the github repo page.

sketch.apply

It is a more advanced prompt that is more useful for data generation. Use it to parse fields, generate new features, and more, eg.:

df['review_keywords'] = df.sketch.apply("Keywords for the review [{{ review_text }}] of product [{{ product_name }}] (comma separated):")

GPT-3 based

The Sketch is based with GPT-3.

GPT-3 is the third generation of the GPT  (Generative Pre-training Transformer) series. With over 175 billion parameters, it is significantly larger and more powerful than its predecessors. 

Playground

There is Sketch palyground with examples at the Google colab.

Conclusion

As the AI grows in its capabilities we’ll see more code-writing tools. As to my Machine Learning endeavours I’d gladly apply Sketch for Data mining problems.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.