duckdb-nsql

Provided this schema:

CREATE TABLE taxi (
    VendorID bigint,
    tpep_pickup_datetime timestamp,
    tpep_dropoff_datetime timestamp,
    passenger_count double,
    trip_distance double,
    fare_amount double,
    extra double,
    tip_amount double,
    tolls_amount double,
    improvement_surcharge double,
    total_amount double,
);

Give me taxis with more than 2 passengers

範例輸出

SELECT * FROM taxi WHERE passenger_count > 2

設定系統提示

此模型期望系統提示中的結構描述作為輸入

/set system """Here is the database schema that the SQL query will run on:
CREATE TABLE taxi (
    VendorID bigint,
    tpep_pickup_datetime timestamp,
    tpep_dropoff_datetime timestamp,
    passenger_count double,
    trip_distance double,
    fare_amount double,
    extra double,
    tip_amount double,
    tolls_amount double,
    improvement_surcharge double,
    total_amount double,
);"""

一旦在系統提示中提供了結構描述，模型將在後續回應中使用它。

對於以下提示

get all columns ending with _amount from taxi table

模型將輸出類似這樣的內容

SELECT COLUMNS('.*_amount') FROM taxi;

API 範例

$ curl https://127.0.0.1:11434/api/generate -d '{
    "model": "duckdb-nsql:7b-q4_0",
    "system": "Here is the database schema that the SQL query will run on: CREATE TABLE taxi (VendorID bigint, tpep_pickup_datetime timestamp, tpep_dropoff_datetime timestamp, passenger_count double, trip_distance double, fare_amount double, extra double, tip_amount double, tolls_amount double, improvement_surcharge double, total_amount double,);",
    "prompt": "get all columns ending with _amount from taxi table"
}'

Python 函式庫範例

pip install ollama

import ollama

r = ollama.generate(
    model='duckdb-nsql:7b-q4_0',
    system='''Here is the database schema that the SQL query will run on:
CREATE TABLE taxi (
    VendorID bigint,
    tpep_pickup_datetime timestamp,
    tpep_dropoff_datetime timestamp,
    passenger_count double,
    trip_distance double,
    fare_amount double,
    extra double,
    tip_amount double,
    tolls_amount double,
    improvement_surcharge double,
    total_amount double,
);''',
    prompt='get all columns ending with _amount from taxi table',
)

print(r['response'])

訓練資料

20 萬個 DuckDB text-to-SQL 對組，使用 Mixtral-8x7B-Instruct-v0.1 合成生成，並以 DuckDB v0.9.2 文件為指導。以及來自 NSText2SQL 的 text-to-SQL 對組，這些對組使用 sqlglot 轉譯為 DuckDB SQL。

訓練程序

DuckDB-NSQL 的訓練使用交叉熵損失，以最大化序列輸入的可能性。對於 text-to-SQL 對組的微調，我們僅計算該對組 SQL 部分的損失。該模型使用 80GB A100 進行訓練，利用資料和模型並行性。我們微調了 10 個 epoch。

預期用途和限制

該模型專為從給定表格結構描述和自然語言提示中生成 text-to-SQL 而設計。該模型在以下定義的提示格式和輸出下效果最佳。與現有的 text-to-SQL 模型相比，SQL 生成不限於 SELECT 語句，而是可以生成任何有效的 DuckDB SQL 語句，包括用於官方 DuckDB 擴展的語句。

參考文獻

Hugging Face

![duckdb-nsql model](https://github.com/ollama/ollama/assets/3325447/b9217c78-0803-45fe-90cf-00bd76705a37)

DuckDB-NSQL is a 7 billion parameter text-to-SQL model designed specifically for SQL generation tasks.

This model is based on Meta's original Llama-2 7B model and further pre-trained on a dataset of general SQL queries and then fine-tuned on a dataset composed of DuckDB text-to-SQL pairs.

## Usage

### Example Prompt

```
Provided this schema:

CREATE TABLE taxi (
    VendorID bigint,
    tpep_pickup_datetime timestamp,
    tpep_dropoff_datetime timestamp,
    passenger_count double,
    trip_distance double,
    fare_amount double,
    extra double,
    tip_amount double,
    tolls_amount double,
    improvement_surcharge double,
    total_amount double,
);

Give me taxis with more than 2 passengers
```

### Example output

```
SELECT * FROM taxi WHERE passenger_count > 2
```

## Setting the system prompt

This model expects the schema in the system prompt as input:

```
/set system """Here is the database schema that the SQL query will run on:
CREATE TABLE taxi (
    VendorID bigint,
    tpep_pickup_datetime timestamp,
    tpep_dropoff_datetime timestamp,
    passenger_count double,
    trip_distance double,
    fare_amount double,
    extra double,
    tip_amount double,
    tolls_amount double,
    improvement_surcharge double,
    total_amount double,
);"""

```

Once the schema is provided in the system prompt, the model will use it in subsequent responses.

For the following prompt:

```
get all columns ending with _amount from taxi table
```

The model will output something like this:

```
SELECT COLUMNS('.*_amount') FROM taxi;
```

## API example

```
$ curl https://127.0.0.1:11434/api/generate -d '{
    "model": "duckdb-nsql:7b-q4_0",
    "system": "Here is the database schema that the SQL query will run on: CREATE TABLE taxi (VendorID bigint, tpep_pickup_datetime timestamp, tpep_dropoff_datetime timestamp, passenger_count double, trip_distance double, fare_amount double, extra double, tip_amount double, tolls_amount double, improvement_surcharge double, total_amount double,);",
    "prompt": "get all columns ending with _amount from taxi table"
}'
```

## Python library example

```
pip install ollama
```

```
import ollama

r = ollama.generate(
    model='duckdb-nsql:7b-q4_0',
    system='''Here is the database schema that the SQL query will run on:
CREATE TABLE taxi (
    VendorID bigint,
    tpep_pickup_datetime timestamp,
    tpep_dropoff_datetime timestamp,
    passenger_count double,
    trip_distance double,
    fare_amount double,
    extra double,
    tip_amount double,
    tolls_amount double,
    improvement_surcharge double,
    total_amount double,
);''',
    prompt='get all columns ending with _amount from taxi table',
)

print(r['response'])
```

## Training Data

200k DuckDB text-to-SQL pairs, synthetically generated using Mixtral-8x7B-Instruct-v0.1, guided by the DuckDB v0.9.2 documentation. And text-to-SQL pairs from NSText2SQL that were transpiled to DuckDB SQL using sqlglot.

## Training Procedure

DuckDB-NSQL was trained using cross-entropy loss to maximize the likelihood of sequential inputs. For finetuning on text-to-SQL pairs, we only compute the loss over the SQL portion of the pair. The model is trained using 80GB A100s, leveraging data and model parallelism. We fine-tuned for 10 epochs.

## Intended Use and Limitations

The model was designed for text-to-SQL generation tasks from given table schema and natural language prompts. The model works best with the prompt format defined below and outputs. In contrast to existing text-to-SQL models, the SQL generation is not contrained to SELECT statements, but can generate any valid DuckDB SQL statement, including statements for official DuckDB extensions.

## References

[Hugging Face](https://huggingface.co/motherduckdb/DuckDB-NSQL-7B-v0.1)

貼上、拖放或點擊上傳圖片 (.png, .jpeg, .jpg, .svg, .gif)