From bcfbbc78a397ddba909fa389d028778a9217c497 Mon Sep 17 00:00:00 2001 From: Mr_zhangxiang <391634362@qq.com> Date: Thu, 26 Dec 2024 08:35:20 +0000 Subject: [PATCH] add duckdb/csv-files.md. Signed-off-by: Mr_zhangxiang <391634362@qq.com> --- duckdb/csv-files.md | 430 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 430 insertions(+) create mode 100644 duckdb/csv-files.md diff --git a/duckdb/csv-files.md b/duckdb/csv-files.md new file mode 100644 index 0000000..1b898cf --- /dev/null +++ b/duckdb/csv-files.md @@ -0,0 +1,430 @@ +[foc] + +# CSV Import + +## Examples + +The following examples use the flights.csv file. + +Read a CSV file from disk, auto-infer options: + +``` +SELECT * FROM 'flights.csv'; +``` + +Use the read_csv function with custom options: + +``` +SELECT * +FROM read_csv('flights.csv', + delim = '|', + header = true, + columns = { + 'FlightDate': 'DATE', + 'UniqueCarrier': 'VARCHAR', + 'OriginCityName': 'VARCHAR', + 'DestCityName': 'VARCHAR' + }); +``` + +Read a CSV from stdin, auto-infer options: + +``` +cat flights.csv | duckdb -c "SELECT * FROM read_csv('/dev/stdin')" +``` + +Read a CSV file into a table: + +``` +CREATE TABLE ontime ( + FlightDate DATE, + UniqueCarrier VARCHAR, + OriginCityName VARCHAR, + DestCityName VARCHAR +); +COPY ontime FROM 'flights.csv'; +``` + +Alternatively, create a table without specifying the schema manually using a CREATE TABLE .. AS SELECT statement: + +``` +CREATE TABLE ontime AS + SELECT * FROM 'flights.csv'; +``` + +We can use the FROM-first syntax to omit SELECT *. + +``` +CREATE TABLE ontime AS + FROM 'flights.csv'; +``` + +Write the result of a query to a CSV file. + +``` +COPY (SELECT * FROM ontime) TO 'flights.csv' WITH (HEADER, DELIMITER '|'); +``` + +If we serialize the entire table, we can simply refer to it with its name. + +``` +COPY ontime TO 'flights.csv' WITH (HEADER, DELIMITER '|'); +``` + +## CSV Loading + +CSV loading, i.e., importing CSV files to the database, is a very common, and yet surprisingly tricky, task. While CSVs seem simple on the surface, there are a lot of inconsistencies found within CSV files that can make loading them a challenge. CSV files come in many different varieties, are often corrupt, and do not have a schema. The CSV reader needs to cope with all of these different situations. + +The DuckDB CSV reader can automatically infer which configuration flags to use by analyzing the CSV file using the CSV sniffer. This will work correctly in most situations, and should be the first option attempted. In rare situations where the CSV reader cannot figure out the correct configuration it is possible to manually configure the CSV reader to correctly parse the CSV file. See the auto detection page for more information. + +## Parameters + +Below are parameters that can be passed to the CSV reader. These parameters are accepted by the read_csv function. But not all parameters are accepted by the COPY statement. + +| Name | Description | Type | Default | +| :------------------- | :----------------------------------------------------------- | :------------------- | :------------ | +| all_varchar | Option to skip type detection for CSV parsing and assume all columns to be of type VARCHAR. This option is only supported by the read_csv function. | BOOL | false | +| allow_quoted_nulls | Option to allow the conversion of quoted values to NULL values | BOOL | true | +| auto_detect | Enables auto detection of CSV parameters. | BOOL | true | +| auto_type_candidates | This option allows you to specify the types that the sniffer will use when detecting CSV column types. The VARCHAR type is always included in the detected types (as a fallback option). See example. | TYPE[] | default types | +| columns | A struct that specifies the column names and column types contained within the CSV file (e.g., {'col1': 'INTEGER', 'col2': 'VARCHAR'}). Using this option implies that auto detection is not used. | STRUCT | (empty) | +| compression | The compression type for the file. By default this will be detected automatically from the file extension (e.g., t.csv.gz will use gzip, t.csv will use none). Options are none, gzip, zstd. | VARCHAR | auto | +| dateformat | Specifies the date format to use when parsing dates. See Date Format. | VARCHAR | (empty) | +| decimal_separator | The decimal separator of numbers. | VARCHAR | . | +| delimiter | Specifies the delimiter character that separates columns within each row (line) of the file. Alias for sep. This option is only available in the COPY statement. | VARCHAR | , | +| delim | Specifies the delimiter character that separates columns within each row (line) of the file. Alias for sep. | VARCHAR | , | +| escape | Specifies the string that should appear before a data character sequence that matches the quote value. | VARCHAR | " | +| filename | Whether or not an extra filename column should be included in the result. | BOOL | false | +| force_not_null | Do not match the specified columns' values against the NULL string. In the default case where the NULL string is empty, this means that empty values will be read as zero-length strings rather than NULLs. | VARCHAR[] | [] | +| header | Specifies that the file contains a header line with the names of each column in the file. | BOOL | false | +| hive_partitioning | Whether or not to interpret the path as a Hive partitioned path. | BOOL | false | +| ignore_errors | Option to ignore any parsing errors encountered – and instead ignore rows with errors. | BOOL | false | +| max_line_size | The maximum line size in bytes. | BIGINT | 2097152 | +| names | The column names as a list, see example. | VARCHAR[] | (empty) | +| new_line | Set the new line character(s) in the file. Options are '\r','\n', or '\r\n'. Note that the CSV parser only distinguishes between single-character and double-character line delimiters. Therefore, it does not differentiate between '\r' and '\n'. | VARCHAR | (empty) | +| normalize_names | Boolean value that specifies whether or not column names should be normalized, removing any non-alphanumeric characters from them. | BOOL | false | +| null_padding | If this option is enabled, when a row lacks columns, it will pad the remaining columns on the right with NULL values. | BOOL | false | +| nullstr | Specifies the string that represents a NULL value or (since v0.10.2) a list of strings that represent a NULL value. | VARCHAR or VARCHAR[] | (empty) | +| parallel | Whether or not the parallel CSV reader is used. | BOOL | true | +| quote | Specifies the quoting string to be used when a data value is quoted. | VARCHAR | " | +| sample_size | The number of sample rows for auto detection of parameters. | BIGINT | 20480 | +| sep | Specifies the delimiter character that separates columns within each row (line) of the file. Alias for delim. | VARCHAR | , | +| skip | The number of lines at the top of the file to skip. | BIGINT | 0 | +| timestampformat | Specifies the date format to use when parsing timestamps. See Date Format. | VARCHAR | (empty) | +| types or dtypes | The column types as either a list (by position) or a struct (by name). Example here. | VARCHAR[] or STRUCT | (empty) | +| union_by_name | Whether the columns of multiple schemas should be unified by name, rather than by position. Note that using this option increases memory consumption. | BOOL | false | + +### auto_type_candidates Details + +The auto_type_candidates option lets you specify the data types that should be considered by the CSV reader for column data type detection. Usage example: + +``` +SELECT * FROM read_csv('csv_file.csv', auto_type_candidates = ['BIGINT', 'DATE']); +``` + +The default value for the auto_type_candidates option is ['SQLNULL', 'BOOLEAN', 'BIGINT', 'DOUBLE', 'TIME', 'DATE', 'TIMESTAMP', 'VARCHAR']. + +## CSV Functions + +The read_csv automatically attempts to figure out the correct configuration of the CSV reader using the CSV sniffer. It also automatically deduces types of columns. If the CSV file has a header, it will use the names found in that header to name the columns. Otherwise, the columns will be named column0, column1, column2, .... An example with the flights.csv file: + +``` +SELECT * FROM read_csv('flights.csv'); +``` + +| FlightDate | UniqueCarrier | OriginCityName | DestCityName | +| ---------- | ------------- | -------------- | --------------- | +| 1988-01-01 | AA | New York, NY | Los Angeles, CA | +| 1988-01-02 | AA | New York, NY | Los Angeles, CA | +| 1988-01-03 | AA | New York, NY | Los Angeles, CA | + +The path can either be a relative path (relative to the current working directory) or an absolute path. + +We can use read_csv to create a persistent table as well: + +``` +CREATE TABLE ontime AS + SELECT * FROM read_csv('flights.csv'); +DESCRIBE ontime; +``` + +| column_name | column_type | null | key | default | extra | +| -------------- | ----------- | ---- | ---- | ------- | ----- | +| FlightDate | DATE | YES | NULL | NULL | NULL | +| UniqueCarrier | VARCHAR | YES | NULL | NULL | NULL | +| OriginCityName | VARCHAR | YES | NULL | NULL | NULL | +| DestCityName | VARCHAR | YES | NULL | NULL | NULL | + +``` +SELECT * FROM read_csv('flights.csv', sample_size = 20_000); +``` + +If we set delim/sep, quote, escape, or header explicitly, we can bypass the automatic detection of this particular parameter: + +``` +SELECT * FROM read_csv('flights.csv', header = true); +``` + +Multiple files can be read at once by providing a glob or a list of files. Refer to the multiple files section for more information. + +## Writing Using the COPY Statement + +The COPY statement can be used to load data from a CSV file into a table. This statement has the same syntax as the one used in PostgreSQL. To load the data using the COPY statement, we must first create a table with the correct schema (which matches the order of the columns in the CSV file and uses types that fit the values in the CSV file). COPY detects the CSV's configuration options automatically. + +``` +CREATE TABLE ontime ( + flightdate DATE, + uniquecarrier VARCHAR, + origincityname VARCHAR, + destcityname VARCHAR +); +COPY ontime FROM 'flights.csv'; +SELECT * FROM ontime; +``` + +| flightdate | uniquecarrier | origincityname | destcityname | +| ---------- | ------------- | -------------- | --------------- | +| 1988-01-01 | AA | New York, NY | Los Angeles, CA | +| 1988-01-02 | AA | New York, NY | Los Angeles, CA | +| 1988-01-03 | AA | New York, NY | Los Angeles, CA | + +If we want to manually specify the CSV format, we can do so using the configuration options of COPY. + +``` +CREATE TABLE ontime (flightdate DATE, uniquecarrier VARCHAR, origincityname VARCHAR, destcityname VARCHAR); +COPY ontime FROM 'flights.csv' (DELIMITER '|', HEADER); +SELECT * FROM ontime; +``` + +## Reading Faulty CSV Files + +DuckDB supports reading erroneous CSV files. For details, see the Reading Faulty CSV Files page. + +## Limitations + +The CSV reader only supports input files using UTF-8 character encoding. For CSV files using different encodings, use e.g., the iconv command-line tool to convert them to UTF-8. For example: + +``` +iconv -f ISO-8859-2 -t UTF-8 input.csv > input-utf-8.csv +``` + +## Order Preservation + +The CSV reader respects the preserve_insertion_order configuration option to preserve insertion order. When true (the default), the order of the rows in the resultset returned by the CSV reader is the same as the order of the corresponding lines read from the file(s). When false, there is no guarantee that the order is preserved. + + + +# Reading Faulty CSV Files + +CSV files can come in all shapes and forms, with some presenting many errors that make the process of cleanly reading them inherently difficult. To help users read these files, DuckDB supports detailed error messages, the ability to skip faulty lines, and the possibility of storing faulty lines in a temporary table to assist users with a data cleaning step. + +## Structural Errors + +DuckDB supports the detection and skipping of several different structural errors. In this section, we will go over each error with an example. For the examples, consider the following table: + +``` +CREATE TABLE people (name VARCHAR, birth_date DATE); +``` + +DuckDB detects the following error types: + +- CAST: Casting errors occur when a column in the CSV file cannot be cast to the expected schema value. For example, the line Pedro,The 90s would cause an error since the string The 90s cannot be cast to a date. +- MISSING COLUMNS: This error occurs if a line in the CSV file has fewer columns than expected. In our example, we expect two columns; therefore, a row with just one value, e.g., Pedro, would cause this error. +- TOO MANY COLUMNS: This error occurs if a line in the CSV has more columns than expected. In our example, any line with more than two columns would cause this error, e.g., Pedro,01-01-1992,pdet. +- UNQUOTED VALUE: Quoted values in CSV lines must always be unquoted at the end; if a quoted value remains quoted throughout, it will cause an error. For example, assuming our scanner uses quote='"', the line "pedro"holanda, 01-01-1992 would present an unquoted value error. +- LINE SIZE OVER MAXIMUM: DuckDB has a parameter that sets the maximum line size a CSV file can have, which by default is set to 2,097,152 bytes. Assuming our scanner is set to max_line_size = 25, the line Pedro Holanda, 01-01-1992 would produce an error, as it exceeds 25 bytes. +- INVALID UNICODE: DuckDB only supports UTF-8 strings; thus, lines containing non-UTF-8 characters will produce an error. For example, the line pedro\xff\xff, 01-01-1992 would be problematic. + +### Anatomy of a CSV Error + +By default, when performing a CSV read, if any structural errors are encountered, the scanner will immediately stop the scanning process and throw the error to the user. These errors are designed to provide as much information as possible to allow users to evaluate them directly in their CSV file. + +This is an example for a full error message: + +``` +Conversion Error: CSV Error on Line: 5648 +Original Line: Pedro,The 90s +Error when converting column "birth_date". date field value out of range: "The 90s", expected format is (DD-MM-YYYY) + +Column date is being converted as type DATE +This type was auto-detected from the CSV file. +Possible solutions: +* Override the type for this column manually by setting the type explicitly, e.g. types={'birth_date': 'VARCHAR'} +* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g. sample_size=-1 +* Use a COPY statement to automatically derive types from an existing table. + + file= people.csv + delimiter = , (Auto-Detected) + quote = " (Auto-Detected) + escape = " (Auto-Detected) + new_line = \r\n (Auto-Detected) + header = true (Auto-Detected) + skip_rows = 0 (Auto-Detected) + date_format = (DD-MM-YYYY) (Auto-Detected) + timestamp_format = (Auto-Detected) + null_padding=0 + sample_size=20480 + ignore_errors=false + all_varchar=0 +``` + +The first block provides us with information regarding where the error occurred, including the line number, the original CSV line, and which field was problematic: + +``` +Conversion Error: CSV Error on Line: 5648 +Original Line: Pedro,The 90s +Error when converting column "birth_date". date field value out of range: "The 90s", expected format is (DD-MM-YYYY) +``` + +The second block provides us with potential solutions: + +``` +Column date is being converted as type DATE +This type was auto-detected from the CSV file. +Possible solutions: +* Override the type for this column manually by setting the type explicitly, e.g. types={'birth_date': 'VARCHAR'} +* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g. sample_size=-1 +* Use a COPY statement to automatically derive types from an existing table. +``` + +Since the type of this field was auto-detected, it suggests defining the field as a VARCHAR or fully utilizing the dataset for type detection. + +Finally, the last block presents some of the options used in the scanner that can cause errors, indicating whether they were auto-detected or manually set by the user. + +## Using the ignore_errors Option + +There are cases where CSV files may have multiple structural errors, and users simply wish to skip these and read the correct data. Reading erroneous CSV files is possible by utilizing the ignore_errors option. With this option set, rows containing data that would otherwise cause the CSV parser to generate an error will be ignored. In our example, we will demonstrate a CAST error, but note that any of the errors described in our Structural Error section would cause the faulty line to be skipped. + +For example, consider the following CSV file, faulty.csv: + +```csv +Pedro,31 +Oogie Boogie, three +``` + +If you read the CSV file, specifying that the first column is a VARCHAR and the second column is an INTEGER, loading the file would fail, as the string three cannot be converted to an INTEGER. + +For example, the following query will throw a casting error. + +``` +FROM read_csv('faulty.csv', columns = {'name': 'VARCHAR', 'age': 'INTEGER'}); +``` + +However, with ignore_errors set, the second row of the file is skipped, outputting only the complete first row. For example: + +``` +FROM read_csv( + 'faulty.csv', + columns = {'name': 'VARCHAR', 'age': 'INTEGER'}, + ignore_errors = true +); +``` + +Outputs: + +| name | age | +| ----- | ---- | +| Pedro | 31 | + +One should note that the CSV Parser is affected by the projection pushdown optimization. Hence, if we were to select only the name column, both rows would be considered valid, as the casting error on the age would never occur. For example: + +``` +SELECT name +FROM read_csv('faulty.csv', columns = {'name': 'VARCHAR', 'age': 'INTEGER'}); +``` + +Outputs: + +| name | +| ------------ | +| Pedro | +| Oogie Boogie | + +## Retrieving Faulty CSV Lines + +Being able to read faulty CSV files is important, but for many data cleaning operations, it is also necessary to know exactly which lines are corrupted and what errors the parser discovered on them. For scenarios like these, it is possible to use DuckDB's CSV Rejects Table feature. By default, this feature creates two temporary tables. + +1. reject_scans: Stores information regarding the parameters of the CSV Scanner +2. reject_errors: Stores information regarding each CSV faulty line and in which CSV Scanner they happened. + +Note that any of the errors described in our Structural Error section will be stored in the rejects tables. Also, if a line has multiple errors, multiple entries will be stored for the same line, one for each error. + +### Reject Scans + +The CSV Reject Scans Table returns the following information: + +| Column name | Description | Type | +| :---------------- | :----------------------------------------------------------- | :------- | +| scan_id | The internal ID used in DuckDB to represent that scanner | UBIGINT | +| file_id | A scanner might happen over multiple files, so the file_id represents a unique file in a scanner | UBIGINT | +| file_path | The file path | VARCHAR | +| delimiter | The delimiter used e.g., ; | VARCHAR | +| quote | The quote used e.g., " | VARCHAR | +| escape | The quote used e.g., " | VARCHAR | +| newline_delimiter | The newline delimiter used e.g., \r\n | VARCHAR | +| skip_rows | If any rows were skipped from the top of the file | UINTEGER | +| has_header | If the file has a header | BOOLEAN | +| columns | The schema of the file (i.e., all column names and types) | VARCHAR | +| date_format | The format used for date types | VARCHAR | +| timestamp_format | The format used for timestamp types | VARCHAR | +| user_arguments | Any extra scanner parameters manually set by the user | VARCHAR | + +### Reject Errors + +The CSV Reject Errors Table returns the following information: + +| Column name | Description | Type | +| :----------------- | :----------------------------------------------------------- | :------ | +| scan_id | The internal ID used in DuckDB to represent that scanner, used to join with reject scans tables | UBIGINT | +| file_id | The file_id represents a unique file in a scanner, used to join with reject scans tables | UBIGINT | +| line | Line number, from the CSV File, where the error occurred. | UBIGINT | +| line_byte_position | Byte Position of the start of the line, where the error occurred. | UBIGINT | +| byte_position | Byte Position where the error occurred. | UBIGINT | +| column_idx | If the error happens in a specific column, the index of the column. | UBIGINT | +| column_name | If the error happens in a specific column, the name of the column. | VARCHAR | +| error_type | The type of the error that happened. | ENUM | +| csv_line | The original CSV line. | VARCHAR | +| error_message | The error message produced by DuckDB. | VARCHAR | + +## Parameters + +The parameters listed below are used in the read_csv function to configure the CSV Rejects Table. + +| Name | Description | Type | Default | +| :------------ | :----------------------------------------------------------- | :------ | :------------ | +| store_rejects | If set to true, any errors in the file will be skipped and stored in the default rejects temporary tables. | BOOLEAN | False | +| rejects_scan | Name of a temporary table where the information of the scan information of faulty CSV file are stored. | VARCHAR | reject_scans | +| rejects_table | Name of a temporary table where the information of the faulty lines of a CSV file are stored. | VARCHAR | reject_errors | +| rejects_limit | Upper limit on the number of faulty records from a CSV file that will be recorded in the rejects table. 0 is used when no limit should be applied. | BIGINT | 0 | + +To store the information of the faulty CSV lines in a rejects table, the user must simply set the store_rejects option to true. For example: + +``` +FROM read_csv( + 'faulty.csv', + columns = {'name': 'VARCHAR', 'age': 'INTEGER'}, + store_rejects = true +); +``` + +You can then query both the reject_scans and reject_errors tables, to retrieve information about the rejected tuples. For example: + +``` +FROM reject_scans; +``` + +Outputs: + +| scan_id | file_id | file_path | delimiter | quote | escape | newline_delimiter | skip_rows | has_header | columns | date_format | timestamp_format | user_arguments | +| ------- | ------- | ---------- | --------- | ----- | ------ | ----------------- | --------- | ---------: | ------------------------------------ | ----------- | ---------------- | ------------------ | +| 5 | 0 | faulty.csv | , | " | " | \n | 0 | false | {'name': 'VARCHAR','age': 'INTEGER'} | | | store_rejects=true | + +``` +FROM reject_errors; +``` + +Outputs: + +| scan_id | file_id | line | line_byte_position | byte_position | column_idx | column_name | error_type | csv_line | error_message | +| ------- | ------- | ---- | ------------------ | ------------- | ---------- | ----------- | ---------- | ------------------- | ------------------------------------------------------------ | +| 5 | 0 | 2 | 10 | 23 | 2 | age | CAST | Oogie Boogie, three | Error when converting column "age". Could not convert string " three" to 'INTEGER' | \ No newline at end of file -- Gitee