04.01.2018       Выпуск 211 (01.01.2018 - 07.01.2018)       Интересные проекты, инструменты, библиотеки

pydqc - утилита проверки качества данных

Читать>>




Экспериментальная функция:

Ниже вы видите текст статьи по ссылке. По нему можно быстро понять ссылка достойна прочтения или нет

Просим обратить внимание, что текст по ссылке и здесь может не совпадать.

README.md

pydqc

Python automatic data quality check toolkit. Aims to relieve the pain of writing tedious codes for general data understanding by:

  • Automatically generate data summary report, which contains useful statistical information for each column in a data table. (useful for general data understanding)
  • Automatically summarize the statistical difference between two data tables. (useful for comparing training set with test set, comparing the same data table from two different snapshot dates, etc.)
  • But still need some help from human for data types inferring.🙈

Motivation

"Today I don't feel like doing anything about data quality check, I just wanna lay in my bed. Don't feel like writing any tedious codes. So build a tool runs on its own."🎤🎵🎶
-modified The Lazy Song

Install pydqc

  • install py2nb
  • install dependents pip install -r requirements.txt
  • install pydqc
git clone https://github.com/SauceCat/pydqc.git
cd pydqc
python setup.py install

How does it work?

For an input data table (pandas dataframe):

Step 1: data schema

  • function: pydqc.infer_schema.infer_schema(_data, fname, output_root='', sample_size=1.0, type_threshold=0.5, n_jobs=1, base_schema=None, base_schema_feature_colname='column', base_schema_dtype_colname='type')
    Infer data types for each column. pydqc recognizes four basic data types, including 'key', 'str', 'date', 'numeric'.

    • 'key': column that doesn't have concrete meaning itself, but acts as 'key' to link with other tables.
    • 'str': categorical column
    • 'date': datetime column
    • 'numeric': numeric column
      After inferring, an excel file named 'data_schema_XXX.xlsx' (XXX here represents the 'fname' parameter) is generated. We should check the generated file and modify the 'type' column when it is necessary (when the infer_schema function makes some mistakes). But it is easy because we can do the modification by selecting from a drop down list.


    For this version, pydqc is not able to infer the 'key' type, so it always needs human modification.
    After necessary modification, it is better to save the schema as 'data_schema_XXX_mdf.xlsx' or with some other names different from the original one.

Step 2 (option 1): data summary

  • function: pydqc.data_summary.data_summary(table_schema, _table, fname, sample_size=1.0, feature_colname='column', dtype_colname='type', output_root='', n_jobs=1)
    Summary basic statistical information for each column based on the provided data type.

    • 'key' and 'str': sample value, rate of nan values, number of unique values, top 10 value count.
      example output:


    the only different between summary for 'key' and 'str' is pydqc doesn't do sampling for 'key' columns. (check 'sample_size' parameter)

    • 'date': sample value, rate of nan values, number of unique values, minimum numeric value, mean numeric value, median numeric value, maximum numeric value, maximum date value, minimum date value, distribution graph for numeric values.
      example output:


    numeric value for 'date' column is calculated as the time difference between the date value and today in months.

    • 'numeric': sample value, rate of nan values, number of unique values, minimum value, mean value, median value, maximum value, distribution graph (log10 scale automatically when absolute maximum value is larger than 1M).
      example output:




    You can also turn the whole data summmary process into a jupyter notebook by function data_summary_notebook()

Step 2 (option 2): data compare

  • function: data_compare(_table1, _table2, _schema1, _schema2, fname, sample_size=1.0, feature_colname1='column', feature_colname2='column', dtype_colname1='type', dtype_colname2='type', output_root='', n_jobs=1)
    Compare statistical characteristics of the same columns between two different tables based on the provided data type. (It might be useful when we want to compare training set with test set, or sample table from two different snapshot dates)

    • 'key': compare sample value, rate of nan values, number of unique values, size of intersection set, size of set only in table1, size of set only in table2, venn graph.
      example output:


    • 'str': compare sample value, rate of nan values, number of unique values, size of intersection set, size of set only in table1, size of set only in table2, top 10 value counts.
      example output:


    • 'date': compare sample value, rate of nan values, number of unique values, minimum numeric value, mean numeric value, median numeric value, maximum numeric value, maximum date value, minimum date value, distribution graph for numeric values.
      example output:


    numeric value for 'date' column is calculated as the time difference between the date value and today in months.

    • 'numeric': compare sample value, rate of nan values, number of unique values, minimum value, mean value, median value, maximum value,distribution graph.
      example output:


    You can also turn the whole data compare process into a jupyter notebook by function data_compare_notebook()

Documentation

For details about the ideas, please refer to Introducing pydqc.
For description about the functions and parameters, please refer to pydqc functions and parameters.
For test and demo, please refer to https://github.com/SauceCat/pydqc/tree/master/test.
For example outputs, please refer to https://github.com/SauceCat/pydqc/tree/master/test/output.


Contribution

If you have other ideas for general automatic data quality check, push requests are always welcome!🙋



Лучшая Python рассылка




Разместим вашу рекламу

Пиши: mail@pythondigest.ru

Нашли опечатку?

Выделите фрагмент и отправьте нажатием Ctrl+Enter.

Система Orphus