datacleaner package

Submodules

datacleaner.datacleaner module

datacleaner is the main class definition for the datacleaner package.

class datacleaner.datacleaner.DataCleaner(*, null_values=False, translations=False, transliterations=False, data_type=None, valid_types=None, filter_parens=False, filter_colon=False, **kwargs)[source]

Bases: object

DataCleaner is a class to clean and process raw data, originally to clean and process data read in from datasets that often included random, human-introduced errors. As released, it provides a general framework to clean data as well as subclasses that translate text into snake_case and/or CamelCase. Developers can extend the Data Cleaner by subclassing and setting class variables and/or by instantiating objects and overriding or setting the properties on those objects.

add_null_values(values)[source]

add a value or list of values to append to the null list

add_translations(new_translations)[source]
add_transliteration_set(shorthand_set_name)[source]

adds a list of predefined sets of transliterations. COMMON_LATIN1_TRANSLITERATIONS tries to account for things typically found in rich text formatting, like angled quotes and different lengths en-dash and em-dashes

add_transliterations(new_transliterations)[source]

provide a list of tuples of additional character translations. Note that order of precedence is preserved, and so chaining of transliterations can occur [(‘a’, ‘b’), (‘b’, ‘c’), ] will result in ‘a’ becoming ‘c’ in the end

apply_transliterations(val)[source]

Apply all transliterations.

This happens before type conversion, so while the development use cases focus on single letter substitutions, it could also be leveraged to manipulate strings prior to running the type converters.

clean(val)[source]

Process val and return a cleaned value based on object properties. Note the following order of operations of the clean function:

  1. Any null mappings are applied.

  2. Any transliterations are applied

  3. Attempts to convert strings to more precise types (numbers, dates, etc) are done.

  4. Any translations are applied, as this should happen after all otehr conversions.

    Args:

    val (scalar|list): The value to clean. If a list, recursively processes each element.

    Returns:

    the cleaned value or a list of cleaned values.

clean_string(val)[source]
convert_dates = False
convert_numbers = True
copy()[source]

Returns a copy of the object so that modifications to the returned object will not affect the source.

default_null_values = []
default_translations = {}
default_transliterations = []
join(values)[source]

An abstract function that takes an array of values and joins them together based on the criteria of the subclass. This is designed primarily to allow for the easy transformation between datacleaner subclass format, such as if you have values in snake_case and want to return the CamelCase version. CamelCase.join(SnakeCase.tokenize(val))

null_values

indicates a list of values that should be considered null, and will return None when parsed

Type:null_values property
nullable = True
static parse_boolean(x)[source]

Takes in a scalar and returns the boolean value equivalent to it, or None.

Parameters:x – A scalar that likely represents a boolean, or None.
Returns:None if x is None. True if x appears to represent a true value. False if x appears to represent a false value.
Raises:ValueError if x is an unknown type.
classmethod parse_date(s)[source]

Takes a scalar object and returns a datetime.date object after delegating to parse_datetime. See parse_datetime for further details on parsible types, acceptable args, and exceptions raised.

Returns:None if s is not true. datetime.date of the parsed value.
static parse_datetime(s)[source]

Takes a scalar object and returns a datetime object if possible (or None or throws an Exception if not parsible).

Parameters:

s – A scalar to parse into a datetime object to the degree that is possible. None is an allowed parameter, but will result in None being returned.

Returns:

None if s is None or not true. datetime.date or datetime.time if s is already one of these formats. datetime.datetime if s is parsable.

Raises:
  • ValueError if the value is not none and is unparseable.
  • TypeError if the type is not recognizable.
static parse_maybeint(val)[source]

Takes in a scalar and returns a string or an integer, or None.

Parameters:x – A scalar that likely represents a string or integer value, or None.
Returns:None if x is None. int if x appears to represent an integer value OR is a string representing a float with “.0” as the decimal. str if x appears to represent a non-numeric string value.
classmethod parse_time(string)[source]

Takes a scalar object and returns a datetime.time object after delegating to parse_datetime. See parse_datetime for further details on parsible types, acceptable args, and exceptions raised.

Returns:None if s is not true. datetime.time of the parsed value.
prepend_transliterations(new_transliterations)[source]

Inserts a list of transliterations at the front of the transliteration list, so they are processed before the other/default transliterations

tokenize(val)[source]

An abstract function that takes a value (potentially cleaned) and breaks it apart based on the criteria of the subclass. This is designed primarily to allow for the easy transformation between datacleaner subclass format, such as if you have values in snake_case and want to return the CamelCase version. CamelCase.join(SnakeCase.tokenize(val))

translations

indicates a dict mapping complete strings to what they should be translated to when encountered

Type:translation property
transliterations

indicates a list of tuples or lists in which character 1 is a character or regexp to what the characters should be replaced with.

Type:transliterations property
datacleaner.datacleaner.underscorejoin(m)

datacleaner.snakecase module

class datacleaner.snakecase.SnakeCase(**kwargs)[source]

Bases: datacleaner.datacleaner.DataCleaner

clean_string(val)[source]
convert_numbers = False
data_type

alias of builtins.str

join(values)[source]

An abstract function that takes an array of values and joins them together based on the criteria of the subclass. This is designed primarily to allow for the easy transformation between datacleaner subclass format, such as if you have values in snake_case and want to return the CamelCase version. CamelCase.join(SnakeCase.tokenize(val))

tokenize(val)[source]

An abstract function that takes a value (potentially cleaned) and breaks it apart based on the criteria of the subclass. This is designed primarily to allow for the easy transformation between datacleaner subclass format, such as if you have values in snake_case and want to return the CamelCase version. CamelCase.join(SnakeCase.tokenize(val))

Module contents