datacleaner package¶
Submodules¶
datacleaner.datacleaner module¶
datacleaner is the main class definition for the datacleaner package.
-
class
datacleaner.datacleaner.DataCleaner(*, null_values=False, translations=False, transliterations=False, data_type=None, valid_types=None, filter_parens=False, filter_colon=False, **kwargs)[source]¶ Bases:
objectDataCleaner is a class to clean and process raw data, originally to clean and process data read in from datasets that often included random, human-introduced errors. As released, it provides a general framework to clean data as well as subclasses that translate text into snake_case and/or CamelCase. Developers can extend the Data Cleaner by subclassing and setting class variables and/or by instantiating objects and overriding or setting the properties on those objects.
-
add_transliteration_set(shorthand_set_name)[source]¶ adds a list of predefined sets of transliterations. COMMON_LATIN1_TRANSLITERATIONS tries to account for things typically found in rich text formatting, like angled quotes and different lengths en-dash and em-dashes
-
add_transliterations(new_transliterations)[source]¶ provide a list of tuples of additional character translations. Note that order of precedence is preserved, and so chaining of transliterations can occur [(‘a’, ‘b’), (‘b’, ‘c’), ] will result in ‘a’ becoming ‘c’ in the end
-
apply_transliterations(val)[source]¶ Apply all transliterations.
This happens before type conversion, so while the development use cases focus on single letter substitutions, it could also be leveraged to manipulate strings prior to running the type converters.
-
clean(val)[source]¶ Process val and return a cleaned value based on object properties. Note the following order of operations of the clean function:
Any null mappings are applied.
Any transliterations are applied
Attempts to convert strings to more precise types (numbers, dates, etc) are done.
Any translations are applied, as this should happen after all otehr conversions.
- Args:
val (scalar|list): The value to clean. If a list, recursively processes each element.
- Returns:
the cleaned value or a list of cleaned values.
-
convert_dates= False¶
-
convert_numbers= True¶
-
copy()[source]¶ Returns a copy of the object so that modifications to the returned object will not affect the source.
-
default_null_values= []¶
-
default_translations= {}¶
-
default_transliterations= []¶
-
join(values)[source]¶ An abstract function that takes an array of values and joins them together based on the criteria of the subclass. This is designed primarily to allow for the easy transformation between datacleaner subclass format, such as if you have values in snake_case and want to return the CamelCase version. CamelCase.join(SnakeCase.tokenize(val))
-
null_values¶ indicates a list of values that should be considered null, and will return None when parsed
Type: null_values property
-
nullable= True¶
-
static
parse_boolean(x)[source]¶ Takes in a scalar and returns the boolean value equivalent to it, or None.
Parameters: x – A scalar that likely represents a boolean, or None. Returns: None if x is None. True if x appears to represent a true value. False if x appears to represent a false value. Raises: ValueError if x is an unknown type.
-
classmethod
parse_date(s)[source]¶ Takes a scalar object and returns a datetime.date object after delegating to parse_datetime. See parse_datetime for further details on parsible types, acceptable args, and exceptions raised.
Returns: None if s is not true. datetime.date of the parsed value.
-
static
parse_datetime(s)[source]¶ Takes a scalar object and returns a datetime object if possible (or None or throws an Exception if not parsible).
Parameters: s – A scalar to parse into a datetime object to the degree that is possible. None is an allowed parameter, but will result in None being returned.
Returns: None if s is None or not true. datetime.date or datetime.time if s is already one of these formats. datetime.datetime if s is parsable.
Raises: - ValueError if the value is not none and is unparseable.
- TypeError if the type is not recognizable.
-
static
parse_maybeint(val)[source]¶ Takes in a scalar and returns a string or an integer, or None.
Parameters: x – A scalar that likely represents a string or integer value, or None. Returns: None if x is None. int if x appears to represent an integer value OR is a string representing a float with “.0” as the decimal. str if x appears to represent a non-numeric string value.
-
classmethod
parse_time(string)[source]¶ Takes a scalar object and returns a datetime.time object after delegating to parse_datetime. See parse_datetime for further details on parsible types, acceptable args, and exceptions raised.
Returns: None if s is not true. datetime.time of the parsed value.
-
prepend_transliterations(new_transliterations)[source]¶ Inserts a list of transliterations at the front of the transliteration list, so they are processed before the other/default transliterations
-
tokenize(val)[source]¶ An abstract function that takes a value (potentially cleaned) and breaks it apart based on the criteria of the subclass. This is designed primarily to allow for the easy transformation between datacleaner subclass format, such as if you have values in snake_case and want to return the CamelCase version. CamelCase.join(SnakeCase.tokenize(val))
-
translations¶ indicates a dict mapping complete strings to what they should be translated to when encountered
Type: translation property
-
transliterations¶ indicates a list of tuples or lists in which character 1 is a character or regexp to what the characters should be replaced with.
Type: transliterations property
-
-
datacleaner.datacleaner.underscorejoin(m)¶
datacleaner.snakecase module¶
-
class
datacleaner.snakecase.SnakeCase(**kwargs)[source]¶ Bases:
datacleaner.datacleaner.DataCleaner-
convert_numbers= False¶
-
data_type¶ alias of
builtins.str
-
join(values)[source]¶ An abstract function that takes an array of values and joins them together based on the criteria of the subclass. This is designed primarily to allow for the easy transformation between datacleaner subclass format, such as if you have values in snake_case and want to return the CamelCase version. CamelCase.join(SnakeCase.tokenize(val))
-
tokenize(val)[source]¶ An abstract function that takes a value (potentially cleaned) and breaks it apart based on the criteria of the subclass. This is designed primarily to allow for the easy transformation between datacleaner subclass format, such as if you have values in snake_case and want to return the CamelCase version. CamelCase.join(SnakeCase.tokenize(val))
-