How to Filter Only Printable Characters in a File on Bash (Linux) or Python

How to filter only printable characters in a file on Bash (linux) or Python?

The hexdump shows that the dot in .[16D is actually an escape character, \x1b.

Esc[nD is an ANSI escape code to delete n characters. So Esc[16D tells the terminal to delete 16 characters, which explains the cat output.

There are various ways to remove ANSI escape codes from a file, either using Bash commands (eg using sed, as in Anubhava's answer) or Python.

However, in cases like this, it may be better to run the file through a terminal emulator to interpret any existing editing control sequences in the file, so you get the result the file's author intended after they applied those editing sequences.

One way to do that in Python is to use pyte, a Python module that implements a simple VTXXX compatible terminal emulator. You can easily install it using pip, and here are its docs on readthedocs.

Here's a simple demo program that interprets the data given in the question. It's written for Python 2, but it's easy to adapt to Python 3. pyte is Unicode-aware, and its standard Stream class expects Unicode strings, but this example uses a ByteStream, so I can pass it a plain byte string.

#!/usr/bin/env python

''' pyte VTxxx terminal emulator demo

Interpret a byte string containing text and ANSI / VTxxx control sequences

Code adapted from the demo script in the pyte tutorial at
http://pyte.readthedocs.org/en/latest/tutorial.html#tutorial

Posted to http://stackoverflow.com/a/30571342/4014959

Written by PM 2Ring 2015.06.02
'''

import pyte

#hex dump of data
#00000000 48 45 4c 4c 4f 20 54 48 49 53 20 49 53 20 54 48 |HELLO THIS IS TH|
#00000010 45 20 54 45 53 54 1b 5b 31 36 44 20 20 20 20 20 |E TEST.[16D |
#00000020 20 20 20 20 20 20 20 20 20 20 20 1b 5b 31 36 44 | .[16D|
#00000030 20 20 | |

data = 'HELLO THIS IS THE TEST\x1b[16D \x1b[16D '

#Create a default sized screen that tracks changed lines
screen = pyte.DiffScreen(80, 24)
screen.dirty.clear()
stream = pyte.ByteStream()
stream.attach(screen)
stream.feed(data)

#Get index of last line containing text
last = max(screen.dirty)

#Gather lines, stripping trailing whitespace
lines = [screen.display[i].rstrip() for i in range(last + 1)]

print '\n'.join(lines)

output

HELLO

hex dump of output

00000000  48 45 4c 4c 4f 0a                                 |HELLO.|

Trying to remove non-printable characters (junk values) from a UNIX file

Perhaps you could go with the complement of [:print:], which contains all printable characters:

tr -cd '[:print:]' < file > newfile

If your version of tr doesn't support multi-byte characters (it seems that many don't), this works for me with GNU sed (with UTF-8 locale settings):

sed 's/[^[:print:]]//g' file

How to filter all words, which contain N or more characters?

egrep -o '[^ ]{N,}' <filename>

Find all non-space constructs at least N characters long. If you're concerned about "words" you might try [a-zA-Z].

How do I grep for all non-ASCII characters?

You can use the command:

grep --color='auto' -P -n "[\x80-\xFF]" file.xml

This will give you the line number, and will highlight non-ascii chars in red.

In some systems, depending on your settings, the above will not work, so you can grep by the inverse

grep --color='auto' -P -n "[^\x00-\x7F]" file.xml

Note also, that the important bit is the -P flag which equates to --perl-regexp: so it will interpret your pattern as a Perl regular expression. It also says that

this is highly experimental and grep -P may warn of unimplemented
features.

Removing non-displaying characters from a file

It looks like your file is encoded in UTF-16 rather than an 8-bit character set. The '^@' is a notation for ASCII NUL '\0', which usually spoils string matching.

One technique for loss-less handling of this would be to use a filter to convert UTF-16 to UTF-8, and then using grep on the output - hypothetically, if the command was 'utf16-utf8', you'd write:

utf16-utf8 weirdo | grep Lunch

As an appallingly crude approximation to 'utf16-utf8', you could consider:

tr -d '\0' < weirdo | grep Lunch

This deletes ASCII NUL characters from the input file and lets grep operate on the 'cleaned up' output. In theory, it might give you false positives; in practice, it probably won't.

Replace non-ASCII characters with a single space

Your ''.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:

return ''.join([i if ord(i) < 128 else ' ' for i in text])

This handles characters one by one and would still use one space per character replaced.

Your regular expression should just replace consecutive non-ASCII characters with a space:

re.sub(r'[^\x00-\x7F]+',' ', text)

Note the + there.

Remove non-ASCII characters from CSV

# -i (inplace)

sed -i 's/[\d128-\d255]//g' FILENAME

Remove non-ASCII characters in a file

If you want to use Perl, do it like this:

perl -pi -e 's/[^[:ascii:]]//g' filename

Detailed Explanation

The following explanation covers every part of the above command assuming the reader is unfamiliar with anything in the solution...

  • perl

    run the perl interpreter. Perl is a programming language that is typically available on all unix like systems. This command needs to be run at a shell prompt.

  • -p

    The -p flag tells perl to iterate over every line in the input file, run the specified commands (described later) on each line, and then print the result. It is equivalent to wrapping your perl program in while(<>) { /* program... */; } continue { print; }. There's a similar -n flag that does the same but omits the continue { print; } block, so you'd use that if you wanted to do your own printing.

  • -i

    The -i flag tells perl that the input file is to be edited in place and output should go back into that file. This is important to actually modify the file. Omitting this flag will write the output to STDOUT which you can then redirect to a new file.

    Note that you cannot omit -i and redirect STDOUT to the input file as this will clobber the input file before it has been read. This is just how the shell works and has nothing to do with perl. The -i flag works around this intelligently.

    Perl and the shell allow you to combine multiple single character parameters into one which is why we can use -pi instead of -p -i

    The -i flag takes a single argument, which is a file extension to use if you want to make a backup of the original file, so if you used -i.bak, then perl would copy the input file to filename.bak before making changes. In this example I've omitted creating a backup because I expect you'll be using version control anyway :)

  • -e

    The -e flag tells perl that the next argument is a complete perl program encapsulated in a string. This is not always a good idea if you have a very long program as that can get unreadable, but with a single command program as we have here, its terseness can improve legibility.

    Note that we cannot combine the -e flag with the -i flag as both of them take in a single argument, and perl would assume that the second flag is the argument, so, for example, if we used -ie <program> <filename>, perl would assume <program> and <filename> are both input files and try to create <program>e and <filename>e assuming that e is the extension you want to use for the backup. This will fail as <program> is not really a file. The other way around (-ei) would also not work as perl would try to execute i as a program, which would fail compilation.

  • s/.../.../

    This is perl's regex based substitution operator. It takes in four arguments. The first comes before the operator, and if not specified, uses the default of $_. The second and third are between the / symbols. The fourth is after the final / and is g in this case.

    • $_ In our code, the first argument is $_ which is the default loop variable in perl. As mentioned above, the -p flag wraps our program in while(<>), which creates a while loop that reads one line at a time (<>) from the input. It implicitly assigns this line to $_, and all commands that take in a single argument will use this if not specified (eg: just calling print; will actually translate to print $_;). So, in our code, the s/.../.../ operator operates once on each line of the input file.

    • [^[:ascii:]] The second argument is the pattern to search for in the input string. This pattern is a regular expression, so anything enclosed within [] is a bracket expression. This section is probably the most complex part of this example, so we will discuss it in detail at the end.

    • <empty string> The third argument is the replacement string, which in our case is the empty string since we want to remove all non-ascii characters.

    • g The fourth argument is a modifier flag for the substitution operator. The g flag specifies that the substitution should be global across all matches in the input. Without this flag, only the first instance will be replaced. Other possible flags are i for case insensitive matches, s and m which are only relevant for multi-line strings (we have single line strings here), o which specifies that the pattern should be precompiled (which could be useful here for long files), and x which specifies that the pattern could include whitespace and comments to make it more readable (but we should not write our program on a single line if that is the case).

  • filename

    This is the input file that contains non-ascii characters that we'd like to strip out.

[^[:ascii:]]

So now let's discuss [^[:ascii:]] in more detail.

As mentioned above, [] in a regular expression specifies a bracket expression, which tells the regex engine to match a single character in the input that matches any one of the characters in the set of characters inside the expression. So, for example, [abc] will match either an a, or a b or a c, and it will match only a single character. Using ^ as the first character inverts the match, so [^abc] will match any one character that is not an a, b, or c.

But what about [:ascii:] inside the bracket expression?

If you have a unix based system available, run man 7 re_format at the command line to read the man page. If not, read the online version

[:ascii:] is a character class that represents the entire set of ascii characters, but this kind of a character class may only be used inside a bracket expression. The correct way to use this is [[:ascii:]] and it may be negated as with the abc case above or combined within a bracket expression with other characters, so, for example, [éç[:ascii:]] will match all ascii characters and also é and ç which are not ascii, and [^éç[:ascii:]] will match all characters that are not ascii and also not é or ç.

Removing all special characters from a string in Bash

You can use tr to print only the printable characters from a string like below. Just use the below command on your input file.

tr -cd "[:print:]\n" < file1   

The flag -d is meant to the delete the character sets defined in the arguments on the input stream, and -c is for complementing those (invert what's provided). So without -c the command would delete all printable characters from the input stream and using it complements it by removing the non-printable characters. We also keep the newline character \n to preserve the line endings in the input file. Removing it would just produce the final output in one big line.

The [:print:] is just a POSIX bracket expression which is a combination of expressions [:alnum:], [:punct:] and space. The [:alnum:] is same as [0-9A-Za-z] and [:punct:] includes characters ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~

Removing a small number of lines from a large file

You can use grep: "-v" keeps the opposite, -P uses perl regex syntax, and [\x80-\xFF] is the character range for non-ascii.

grep -vP "[\x80-\xFF]" data.tsv > data-ASCII-only.tsv

See this question How do I grep for all non-ASCII characters in UNIX for more about search for ascii characters with grep.



Related Topics



Leave a reply



Submit