EXTRACT

Section: User Commands (1)
Updated: Dec 29, 2006
Index Return to Main Contents

NAME

extract - determine meta-information about a file

SYNOPSIS

extract [ -abdfghLnrsvV ] [ -B language ] [ -H hash-algorithm ] [ -l library ] [ -p type ] [ -x type ] file ...

DESCRIPTION

This manual page documents version 0.5.17 of the extract command.

extract tests each file specified in the argument list in an attempt to infer meta-information from it. Each file is subjected to the meta-data extraction libraries from libextractor.

libextractor classifies meta-information (also referred to as keywords) into types. A list of all types can be obtained with the -L option.

OPTIONS

-a: Do not remove any duplicates, even if the keywords match exactly and have the same type (i.e. because the same keyword was found by different extractor libraries).
-b: Display the output in BiBTeX format. This implies the -d option
-B LANG: Use the generic plaintext extractor for the language with the 2-letter language code LANG. Supported languages are DA (Danish), DE (German), EN (English), ES (Spanish), FI (Finnish), FR (French), GA (Gaelic), IT (Italian), NO (Norwegian) and SV (Swedish).
-d: Remove duplicates only if the types match exactly. By default, duplicates are removed if the types match or if one of the types is I unknown (in this case, the duplicate of unknown type is removed).
-f: add the filename(s) (without directory) to the list of keywords.
-g: Use grep-friendly output (all keywords on a single line for each file). Use the verbose option to print the filename first, followed by the keywords. Use the verbose option twice to also display the keyword types. This option will not print keyword types or non-textual metadata.
-h: Print a brief summary of the options.
-H ALGORITHM: Use the ALGORITHM to compute a hash of each file (possible algorithms are sha1 and md5).
-L: Print a list of all known keyword types.
-n: Do not use the default set of extractors (typically all standard extractors, currently mp3, ogg, jpg, gif, png, tiff, real, html, pdf and mime-types), use only the extractors specified with the .B -l option.
-r: Remove all duplicates disregarding differences in the keyword type.
-s: Split keywords at delimiters (space, comma, colon, etc.) and list split keywords to be of .I unknown type. This can also be done by loading the split-library. Using this option guarantees that the splitting is performed after all other libraries have been run. It is always performed before duplicate elimination.
-v: Print the version number and exit.
-V: Be verbose.
-B: Run the printable extractor (costly, generic extractor for binaries)
-l libraries: Use the specified libraries to extract keywords. The general format of libraries is .I [[-]LIBRARYNAME[:[-]LIBRARYNAME]*] where LIBRARYNAME is a libextractor compatible library and typically of the form .I libextractor_jpeg.so. The minus before the libraryname indicates that this library should be run after all the libraries that were specified so far. If the minus is missing, the library is run before all previously specified libraries.
-p type: Print only the keywords matching the specified type. By default, all keywords that are found and not removed as duplicates are printed.
-x type: Exclude keywords of the specified type from the output. By default, all keywords that are found and not removed as duplicates are printed.

EXAMPLES

$ extract test/test.jpg
comment - (C) 2001 by Christian Grothoff, using gimp 1.2 1
mimetype - image/jpeg

$ extract -Vf -x comment test/test.jpg
Keywords for file test/test.jpg:
mimetype - image/jpeg
filename - test.jpg

$ extract -p comment test/test.jpg
comment - (C) 2001 by Christian Grothoff, using gimp 1.2 1

$ extract -nV -l libextractor_png.so -p comment test/test.jpg test/test.png
Keywords for file test/test.jpg:
Keywords for file test/test.png:
comment - Testing keyword extraction

LEGAL NOTICE

libextractor and the extract tool are released under the GPL. libextractor is a GNU project.

BUGS

A couple of file-formats (on the order of 10^3) are not recognized...

AUTHORS

extract was originally written by Christian Grothoff <christian@grothoff.org> and Vidyut Samanta <vids@cs.ucla.edu>. Use <libextractor@gnu.org> to contact the current maintainer(s).

AVAILABILITY

You can obtain the original author's latest version from http://gnunet.org/libextractor/